
Senior Site Reliability Engineer, SRE
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in Brazil.
• Design, implement, and enhance Site Reliability Engineering practices within production environments.
• Define, oversee, and continually enhance Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
• Lead and engage in incident response and command processes.
• Develop and refine observability strategies, including monitoring, logging, alerting, and distributed tracing.
• Enhance system reliability, availability, scalability, and operational efficiency.
• Collaborate with engineering teams to boost application performance and readiness for production.
• Create automation solutions that minimize operational overhead and enhance reliability.
• Engage in root cause analysis and conduct post-incident reviews.
• Propel continuous improvement initiatives grounded in operational insights and lessons learned from incidents.
• Assist in establishing reliability best practices across teams and services.
• Over 5 years of professional experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
• Solid understanding of Site Reliability Engineering principles and best practices.
• Experience in supporting and managing production systems at scale.
• Strong knowledge of monitoring, observability, and reliability engineering concepts.
• Experience in cloud-based environments.
• Excellent troubleshooting and problem-solving abilities.
• Experience with distributed systems and contemporary application architectures.
• Proven track record in Site Reliability Engineering.
• Experience in defining and managing:
• Service Level Objectives (SLOs)
• Service Level Indicators (SLIs)
• Error Budgets
• Experience in leading or actively participating in Incident Command and Incident Response processes.
• Experience in designing and implementing observability strategies.
• Hands-on experience with:
• Monitoring
• Logging
• Alerting
• Distributed Tracing
• Experience in enhancing system reliability, availability, and operational excellence.
• Experience in supporting mission-critical production environments.
• Familiarity with cloud platforms (AWS preferred).
• Strong automation mindset.
• Experience in conducting root cause analysis and postmortems.
• Experience with Kubernetes.
• Experience with Terraform or Infrastructure as Code.
• CI/CD pipeline experience.
• Familiarity with containerized environments.
• Experience with distributed microservices architectures.
• Background in performance engineering.
• Experience mentoring engineers on reliability practices.
• Multi-cloud experience.
• Experience in highly regulated or high-availability environments.
• Home office option;
• Competitive compensation based on experience;
• Career development plans to support significant growth within the company;
• Opportunities to work on international projects;
• Oowlish English Program (Technical and Conversational);
• Oowlish Fitness with Total Pass;
• Engaging games and competitions;
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.