This is a fully remote position, open to applicants in Brazil.

📋 Description

• Design, implement, and enhance Site Reliability Engineering practices within production environments.

• Define, oversee, and continually enhance Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.

• Lead and engage in incident response and command processes.

• Develop and refine observability strategies, including monitoring, logging, alerting, and distributed tracing.

• Enhance system reliability, availability, scalability, and operational efficiency.

• Collaborate with engineering teams to boost application performance and readiness for production.

• Create automation solutions that minimize operational overhead and enhance reliability.

• Engage in root cause analysis and conduct post-incident reviews.

• Propel continuous improvement initiatives grounded in operational insights and lessons learned from incidents.

• Assist in establishing reliability best practices across teams and services.

⛳️ Requirements

• Over 5 years of professional experience in Site Reliability Engineering, DevOps, or Production Engineering roles.

• Solid understanding of Site Reliability Engineering principles and best practices.

• Experience in supporting and managing production systems at scale.

• Strong knowledge of monitoring, observability, and reliability engineering concepts.

• Experience in cloud-based environments.

• Excellent troubleshooting and problem-solving abilities.

• Experience with distributed systems and contemporary application architectures.

• Proven track record in Site Reliability Engineering.

• Experience in defining and managing:

• Service Level Objectives (SLOs)

• Service Level Indicators (SLIs)

• Error Budgets

• Experience in leading or actively participating in Incident Command and Incident Response processes.

• Experience in designing and implementing observability strategies.

• Hands-on experience with:

• Monitoring

• Logging

• Alerting

• Distributed Tracing

• Experience in enhancing system reliability, availability, and operational excellence.

• Experience in supporting mission-critical production environments.

• Familiarity with cloud platforms (AWS preferred).

• Strong automation mindset.

• Experience in conducting root cause analysis and postmortems.

• Experience with Kubernetes.

• Experience with Terraform or Infrastructure as Code.

• CI/CD pipeline experience.

• Familiarity with containerized environments.

• Experience with distributed microservices architectures.

• Background in performance engineering.

• Experience mentoring engineers on reliability practices.

• Multi-cloud experience.

• Experience in highly regulated or high-availability environments.

🏝️ Benefits

• Home office option;

• Competitive compensation based on experience;

• Career development plans to support significant growth within the company;

• Opportunities to work on international projects;

• Oowlish English Program (Technical and Conversational);

• Oowlish Fitness with Total Pass;

• Engaging games and competitions;

Senior Site Reliability Engineer, SRE

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

DevOps Engineer – Platform, AWS, CI/CD

Site Reliability Engineer

Never miss a great job!