This is a fully remote position, open to applicants in Brazil.

📋 Description

• **Incident Leadership:**

• Serve as the Incident Response Lead during War Rooms, facilitating technical remediation and stakeholder communication.

• **Observability Engineering:**

• Design and enhance telemetry within Datadog (Logs, APM, Traces, and business metrics) to minimize MTTD and lessen the cognitive load on the team.

• **Workload Management on AWS Amplify:**

• Ensure the resilience and scalability of hosted front-end applications and essential APIs.

• **SRE Governance:**

• Define and oversee SLIs, SLOs, and SLAs while managing the Error Budget to balance delivery speed with system stability.

• **Mitigation Automation:**

• Create auto-healing tools and scripts (automatic rollback, controlled restarts, component isolation).

• **Root Cause Analysis:**

• Lead blameless post-mortem sessions and ensure the implementation of structural improvements to avert future occurrences.

• **Systems Modernization:**

• Collaborate with development teams to adopt resilience patterns (Circuit Breakers, Bulkheads, and Rate Limiting) in both modern architectures and legacy systems.

• **AI in Operations:**

• Implement anomaly detection and intelligent response solutions using AIOps (Datadog Bits AI or AWS DevOps Agent).

⛳️ Requirements

• **Proven Seniority in SRE or DevOps:** Solid experience in high-scale, mission-critical environments.

• **Deep AWS Expertise:** Advanced knowledge of EC2, RDS, S3, IAM, EKS, and Amplify.

• **Observability Tools:** Strong experience in monitoring, logging, and APM (preferably with Datadog).

• **Containers & Orchestration:** In-depth knowledge of Docker and Kubernetes (EKS/GKE).

• **Infrastructure as Code (IaC):** Proficiency in Terraform.

• **Development/Scripting:** Skilled in Python, Go, or Shell scripting for automation purposes.

• **Incident Management:** Practical experience with on-call rotations and real-time problem resolution.

• **Plus / Nice-to-haves:**

• **Analytical Profile for Legacy Systems:** Experience in troubleshooting .NET Framework applications and Oracle or PostgreSQL databases.

• **Chaos Engineering:** Experience in conducting controlled stress and resilience tests.

• **Certifications:** AWS Certified DevOps Engineer - Professional or official Datadog certifications.

🏝️ Benefits

• 📚 Educational Incentives (Partnerships with Educational Institutions)

• 🌴 Paid Vacation

• 🏋️ TotalPass

• 🎂 Birthday off

• 🏥 Health Insurance

• 🦷 Dental Insurance

• 🤰 Maternity Leave

• 👨‍👩‍👧‍👦 Paternity Leave

• 🌟 Reimbursement for AWS Certifications

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!