
Senior Site Reliability Engineer
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Brazil.
• **Incident Leadership:**
• Serve as the Incident Response Lead during War Rooms, facilitating technical remediation and stakeholder communication.
• **Observability Engineering:**
• Design and enhance telemetry within Datadog (Logs, APM, Traces, and business metrics) to minimize MTTD and lessen the cognitive load on the team.
• **Workload Management on AWS Amplify:**
• Ensure the resilience and scalability of hosted front-end applications and essential APIs.
• **SRE Governance:**
• Define and oversee SLIs, SLOs, and SLAs while managing the Error Budget to balance delivery speed with system stability.
• **Mitigation Automation:**
• Create auto-healing tools and scripts (automatic rollback, controlled restarts, component isolation).
• **Root Cause Analysis:**
• Lead blameless post-mortem sessions and ensure the implementation of structural improvements to avert future occurrences.
• **Systems Modernization:**
• Collaborate with development teams to adopt resilience patterns (Circuit Breakers, Bulkheads, and Rate Limiting) in both modern architectures and legacy systems.
• **AI in Operations:**
• Implement anomaly detection and intelligent response solutions using AIOps (Datadog Bits AI or AWS DevOps Agent).
• **Proven Seniority in SRE or DevOps:** Solid experience in high-scale, mission-critical environments.
• **Deep AWS Expertise:** Advanced knowledge of EC2, RDS, S3, IAM, EKS, and Amplify.
• **Observability Tools:** Strong experience in monitoring, logging, and APM (preferably with Datadog).
• **Containers & Orchestration:** In-depth knowledge of Docker and Kubernetes (EKS/GKE).
• **Infrastructure as Code (IaC):** Proficiency in Terraform.
• **Development/Scripting:** Skilled in Python, Go, or Shell scripting for automation purposes.
• **Incident Management:** Practical experience with on-call rotations and real-time problem resolution.
• **Plus / Nice-to-haves:**
• **Analytical Profile for Legacy Systems:** Experience in troubleshooting .NET Framework applications and Oracle or PostgreSQL databases.
• **Chaos Engineering:** Experience in conducting controlled stress and resilience tests.
• **Certifications:** AWS Certified DevOps Engineer - Professional or official Datadog certifications.
• 📚 Educational Incentives (Partnerships with Educational Institutions)
• 🌴 Paid Vacation
• 🏋️ TotalPass
• 🎂 Birthday off
• 🏥 Health Insurance
• 🦷 Dental Insurance
• 🤰 Maternity Leave
• 👨👩👧👦 Paternity Leave
• 🌟 Reimbursement for AWS Certifications
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.