
SRE Partner – Affirmative Action Position for Persons with Disabilities
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Brazil.
• Ensure the reliability, availability, and scalability of systems and services within the assigned Product Areas (PAs).
• Design and implement solutions for monitoring, observability, and alerting that integrate with the Agentic Engineering Platform.
• Assist teams in defining and tracking Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
• Organize and enhance on-call management within the PAs, including rotation, escalation procedures, alerting tools, and incident management.
• Collaborate closely with the Engineering Platform to ensure that platform capabilities are effectively utilized and adopted by product teams.
• Actively engage in the advancement of the Agentic Engineering Platform by providing genuine feedback from PAs regarding friction points, gaps, and opportunities for enhancement.
• Participate in and help foster a reliability-oriented Site Reliability Engineering (SRE) culture throughout the organization.
• Assist in the migration of critical systems, the segregation of environments, and the deprecation of legacy technologies.
• Experience working with cloud environments, ideally Google Cloud Platform (GCP).
• Expertise in observability tools and methodologies (Prometheus, Grafana, Loki, Thanos, Elasticsearch, AlertManager, etc.).
• Strong understanding of Kubernetes and distributed system architectures.
• Solid knowledge of Infrastructure as Code (IaC) principles and Terraform.
• Practical experience in incident management, on-call duties, and conducting post-mortems.
• Experience in defining and tracking SLOs and error budgets.
• Ability to analyze logs and assess the performance of distributed systems.
• Excellent communication and influencing skills, capable of advocating technical solutions to a variety of audiences, including engineers, product managers, and leadership.
• A data-driven approach, utilizing data to identify risks, prioritize actions, and illustrate impact.
• N/A
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.