
Site Reliability Engineer
Posted May 20

Posted May 20
This is a fully remote position, open to applicants in Brazil.
• Design, implement, and maintain monitoring and observability solutions utilizing tools such as Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
• Define and execute SLOs, SLIs, and error budgets to assess system reliability.
• Create and enhance dashboards, alerts, and reports for both system performance and business metrics.
• Develop actionable alerting strategies to reduce noise and enhance MTTR.
• Integrate alerting systems with Jira for streamlined incident management.
• Establish and improve runbooks for on-call teams to manage alerts effectively.
• Equip teams to guarantee observability coverage and effective incident response practices.
• Analyze system performance metrics, pinpoint bottlenecks, and implement enhancements to boost system efficiency, scalability, and cost-effectiveness.
• Assist in conducting load testing and capacity planning to ensure systems can accommodate peak traffic loads.
• Identify automation opportunities and create tools to optimize operational processes, such as fail-over, configuration management, and monitoring.
• Implement monitoring and alerting systems within automations to proactively detect and resolve issues.
• Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to comprehend system requirements, provide technical guidance, and drive solutions.
• Effectively communicate with stakeholders regarding system changes, incidents, and enhancements.
• Promote and disseminate SRE principles and practices throughout the organization.
• Must be located in Latin America.
• Proficient in English at C1 or C2 level.
• Demonstrated experience as a Site Reliability Engineer or in a comparable role.
• Expertise in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
• Familiarity with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
• Strong programming and scripting skills (Python, Bash).
• Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
• Understanding of Linux-based systems, networking, and security principles related to containerized applications.
• Robust problem-solving and troubleshooting abilities, with a keen interest in identifying and resolving complex technical challenges.
• Excellent communication and collaborative skills.
• Ability to thrive in a dynamic, fast-paced environment.
• Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).
• 2 year+ contract.
• 15 business days of vacation.
• Access to tech courses and conferences.
• State-of-the-art MacBook.
• Flexible working hours.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.