This is a fully remote position, open to applicants in Brazil.

📋 Description

• Design, implement, and maintain monitoring and observability solutions utilizing tools such as Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.

• Define and execute SLOs, SLIs, and error budgets to assess system reliability.

• Create and enhance dashboards, alerts, and reports for both system performance and business metrics.

• Develop actionable alerting strategies to reduce noise and enhance MTTR.

• Integrate alerting systems with Jira for streamlined incident management.

• Establish and improve runbooks for on-call teams to manage alerts effectively.

• Equip teams to guarantee observability coverage and effective incident response practices.

• Analyze system performance metrics, pinpoint bottlenecks, and implement enhancements to boost system efficiency, scalability, and cost-effectiveness.

• Assist in conducting load testing and capacity planning to ensure systems can accommodate peak traffic loads.

• Identify automation opportunities and create tools to optimize operational processes, such as fail-over, configuration management, and monitoring.

• Implement monitoring and alerting systems within automations to proactively detect and resolve issues.

• Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to comprehend system requirements, provide technical guidance, and drive solutions.

• Effectively communicate with stakeholders regarding system changes, incidents, and enhancements.

• Promote and disseminate SRE principles and practices throughout the organization.

⛳️ Requirements

• Must be located in Latin America.

• Proficient in English at C1 or C2 level.

• Demonstrated experience as a Site Reliability Engineer or in a comparable role.

• Expertise in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).

• Familiarity with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).

• Strong programming and scripting skills (Python, Bash).

• Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).

• Understanding of Linux-based systems, networking, and security principles related to containerized applications.

• Robust problem-solving and troubleshooting abilities, with a keen interest in identifying and resolving complex technical challenges.

• Excellent communication and collaborative skills.

• Ability to thrive in a dynamic, fast-paced environment.

• Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).

🏝️ Benefits

• 2 year+ contract.

• 15 business days of vacation.

• Access to tech courses and conferences.

• State-of-the-art MacBook.

• Flexible working hours.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!