This is a fully remote position, open to applicants in Brazil.

📋 Description

• Lead, nurture, and retain the SRE team, fostering a culture of high performance, collaboration, and continuous learning.

• Oversee hiring, onboarding, feedback cycles, individual development plans (IDPs), and performance evaluations.

• Establish the SRE team's strategy and roadmap to align with Cloud and business objectives.

• Advocate for SRE and observability culture, serving as a technical reference for Engineering.

• Manage team priorities, capacity, and trade-offs to ensure quality deliverables.

• Align initiatives with leadership in Cloud Engineering, Platform Engineering, and Cloud Security.

• Report team metrics, risks, and progress to Cloud leadership.

• Define and spearhead the observability strategy (metrics, logs, and traces).

• Evolve the observability platform (Prometheus, Grafana, OpenTelemetry, Loki, Tempo).

• Establish and govern SLIs, SLOs, and Error Budgets for critical services.

• Define instrumentation standards for applications and infrastructure, promoting adoption across teams.

• Implement an actionable alerting strategy to minimize noise.

• Plan and execute capacity management based on metrics.

• Optimize costs and performance of observability solutions at scale.

• Structure and lead the incident management process (escalation, war room, and communication).

• Ensure blameless post-mortems and follow up on corrective actions.

• Identify recurring issues and suggest systemic, data-driven enhancements.

• Lead toil reduction by automating operational tasks.

• Maintain operational documentation (runbooks, procedures, and architectures) up to date and accessible.

⛳️ Requirements

• Proven experience in leading technical teams (SRE, DevOps, Cloud Engineering).

• Familiarity with SRE practices, including SLIs, SLOs, Error Budgets, and toil reduction.

• Experience with APM tools (Datadog, New Relic, Dynatrace).

• Knowledge of observability and telemetry (metrics, logs, traces), specifically with Prometheus and OpenTelemetry (Grafana).

• Hands-on experience with Infrastructure as Code (AWS CDK, Terraform).

• Proficient in scripting languages (Python, Bash) and at least one programming language (Go, Java).

• Experience with large-scale logging and tracing solutions (Loki, Tempo, Jaeger, ELK Stack).

• Cloud experience, preferably with AWS.

• Familiarity with containers (Docker) and orchestration platforms (Kubernetes, ECS).

• Experience in incident management and conducting post-mortems.

• Understanding of Linux systems and diagnostic tools.

• Proficient in technical English (reading and writing).

🏝️ Benefits

• Comprehensive medical and dental plans with no co-pay.

• Life insurance coverage.

• Assistance with pharmacy/medication expenses.

• Support for physical activities through a fitness subsidy.

• Neon partnership to promote employee financial health.

• Access to Zenklub for mental and physical health (4 free monthly sessions for therapy or nutrition).

• Quick massage services available at headquarters.

• Flexible meal benefits provided via a Visa credit card.

• Free on-site food.

• Childcare allowance.

• Parental support program.

• Extended maternity and paternity leave.

• Access to an in-company training platform.

• Educational assistance covering 70% of tuition for degree programs and language courses.

• Home office allowance provided.

• Work equipment supplied.

• Furniture allowance available.

• Partnerships with coworking spaces across Brazil.

• Birthday day off.

• Happy hour allowance.

• Referral bonus for new hires.

• Performance-based bonus based on annual targets.

• Stock option plan.

• A relaxed, casual work environment with no dress code.

Lead Site Reliability Engineer – Observability

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior DevOps Engineer

Kubernetes Site Reliability Engineer

DevOps confirmé

DevOps Engineer, Cloud

DevOps Engineer – Part-Time

Mid Cloud Product Reliability Engineer

Never miss a great job!