
Lead Site Reliability Engineer – Observability
Posted 4 hours ago

Posted 4 hours ago
• Lead, nurture, and retain the SRE team, fostering a culture of high performance, collaboration, and continuous learning.
• Oversee hiring, onboarding, feedback cycles, individual development plans (IDPs), and performance evaluations.
• Establish the SRE team's strategy and roadmap to align with Cloud and business objectives.
• Advocate for SRE and observability culture, serving as a technical reference for Engineering.
• Manage team priorities, capacity, and trade-offs to ensure quality deliverables.
• Align initiatives with leadership in Cloud Engineering, Platform Engineering, and Cloud Security.
• Report team metrics, risks, and progress to Cloud leadership.
• Define and spearhead the observability strategy (metrics, logs, and traces).
• Evolve the observability platform (Prometheus, Grafana, OpenTelemetry, Loki, Tempo).
• Establish and govern SLIs, SLOs, and Error Budgets for critical services.
• Define instrumentation standards for applications and infrastructure, promoting adoption across teams.
• Implement an actionable alerting strategy to minimize noise.
• Plan and execute capacity management based on metrics.
• Optimize costs and performance of observability solutions at scale.
• Structure and lead the incident management process (escalation, war room, and communication).
• Ensure blameless post-mortems and follow up on corrective actions.
• Identify recurring issues and suggest systemic, data-driven enhancements.
• Lead toil reduction by automating operational tasks.
• Maintain operational documentation (runbooks, procedures, and architectures) up to date and accessible.
• Proven experience in leading technical teams (SRE, DevOps, Cloud Engineering).
• Familiarity with SRE practices, including SLIs, SLOs, Error Budgets, and toil reduction.
• Experience with APM tools (Datadog, New Relic, Dynatrace).
• Knowledge of observability and telemetry (metrics, logs, traces), specifically with Prometheus and OpenTelemetry (Grafana).
• Hands-on experience with Infrastructure as Code (AWS CDK, Terraform).
• Proficient in scripting languages (Python, Bash) and at least one programming language (Go, Java).
• Experience with large-scale logging and tracing solutions (Loki, Tempo, Jaeger, ELK Stack).
• Cloud experience, preferably with AWS.
• Familiarity with containers (Docker) and orchestration platforms (Kubernetes, ECS).
• Experience in incident management and conducting post-mortems.
• Understanding of Linux systems and diagnostic tools.
• Proficient in technical English (reading and writing).
• Comprehensive medical and dental plans with no co-pay.
• Life insurance coverage.
• Assistance with pharmacy/medication expenses.
• Support for physical activities through a fitness subsidy.
• Neon partnership to promote employee financial health.
• Access to Zenklub for mental and physical health (4 free monthly sessions for therapy or nutrition).
• Quick massage services available at headquarters.
• Flexible meal benefits provided via a Visa credit card.
• Free on-site food.
• Childcare allowance.
• Parental support program.
• Extended maternity and paternity leave.
• Access to an in-company training platform.
• Educational assistance covering 70% of tuition for degree programs and language courses.
• Home office allowance provided.
• Work equipment supplied.
• Furniture allowance available.
• Partnerships with coworking spaces across Brazil.
• Birthday day off.
• Happy hour allowance.
• Referral bonus for new hires.
• Performance-based bonus based on annual targets.
• Stock option plan.
• A relaxed, casual work environment with no dress code.
PandaDoc
PandaDoc
PandaDoc
PandaDoc
Get handpicked remote jobs straight to your inbox weekly.