This is a fully remote position, open to applicants in Australia.

📋 Description

• Create and sustain observability solutions utilizing platforms such as Datadog, Prometheus, and Grafana.

• Assume a leadership role in incident management, which includes coordinating response efforts, diagnosing issues, and determining follow-up actions.

• Collaborate with product engineering teams to design dependable systems, recover from incidents, and derive lessons from mistakes.

• Work alongside teams to establish and uphold SLOs, monitoring, and alerting strategies that guarantee reliability at scale.

• Develop and implement automation and support tools to enhance system resilience, ensure operational safety, and minimize operational overhead.

• Oversee the creation and upkeep of runbooks, alert definitions, and incident response protocols.

• Engage in on-call rotations to provide 24/7 support for critical production systems.

⛳️ Requirements

• A minimum of 6 years of experience in Site Reliability Engineering or comparable DevOps positions focused on system reliability and incident management.

• Extensive experience with contemporary monitoring stacks including Prometheus, Grafana, and Datadog.

• Proficiency in at least one systems programming language, such as Python, Go, Rust, C/C++, or Java.

• Mastery of Infrastructure as Code tools, including Terraform and Helm.

• Familiarity with at least one major cloud service provider (AWS, GCP, Azure).

• Strong communication skills, capable of leading incident responses and effectively collaborating across teams.

• Willingness and experience in participating in on-call rotations and emergency response processes.

• A high degree of autonomy and a proactive approach to identifying and resolving issues.

• Exceptional problem-solving abilities and a systematic approach to troubleshooting complex challenges.

🏝️ Benefits

• Health, dental, vision, life, and disability insurance.

• 401(k) plan and flexible spending accounts.

• Flexible time off.

• Option to work from the Atlanta or San Francisco offices.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Lead DevOps Engineer, Data & AI Platform

DevOps Engineer, German

Site Reliability Engineer – Kubernetes Platform

Lead DevOps Engineer – Data & AI Platform

Security Engineer, DevSecOps

Cloud Operations Engineer

Never miss a great job!