Remotery

Senior Site Reliability Engineer

Posted May 19

This is a fully remote position, open to applicants in Australia.

📋 Description

• Create and sustain observability solutions utilizing platforms such as Datadog, Prometheus, and Grafana.

• Assume a leadership role in incident management, which includes coordinating response efforts, diagnosing issues, and determining follow-up actions.

• Collaborate with product engineering teams to design dependable systems, recover from incidents, and derive lessons from mistakes.

• Work alongside teams to establish and uphold SLOs, monitoring, and alerting strategies that guarantee reliability at scale.

• Develop and implement automation and support tools to enhance system resilience, ensure operational safety, and minimize operational overhead.

• Oversee the creation and upkeep of runbooks, alert definitions, and incident response protocols.

• Engage in on-call rotations to provide 24/7 support for critical production systems.


⛳️ Requirements

• A minimum of 6 years of experience in Site Reliability Engineering or comparable DevOps positions focused on system reliability and incident management.

• Extensive experience with contemporary monitoring stacks including Prometheus, Grafana, and Datadog.

• Proficiency in at least one systems programming language, such as Python, Go, Rust, C/C++, or Java.

• Mastery of Infrastructure as Code tools, including Terraform and Helm.

• Familiarity with at least one major cloud service provider (AWS, GCP, Azure).

• Strong communication skills, capable of leading incident responses and effectively collaborating across teams.

• Willingness and experience in participating in on-call rotations and emergency response processes.

• A high degree of autonomy and a proactive approach to identifying and resolving issues.

• Exceptional problem-solving abilities and a systematic approach to troubleshooting complex challenges.


🏝️ Benefits

• Health, dental, vision, life, and disability insurance.

• 401(k) plan and flexible spending accounts.

• Flexible time off.

• Option to work from the Atlanta or San Francisco offices.

People also viewed

Work Life Group31 min ago

Lead DevOps Engineer, Data & AI Platform

HU flagHungary OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
accesa.eu31 min ago

DevOps Engineer, German

RO flagRomania OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cisco38 min ago

Site Reliability Engineer – Kubernetes Platform

IN flagIndia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Work Life Group44 min ago

Lead DevOps Engineer – Data & AI Platform

CZ flagCzechia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
JumpCloud44 min ago

Security Engineer, DevSecOps

MX flagMexico OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Unit444 min ago

Cloud Operations Engineer

PT flagPortugal OnlyFull-timeDevOps & Site Reliability Engineer (SRE)€30.5k – €35.1k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers