This is a fully remote position, open to applicants anywhere in the world.

📋 Description

• Collaborate with service teams to establish significant SLIs and SLOs that are rooted in customer experience, and develop error budget policies that translate them into engineering decisions.

• Take ownership of and enhance the Operational Readiness Review (ORR) process, conducting evaluations for new services and major changes related to observability, alerting, runbooks, capacity, and graceful degradation.

• Enhance the incident-to-improvement pipeline by linking postmortem insights to operational readiness shortcomings, pinpointing recurring failure patterns, and driving systematic resolutions.

• Serve as the reliability authority that teams consult for architecture reviews, failure mode analyses, dependency mapping, and resilience design.

• Identify and assess operational toil across the organization, advocating for or creating automation solutions that eliminate it.

• Assist teams in developing sustainable on-call practices, focusing on alert quality, escalation procedures, runbook coverage, and noise reduction.

• Monitor and report on the overall operational maturity of the organization, highlighting systemic deficiencies and promoting remediation efforts.

⛳️ Requirements

• Possess over 7 years of experience in SRE, production engineering, or roles focused on reliability, including shaping SRE practices and fostering their adoption within engineering teams.

• Have a software engineering mindset—capable of writing code and building tools, not just configuring them.

• Demonstrate hands-on experience in defining and operationalizing SLOs/SLIs at scale, including error budget policies that have genuinely influenced engineering decisions.

• Hold extensive experience in incident response, facilitating postmortems, and transforming incident learnings into systemic enhancements.

• Have experience with large-scale multi-tenant systems (bonus: managed database platforms or Postgres).

• Be proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable).

• Communicate effectively and persuasively—this position necessitates the ability to influence without authority in a distributed organization.

• Have experience working in asynchronous or globally distributed teams.

• Be motivated by empowering other teams to be more effective rather than being the sole problem-solver.

🏝️ Benefits

• Fully Remote

• ESOP

• Tech Allowance

• Health Benefits

• Annual Off-Sites

• Flexible Work

• Professional Development

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!