
Site Reliability Engineer
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants anywhere in the world.
• Collaborate with service teams to establish significant SLIs and SLOs that are rooted in customer experience, and develop error budget policies that translate them into engineering decisions.
• Take ownership of and enhance the Operational Readiness Review (ORR) process, conducting evaluations for new services and major changes related to observability, alerting, runbooks, capacity, and graceful degradation.
• Enhance the incident-to-improvement pipeline by linking postmortem insights to operational readiness shortcomings, pinpointing recurring failure patterns, and driving systematic resolutions.
• Serve as the reliability authority that teams consult for architecture reviews, failure mode analyses, dependency mapping, and resilience design.
• Identify and assess operational toil across the organization, advocating for or creating automation solutions that eliminate it.
• Assist teams in developing sustainable on-call practices, focusing on alert quality, escalation procedures, runbook coverage, and noise reduction.
• Monitor and report on the overall operational maturity of the organization, highlighting systemic deficiencies and promoting remediation efforts.
• Possess over 7 years of experience in SRE, production engineering, or roles focused on reliability, including shaping SRE practices and fostering their adoption within engineering teams.
• Have a software engineering mindset—capable of writing code and building tools, not just configuring them.
• Demonstrate hands-on experience in defining and operationalizing SLOs/SLIs at scale, including error budget policies that have genuinely influenced engineering decisions.
• Hold extensive experience in incident response, facilitating postmortems, and transforming incident learnings into systemic enhancements.
• Have experience with large-scale multi-tenant systems (bonus: managed database platforms or Postgres).
• Be proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable).
• Communicate effectively and persuasively—this position necessitates the ability to influence without authority in a distributed organization.
• Have experience working in asynchronous or globally distributed teams.
• Be motivated by empowering other teams to be more effective rather than being the sole problem-solver.
• Fully Remote
• ESOP
• Tech Allowance
• Health Benefits
• Annual Off-Sites
• Flexible Work
• Professional Development
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.