Remotery

Lead Site Reliability Developer

Posted 10 hours ago

This is a fully remote position, open to applicants in Arizona, +3 more states.

📋 Description

• Lead consulting initiatives from the discovery phase through to delivery by ensuring stakeholder alignment on priorities, sequencing tasks, and communicating measurable outcomes.

• Establish a regular working rhythm and facilitate decision-making forums to highlight risks, map dependencies, and drive accountability along with timelines.

• Align stakeholders from product, platform, and engineering on reliability objectives and trade-offs using Service Level Objectives (SLOs) and error budgets.

• Collaborate frequently with Engineering Managers, product managers, Staff and Principal engineers, and platform leads to ensure alignment on dependencies, decisions, and delivery timelines.

• Identify systemic risks across shared dependencies and coordinate remediation efforts across various teams to minimize recurring incidents.

• Foster change adoption by integrating reliability mechanisms into partner team routines, including planning, Post-Release Reviews (PRRs), and on-call practices.

• Design and implement reusable reliability mechanisms, templates, and tooling that can be utilized across multiple teams.

• Establish and refine production readiness review practices with partner teams to enhance launch quality and change safety.

• Drive the observability strategy for partner domains by enhancing signal quality, alerting philosophy, and operational dashboards.

• Lead complex incident investigations, ensuring that learnings result in sustainable fixes with assigned owners and verification processes.

• Conduct reliability-focused design and code reviews, guiding teams towards simpler and safer architectural solutions.

• Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to amplify impact.

• Collaborate with internal platform engineering to influence roadmaps and deliver shared capabilities that accelerate Site Reliability Engineering (SRE) adoption.

• Enhance CSRE Consulting playbooks and operational practices based on recurring patterns identified across teams.


⛳️ Requirements

• In-depth practical understanding of SRE principles, including SLO governance and error budget policies in practice.

• Demonstrated ability to lead technical efforts across teams and influence without direct authority.

• Extensive experience in designing and troubleshooting distributed systems with cross-service failure modes.

• Proven experience in shaping observability and alerting strategies, along with improving operational signal quality.

• Strong expertise in Kubernetes and AWS, including governance and cost-related trade-offs.

• Ability to design reliability automation and tooling that is reusable and can be adopted by various teams.

• Experience in leading production readiness and resilience practices, including Disaster Recovery (DR) validation and controlled testing.

• Strong software engineering fundamentals, with the capability to deliver and review high-quality changes in enterprise-level codebases.

• Advanced incident analysis skills focused on reducing systemic risk and promoting organizational learning.

• Excellent communication skills, including the ability to create executive-ready summaries and clear technical diagrams.


🏝️ Benefits

• Health: Medical, vision, dental, and mental health benefits for you and your family, along with access to a health care concierge, and Flexible or Health Savings Accounts (FSA or HSA).

• Yourself: Complimentary concert tickets, generous paid time off including holidays, sick leave, and personal days.

• Wealth: 401(k) program with company matching, stock reimbursement program.

• Family: New parent programs including caregiver leave, plus support for fertility, adoption, foster care, or surrogacy.

• Career: Career and skill development initiatives with School of Live, tuition reimbursement, and student loan repayment options.

• Others: Volunteer time off and crowdfunding match.

People also viewed

Investigo8 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind8 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal8 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job
Avaya8 hours ago

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$129k – $143k/year
ApplyView job
Agilent Technologies8 hours ago

DevOps Engineer – Platform, AWS, CI/CD

US flagColorado OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$143.8k – $224.6k/year
ApplyView job
Dropbox8 hours ago

Site Reliability Engineer

PL flagPoland OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers