This is a fully remote position, open to applicants in Arizona, +3 more states.

📋 Description

• Lead consulting initiatives from the discovery phase to delivery by aligning stakeholders on priorities, sequencing tasks, and communicating measurable results.

• Establish a working rhythm and facilitate decision-making forums to identify risks, map dependencies, and ensure clear ownership and timelines.

• Align product, platform, and engineering stakeholders on reliability objectives and trade-offs using Service Level Objectives (SLOs) and error budgets.

• Collaborate regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leaders to maintain alignment on dependencies, decisions, and delivery.

• Identify systemic risks across shared dependencies and coordinate solutions across multiple teams to minimize recurring incidents.

• Promote change adoption by integrating reliability mechanisms into partner team routines such as planning, Post-Release Reviews (PRRs), and on-call practices.

• Design and implement reusable reliability mechanisms, templates, and tools that can be utilized across various teams.

• Establish and refine production readiness review practices with partner teams to enhance launch quality and safety of changes.

• Drive the observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards.

• Lead intricate incident investigations and ensure that lessons learned are transformed into sustainable solutions with designated owners and verification processes.

• Lead design and code reviews focused on reliability and guide teams towards simpler, safer architectures.

• Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to amplify impact.

• Collaborate with internal platform engineering to influence roadmaps and deliver shared capabilities that facilitate SRE adoption.

• Enhance CSRE Consulting playbooks and operational practices based on recurring patterns observed across teams.

⛳️ Requirements

• In-depth practical knowledge of SRE principles, including SLO governance and the application of error budget policies.

• Proven capability to lead cross-team technical efforts and influence stakeholders without direct authority.

• Extensive experience in designing and troubleshooting distributed systems with cross-service failure modes.

• Experience in developing observability and alerting strategies while enhancing operational signal quality.

• Strong expertise in Kubernetes and AWS, including governance and cost trade-offs.

• Ability to create reliability automation and tooling that is reusable and adopted by multiple teams.

• Experience in leading production readiness and resilience practices, including disaster recovery validation and controlled testing.

• Solid software engineering fundamentals with the ability to produce and evaluate high-quality changes in enterprise codebases.

• Advanced incident analysis capabilities focused on reducing systemic risks and fostering organizational learning.

• Exceptional communication skills, including the ability to create executive-ready summaries and clear technical diagrams.

🏝️ Benefits

• Comprehensive medical, vision, dental, and mental health benefits for you and your family, with access to a health care concierge, as well as Flexible or Health Savings Accounts (FSA or HSA).

• Complimentary concert tickets, generous paid time off including holidays, sick leave, and personal days.

• 401(k) plan with company matching, along with a stock reimbursement program.

• New parent programs including caregiver leave, plus support for fertility, adoption, foster care, or surrogacy.

• Career and skill development programs through School of Live, tuition reimbursement, and student loan repayment assistance.

• Volunteer time off and crowdfunding matching opportunities.

Lead Site Reliability Developer – CSRE Consulting

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

DevOps Engineer – Platform, AWS, CI/CD

Site Reliability Engineer

Never miss a great job!