
Lead Site Reliability Developer – CSRE Consulting
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Arizona, +3 more states.
• Lead consulting initiatives from the discovery phase to delivery by aligning stakeholders on priorities, sequencing tasks, and communicating measurable results.
• Establish a working rhythm and facilitate decision-making forums to identify risks, map dependencies, and ensure clear ownership and timelines.
• Align product, platform, and engineering stakeholders on reliability objectives and trade-offs using Service Level Objectives (SLOs) and error budgets.
• Collaborate regularly with Engineering Managers, product managers, Staff and Principal engineers, and platform leaders to maintain alignment on dependencies, decisions, and delivery.
• Identify systemic risks across shared dependencies and coordinate solutions across multiple teams to minimize recurring incidents.
• Promote change adoption by integrating reliability mechanisms into partner team routines such as planning, Post-Release Reviews (PRRs), and on-call practices.
• Design and implement reusable reliability mechanisms, templates, and tools that can be utilized across various teams.
• Establish and refine production readiness review practices with partner teams to enhance launch quality and safety of changes.
• Drive the observability strategy for partner domains by improving signal quality, alerting philosophy, and operational dashboards.
• Lead intricate incident investigations and ensure that lessons learned are transformed into sustainable solutions with designated owners and verification processes.
• Lead design and code reviews focused on reliability and guide teams towards simpler, safer architectures.
• Mentor Senior engineers and other consultants through pairing, reviews, and structured coaching to amplify impact.
• Collaborate with internal platform engineering to influence roadmaps and deliver shared capabilities that facilitate SRE adoption.
• Enhance CSRE Consulting playbooks and operational practices based on recurring patterns observed across teams.
• In-depth practical knowledge of SRE principles, including SLO governance and the application of error budget policies.
• Proven capability to lead cross-team technical efforts and influence stakeholders without direct authority.
• Extensive experience in designing and troubleshooting distributed systems with cross-service failure modes.
• Experience in developing observability and alerting strategies while enhancing operational signal quality.
• Strong expertise in Kubernetes and AWS, including governance and cost trade-offs.
• Ability to create reliability automation and tooling that is reusable and adopted by multiple teams.
• Experience in leading production readiness and resilience practices, including disaster recovery validation and controlled testing.
• Solid software engineering fundamentals with the ability to produce and evaluate high-quality changes in enterprise codebases.
• Advanced incident analysis capabilities focused on reducing systemic risks and fostering organizational learning.
• Exceptional communication skills, including the ability to create executive-ready summaries and clear technical diagrams.
• Comprehensive medical, vision, dental, and mental health benefits for you and your family, with access to a health care concierge, as well as Flexible or Health Savings Accounts (FSA or HSA).
• Complimentary concert tickets, generous paid time off including holidays, sick leave, and personal days.
• 401(k) plan with company matching, along with a stock reimbursement program.
• New parent programs including caregiver leave, plus support for fertility, adoption, foster care, or surrogacy.
• Career and skill development programs through School of Live, tuition reimbursement, and student loan repayment assistance.
• Volunteer time off and crowdfunding matching opportunities.
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.