
VP of Site Reliability
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in United States.
• Establish and initially operate the SRE practice, including the SLO framework, on-call rotation, and incident command process.
• Develop SLOs, manage the rotation, lead incident responses for live banking customers, and create postmortem reports.
• Determine severity tiers, SLA commitments for each customer tier, and escalation procedures for production support.
• Establish operational standards across all four engineering lanes: sprint discipline, release rituals, code review standards, and change management documentation.
• A minimum of ten years of experience in engineering.
• At least five years of hands-on experience building SRE or platform operations functions within a software company catering to enterprise or regulated markets.
• Familiarity with organizations that deliver software to customers and manage it at scale, such as ServiceNow, MongoDB, AWS, GCP, or similar.
• Experience managing multi-tenant and multi-deployment-model infrastructure, understanding the complexities involved in the final stages.
• Ability to create SLOs that are effectively utilized.
• Experience establishing an on-call rotation from the ground up.
• Acted as the technical lead during production incidents and understand the implications of lacking a defined process.
• Build trust with senior engineers based on merit rather than title.
• Self-motivated individual dedicated to developing processes and infrastructure.
• Competitive base salary and significant equity opportunities.
• Strong preference for candidates in Atlanta, GA. Consideration for West Coast applicants on an individual basis. Remote work is an option for the right candidate.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.