
Principal Service Reliability Engineer
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Virginia, +4 more states.
• Develop and enhance build and deployment pipelines for secure, dependable production releases.
• Oversee the design and management of pre-production and production cloud infrastructure to guarantee high availability, performance, and security.
• Collaborate with Engineering teams to streamline the release flow from development to testing and production.
• Establish and enforce monitoring, alerting, and incident response protocols.
• Direct intricate troubleshooting efforts, conduct root cause analysis, and promote systematic post-incident enhancements.
• Assess, recommend, and integrate new infrastructure technologies and services.
• Ensure platforms comply with or surpass healthcare security and compliance standards.
• Promote and facilitate the adoption of SRE best practices (SLOs, SLIs, error budgets, reliability engineering standards).
• Act as a technical leader and advisor across Service Reliability, DevOps, and engineering teams.
• Guide engineers through design reviews, knowledge sharing, and best practice advisement.
• Influence system design and architectural choices to enhance scalability and resilience.
• Collaborate with teams to prioritize reliability initiatives and minimize operational risk.
• Assist in defining engineering standards, best practices, and operational runbooks.
• Cultivate a culture of ownership, accountability, reliability, and continuous improvement.
• Bachelor’s degree in Computer Science, Computer Engineering, Information Security, or a related field with practical experience.
• Over 8 years of experience in SRE, DevOps, or infrastructure engineering positions.
• Extensive expertise in managing cloud-based infrastructure, preferably in Azure.
• Proficient experience with Kubernetes, encompassing: Cluster setup, networking, access control, and authorization.
• Knowledge of deployments, services, config maps, secrets, and cronjobs.
• Designing, deploying, and maintaining service mesh infrastructure.
• Strong experience with GitHub Actions and CI/CD pipelines.
• Experience in supporting production environments and high-availability systems.
• Familiarity with Agile methodologies (Scrum, sprints, backlogs).
• Experience managing certificates, secrets, and monitoring systems.
• Excellent collaboration skills within a large, evolving engineering organization.
• Proven ability to lead complex technical initiatives across multiple teams.
• Proactive approach with a focus on continuous improvement.
• Flexible time off, including 12 paid holidays.
• 401k match along with 100% employer-paid medical, dental, and vision premiums.
• Company contributions to Health Savings Account.
• Stock options.
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.