
Engineering Manager, Fleet Reliability
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in United States.
• Establish and oversee the Fleet Reliability team: recruit, nurture, and retain talent.
• Ensure 24/7 support for node provisioning, validation, and triage processes.
• Lead the automation strategy: implement event-driven remediation, self-healing mechanisms, and enhance observability.
• Define and uphold the SLAs that ensure production GPUs remain operational and serving traffic.
• Cultivate the team culture: establish performance metrics, communication practices, and growth opportunities.
• Over 7 years of experience in infrastructure, software, or SRE, with a minimum of 2 years in a leadership role.
• Experience managing a fleet reliability or hardware operations team in a production environment.
• Developed SRE principles within a team from the ground up: including incident management, postmortem analysis, observability, and change management.
• Advocated for automation to reduce manual toil within teams.
• Comprehensive benefits package including health, dental, and vision insurance.
• Flexible work hours and the option for remote work.
• Opportunities for professional development and growth.
Zero Hash
Anthology Careers
Flosum
Mozilla
Get handpicked remote jobs straight to your inbox weekly.