
Engineering Manager, Fleet Reliability
Posted 18 hours ago

Posted 18 hours ago
• Establish and lead the Fleet Reliability team: recruit, develop, and retain talent.
• Maintain 24/7 support for node provisioning, validation, and triage processes.
• Propel the automation strategy: implement event-driven remediation, self-healing mechanisms, and enhance observability.
• Define and uphold the Service Level Agreements (SLAs) that ensure production GPUs are consistently serving traffic.
• Cultivate the team culture: establish metrics for performance, facilitate communication, and promote professional growth.
• Over 7 years of experience in infrastructure, software, or Site Reliability Engineering (SRE), with a minimum of 2 years in a leadership role.
• Experience managing a fleet reliability or hardware operations team in a production environment.
• Successfully built SRE practices within a team from the ground up, including incident management, postmortems, observability, and change management.
• Advocated for and directed teams towards automating processes to reduce manual toil.
• Competitive salary and performance-based incentives.
• Comprehensive health, dental, and vision insurance plans.
• Flexible work arrangements and paid time off.
• Opportunities for professional development and continuous learning.
Utility Warehouse
Experian
Peek Vision
Wayflyer
Get handpicked remote jobs straight to your inbox weekly.