
Operations Engineer, Fleet Reliability
Posted 19 hours ago

Posted 19 hours ago
• Provision, validate, and manage GPU nodes across B300, H200, and H100 clusters.
• Diagnose hardware and software challenges within compute, network, and storage environments.
• Oversee fleet health, implement remediation actions, and escalate fixes when necessary.
• Create runbooks, enhance existing ones, and eliminate those that are ineffective.
• Experience in administering Linux Systems in critical environments.
• Resolved GPU node challenges: NVLink, NCCL, IB, as well as driver and firmware issues.
• Familiarity with observability tools such as Grafana and Prometheus.
• Developed scripts to automate repetitive tasks (using bash, python, go, or other languages).
• Competitive salary and performance-based incentives.
• Opportunities for professional development and career growth.
• Comprehensive health and wellness benefits.
• Collaborative and innovative work environment.
Pearl West
Guild Mortgage
Recruiting.com
fal
Get handpicked remote jobs straight to your inbox weekly.