
Operations Engineer, Fleet Reliability
Posted Jun 19

Posted Jun 19
This is a fully remote position, open to applicants in United States.
• Provision, validate, and manage GPU nodes within the B300, H200, and H100 clusters.
• Diagnose and resolve hardware and software issues related to compute, network, and storage systems.
• Oversee fleet health, implement corrective actions, and escalate fixes as necessary.
• Create runbooks, enhance existing ones, and eliminate those that are ineffective.
• Prior experience in administering Linux Systems in critical environments.
• Expertise in troubleshooting GPU node problems including NVLink, NCCL, IB, and driver or firmware bugs.
• Familiarity with observability tools such as Grafana and Prometheus.
• Proficient in scripting to automate repetitive tasks using languages like bash, python, go, or similar.
• Competitive salary and performance-based incentives.
• Comprehensive health and wellness benefits.
• Opportunities for professional development and growth.
• Flexible working arrangements and a supportive team environment.
ABC Legal Services
Marsh McLennan
Stewart Title
CCS Fundraising
Get handpicked remote jobs straight to your inbox weekly.