Remotery

Operations Engineer, Fleet Reliability

Posted Jun 19

This is a fully remote position, open to applicants in United States.

📋 Description

• Provision, validate, and manage GPU nodes within the B300, H200, and H100 clusters.

• Diagnose and resolve hardware and software issues related to compute, network, and storage systems.

• Oversee fleet health, implement corrective actions, and escalate fixes as necessary.

• Create runbooks, enhance existing ones, and eliminate those that are ineffective.


⛳️ Requirements

• Prior experience in administering Linux Systems in critical environments.

• Expertise in troubleshooting GPU node problems including NVLink, NCCL, IB, and driver or firmware bugs.

• Familiarity with observability tools such as Grafana and Prometheus.

• Proficient in scripting to automate repetitive tasks using languages like bash, python, go, or similar.


🏝️ Benefits

• Competitive salary and performance-based incentives.

• Comprehensive health and wellness benefits.

• Opportunities for professional development and growth.

• Flexible working arrangements and a supportive team environment.

People also viewed

ABC Legal Services3 hours ago

Coverage Operations Specialist

US flagNorth Carolina OnlyFull-timeOperations$20 – $22/hour
ApplyView job
Marsh McLennan4 hours ago

Corporate Operations Manager – Offshore Operations

US flagCalifornia, +2 more statesFull-timeOperations$93.1k – $173.4k/year
ApplyView job
Stewart Title4 hours ago

Senior Title Operations Manager

US flagTexas OnlyFull-timeOperations
ApplyView job
CCS Fundraising4 hours ago

Director, Operations – Change Management

US flagUnited States OnlyFull-timeOperations$115k – $135k/year
ApplyView job
Viatris4 hours ago

Director, Operational Excellence – EU & NA

IE flagIreland OnlyFull-timeOperations
ApplyView job
Sólides4 hours ago

AI Ops Senior

BR flagBrazil OnlyFull-timeOperations
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers