This is a fully remote position, open to applicants in United States.

📋 Description

• Provision, validate, and manage GPU nodes within the B300, H200, and H100 clusters.

• Diagnose and resolve hardware and software issues related to compute, network, and storage systems.

• Oversee fleet health, implement corrective actions, and escalate fixes as necessary.

• Create runbooks, enhance existing ones, and eliminate those that are ineffective.

⛳️ Requirements

• Prior experience in administering Linux Systems in critical environments.

• Expertise in troubleshooting GPU node problems including NVLink, NCCL, IB, and driver or firmware bugs.

• Familiarity with observability tools such as Grafana and Prometheus.

• Proficient in scripting to automate repetitive tasks using languages like bash, python, go, or similar.

🏝️ Benefits

• Competitive salary and performance-based incentives.

• Comprehensive health and wellness benefits.

• Opportunities for professional development and growth.

• Flexible working arrangements and a supportive team environment.

Coverage Operations Specialist

North Carolina OnlyFull-timeOperations$20 – $22/hour

3 hours ago

Apply

Marsh McLennan4 hours ago

Marsh McLennan

Corporate Operations Manager – Offshore Operations

California, +2 more statesFull-timeOperations$93.1k – $173.4k/year

4 hours ago

Apply

Stewart Title4 hours ago

Stewart Title

Senior Title Operations Manager

Texas OnlyFull-timeOperations

4 hours ago

Apply

CCS Fundraising4 hours ago

CCS Fundraising

Director, Operations – Change Management

United States OnlyFull-timeOperations$115k – $135k/year

4 hours ago

Apply

Viatris4 hours ago

Viatris

Director, Operational Excellence – EU & NA

Ireland OnlyFull-timeOperations

4 hours ago

Apply

Sólides4 hours ago

Sólides

AI Ops Senior

Brazil OnlyFull-timeOperations

4 hours ago

Apply

Operations Engineer, Fleet Reliability

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Coverage Operations Specialist

Corporate Operations Manager – Offshore Operations

Senior Title Operations Manager

Director, Operations – Change Management

Director, Operational Excellence – EU & NA

AI Ops Senior

Never miss a great job!