Remotery

Operations Engineer, Fleet Reliability

Posted May 14

This is a fully remote position, open to applicants in United States.

📋 Description

• Provision, validate, and manage GPU nodes across B300, H200, and H100 clusters.

• Diagnose hardware and software challenges within compute, network, and storage environments.

• Oversee fleet health, implement remediation actions, and escalate fixes when necessary.

• Create runbooks, enhance existing ones, and eliminate those that are ineffective.


⛳️ Requirements

• Experience in administering Linux Systems in critical environments.

• Resolved GPU node challenges: NVLink, NCCL, IB, as well as driver and firmware issues.

• Familiarity with observability tools such as Grafana and Prometheus.

• Developed scripts to automate repetitive tasks (using bash, python, go, or other languages).


🏝️ Benefits

• Competitive salary and performance-based incentives.

• Opportunities for professional development and career growth.

• Comprehensive health and wellness benefits.

• Collaborative and innovative work environment.

People also viewed

ABC Legal Services11 hours ago

Coverage Operations Specialist

US flagNorth Carolina OnlyFull-timeOperations$20 – $22/hour
ApplyView job
Marsh McLennan11 hours ago

Corporate Operations Manager – Offshore Operations

US flagCalifornia, +2 more statesFull-timeOperations$93.1k – $173.4k/year
ApplyView job
Stewart Title11 hours ago

Senior Title Operations Manager

US flagTexas OnlyFull-timeOperations
ApplyView job
CCS Fundraising11 hours ago

Director, Operations – Change Management

US flagUnited States OnlyFull-timeOperations$115k – $135k/year
ApplyView job
Viatris11 hours ago

Director, Operational Excellence – EU & NA

IE flagIreland OnlyFull-timeOperations
ApplyView job
Sólides11 hours ago

AI Ops Senior

BR flagBrazil OnlyFull-timeOperations
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers