Remotery

Operations Engineer, Fleet Reliability

atfalUS flagUnited StatesFull-timeOperationsMid-levelSenior

Posted 19 hours ago

📋 Description

• Provision, validate, and manage GPU nodes across B300, H200, and H100 clusters.

• Diagnose hardware and software challenges within compute, network, and storage environments.

• Oversee fleet health, implement remediation actions, and escalate fixes when necessary.

• Create runbooks, enhance existing ones, and eliminate those that are ineffective.


⛳️ Requirements

• Experience in administering Linux Systems in critical environments.

• Resolved GPU node challenges: NVLink, NCCL, IB, as well as driver and firmware issues.

• Familiarity with observability tools such as Grafana and Prometheus.

• Developed scripts to automate repetitive tasks (using bash, python, go, or other languages).


🏝️ Benefits

• Competitive salary and performance-based incentives.

• Opportunities for professional development and career growth.

• Comprehensive health and wellness benefits.

• Collaborative and innovative work environment.

People also viewed

Pearl West19 hours ago

Operations Specialist

CA flagCanada OnlyFull-timeOperations$600 – $800/month
ApplyView job
Guild Mortgage19 hours ago

Process Improvement Consultant

US flagUnited States OnlyFull-timeOperations$79.3k – $119k/year
ApplyView job
Recruiting.com19 hours ago

Senior Director – Cencora University Operations

US flagPennsylvania OnlyFull-timeOperations$156.3k – $241k/year
ApplyView job
fal19 hours ago

Operations Engineer, HPC Networking

US flagUnited States OnlyFull-timeOperations
ApplyView job
EY19 hours ago

Senior Manager – TechOps, Service Management, ITSM

IN flagIndia OnlyFull-timeOperations
ApplyView job
Siemens Healthineers19 hours ago

Head of Total Rewards Operations

US flagAlabama, +2 more statesFull-timeOperations$189.9k – $261.2k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers