Remotery

Staff SRE, AI Infrastructure

Posted Jun 21

This is a fully remote position, open to applicants in California.

📋 Description

• Take ownership of the reliability of Andromeda's infrastructure from start to finish.

• Lead responses for training runs of top customers and document the postmortem.

• Ensure the operational health of thousands of GPUs across multiple providers.

• Develop telemetry, conduct GPU health assessments, and implement automated remediation.

• Establish on-call procedures, including rotations and escalation protocols.

• Serve as the reliability advocate during customer incident reviews.

• Collaborate closely with the product team to define Service Level Objectives (SLOs).

• Partner with providers and data center teams to enhance physical design.

• Elevate the skills of other engineers through mentorship.


⛳️ Requirements

• Several years of experience in building and operating large-scale GPU infrastructure as your main focus.

• A proven track record of ensuring the reliability of load-bearing infrastructure.

• Extensive hands-on experience with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale.

• Real-world experience with InfiniBand, RoCE, and NVLink fabric technologies.

• Familiarity with the execution of large training jobs, including NCCL, CUDA, and PyTorch distributed systems.

• Strong proficiency in Go, Python, or Rust programming languages.

• Expert-level knowledge of Linux and Systems Internals.

• Comfortable taking the lead as the senior engineer during a P0 bridge with customers.

• Confident in serving as the senior technical voice for AI infrastructure customers.


🏝️ Benefits

• Significant autonomy in your work.

• Opportunity to work on infrastructure relied upon by the most ambitious AI labs.

People also viewed

Innovative Solutions49 min ago

Cloud Engineer – DevOps

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$100k – $160k/year
ApplyView job
Caspar Health49 min ago

DevSecOps/DevOps Engineer

DE flagGermany OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
IVIX49 min ago

Deployment Engineer

US flagNew York OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Investigo11 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind11 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal11 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers