
Staff SRE, AI Infrastructure
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in California.
• Take ownership of the reliability of Andromeda's infrastructure from start to finish.
• Lead responses for training runs of top customers and document the postmortem.
• Ensure the operational health of thousands of GPUs across multiple providers.
• Develop telemetry, conduct GPU health assessments, and implement automated remediation.
• Establish on-call procedures, including rotations and escalation protocols.
• Serve as the reliability advocate during customer incident reviews.
• Collaborate closely with the product team to define Service Level Objectives (SLOs).
• Partner with providers and data center teams to enhance physical design.
• Elevate the skills of other engineers through mentorship.
• Several years of experience in building and operating large-scale GPU infrastructure as your main focus.
• A proven track record of ensuring the reliability of load-bearing infrastructure.
• Extensive hands-on experience with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale.
• Real-world experience with InfiniBand, RoCE, and NVLink fabric technologies.
• Familiarity with the execution of large training jobs, including NCCL, CUDA, and PyTorch distributed systems.
• Strong proficiency in Go, Python, or Rust programming languages.
• Expert-level knowledge of Linux and Systems Internals.
• Comfortable taking the lead as the senior engineer during a P0 bridge with customers.
• Confident in serving as the senior technical voice for AI infrastructure customers.
• Significant autonomy in your work.
• Opportunity to work on infrastructure relied upon by the most ambitious AI labs.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.