This is a fully remote position, open to applicants in California.

• Take ownership of the reliability of Andromeda's infrastructure from start to finish.

• Lead responses for training runs of top customers and document the postmortem.

• Ensure the operational health of thousands of GPUs across multiple providers.

• Develop telemetry, conduct GPU health assessments, and implement automated remediation.

• Establish on-call procedures, including rotations and escalation protocols.

• Serve as the reliability advocate during customer incident reviews.

• Collaborate closely with the product team to define Service Level Objectives (SLOs).

• Partner with providers and data center teams to enhance physical design.

• Elevate the skills of other engineers through mentorship.

• Several years of experience in building and operating large-scale GPU infrastructure as your main focus.

• A proven track record of ensuring the reliability of load-bearing infrastructure.

• Extensive hands-on experience with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale.

• Real-world experience with InfiniBand, RoCE, and NVLink fabric technologies.

• Familiarity with the execution of large training jobs, including NCCL, CUDA, and PyTorch distributed systems.

• Strong proficiency in Go, Python, or Rust programming languages.

• Expert-level knowledge of Linux and Systems Internals.

• Comfortable taking the lead as the senior engineer during a P0 bridge with customers.

• Confident in serving as the senior technical voice for AI infrastructure customers.

• Significant autonomy in your work.

• Opportunity to work on infrastructure relied upon by the most ambitious AI labs.

Staff SRE, AI Infrastructure

People also viewed