This is a fully remote position, open to applicants in California, +3 more states.

📋 Description

• Lead the initiation, validation, and troubleshooting of large-scale AI clusters, infrastructure, and comprehensive workloads, establishing operational standards for the team.

• Set up, optimize, and benchmark AI pre-training, post-training, and inference tasks utilizing PyTorch, NeMo / Megatron, TensorRT-LLM, and related NVIDIA AI software frameworks.

• Assess and enhance end-to-end workload performance across computing, memory, networking, and communication elements using tools such as Nsight Systems, NCCL tests, and tailored microbenchmarks.

• Evaluate scaling efficiency for distributed LLM workloads through data, tensor, pipeline, and expert parallelism in contemporary GPU clusters, and convert insights into specific tuning recommendations.

• Conduct root-cause analysis for complex failures, including hangs, performance degradations, and topology sensitivity in extensive distributed settings.

• Establish and develop the resilience and failure-attribution framework: identifying, prioritizing, and attributing node, fabric, and workload failures across the cluster at scale.

• Create repeatable benchmarking suites, automation processes, acceptance criteria, and qualification workflows on new platforms.

• Adjust runtime settings, communication parameters, and deployment configurations in close collaboration with framework, systems, and platform teams.

• Provide actionable, data-driven insights based on profiling, benchmark outcomes, and cluster characterization.

• Mentor engineers, promote technical standards, and serve as a force multiplier across the wider performance and infrastructure organization.

⛳️ Requirements

• Bachelor’s or Master’s degree in Computer Science or a related technical discipline (or equivalent experience).

• Over 8 years of experience in developing software infrastructure for large-scale AI or HPC systems, demonstrating a history of technical leadership.

• Proficiency in debugging and triaging AI applications across the entire stack, from the application layer to the hardware.

• Extensive hands-on experience with NCCL, CUDA-aware distributed execution, and troubleshooting multi-GPU and multi-node workloads at scale.

• Proven experience in designing, debugging, and scaling large-scale distributed systems.

• Expert-level programming skills in Python and C/C++.

• Familiarity with operating workloads in scheduled, containerized cluster environments.

• Exceptional analytical, debugging, and communication skills, with the ability to influence across teams.

🏝️ Benefits

• Equity

• Benefits

Senior Software Engineer, DGX Cloud AI Infrastructure

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Program Manager II

Senior Product Manager – Candidate & Recruiter Platform

Account Director

Forward-Deployed Product Manager – FDPM

Human Resource Generalist

Product Marketing Engineer

Never miss a great job!