
Senior Software Engineer, DGX Cloud AI Infrastructure
Posted 1 hour ago

Posted 1 hour ago
This is a fully remote position, open to applicants in California, +3 more states.
• Lead the initiation, validation, and troubleshooting of large-scale AI clusters, infrastructure, and comprehensive workloads, establishing operational standards for the team.
• Set up, optimize, and benchmark AI pre-training, post-training, and inference tasks utilizing PyTorch, NeMo / Megatron, TensorRT-LLM, and related NVIDIA AI software frameworks.
• Assess and enhance end-to-end workload performance across computing, memory, networking, and communication elements using tools such as Nsight Systems, NCCL tests, and tailored microbenchmarks.
• Evaluate scaling efficiency for distributed LLM workloads through data, tensor, pipeline, and expert parallelism in contemporary GPU clusters, and convert insights into specific tuning recommendations.
• Conduct root-cause analysis for complex failures, including hangs, performance degradations, and topology sensitivity in extensive distributed settings.
• Establish and develop the resilience and failure-attribution framework: identifying, prioritizing, and attributing node, fabric, and workload failures across the cluster at scale.
• Create repeatable benchmarking suites, automation processes, acceptance criteria, and qualification workflows on new platforms.
• Adjust runtime settings, communication parameters, and deployment configurations in close collaboration with framework, systems, and platform teams.
• Provide actionable, data-driven insights based on profiling, benchmark outcomes, and cluster characterization.
• Mentor engineers, promote technical standards, and serve as a force multiplier across the wider performance and infrastructure organization.
• Bachelor’s or Master’s degree in Computer Science or a related technical discipline (or equivalent experience).
• Over 8 years of experience in developing software infrastructure for large-scale AI or HPC systems, demonstrating a history of technical leadership.
• Proficiency in debugging and triaging AI applications across the entire stack, from the application layer to the hardware.
• Extensive hands-on experience with NCCL, CUDA-aware distributed execution, and troubleshooting multi-GPU and multi-node workloads at scale.
• Proven experience in designing, debugging, and scaling large-scale distributed systems.
• Expert-level programming skills in Python and C/C++.
• Familiarity with operating workloads in scheduled, containerized cluster environments.
• Exceptional analytical, debugging, and communication skills, with the ability to influence across teams.
• Equity
• Benefits
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.