Remotery

Senior Software Engineer, DGX Cloud AI Infrastructure

Posted 1 hour ago

This is a fully remote position, open to applicants in California, +3 more states.

📋 Description

• Lead the initiation, validation, and troubleshooting of large-scale AI clusters, infrastructure, and comprehensive workloads, establishing operational standards for the team.

• Set up, optimize, and benchmark AI pre-training, post-training, and inference tasks utilizing PyTorch, NeMo / Megatron, TensorRT-LLM, and related NVIDIA AI software frameworks.

• Assess and enhance end-to-end workload performance across computing, memory, networking, and communication elements using tools such as Nsight Systems, NCCL tests, and tailored microbenchmarks.

• Evaluate scaling efficiency for distributed LLM workloads through data, tensor, pipeline, and expert parallelism in contemporary GPU clusters, and convert insights into specific tuning recommendations.

• Conduct root-cause analysis for complex failures, including hangs, performance degradations, and topology sensitivity in extensive distributed settings.

• Establish and develop the resilience and failure-attribution framework: identifying, prioritizing, and attributing node, fabric, and workload failures across the cluster at scale.

• Create repeatable benchmarking suites, automation processes, acceptance criteria, and qualification workflows on new platforms.

• Adjust runtime settings, communication parameters, and deployment configurations in close collaboration with framework, systems, and platform teams.

• Provide actionable, data-driven insights based on profiling, benchmark outcomes, and cluster characterization.

• Mentor engineers, promote technical standards, and serve as a force multiplier across the wider performance and infrastructure organization.


⛳️ Requirements

• Bachelor’s or Master’s degree in Computer Science or a related technical discipline (or equivalent experience).

• Over 8 years of experience in developing software infrastructure for large-scale AI or HPC systems, demonstrating a history of technical leadership.

• Proficiency in debugging and triaging AI applications across the entire stack, from the application layer to the hardware.

• Extensive hands-on experience with NCCL, CUDA-aware distributed execution, and troubleshooting multi-GPU and multi-node workloads at scale.

• Proven experience in designing, debugging, and scaling large-scale distributed systems.

• Expert-level programming skills in Python and C/C++.

• Familiarity with operating workloads in scheduled, containerized cluster environments.

• Exceptional analytical, debugging, and communication skills, with the ability to influence across teams.


🏝️ Benefits

• Equity

• Benefits

People also viewed

Instacart6 min ago

Program Manager II

US flagCalifornia, +18 more statesFull-timeUncategorized$122k – $155k/year
ApplyView job
CLASP6 min ago

Senior Product Manager – Candidate & Recruiter Platform

US flagMassachusetts OnlyFull-timeUncategorized$140k – $170k/year
ApplyView job
Tevora6 min ago

Account Director

US flagOregon OnlyFull-timeUncategorized$110k – $130k/year
ApplyView job
Tailor6 min ago

Forward-Deployed Product Manager – FDPM

US flagCalifornia OnlyFull-timeUncategorized$130k – $170k/year
ApplyView job
Cube Care Company6 min ago

Human Resource Generalist

US flagUnited States OnlyFull-timeUncategorized
ApplyView job
Juniper Square6 min ago

Product Marketing Engineer

US flagUnited States OnlyFull-timeUncategorized$160k – $215k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers