
Principal Software Engineer – Distributed Systems Engineer, DGX Cloud
Posted 4 hours ago

Posted 4 hours ago
• You will join the DGX Cloud team, which is accountable for production systems that facilitate extensive scalable GPU clusters for various AI workloads.
• This role involves working on specialized software related to the scheduling of GPU resources on Kubernetes.
• You will be implementing monitoring and health management features that ensure industry-leading reliability, availability, and scalability of GPU assets.
• Your responsibilities will include managing multiple data streams, encompassing GPU hardware diagnostics as well as cluster and network telemetry.
• Collaboration with teams across NVIDIA will be essential to ensure that production AI clusters operate reliably and consistently at peak performance.
• You will assess system failures and enhance services based on a well-defined incident management procedure.
• Proven experience in a software engineering role within a highly technical environment, demonstrating significant impact from your contributions.
• Software development expertise with Kubernetes APIs and frameworks, going beyond merely operating a cluster.
• Highly driven with excellent communication skills, capable of working effectively with multi-functional teams, principles, and architects, while coordinating across organizational boundaries and geographies.
• Over 15 years of experience in a similar position, particularly with large-scale production systems.
• Familiarity with standard software engineering principles, tools, and techniques.
• A BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree, or equivalent experience.
• Technical proficiency, including knowledge of a systems programming language (Go, Python) and a solid grasp of data structures and algorithms.
• Equity
• Benefits
Davion Labs
EY
Get handpicked remote jobs straight to your inbox weekly.