
Principal Software Engineer – Distributed Systems Engineer, DGX Cloud
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in North Carolina.
• Join the DGX Cloud team, where you will contribute to production systems that facilitate large, scalable GPU clusters for diverse AI workloads.
• Your role will involve developing custom software aimed at optimizing GPU resource scheduling on Kubernetes.
• You will implement monitoring and health management features to ensure exceptional reliability, availability, and scalability of GPU resources.
• The position requires you to manage multiple data streams, including GPU hardware diagnostics and cluster and network telemetry.
• Collaborate with teams across NVIDIA to guarantee that production AI clusters operate reliably and consistently at peak performance.
• Assess system failures and enhance services following a clearly defined incident management process.
• Proven experience in a software engineering position within a highly technical organization, demonstrating the impact of your contributions.
• Proficiency in software development using Kubernetes APIs and frameworks, rather than merely managing a cluster.
• A highly motivated individual with excellent communication skills, capable of successfully collaborating with multi-functional teams, principles, and architects while coordinating effectively across organizational boundaries and locations.
• At least 15 years of experience in a similar role, particularly with large-scale production systems.
• Familiarity with standard software engineering principles, tools, and techniques.
• A Bachelor’s degree in Computer Science, Engineering, Physics, Mathematics, or a comparable discipline, or equivalent experience.
• Technical expertise, including knowledge of a systems programming language (Go, Python) and a strong understanding of data structures and algorithms.
• Equity
• Benefits
Injective Labs
Allium
Lukka
decircle
Get handpicked remote jobs straight to your inbox weekly.