
Senior Production Engineer – DGX Cloud
Posted 11 hours ago

Posted 11 hours ago
This is a fully remote position, open to applicants in California, +4 more states.
• Join the DGX Cloud team, where you will be responsible for the production systems that facilitate the use of large scalable GPU clusters for various AI workloads.
• Your role will involve developing custom software for GPU asset provisioning, configuration, and lifecycle management across different cloud providers.
• You will implement monitoring and health management features to ensure industry-leading reliability, availability, and scalability of GPU assets.
• This position will require you to integrate multiple data streams, including GPU hardware diagnostics and cluster and network telemetry.
• Collaborate with teams across NVIDIA to ensure that production AI clusters operate reliably and efficiently with optimal performance.
• You will assess system failures and enhance services through a well-defined incident management process.
• Proven experience in a Production Engineering, DevOps, or SRE role within a highly technical organization, showcasing the impact of your contributions.
• A self-driven individual with strong communication abilities, capable of effectively collaborating with multi-functional teams, principles, and architects while coordinating across organizational boundaries and geographies.
• Over 8 years of experience in a similar role, particularly with large-scale production systems.
• Familiarity with the principles, tools, and techniques related to Production Engineering, DevOps, and SRE.
• A Bachelor’s degree in Computer Science, Engineering, Physics, Mathematics, or a related field, or equivalent experience.
• Technical expertise, including proficiency in a systems programming language (such as Go or Python) and a strong grasp of data structures and algorithms.
• Equity
• Comprehensive benefits
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.