
Senior Software Engineer, DGX Cloud Production Engineering
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in California.
• Design and manage automation processes for expansive GPU clusters within NVIDIA Cloud Partners (NCP) and on-premise settings.
• Create tools and services for provisioning, validation, upgrades, monitoring, maintenance, and the overall lifecycle management of clusters.
• Enhance Day 0, Day 1, and Day 2 workflows related to cluster deployment, transitions, and operational procedures.
• Minimize manual interventions in production through the use of APIs, GitOps, automation, and agent-assisted workflows.
• Engage in on-call duties, incident management, troubleshooting, and thorough follow-up tasks.
• Collaborate with platform, storage, networking, security, and workload teams to ensure infrastructure is ready for production.
• Over 8 years of experience in building or managing production infrastructure.
• Proficient programming skills in Python, Go, or similar languages.
• Familiarity with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation.
• Capable of troubleshooting distributed systems in a production environment.
• Excellent communication skills and ability to collaborate across various teams.
• BS/MS degree in Computer Science or equivalent professional experience.
• Equity
• Benefits
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.