
Senior Software Engineer β NVLink Rack Scale Stability and Reliability
Posted 2 hours ago

Posted 2 hours ago
This is a fully remote position, open to applicants in Arizona, +3 more states.
β’ Lead the platform bring-up, feature activation, comprehensive software validation, and troubleshooting for cutting-edge NVLink-based GPU and rack-scale systems.
β’ Create tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet maintenance.
β’ Direct reliability and MTBI validation through stress testing, telemetry evaluation, failure injection, and problem resolution.
β’ Analyze intricate software, firmware, networking, and platform challenges across validation, deployment, and production settings.
β’ Work alongside architecture, hardware, firmware, software, and customer engagement teams to enhance system quality and reliability.
β’ Develop and sustain SRE-style validation infrastructure, encompassing provisioning, monitoring, and operational readiness.
β’ Design automation, dashboards, runbooks, and debugging workflows that enhance root-cause analysis and operational efficiency.
β’ BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
β’ Over 5 years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
β’ Proficient programming skills in C/C++ and Python; experience in Bash/Shell scripting is advantageous.
β’ Strong system-level debugging capabilities across software, firmware, hardware, and networking layers.
β’ Solid understanding of networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
β’ Experience with large-scale AI systems, including platform bring-up, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.
β’ Ability to diagnose complex multi-domain issues utilizing logs, telemetry, experiments, and structured debugging techniques.
β’ Excellent communication and collaboration skills with engineering, customer, and operations teams.
β’ Enthusiasm for developing reliable next-generation AI infrastructure and addressing complex system-level challenges at scale.
β’ Eligible for equity and benefits
Focus
Trellis
Mattel, Inc.
Milliman
Get handpicked remote jobs straight to your inbox weekly.