Remotery

Senior Software Engineer – NVLink Rack Scale Stability and Reliability

Posted 2 hours ago

This is a fully remote position, open to applicants in Arizona, +3 more states.

πŸ“‹ Description

β€’ Lead the platform bring-up, feature activation, comprehensive software validation, and troubleshooting for cutting-edge NVLink-based GPU and rack-scale systems.

β€’ Create tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet maintenance.

β€’ Direct reliability and MTBI validation through stress testing, telemetry evaluation, failure injection, and problem resolution.

β€’ Analyze intricate software, firmware, networking, and platform challenges across validation, deployment, and production settings.

β€’ Work alongside architecture, hardware, firmware, software, and customer engagement teams to enhance system quality and reliability.

β€’ Develop and sustain SRE-style validation infrastructure, encompassing provisioning, monitoring, and operational readiness.

β€’ Design automation, dashboards, runbooks, and debugging workflows that enhance root-cause analysis and operational efficiency.


⛳️ Requirements

β€’ BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.

β€’ Over 5 years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.

β€’ Proficient programming skills in C/C++ and Python; experience in Bash/Shell scripting is advantageous.

β€’ Strong system-level debugging capabilities across software, firmware, hardware, and networking layers.

β€’ Solid understanding of networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.

β€’ Experience with large-scale AI systems, including platform bring-up, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.

β€’ Ability to diagnose complex multi-domain issues utilizing logs, telemetry, experiments, and structured debugging techniques.

β€’ Excellent communication and collaboration skills with engineering, customer, and operations teams.

β€’ Enthusiasm for developing reliable next-generation AI infrastructure and addressing complex system-level challenges at scale.


🏝️ Benefits

β€’ Eligible for equity and benefits

People also viewed

Focus41 min ago

Senior/Staff Software Engineer

US flagUnited States OnlyFull-timeFull-stack Engineer$100k – $205k/year
ApplyView job
Trellis41 min ago

Full-Stack Team Lead

US flagUnited States OnlyFull-timeFull-stack Engineer
ApplyView job
Mattel, Inc.41 min ago

Senior Engineer, EDI Onboarding

IN flagIndia OnlyFull-timeFull-stack Engineer
ApplyView job
Milliman41 min ago

Senior Software Engineer – Cloud

US flagTexas OnlyFull-timeFull-stack Engineer$93.7k – $177.7k/year
ApplyView job
magentIQ1 hour ago

Mid-Level Full-Stack Software Engineer

PH flagPhilippines OnlyFull-timeFull-stack Engineer
ApplyView job
Stefanini LATAM1 hour ago

Desarrollador FullStack

CO flagColombia OnlyFull-timeFull-stack Engineer
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers