
Operations Engineer, HPC Networking
Posted 18 hours ago

Posted 18 hours ago
• Oversee the health and performance of InfiniBand and Ethernet networks: including switches, HCAs, transceivers, and links.
• Analyze and resolve fabric-related issues: such as connectivity problems, congestion, and performance regressions.
• Assist in fabric deployment in collaboration with data center operations and customer-facing teams.
• Conduct maintenance and upgrades on switches and control plane elements.
• Collaborate with cluster operations on cross-domain incidents where the boundaries between compute and network are unclear.
• Enhance tools and runbooks to ensure that the resolution of future incidents is quicker than before.
• Experience operating InfiniBand fabrics in a production environment: including subnet management, routing, partitioning, and monitoring.
• Proficient in debugging the entire stack: from cables and transceivers to switch firmware, HCAs, drivers, and NCCL.
• Successfully established new fabrics from cable installation through to validation.
• Created scripts to automate repetitive operational tasks (using bash, python, go, or similar languages).
• Preferred experience: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
• Comprehensive health and wellness programs.
• Opportunities for professional development and training.
• Flexible work arrangements to promote work-life balance.
• Competitive salary and performance-based incentives.
Pearl West
Guild Mortgage
Recruiting.com
EY
Get handpicked remote jobs straight to your inbox weekly.