
Operations Engineer, HPC Networking
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in United States.
β’ Oversee the health and performance of InfiniBand and Ethernet fabrics including switches, HCAs, transceivers, and links.
β’ Analyze and resolve fabric-related issues such as connectivity, congestion, and performance regressions.
β’ Collaborate with DC operations and customer-facing teams to support fabric bring-up.
β’ Conduct maintenance and upgrades on switches and control plane components.
β’ Work alongside cluster operations on cross-domain incidents where the distinction between compute and network is unclear.
β’ Enhance tooling and runbooks to ensure quicker resolution of incidents compared to previous ones.
β’ Experience operating InfiniBand fabrics in a production environment, including subnet management, routing, partitioning, and monitoring.
β’ Proficient in debugging the entire stack, including cables, transceivers, switch firmware, HCAs, drivers, and NCCL.
β’ Successfully brought up new fabrics starting from cable installation through to validation.
β’ Capable of automating repetitive operational tasks using scripting languages such as bash, python, go, or others.
β’ Preferred: Familiarity with Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
β’ Competitive salary and comprehensive benefits package.
β’ Opportunities for professional development and career growth.
β’ Collaborative and innovative work environment.
β’ Flexibility in work arrangements and schedules.
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.