
HPC Cluster Architect
Posted 10 hours ago

Posted 10 hours ago
This is a fully remote position, open to applicants in United Kingdom.
• Take responsibility for the entire cluster architecture for large-scale NVIDIA GPU implementations — from understanding customer requirements to creating rack layouts, bill of materials, power and cooling designs, and handing over to production.
• Develop high-performance network fabrics for compute (InfiniBand, RDMA, NVLink/NVSwitch), storage, and WAN — defining topologies, oversubscription models, and strategies for scaling.
• Collaborate directly with OEMs and vendors — verifying hardware configurations, assessing quotes, and ensuring that designs are both technically robust and commercially viable.
• Offer technical guidance during deployment and bring-up — assisting with hardware validation, performance testing, and serving as an escalation point for intricate integration challenges.
• Function as a senior technical authority across Solutions Architecture, Cloud Engineering, and data center partners — contributing to standardized reference designs and enhancing the HPC engineering function.
• Demonstrated experience in designing and delivering GPU-based HPC or AI clusters at scale — encompassing the entire lifecycle from design through procurement, deployment, and validation.
• Extensive hands-on expertise with NVIDIA GPU platforms (H100/H200/B-series) and NVIDIA reference architectures.
• Strong experience in InfiniBand/RDMA design — including topology, performance tuning, and high-performance Ethernet fabrics.
• Solid understanding of Linux systems, PCIe topology, NUMA alignment, and server-level performance considerations.
• Experience from an OEM, hyperscaler, neo-cloud, or enterprise/research HPC environment — with proven exposure to the complete design-to-deployment lifecycle.
• Ability to confidently engage with customers, vendors, OEMs, and internal engineering teams as a technical expert — capable of translating complex design trade-offs into straightforward decisions.
• Familiarity with Spectrum-X or next-generation Ethernet fabrics (Nice to Have).
• Previous involvement in large-scale cluster deployments (1,000+ GPUs) and performance benchmarking (NCCL, MLPerf) (Nice to Have).
• Exposure to both air-cooled and liquid-cooled HPC environments, and/or automation/infrastructure-as-code (Nice to Have).
• Competitive salary and annual discretionary bonus scheme.
• Employee wellbeing benefits.
• 25 days of holiday, in addition to public holidays.
• Flexible working arrangements (remote or hybrid, depending on role and location).
• Genuine ownership and autonomy, with the freedom to take initiative and experiment.
• The chance to make a visible and meaningful impact as we scale.
• Clear career progression and growth opportunities in a rapidly growing company.
• A collaborative, international culture founded on trust, transparency, and ownership.
• The opportunity to help shape NexGen Cloud’s team, culture, and future alongside ambitious, mission-driven colleagues.
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.