Remotery

HPC Cluster Architect

Posted 10 hours ago

This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Take responsibility for the entire cluster architecture for large-scale NVIDIA GPU implementations — from understanding customer requirements to creating rack layouts, bill of materials, power and cooling designs, and handing over to production.

• Develop high-performance network fabrics for compute (InfiniBand, RDMA, NVLink/NVSwitch), storage, and WAN — defining topologies, oversubscription models, and strategies for scaling.

• Collaborate directly with OEMs and vendors — verifying hardware configurations, assessing quotes, and ensuring that designs are both technically robust and commercially viable.

• Offer technical guidance during deployment and bring-up — assisting with hardware validation, performance testing, and serving as an escalation point for intricate integration challenges.

• Function as a senior technical authority across Solutions Architecture, Cloud Engineering, and data center partners — contributing to standardized reference designs and enhancing the HPC engineering function.


⛳️ Requirements

• Demonstrated experience in designing and delivering GPU-based HPC or AI clusters at scale — encompassing the entire lifecycle from design through procurement, deployment, and validation.

• Extensive hands-on expertise with NVIDIA GPU platforms (H100/H200/B-series) and NVIDIA reference architectures.

• Strong experience in InfiniBand/RDMA design — including topology, performance tuning, and high-performance Ethernet fabrics.

• Solid understanding of Linux systems, PCIe topology, NUMA alignment, and server-level performance considerations.

• Experience from an OEM, hyperscaler, neo-cloud, or enterprise/research HPC environment — with proven exposure to the complete design-to-deployment lifecycle.

• Ability to confidently engage with customers, vendors, OEMs, and internal engineering teams as a technical expert — capable of translating complex design trade-offs into straightforward decisions.

• Familiarity with Spectrum-X or next-generation Ethernet fabrics (Nice to Have).

• Previous involvement in large-scale cluster deployments (1,000+ GPUs) and performance benchmarking (NCCL, MLPerf) (Nice to Have).

• Exposure to both air-cooled and liquid-cooled HPC environments, and/or automation/infrastructure-as-code (Nice to Have).


🏝️ Benefits

• Competitive salary and annual discretionary bonus scheme.

• Employee wellbeing benefits.

• 25 days of holiday, in addition to public holidays.

• Flexible working arrangements (remote or hybrid, depending on role and location).

• Genuine ownership and autonomy, with the freedom to take initiative and experiment.

• The chance to make a visible and meaningful impact as we scale.

• Clear career progression and growth opportunities in a rapidly growing company.

• A collaborative, international culture founded on trust, transparency, and ownership.

• The opportunity to help shape NexGen Cloud’s team, culture, and future alongside ambitious, mission-driven colleagues.

People also viewed

Anchor Utility10 hours ago

Rate Analyst

US flagTexas OnlyFull-timeUncategorized
ApplyView job
Honeywell10 hours ago

HSE Manager

US flagNorth Carolina OnlyFull-timeUncategorized
ApplyView job
Cision France10 hours ago

People Partner

CA flagCanada OnlyFull-timeUncategorized$85k/year
ApplyView job
Navigate Power10 hours ago

B2B Outside Sales Consultant

US flagPennsylvania OnlyFreelanceUncategorized$50k – $250k/year
ApplyView job
TELUS10 hours ago

Business Development Executive, Early Career – European Language Required

GB flagUnited Kingdom OnlyFull-timeUncategorized
ApplyView job
Gilead Sciences10 hours ago

Statistical Programmer II

US flagUnited States OnlyFull-timeUncategorized$107.2k – $138.7k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers