📋 Description

• Optimizing the performance of GPU clusters and InfiniBand networks to guarantee peak operation in HPC and GPU-centric settings.

• Investigating and diagnosing the root causes of problems associated with GPUs and InfiniBand networks, along with recommending corrective measures.

• Incorporating new hardware into the current infrastructure, which includes enabling support for new GPU hardware via software stacks such as Kubernetes, QEMU, and KVM.

• Advancing automation systems for proactive monitoring, identifying, and resolving challenges in GPU and InfiniBand environments.

• Configuring and overseeing GPU devices and InfiniBand fabrics to ensure effective and dependable operation.

⛳️ Requirements

• Over 5 years of professional experience in system-level software development, emphasizing performance optimization and low-level programming.

• More than 3 years of practical experience with Linux systems, including administration, troubleshooting, and performance enhancement.

• Comprehensive knowledge of server architecture, encompassing PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.

• Strong expertise in one or more performance-driven programming languages (C/C++, Go, Python).

🏝️ Benefits

• Competitive salary and an extensive benefits package.

• Opportunities for professional advancement within Nebius.

• Flexible working options.

• A vibrant and collaborative work atmosphere that appreciates initiative and innovation.

Senior HPC Cluster Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Forward Deployed Engineer – Agents

Research and Development Engineer

Linear Projects Engineer – Sanitation

Aveva Pi Engineer

Senior Online Engineer

Early Career Engineers

Never miss a great job!