
Senior HPC Cluster Engineer
Posted May 25

Posted May 25
This is a fully remote position, open to applicants in Czechia.
• Optimizing the performance of GPU clusters and InfiniBand networks to guarantee peak efficiency in HPC and GPU-centric settings.
• Investigating and diagnosing the root causes of issues associated with GPUs and InfiniBand networks, while recommending corrective measures.
• Incorporating new hardware into the current infrastructure, which includes supporting new GPU hardware via software frameworks like Kubernetes, QEMU, and KVM.
• Improving automation systems for proactive monitoring, identifying, and resolving issues within GPU and InfiniBand environments.
• Configuring and overseeing GPU devices and InfiniBand fabrics to ensure efficient and dependable performance.
• Over 5 years of professional experience in system-level software development with a focus on performance optimization and low-level programming.
• More than 3 years of practical experience with Linux systems, including administration, troubleshooting, and performance enhancement.
• Comprehensive understanding of server architecture, particularly PCIe devices, NICs, the Linux OS/Kernel, and high-performance computing (HPC) systems.
• Strong expertise in one or more performance-centric programming languages such as C/C++, Go, or Python.
• Competitive salary along with a comprehensive benefits package.
• Opportunities for professional advancement within Nebius.
• Flexible working arrangements available.
• A vibrant and collaborative work atmosphere that encourages initiative and innovation.
Akka (formerly Lightbend)
Swimlane
Get handpicked remote jobs straight to your inbox weekly.