
Senior HPC Cluster Engineer
Posted 1 hour ago

Posted 1 hour ago
• Optimizing the performance of GPU clusters and InfiniBand networks to guarantee peak operation in HPC and GPU-centric settings.
• Investigating and diagnosing the root causes of problems associated with GPUs and InfiniBand networks, along with recommending corrective measures.
• Incorporating new hardware into the current infrastructure, which includes enabling support for new GPU hardware via software stacks such as Kubernetes, QEMU, and KVM.
• Advancing automation systems for proactive monitoring, identifying, and resolving challenges in GPU and InfiniBand environments.
• Configuring and overseeing GPU devices and InfiniBand fabrics to ensure effective and dependable operation.
• Over 5 years of professional experience in system-level software development, emphasizing performance optimization and low-level programming.
• More than 3 years of practical experience with Linux systems, including administration, troubleshooting, and performance enhancement.
• Comprehensive knowledge of server architecture, encompassing PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
• Strong expertise in one or more performance-driven programming languages (C/C++, Go, Python).
• Competitive salary and an extensive benefits package.
• Opportunities for professional advancement within Nebius.
• Flexible working options.
• A vibrant and collaborative work atmosphere that appreciates initiative and innovation.
SANEARES
Deep Digital Solutions Group
Get handpicked remote jobs straight to your inbox weekly.