
Senior Linux Kernel Engineer – High-Performance Computing
Posted May 25

Posted May 25
This is a fully remote position, open to applicants in Netherlands.
• Optimizing the performance of clusters and InfiniBand networks to guarantee peak functionality in HPC and GPU-centric environments.
• Investigating and diagnosing the underlying causes of issues pertaining to GPUs and InfiniBand networks, and recommending corrective measures.
• Incorporating new hardware into the current infrastructure, including enabling support for new GPU hardware via software stacks such as Kubernetes, QEMU, and KVM.
• Advancing automation systems for proactive monitoring, identifying, and resolving complications in GPU and InfiniBand settings.
• Setting up and overseeing GPU devices and InfiniBand fabrics to ensure effective and dependable operation.
• Over 5 years of professional experience in system-level software development, emphasizing performance optimization and low-level programming.
• More than 3 years of practical experience with Linux systems, including administration, troubleshooting, and/or performance tuning.
• Proficient with essential tools for kernel profiling and tuning, including perf, ftrace, and (e)BPF.
• Comprehensive knowledge of server architecture, encompassing PCIe devices, NICs, Linux OS/Kernel, etc.
• Strong command of one or more performance-focused programming languages such as C/C++, Go, or Python.
• It would be advantageous (though not essential) if you possess:
• Experience in GPU end-to-end testing within a cluster setup utilizing InfiniBand networking.
• A proven history of analyzing and enhancing the performance of HPC workloads, including simulations, data analysis, and AI/ML tasks.
• Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
• Background in Software-Defined Networking (SDN) along with experience in HPC cluster networking.
• Understanding of QEMU/KVM virtualization and management of virtualized environments.
• Experience with deep learning frameworks like PyTorch and TensorFlow, and their integration into HPC systems.
• Knowledge of collective communication libraries such as MPI and NCCL for distributed computing.
• Flexible working arrangements
• A dynamic and collaborative work environment that encourages initiative and innovation.
Webedia
TechBiz Global
The Flex
Nodeworthy
Get handpicked remote jobs straight to your inbox weekly.