
HPC System Engineer
Posted 23 hours ago

Posted 23 hours ago
This is a fully remote position, open to applicants in Netherlands.
• Collaborate extensively with hardware and development teams to evaluate and analyze GPU performance at both the system and kernel levels.
• Assess and compare GPU performance across various platforms, architectures, and software stacks, including CUDA and ROCm.
• Conduct acceptance testing for new GPU clusters, ensuring that both hardware and software align with performance, stability, and compatibility standards for AI workloads.
• Execute experiments on various GPU system configurations to determine the effects of different interconnect strategies and system-level optimizations on performance and scalability.
• Strong proficiency in Unix/Linux, along with skills in Python and Bash for automation tasks.
• Solid understanding of the GPU stack, including CUDA, NCCL, drivers, and associated libraries.
• Demonstrated capability to troubleshoot complex system challenges encompassing hardware, software, and networking issues.
• Experience with containerized environments, such as Docker and Kubernetes.
• Competitive compensation package.
• Opportunities for career growth and professional development.
• Flexibility that promotes work-life balance.
• A culture that fosters collaboration and innovation.
• The chance to contribute to meaningful AI projects.
• An international environment with skilled teams.
Jellyfish
ScalableOS
Pragmatike
Get handpicked remote jobs straight to your inbox weekly.