
ML Infrastructure Engineer
Posted 1 day ago

Posted 1 day ago
• Collaborate closely with hardware and development teams to assess and analyze GPU performance at both the system and kernel levels.
• Assess and compare GPU performance across various platforms, architectures, and software stacks (such as CUDA and ROCm).
• Debug and optimize machine learning workloads to ensure efficient execution on GPU hardware, identifying and addressing performance bottlenecks.
• Conduct acceptance testing for new GPU clusters, verifying that hardware and software meet the performance, stability, and compatibility criteria for AI workloads.
• Execute experiments across different GPU system configurations to evaluate the effects of varying interconnect strategies and system-level optimizations on performance and scalability.
• Create tools and dashboards for visualizing performance metrics, identifying bottlenecks, and tracking trends.
• Contribute to the development of internal tools, frameworks, and best practices.
• A strong grasp of the theoretical foundations underlying machine learning.
• In-depth knowledge of performance considerations for training and inference in large neural networks (including data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimizations, and dynamic batching).
• Extensive experience with contemporary deep learning frameworks (such as PyTorch, JAX, Megatron-LM, and Tensor-LLM).
• Solid understanding of the GPU stack, including CUDA, NCCL, drivers, and pertinent libraries.
• Familiarity with containerized environments (e.g., Docker and Kubernetes).
• Excellent communication skills and the ability to work independently.
• Competitive salary.
• Opportunities for career advancement and professional development.
• Flexibility and emphasis on work-life balance.
• A collaborative and innovative workplace culture.
• Chance to engage in impactful AI projects.
• An international environment with skilled teams.
Bullhorn
Rocket Money (formerly Truebill)
CrowdStrike
K1X
Get handpicked remote jobs straight to your inbox weekly.