Remotery

ML Infrastructure Engineer

atNebius GroupUS flagCaliforniaFull-timeInfrastructure EngineerMid-levelSenior

Posted 1 day ago

📋 Description

• Collaborate closely with hardware and development teams to assess and analyze GPU performance at both the system and kernel levels.

• Assess and compare GPU performance across various platforms, architectures, and software stacks (such as CUDA and ROCm).

• Debug and optimize machine learning workloads to ensure efficient execution on GPU hardware, identifying and addressing performance bottlenecks.

• Conduct acceptance testing for new GPU clusters, verifying that hardware and software meet the performance, stability, and compatibility criteria for AI workloads.

• Execute experiments across different GPU system configurations to evaluate the effects of varying interconnect strategies and system-level optimizations on performance and scalability.

• Create tools and dashboards for visualizing performance metrics, identifying bottlenecks, and tracking trends.

• Contribute to the development of internal tools, frameworks, and best practices.


⛳️ Requirements

• A strong grasp of the theoretical foundations underlying machine learning.

• In-depth knowledge of performance considerations for training and inference in large neural networks (including data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimizations, and dynamic batching).

• Extensive experience with contemporary deep learning frameworks (such as PyTorch, JAX, Megatron-LM, and Tensor-LLM).

• Solid understanding of the GPU stack, including CUDA, NCCL, drivers, and pertinent libraries.

• Familiarity with containerized environments (e.g., Docker and Kubernetes).

• Excellent communication skills and the ability to work independently.


🏝️ Benefits

• Competitive salary.

• Opportunities for career advancement and professional development.

• Flexibility and emphasis on work-life balance.

• A collaborative and innovative workplace culture.

• Chance to engage in impactful AI projects.

• An international environment with skilled teams.

People also viewed

Bullhorn17 hours ago

Infrastructure Engineer II

US flagUnited States OnlyFull-timeInfrastructure Engineer$81k – $106.3k/year
ApplyView job
Rocket Money (formerly Truebill)17 hours ago

Senior Infrastructure Engineer, Cloud Security

US flagCalifornia, +2 more statesFull-timeInfrastructure Engineer$150k – $185k/year
ApplyView job
CrowdStrike22 hours ago

Senior Infrastructure Engineer – Kubernetes

US flagUnited States OnlyFull-timeInfrastructure Engineer$140k – $215k/year
ApplyView job
K1X1 day ago

Platform Infrastructure Engineer

US flagUnited States OnlyFull-timeInfrastructure Engineer
ApplyView job
Element 841 day ago

Senior Security Engineer – AI Infrastructure

US flagVirginia OnlyFull-timeInfrastructure Engineer$150k – $180k/year
ApplyView job
Tietoevry1 day ago

Cloud Infrastructure Engineer – Azure, Windows, Linux

UA flagUkraine OnlyFull-timeInfrastructure Engineer
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers