Remotery

Senior HPC Cluster Administrator – Deep Learning Frameworks Infrastructure

Posted May 24

This is a fully remote position, open to applicants in Poland.

πŸ“‹ Description

β€’ Oversee the entire lifecycle of GPU compute clusters β€” including procurement, provisioning, configuration management, monitoring, and deprecation β€” within diverse Linux environments (DGX, HGX, embedded systems).

β€’ Develop and scale storage solutions (NFS, Lustre, WekaFS, or similar) with a well-defined roadmap for capacity and performance enhancement.

β€’ Drive the automation of infrastructure utilizing modern Infrastructure as Code (IaC) tools (Ansible, Terraform) and CI/CD pipelines (GitLab).

β€’ Manage and enhance job scheduling using Slurm, which includes fair-share policies, reservation management, and MIG/GPU partitioning strategies.

β€’ Sustain and enhance observability stacks (Prometheus, Grafana, DCGM) while proactively resolving hardware and software incidents.

β€’ Collaborate with ML engineers and software teams to optimize cluster configurations for extensive distributed training workloads.

β€’ Assess and implement new technologies β€” networking fabrics (InfiniBand, NVLink, EFA/RDMA), storage tiers, container runtimes β€” to boost performance and reliability.

β€’ Mentor junior engineers and contribute to the establishment of team-wide engineering standards.


⛳️ Requirements

β€’ BS/MS in Computer Science, Electrical Engineering, Computer Engineering, or equivalent practical experience.

β€’ Over 5 years of experience in deploying and managing large-scale HPC or ML training clusters.

β€’ Profound expertise in Linux systems administration at scale.

β€’ Strong scripting and automation capabilities in Python and/or Bash.

β€’ Practical experience with Slurm (scheduling, accounting, cgroup configuration).

β€’ Proficient in configuration management and IaC (Ansible is required; Terraform is a plus).

β€’ Familiarity with container technologies (Docker, Apptainer/Singularity, Kubernetes).

β€’ Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA).

β€’ Experience with distributed/parallel filesystems and storage architectures.

β€’ Ability to independently manage problems from start to finish and communicate effectively with engineering and management stakeholders.


🏝️ Benefits

β€’ Health insurance

β€’ Professional development opportunities

People also viewed

The Investment Diversity Exchange (TIDE)49 min ago

Senior NetSuite Technical Administrator

Anywhere in the WorldFull-timeAdministration$112.4k – $168.6k/year
ApplyView job
PSS Tecnologias de la Informacion13 hours ago

Administrador/a de Sistemas IMS, z/OS

ES flagSpain OnlyFull-timeAdministration€50k – €65k/year
ApplyView job
Globalweb Corp13 hours ago

Data Administrator

Anywhere in the WorldFull-timeAdministration
ApplyView job
Elfonze Technologies2 days ago

Lead OS / Infrastructure Administrator

IN flagIndia OnlyFull-timeAdministration
ApplyView job
Cielo Talent2 days ago

Talent Administrator, Coordinator

HU flagHungary OnlyFull-timeAdministration
ApplyView job
NORDFROST GmbH & Co. KG6 days ago

Anwendungsbetreuer Dokumentenmanagementsystem, DMS Administrator

DE flagGermany OnlyFull-timeAdministration
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers