
Senior HPC Cluster Administrator β Deep Learning Frameworks Infrastructure
Posted May 24

Posted May 24
This is a fully remote position, open to applicants in Poland.
β’ Oversee the entire lifecycle of GPU compute clusters β including procurement, provisioning, configuration management, monitoring, and deprecation β within diverse Linux environments (DGX, HGX, embedded systems).
β’ Develop and scale storage solutions (NFS, Lustre, WekaFS, or similar) with a well-defined roadmap for capacity and performance enhancement.
β’ Drive the automation of infrastructure utilizing modern Infrastructure as Code (IaC) tools (Ansible, Terraform) and CI/CD pipelines (GitLab).
β’ Manage and enhance job scheduling using Slurm, which includes fair-share policies, reservation management, and MIG/GPU partitioning strategies.
β’ Sustain and enhance observability stacks (Prometheus, Grafana, DCGM) while proactively resolving hardware and software incidents.
β’ Collaborate with ML engineers and software teams to optimize cluster configurations for extensive distributed training workloads.
β’ Assess and implement new technologies β networking fabrics (InfiniBand, NVLink, EFA/RDMA), storage tiers, container runtimes β to boost performance and reliability.
β’ Mentor junior engineers and contribute to the establishment of team-wide engineering standards.
β’ BS/MS in Computer Science, Electrical Engineering, Computer Engineering, or equivalent practical experience.
β’ Over 5 years of experience in deploying and managing large-scale HPC or ML training clusters.
β’ Profound expertise in Linux systems administration at scale.
β’ Strong scripting and automation capabilities in Python and/or Bash.
β’ Practical experience with Slurm (scheduling, accounting, cgroup configuration).
β’ Proficient in configuration management and IaC (Ansible is required; Terraform is a plus).
β’ Familiarity with container technologies (Docker, Apptainer/Singularity, Kubernetes).
β’ Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA).
β’ Experience with distributed/parallel filesystems and storage architectures.
β’ Ability to independently manage problems from start to finish and communicate effectively with engineering and management stakeholders.
β’ Health insurance
β’ Professional development opportunities
The Investment Diversity Exchange (TIDE)
PSS Tecnologias de la Informacion
Globalweb Corp
Elfonze Technologies
Get handpicked remote jobs straight to your inbox weekly.