This is a fully remote position, open to applicants in Poland.

📋 Description

• Oversee the entire lifecycle of GPU compute clusters — including procurement, provisioning, configuration management, monitoring, and deprecation — within diverse Linux environments (DGX, HGX, embedded systems).

• Develop and scale storage solutions (NFS, Lustre, WekaFS, or similar) with a well-defined roadmap for capacity and performance enhancement.

• Drive the automation of infrastructure utilizing modern Infrastructure as Code (IaC) tools (Ansible, Terraform) and CI/CD pipelines (GitLab).

• Manage and enhance job scheduling using Slurm, which includes fair-share policies, reservation management, and MIG/GPU partitioning strategies.

• Sustain and enhance observability stacks (Prometheus, Grafana, DCGM) while proactively resolving hardware and software incidents.

• Collaborate with ML engineers and software teams to optimize cluster configurations for extensive distributed training workloads.

• Assess and implement new technologies — networking fabrics (InfiniBand, NVLink, EFA/RDMA), storage tiers, container runtimes — to boost performance and reliability.

• Mentor junior engineers and contribute to the establishment of team-wide engineering standards.

⛳️ Requirements

• BS/MS in Computer Science, Electrical Engineering, Computer Engineering, or equivalent practical experience.

• Over 5 years of experience in deploying and managing large-scale HPC or ML training clusters.

• Profound expertise in Linux systems administration at scale.

• Strong scripting and automation capabilities in Python and/or Bash.

• Practical experience with Slurm (scheduling, accounting, cgroup configuration).

• Proficient in configuration management and IaC (Ansible is required; Terraform is a plus).

• Familiarity with container technologies (Docker, Apptainer/Singularity, Kubernetes).

• Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA).

• Experience with distributed/parallel filesystems and storage architectures.

• Ability to independently manage problems from start to finish and communicate effectively with engineering and management stakeholders.

🏝️ Benefits

• Health insurance

• Professional development opportunities

Senior HPC Cluster Administrator – Deep Learning Frameworks Infrastructure

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior NetSuite Technical Administrator

Administrador/a de Sistemas IMS, z/OS

Data Administrator

Lead OS / Infrastructure Administrator

Talent Administrator, Coordinator

Anwendungsbetreuer Dokumentenmanagementsystem, DMS Administrator

Never miss a great job!