Remotery

Senior HPC AI Cluster Engineer

Posted May 24

This is a fully remote position, open to applicants in Germany.

📋 Description

• Design, implement, and oversee large-scale HPC/AI clusters, ensuring effective monitoring, logging, and alerting mechanisms.

• Manage job and workload scheduling on Linux systems along with orchestration tools.

• Develop and maintain pipelines for continuous integration and delivery.

• Create tools to automate the deployment and management of extensive infrastructure environments, facilitate operational monitoring and alerting, and support self-service resource consumption.

• Implement monitoring solutions for servers, networks, and storage systems.

• Conduct troubleshooting from the ground up, addressing issues from bare metal through the operating system, software stack, and application levels.

• Serve as a technical resource by developing, refining, and documenting best practices to share with internal teams.

• Assist in Research & Development efforts and participate in proof of concepts (POCs) and proof of values (POVs) for future advancements.


⛳️ Requirements

• A degree in Computer Science, Engineering, or a related field with 8+ years of relevant experience.

• Understanding of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.

• Experience with workload scheduling and orchestration tools such as Slurm and Kubernetes.

• Strong expertise in Windows and Linux (Redhat/CentOS and Ubuntu) networking, including sockets, firewalld, iptables, wireshark, and OS-level security protections, as well as common protocols like TCP, DHCP, and DNS.

• Familiarity with various storage solutions, including Lustre, GPFS, and Weka.io, along with knowledge of emerging storage technologies.

• Proficient in Python programming and bash scripting.

• Comfortable using automation and configuration management tools such as Jenkins, Ansible, and Puppet/Chef.

• In-depth knowledge of networking protocols including InfiniBand and Ethernet.

• Strong understanding and experience with virtualization technologies (e.g., VMware, Hyper-V, KVM, or Citrix).

• Familiarity with cloud computing platforms such as AWS, Azure, or Google Cloud.


🏝️ Benefits

• We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We ensure that individuals with disabilities are provided reasonable accommodations to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

People also viewed

EverAI10 hours ago

Senior AI Vertical Mini-Series Director

BA flagBosnia and Herzegovina OnlyFull-timeArtificial Intelligence
ApplyView job
10x.Team10 hours ago

Risk Analyst – AI Trainer, Freelance

FR flagFrance OnlyFreelanceArtificial Intelligence€83 – €150/hour
ApplyView job
EverAI10 hours ago

Senior AI Vertical Mini-Series Director – Freelance

Anywhere in the WorldFull-timeArtificial Intelligence
ApplyView job
Invisible Technologies1 day ago

Language Alignment & Resource Partner – Haitian Creole, Freelance AI Trainer

Anywhere in the WorldFreelanceArtificial Intelligence$6 – $65/hour
ApplyView job
Lemontaps®1 day ago

Automation & AI Manager

Anywhere in the WorldFull-timeArtificial Intelligence
ApplyView job
Anyone AI1 day ago

Mathematics AI Training Expert

IT flagItaly OnlyFreelanceArtificial Intelligence$40/hour
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers