
Senior HPC AI Cluster Engineer
Posted May 24

Posted May 24
This is a fully remote position, open to applicants in Germany.
• Design, implement, and oversee large-scale HPC/AI clusters, ensuring effective monitoring, logging, and alerting mechanisms.
• Manage job and workload scheduling on Linux systems along with orchestration tools.
• Develop and maintain pipelines for continuous integration and delivery.
• Create tools to automate the deployment and management of extensive infrastructure environments, facilitate operational monitoring and alerting, and support self-service resource consumption.
• Implement monitoring solutions for servers, networks, and storage systems.
• Conduct troubleshooting from the ground up, addressing issues from bare metal through the operating system, software stack, and application levels.
• Serve as a technical resource by developing, refining, and documenting best practices to share with internal teams.
• Assist in Research & Development efforts and participate in proof of concepts (POCs) and proof of values (POVs) for future advancements.
• A degree in Computer Science, Engineering, or a related field with 8+ years of relevant experience.
• Understanding of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
• Experience with workload scheduling and orchestration tools such as Slurm and Kubernetes.
• Strong expertise in Windows and Linux (Redhat/CentOS and Ubuntu) networking, including sockets, firewalld, iptables, wireshark, and OS-level security protections, as well as common protocols like TCP, DHCP, and DNS.
• Familiarity with various storage solutions, including Lustre, GPFS, and Weka.io, along with knowledge of emerging storage technologies.
• Proficient in Python programming and bash scripting.
• Comfortable using automation and configuration management tools such as Jenkins, Ansible, and Puppet/Chef.
• In-depth knowledge of networking protocols including InfiniBand and Ethernet.
• Strong understanding and experience with virtualization technologies (e.g., VMware, Hyper-V, KVM, or Citrix).
• Familiarity with cloud computing platforms such as AWS, Azure, or Google Cloud.
• We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We ensure that individuals with disabilities are provided reasonable accommodations to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
EverAI
10x.Team
EverAI
Invisible Technologies
Get handpicked remote jobs straight to your inbox weekly.