
Senior HPC Software Engineer
Posted 2 days ago

Posted 2 days ago
This is a fully remote position, open to applicants in Michigan.
• Manage, troubleshoot, and enhance RHEL-based high-performance computing environments that support CPU and GPU workloads.
• Design and sustain HPC services encompassing compute, storage, networking, scheduling, Kubernetes, and observability.
• Create tools, scripts, APIs, integrations, and automation utilizing Python, Go, Bash, or similar programming languages.
• Implement software engineering best practices, including Git workflows, code reviews, testing, modular design, and CI/CD methodologies.
• Assist in updating HPC scheduling environments, with a preference for Slurm experience.
• Enhance monitoring, alerting, dashboards, and operational visibility through tools such as Grafana, Prometheus, Dynatrace, and others.
• Collaborate with users, customers, and internal engineering teams to comprehend requirements, address issues, and enhance platform usability.
• Develop and maintain documentation, architecture notes, user manuals, and operational procedures.
• Propel platform modernization with a focus on reliability, scalability, automation, security, and maintainability.
• Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
• Over 10 years of experience in systems engineering, infrastructure engineering, platform engineering, or a comparable technical role.
• Extensive experience in Linux systems administration, preferably with RHEL.
• Familiarity with Slurm, PBS, or other HPC workload managers.
• Proven experience in developing APIs, applications, and services that facilitate platform operations and user workflows.
• Experience in supporting production compute, infrastructure, and large-scale technical environments.
• Practical experience in scripting and software development using Python, Go, Bash, or similar languages.
• Knowledge of CI/CD concepts, GitHub, and contemporary software delivery practices.
• Strong troubleshooting abilities across operating systems, services, networking, storage, and application layers.
• Capacity to produce clear documentation and communicate effectively with both technical and non-technical audiences.
• Strong sense of ownership with the capability to drive issues to resolution.
• Ability to exercise independent judgment to make informed technical decisions.
• Immediate medical, dental, and prescription drug coverage.
• Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up child care, and more.
• Vehicle discount program available for employees and family members, along with management leases.
• Tuition assistance offered.
• Established and active employee resource groups.
• Paid time off for individual and team community service activities.
• A generous schedule of paid holidays, including the week between Christmas and New Year’s Day.
• Paid time off with the option to purchase additional vacation time.
EXL
Headspace
Allstate
Sargent & Lundy
Get handpicked remote jobs straight to your inbox weekly.