
Senior Solutions Architect, Cloud Infrastructure, DevOps
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Japan.
• Oversee large-scale HPC/AI clusters with effective monitoring, logging, and alerting systems.
• Administer Linux job/workload schedulers and orchestration tools.
• Design and uphold continuous integration and delivery pipelines.
• Create tools to streamline the deployment and management of extensive infrastructure environments, automate operational monitoring and alerting, and facilitate self-service resource consumption.
• Implement monitoring solutions for servers, networks, and storage systems.
• Conduct troubleshooting from the ground up, addressing bare metal, operating system, software stack, and application levels.
• As a technical expert, develop, refine, and document standardized methodologies to share with internal teams.
• Assist in Research & Development efforts and participate in POCs/POVs for future enhancements.
• BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related disciplines.
• A minimum of 8 years of professional experience in networking principles, TCP/IP stack, and data center architecture.
• Familiarity with HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and related software.
• Comprehensive knowledge and practical experience with Kubernetes, focusing on container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.
• Experience in managing and setting up HPC clusters, covering aspects of deployment, optimization, and troubleshooting.
• Proficient in job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity.
• Strong understanding of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security measures, and common protocols such as TCP, DHCP, DNS, etc.
• Experience with various storage solutions, including Lustre, GPFS, ZFS, and XFS.
• Familiarity with new and emerging storage technologies is advantageous.
• Proficient in Python programming and bash scripting.
• Understanding of CI/CD pipelines for software deployment and automation.
• Comfortable using automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef, etc.
• Ability to convey technical concepts and work collaboratively with Japanese-speaking clients.
• Opportunities for professional development.
• Flexible work arrangements.
Quandary Consulting Group
Effective People
Presidio
Luminovo
Get handpicked remote jobs straight to your inbox weekly.