This is a fully remote position, open to applicants in California, +2 more states.

• Configure, adjust, and validate AI factory environments within multi-GPU and multi-node Linux clusters.

• Ensure that configurations adhere to best practices for NCCL, collectives, and distributed training frameworks.

• Take ownership of executing crucial AI/LLM benchmarks, which includes setup, orchestration, result gathering, and analysis.

• Troubleshoot and resolve issues that arise when training jobs or benchmarks fail, hang, or do not perform as expected.

• Enhance observability for AI factories (metrics, logs, traces, dashboards) to gain insights into workload behavior and system health.

• Create automation (Python, Shell) for running benchmarks, collecting results, and conducting regression checks.

• Analyze communication patterns and NCCL utilization for AI/LLM workloads, focusing on collective operations such as AllReduce and AllToAll.

• Suggest modifications to job configurations, parallelism strategies, and cluster settings to boost throughput, latency, and scaling efficiency.

• Collaborate closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer deployment.

• Contribute to the development of documentation, guidelines, and readiness materials that assist internal colleagues and customer-facing teams.

• A Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related discipline.

• Over 6 years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML environments.

• Practical experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with a solid understanding of NCCL.

• Strong understanding of collective communication patterns, especially AllReduce and AllToAll, and their application in modern ML/LLM training.

• Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.

• Proficient in Python and Shell/Bash for scripting, automation, and tooling purposes.

• Experience in benchmarking (designing, executing, and analyzing performance benchmarks).

• Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and enhance complex distributed workloads.

• Excellent communication skills and the ability to work effectively within cross-functional teams.

• Eligible for equity and benefits

Senior Solutions Architect – AI Factory Deployment

People also viewed