
Senior Solutions Architect – AI Factory Deployment
Posted 1 hour ago

Posted 1 hour ago
This is a fully remote position, open to applicants in California, +2 more states.
• Configure, adjust, and validate AI factory environments within multi-GPU and multi-node Linux clusters.
• Ensure that configurations adhere to best practices for NCCL, collectives, and distributed training frameworks.
• Take ownership of executing crucial AI/LLM benchmarks, which includes setup, orchestration, result gathering, and analysis.
• Troubleshoot and resolve issues that arise when training jobs or benchmarks fail, hang, or do not perform as expected.
• Enhance observability for AI factories (metrics, logs, traces, dashboards) to gain insights into workload behavior and system health.
• Create automation (Python, Shell) for running benchmarks, collecting results, and conducting regression checks.
• Analyze communication patterns and NCCL utilization for AI/LLM workloads, focusing on collective operations such as AllReduce and AllToAll.
• Suggest modifications to job configurations, parallelism strategies, and cluster settings to boost throughput, latency, and scaling efficiency.
• Collaborate closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer deployment.
• Contribute to the development of documentation, guidelines, and readiness materials that assist internal colleagues and customer-facing teams.
• A Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related discipline.
• Over 6 years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML environments.
• Practical experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with a solid understanding of NCCL.
• Strong understanding of collective communication patterns, especially AllReduce and AllToAll, and their application in modern ML/LLM training.
• Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.
• Proficient in Python and Shell/Bash for scripting, automation, and tooling purposes.
• Experience in benchmarking (designing, executing, and analyzing performance benchmarks).
• Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and enhance complex distributed workloads.
• Excellent communication skills and the ability to work effectively within cross-functional teams.
• Eligible for equity and benefits
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.