Remotery

Senior Solutions Architect – AI Factory Deployment

Posted 1 hour ago

This is a fully remote position, open to applicants in California, +2 more states.

📋 Description

• Configure, adjust, and validate AI factory environments within multi-GPU and multi-node Linux clusters.

• Ensure that configurations adhere to best practices for NCCL, collectives, and distributed training frameworks.

• Take ownership of executing crucial AI/LLM benchmarks, which includes setup, orchestration, result gathering, and analysis.

• Troubleshoot and resolve issues that arise when training jobs or benchmarks fail, hang, or do not perform as expected.

• Enhance observability for AI factories (metrics, logs, traces, dashboards) to gain insights into workload behavior and system health.

• Create automation (Python, Shell) for running benchmarks, collecting results, and conducting regression checks.

• Analyze communication patterns and NCCL utilization for AI/LLM workloads, focusing on collective operations such as AllReduce and AllToAll.

• Suggest modifications to job configurations, parallelism strategies, and cluster settings to boost throughput, latency, and scaling efficiency.

• Collaborate closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer deployment.

• Contribute to the development of documentation, guidelines, and readiness materials that assist internal colleagues and customer-facing teams.


⛳️ Requirements

• A Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related discipline.

• Over 6 years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML environments.

• Practical experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with a solid understanding of NCCL.

• Strong understanding of collective communication patterns, especially AllReduce and AllToAll, and their application in modern ML/LLM training.

• Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.

• Proficient in Python and Shell/Bash for scripting, automation, and tooling purposes.

• Experience in benchmarking (designing, executing, and analyzing performance benchmarks).

• Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and enhance complex distributed workloads.

• Excellent communication skills and the ability to work effectively within cross-functional teams.


🏝️ Benefits

• Eligible for equity and benefits

People also viewed

Instacart7 min ago

Program Manager II

US flagCalifornia, +18 more statesFull-timeUncategorized$122k – $155k/year
ApplyView job
CLASP7 min ago

Senior Product Manager – Candidate & Recruiter Platform

US flagMassachusetts OnlyFull-timeUncategorized$140k – $170k/year
ApplyView job
Tevora7 min ago

Account Director

US flagOregon OnlyFull-timeUncategorized$110k – $130k/year
ApplyView job
Tailor7 min ago

Forward-Deployed Product Manager – FDPM

US flagCalifornia OnlyFull-timeUncategorized$130k – $170k/year
ApplyView job
Cube Care Company7 min ago

Human Resource Generalist

US flagUnited States OnlyFull-timeUncategorized
ApplyView job
Juniper Square7 min ago

Product Marketing Engineer

US flagUnited States OnlyFull-timeUncategorized$160k – $215k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers