
Senior Distinguished Engineer, AI Compute
Posted May 10

Posted May 10
This is a fully remote position, open to applicants in California, +3 more states.
• Design and implement control and data plane solutions necessary for establishing a highly available, multi-tenant, large-scale, and secure machine learning platform.
• Create distributed compute engine solutions using Ray and Spark to enhance various workloads, including LLM pre-training, reinforcement learning, and extensive data processing, while optimizing compute unit economics.
• Drive systemic enhancements for operational excellence by automating KTLO (Keep The Lights On) processes.
• Oversee the technical execution of a varied project portfolio, working alongside developers who specialize in various areas, from distributed microservices to managing large foundation models.
• Collaborate across product and program management teams, as well as with stakeholders and partners throughout Capital One, to enhance business outcomes while advancing robust technology solutions.
• Share your enthusiasm for staying informed about technological advancements, experimenting with new technologies, engaging in internal and external tech communities, and leading system design and code review sessions.
• Contribute to the growth of the Capital One Distinguished Engineering community and position yourself as a key resource in specific technologies and technology-enabled capabilities.
• Take the lead in nurturing the next generation of talent by mentoring internal staff and actively recruiting external candidates to enhance the Capital One tech talent pool.
• A Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields with a minimum of 10 years of experience in developing AI and ML algorithms or technologies; alternatively, a Master's degree in the same fields with at least 8 years of relevant experience.
• A minimum of 10 years of programming experience with Python, Go, Scala, or Java.
• A Master’s Degree in Computer Science or Software Engineering is preferred.
• Practical experience with the internals of Ray (Actors/GCS/Scheduling) or Spark (Query Optimizer/Memory Management) is preferred.
• Experience in building platforms that facilitate LLM training, fine-tuning, or high-throughput inference is preferred.
• Hands-on experience with AWS-specific compute services (EKS, EC2 UltraClusters, Graviton) and cost-optimization approaches is preferred.
• A track record of upstream contributions to significant distributed systems projects is preferred.
• A comprehensive, competitive, and inclusive array of health, financial, and additional benefits that support your overall well-being.
Gartner
ELVTR
Hire Digital
FutureSight
Get handpicked remote jobs straight to your inbox weekly.