
Senior Distinguished Engineer, AI Compute
Posted 11 hours ago

Posted 11 hours ago
This is a fully remote position, open to applicants in California, +3 more states.
• Design and construct control and data plane implementations necessary for creating a highly available, multi-tenant, large-scale, and secure machine learning platform.
• Create solutions using Ray and Spark distributed compute engines to enhance various workloads, including LLM pre-training, reinforcement learning, and large-scale data processing, while maximizing compute unit economics.
• Implement systemic enhancements for operational excellence, such as automating KTLO (Keep The Lights On) workflows.
• Oversee the technical execution of a diverse project portfolio, working alongside developers who specialize in areas ranging from distributed microservices to large foundation models.
• Collaborate cross-functionally with product and program management teams, as well as stakeholders and partners across Capital One, to optimize business results while driving robust technology solutions.
• Share your enthusiasm for keeping up with tech trends, experimenting with and learning new technologies, and participating in both internal and external technology communities, as well as leading system design and code review sessions.
• Contribute to enhancing the Capital One Distinguished Engineering community and establish yourself as a reliable resource on specific technologies and technology-enabled capabilities.
• Take the initiative in developing the next generation of talent by mentoring internal staff and actively recruiting external candidates to strengthen the Capital One tech talent pool.
• Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields with a minimum of 10 years of experience in developing AI and ML algorithms or technologies, or a Master's degree in the same fields with at least 8 years of relevant experience.
• A minimum of 10 years of programming experience in Python, Go, Scala, or Java.
• A Master’s Degree in Computer Science or a Master’s Degree in Software Engineering is preferred.
• Practical experience with the internals of Ray (Actors/GCS/Scheduling) or Spark (Query Optimizer/Memory Management) is preferred.
• Experience in building platforms that facilitate LLM training, fine-tuning, or high-throughput inference is preferred.
• Hands-on experience with AWS-specific compute primitives (EKS, EC2 UltraClusters, Graviton) and strategies for cost optimization is preferred.
• A proven track record of upstream contributions to significant distributed systems projects is preferred.
• A comprehensive, competitive, and inclusive array of health, financial, and other benefits that support your overall well-being.
FutureSight
Tribe AI
AAPC
Gartner
Get handpicked remote jobs straight to your inbox weekly.