This is a fully remote position, open to applicants in Canada.

📋 Description

• Design, develop, and maintain essential AI platform components utilized for training, deploying, and serving machine learning models in production settings.

• Take ownership of model serving and inference workflows from start to finish, enhancing reliability, scalability, performance, and operational excellence.

• Lead initiatives to optimize inference systems focusing on throughput, latency, and cost-efficiency across both CPU and GPU workloads.

• Design and oversee GPU-based inference and training tasks, including performance tuning, capacity planning, and optimizing resource utilization.

• Manage and enhance critical elements of the model lifecycle, such as packaging, versioning, testing strategies, validation, and automation of deployment.

• Implement and advance observability practices (metrics, logging, tracing, alerting) to enhance visibility and operational resilience of ML services and pipelines.

• Collaborate closely with product, infrastructure, security, and data teams to design scalable platform capabilities that support AI-driven features.

• Contribute to technical design discussions, suggest architectural enhancements, and mentor junior engineers through code reviews and knowledge transfer.

• Engage in and help refine operational processes, including incident response, on-call rotations, and post-incident evaluations.

⛳️ Requirements

• Bachelor’s degree with 4–6 years of relevant industry experience, or a Master’s degree with substantial hands-on experience in building and operating production ML systems, or equivalent work experience.

• Strong proficiency in Python for developing machine learning systems, backend services, or distributed data processing.

• Proven track record of deploying and managing ML workloads in cloud environments, including production-grade infrastructure.

• Comprehensive understanding of model serving architectures, inference pipelines, and performance trade-offs (latency, throughput, cost, scaling strategies).

• Practical experience working with GPU-based workloads and accelerated computing in live production environments.

• Experience in designing CI/CD pipelines and development workflows that facilitate reliable deployment of ML systems.

• Capability to independently scope and lead technical projects while balancing product and operational priorities.

• Strong problem-solving abilities and the competence to debug performance and reliability challenges in distributed systems.

• Excellent communication skills, with a background in collaborating across engineering, product, and infrastructure teams.

🏝️ Benefits

• Generous performance-based bonus plans available to all eligible employees - we celebrate our success as one team.

• Comprehensive medical, dental, and vision insurance coverage.

• Significant retirement contributions with 100% immediate vesting (regardless of your contributions).

• Quarterly company-wide wellness days where everyone takes a collective break.

• Country-specific holidays plus an additional day off for your birthday.

• One-time stipend for home office setup.

• Annual budget for professional development.

• Quarterly well-being stipend.

• Generous paid parental leave.

• Employee referral bonus program.

• Additional benefits such as life/AD&D insurance, disability coverage, EAP, etc. (varies by country).

Senior Machine Learning Engineer, AI Platform

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Architect, Value Engineer

Senior Applied AI Engineer

ML Engineer – Applied AI

AI Engineer

Agentic AI Lead

AI-Native Product Engineer

Never miss a great job!