
Senior Machine Learning Engineer, AI Platform
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Canada.
• Design, develop, and maintain essential AI platform components utilized for training, deploying, and serving machine learning models in production settings.
• Take ownership of model serving and inference workflows from start to finish, enhancing reliability, scalability, performance, and operational excellence.
• Lead initiatives to optimize inference systems focusing on throughput, latency, and cost-efficiency across both CPU and GPU workloads.
• Design and oversee GPU-based inference and training tasks, including performance tuning, capacity planning, and optimizing resource utilization.
• Manage and enhance critical elements of the model lifecycle, such as packaging, versioning, testing strategies, validation, and automation of deployment.
• Implement and advance observability practices (metrics, logging, tracing, alerting) to enhance visibility and operational resilience of ML services and pipelines.
• Collaborate closely with product, infrastructure, security, and data teams to design scalable platform capabilities that support AI-driven features.
• Contribute to technical design discussions, suggest architectural enhancements, and mentor junior engineers through code reviews and knowledge transfer.
• Engage in and help refine operational processes, including incident response, on-call rotations, and post-incident evaluations.
• Bachelor’s degree with 4–6 years of relevant industry experience, or a Master’s degree with substantial hands-on experience in building and operating production ML systems, or equivalent work experience.
• Strong proficiency in Python for developing machine learning systems, backend services, or distributed data processing.
• Proven track record of deploying and managing ML workloads in cloud environments, including production-grade infrastructure.
• Comprehensive understanding of model serving architectures, inference pipelines, and performance trade-offs (latency, throughput, cost, scaling strategies).
• Practical experience working with GPU-based workloads and accelerated computing in live production environments.
• Experience in designing CI/CD pipelines and development workflows that facilitate reliable deployment of ML systems.
• Capability to independently scope and lead technical projects while balancing product and operational priorities.
• Strong problem-solving abilities and the competence to debug performance and reliability challenges in distributed systems.
• Excellent communication skills, with a background in collaborating across engineering, product, and infrastructure teams.
• Generous performance-based bonus plans available to all eligible employees - we celebrate our success as one team.
• Comprehensive medical, dental, and vision insurance coverage.
• Significant retirement contributions with 100% immediate vesting (regardless of your contributions).
• Quarterly company-wide wellness days where everyone takes a collective break.
• Country-specific holidays plus an additional day off for your birthday.
• One-time stipend for home office setup.
• Annual budget for professional development.
• Quarterly well-being stipend.
• Generous paid parental leave.
• Employee referral bonus program.
• Additional benefits such as life/AD&D insurance, disability coverage, EAP, etc. (varies by country).
Granicus
Omada Health
NineTwoThree Studio
Stride, Inc.
Get handpicked remote jobs straight to your inbox weekly.