
Lead Machine Learning Operations Engineer
Posted 23 hours ago

Posted 23 hours ago
This is a fully remote position, open to applicants in New York.
• Take ownership of the reliability strategy for ML production.
• Develop and guide the operational framework for production ML systems, encompassing monitoring, traceability, safe deployment, incident response, and validation after deployment.
• Establish the benchmarks for ML teams to evaluate model health, performance, and trustworthiness within production environments.
• Manage model traceability and governance.
• Ensure that every production model has a clear lineage (including data, features, code, artifacts, validation, and deployment history) and promote the use of model registry and metadata tools across ML teams.
• Create comprehensive ML observability.
• Design and execute monitoring for the entire ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance.
• Define metrics for production health.
• Collaborate with ML, data, product, and business stakeholders to outline post-deployment metrics that encompass model quality, system reliability, business guardrails, and indicators of degradation.
• Proactively identify drift and degradation.
• Detect data drift, feature drift, changes in model behavior, and silent failures before they affect customers through thresholding, alerting, anomaly detection, and monitoring across releases.
• Lead the development of diagnostic tools and root-cause analysis.
• Construct dashboards, logs, and diagnostic workflows that enable swift transitions from 'recommendations appear off' to identifying the root cause, capturing context across candidates, features, scores, ranking decisions, and downstream outcomes.
• Ensure safety in ML deployment.
• Define and manage automated gates to prevent the promotion of faulty models or data to production.
• Collaborate with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and reviews of release health.
• Oversee ML incident response.
• Manage incident response protocols for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortem analyses.
• Focus on resolving underlying systemic issues following incidents rather than solely addressing immediate problems.
• Collaborate with the ML Platform, Data, and ML teams along with DevOps/Platform for infrastructure and observability requirements; work with Data Engineering to ensure data quality, drift, and freshness; and partner with ML Engineering to integrate operational requirements into development and deployment workflows.
• Establish standards and mentor others.
• Serve as the technical lead for ML operations: create reusable patterns, playbooks, and standards, while mentoring engineers in reliability, observability, and operational rigor.
• A minimum of 5 years of experience in machine learning engineering, ML platforms, applied ML, MLOps, data platforms, reliability engineering, or a similar technical role.
• Proven experience in managing production ML systems, including monitoring, deployment, incident response, model validation, data quality, or ownership of reliability.
• Experience leading technical initiatives across various engineering teams, particularly where success necessitated influencing architecture, tools, standards, or adoption.
• Practical experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms.
• Comprehensive understanding of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and measuring business outcomes.
• Capability to analyze ML operational failure modes: stale features, distribution shifts, training-serving discrepancies, delayed labels, and gaps between offline and online metrics.
• Strong SQL skills and comfort in investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies.
• Proven record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver operational capabilities suitable for production.
• Excellent written and verbal communication skills, including the ability to articulate ML system health, risks, incidents, and tradeoffs to both technical and non-technical audiences.
• Medical coverage
• Dental coverage
• Vision coverage
• 401(k) plan
• Life insurance
• Disability benefits
• Tuition assistance program
• Paid Time Off (PTO)
Onsights.io
Flock Safety
Inspiren
OneStudyTeam
Get handpicked remote jobs straight to your inbox weekly.