This is a fully remote position, open to applicants in New York.

📋 Description

• Take ownership of the reliability strategy for ML production.

• Develop and guide the operational framework for production ML systems, encompassing monitoring, traceability, safe deployment, incident response, and validation after deployment.

• Establish the benchmarks for ML teams to evaluate model health, performance, and trustworthiness within production environments.

• Manage model traceability and governance.

• Ensure that every production model has a clear lineage (including data, features, code, artifacts, validation, and deployment history) and promote the use of model registry and metadata tools across ML teams.

• Create comprehensive ML observability.

• Design and execute monitoring for the entire ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance.

• Define metrics for production health.

• Collaborate with ML, data, product, and business stakeholders to outline post-deployment metrics that encompass model quality, system reliability, business guardrails, and indicators of degradation.

• Proactively identify drift and degradation.

• Detect data drift, feature drift, changes in model behavior, and silent failures before they affect customers through thresholding, alerting, anomaly detection, and monitoring across releases.

• Lead the development of diagnostic tools and root-cause analysis.

• Construct dashboards, logs, and diagnostic workflows that enable swift transitions from 'recommendations appear off' to identifying the root cause, capturing context across candidates, features, scores, ranking decisions, and downstream outcomes.

• Ensure safety in ML deployment.

• Define and manage automated gates to prevent the promotion of faulty models or data to production.

• Collaborate with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and reviews of release health.

• Oversee ML incident response.

• Manage incident response protocols for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortem analyses.

• Focus on resolving underlying systemic issues following incidents rather than solely addressing immediate problems.

• Collaborate with the ML Platform, Data, and ML teams along with DevOps/Platform for infrastructure and observability requirements; work with Data Engineering to ensure data quality, drift, and freshness; and partner with ML Engineering to integrate operational requirements into development and deployment workflows.

• Establish standards and mentor others.

• Serve as the technical lead for ML operations: create reusable patterns, playbooks, and standards, while mentoring engineers in reliability, observability, and operational rigor.

⛳️ Requirements

• A minimum of 5 years of experience in machine learning engineering, ML platforms, applied ML, MLOps, data platforms, reliability engineering, or a similar technical role.

• Proven experience in managing production ML systems, including monitoring, deployment, incident response, model validation, data quality, or ownership of reliability.

• Experience leading technical initiatives across various engineering teams, particularly where success necessitated influencing architecture, tools, standards, or adoption.

• Practical experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms.

• Comprehensive understanding of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and measuring business outcomes.

• Capability to analyze ML operational failure modes: stale features, distribution shifts, training-serving discrepancies, delayed labels, and gaps between offline and online metrics.

• Strong SQL skills and comfort in investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies.

• Proven record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver operational capabilities suitable for production.

• Excellent written and verbal communication skills, including the ability to articulate ML system health, risks, incidents, and tradeoffs to both technical and non-technical audiences.

🏝️ Benefits

• Medical coverage

• Dental coverage

• Vision coverage

• 401(k) plan

• Life insurance

• Disability benefits

• Tuition assistance program

• Paid Time Off (PTO)

Lead Machine Learning Operations Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Machine Learning Engineer

Full Stack Engineer, Machine Learning Tooling

Senior Machine Learning Engineer

Senior Machine Learning Engineer

Senior ML, MLOps Engineer

Manager, Machine Learning

Never miss a great job!