Remotery

Lead Machine Learning Operations Engineer

Posted 23 hours ago

This is a fully remote position, open to applicants in New York.

📋 Description

• Take ownership of the reliability strategy for ML production.

• Develop and guide the operational framework for production ML systems, encompassing monitoring, traceability, safe deployment, incident response, and validation after deployment.

• Establish the benchmarks for ML teams to evaluate model health, performance, and trustworthiness within production environments.

• Manage model traceability and governance.

• Ensure that every production model has a clear lineage (including data, features, code, artifacts, validation, and deployment history) and promote the use of model registry and metadata tools across ML teams.

• Create comprehensive ML observability.

• Design and execute monitoring for the entire ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance.

• Define metrics for production health.

• Collaborate with ML, data, product, and business stakeholders to outline post-deployment metrics that encompass model quality, system reliability, business guardrails, and indicators of degradation.

• Proactively identify drift and degradation.

• Detect data drift, feature drift, changes in model behavior, and silent failures before they affect customers through thresholding, alerting, anomaly detection, and monitoring across releases.

• Lead the development of diagnostic tools and root-cause analysis.

• Construct dashboards, logs, and diagnostic workflows that enable swift transitions from 'recommendations appear off' to identifying the root cause, capturing context across candidates, features, scores, ranking decisions, and downstream outcomes.

• Ensure safety in ML deployment.

• Define and manage automated gates to prevent the promotion of faulty models or data to production.

• Collaborate with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and reviews of release health.

• Oversee ML incident response.

• Manage incident response protocols for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortem analyses.

• Focus on resolving underlying systemic issues following incidents rather than solely addressing immediate problems.

• Collaborate with the ML Platform, Data, and ML teams along with DevOps/Platform for infrastructure and observability requirements; work with Data Engineering to ensure data quality, drift, and freshness; and partner with ML Engineering to integrate operational requirements into development and deployment workflows.

• Establish standards and mentor others.

• Serve as the technical lead for ML operations: create reusable patterns, playbooks, and standards, while mentoring engineers in reliability, observability, and operational rigor.


⛳️ Requirements

• A minimum of 5 years of experience in machine learning engineering, ML platforms, applied ML, MLOps, data platforms, reliability engineering, or a similar technical role.

• Proven experience in managing production ML systems, including monitoring, deployment, incident response, model validation, data quality, or ownership of reliability.

• Experience leading technical initiatives across various engineering teams, particularly where success necessitated influencing architecture, tools, standards, or adoption.

• Practical experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms.

• Comprehensive understanding of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and measuring business outcomes.

• Capability to analyze ML operational failure modes: stale features, distribution shifts, training-serving discrepancies, delayed labels, and gaps between offline and online metrics.

• Strong SQL skills and comfort in investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies.

• Proven record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver operational capabilities suitable for production.

• Excellent written and verbal communication skills, including the ability to articulate ML system health, risks, incidents, and tradeoffs to both technical and non-technical audiences.


🏝️ Benefits

• Medical coverage

• Dental coverage

• Vision coverage

• 401(k) plan

• Life insurance

• Disability benefits

• Tuition assistance program

• Paid Time Off (PTO)

People also viewed

Onsights.io45 min ago

Senior Machine Learning Engineer

US flagUnited States OnlyFull-timeMachine Learning Engineer
ApplyView job
Flock Safety11 hours ago

Full Stack Engineer, Machine Learning Tooling

US flagNew York OnlyFull-timeMachine Learning Engineer$145k – $165k/year
ApplyView job
Inspiren11 hours ago

Senior Machine Learning Engineer

US flagNew York OnlyFull-timeMachine Learning Engineer$200k – $230k/year
ApplyView job
OneStudyTeam11 hours ago

Senior Machine Learning Engineer

US flagUnited States OnlyFull-timeMachine Learning Engineer$140k – $190k/year
ApplyView job
CDW11 hours ago

Senior ML, MLOps Engineer

US flagUnited States OnlyFull-timeMachine Learning Engineer
ApplyView job
Extend12 hours ago

Manager, Machine Learning

US flagUnited States OnlyFull-timeMachine Learning Engineer$180k – $210k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers