Remotery

Staff Machine Learning Systems Engineer – MLOps

athims & hersUS flagUnited StatesFull-timeUncategorizedLead$210k – $250k/year

Posted 1 hour ago

This is a fully remote position, open to applicants in United States.

πŸ“‹ Description

β€’ Take ownership and expand the AI compute and deployment platform.

β€’ Manage and enhance our containerized application deployment platform and associated systems for AI workloads, which includes general process and job orchestration (e.g., Kubernetes) β€” covering cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production.

β€’ Develop and sustain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that enable teams to deploy AI services safely and consistently.

β€’ Create ephemeral/preview environments, feature-branched deployments, and nightly release pipelines, allowing teams to validate AI modifications in environments resembling production prior to release.

β€’ Enhance efficiency and manage costs related to compute, autoscaling, and inference infrastructure.

β€’ Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g., Bedrock, Vertex, and other providers) β€” including management of credentials, rate limits, and failover mechanisms.

β€’ Establish reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level.

β€’ Develop reusable infrastructure abstractions and contracts to standardize the deployment, configuration, and consumption of AI services across the organization.

β€’ Manage the LLM/AI observability and tracing stack β€” provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g., ClickHouse) β€” ensuring AI behavior is auditable and debuggable in production.

β€’ Create analytics and monitoring pipelines that highlight latency, error rates, quality, and regression indicators to engineering and clinical stakeholders.

β€’ Define SLOs, alerting protocols, on-call runbooks, and incident response strategies for AI infrastructure; spearhead troubleshooting efforts and continuously enhance platform reliability.

β€’ Oversee and refine the monorepo build system and CI/CD pipelines for AI workloads β€” encompassing evaluation workflows, Docker image creation, automated PR checks and convention enforcement, as well as cross-platform test execution.

β€’ Manage shared infrastructure tools, CLIs, and IaC modules (Terraform, Scalr) utilized daily by AI and product engineers.

β€’ Identify and eliminate platform bottlenecks β€” decreasing CI/CD cycle times, build latency, and deployment friction β€” to boost developer productivity across the Applied AI organization.

β€’ Construct IAM, OIDC, and secrets management as essential infrastructure β€” implementing scoped, least-privilege roles, write-only secret rotation, and cross-account access audits.

β€’ Integrate security-by-default, scope boundaries, and access controls into the platform to ensure that AI services are HIPAA-compliant and prioritize privacy.

β€’ Collaborate with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant and auditable data access.

β€’ Lead multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and the evolution of observability.

β€’ Write and guide technical design documents and design reviews, establish infrastructure standards and development workflow conventions, and contribute to technical governance across AI engineering.

β€’ Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, bridging the gap between prototypes and production-ready systems.


⛳️ Requirements

β€’ 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering β€” with at least 3 years concentrated on ML/AI systems in production.

β€’ Extensive, hands-on experience with Kubernetes (preferably EKS) and the cloud-native ecosystem β€” autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration.

β€’ Strong skills in infrastructure-as-code (Terraform) and experience in designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access.

β€’ Proficient in Python, with experience in building production infrastructure tools, CLIs, and data/observability pipelines.

β€’ 2+ years of experience managing LLM-based systems in production (LLMOps) β€” including inference routing, serving, tracing, and the reliability patterns required to operate them at scale.

β€’ Practical experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or similar) and metrics/log/trace pipelines.

β€’ Experience in designing and maintaining CI/CD pipelines, build systems, and developer tools for agile engineering teams.

β€’ A systems-and-operations mindset: you consider failure modes, SLOs, observability, security, and long-term maintainability before deployment.

β€’ Experience in writing and leading technical design documents (TDDs/RFCs) for large-scale infrastructure initiatives.

β€’ Strong collaboration skills across engineering, ML, product, security, and clinical teams.

β€’ A deep understanding of safety, privacy, and security β€” ideally with experience in a regulated sector such as healthcare, fintech, or life sciences.


🏝️ Benefits

β€’ Competitive salary & equity compensation for full-time positions.

β€’ Unlimited PTO, company holidays, and quarterly mental health days.

β€’ Comprehensive health benefits including medical, dental & vision, and parental leave.

β€’ Employee Stock Purchase Program (ESPP).

β€’ 401k benefits with employer matching contribution.

β€’ Offsite team retreats.

People also viewed

Instacart25 min ago

Program Manager II

US flagCalifornia, +18 more statesFull-timeUncategorized$122k – $155k/year
ApplyView job
CLASP25 min ago

Senior Product Manager – Candidate & Recruiter Platform

US flagMassachusetts OnlyFull-timeUncategorized$140k – $170k/year
ApplyView job
Tevora25 min ago

Account Director

US flagOregon OnlyFull-timeUncategorized$110k – $130k/year
ApplyView job
Tailor25 min ago

Forward-Deployed Product Manager – FDPM

US flagCalifornia OnlyFull-timeUncategorized$130k – $170k/year
ApplyView job
Cube Care Company25 min ago

Human Resource Generalist

US flagUnited States OnlyFull-timeUncategorized
ApplyView job
Juniper Square25 min ago

Product Marketing Engineer

US flagUnited States OnlyFull-timeUncategorized$160k – $215k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers