
Staff Machine Learning Systems Engineer β MLOps
Posted 1 hour ago

Posted 1 hour ago
This is a fully remote position, open to applicants in United States.
β’ Take ownership and expand the AI compute and deployment platform.
β’ Manage and enhance our containerized application deployment platform and associated systems for AI workloads, which includes general process and job orchestration (e.g., Kubernetes) β covering cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production.
β’ Develop and sustain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that enable teams to deploy AI services safely and consistently.
β’ Create ephemeral/preview environments, feature-branched deployments, and nightly release pipelines, allowing teams to validate AI modifications in environments resembling production prior to release.
β’ Enhance efficiency and manage costs related to compute, autoscaling, and inference infrastructure.
β’ Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g., Bedrock, Vertex, and other providers) β including management of credentials, rate limits, and failover mechanisms.
β’ Establish reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level.
β’ Develop reusable infrastructure abstractions and contracts to standardize the deployment, configuration, and consumption of AI services across the organization.
β’ Manage the LLM/AI observability and tracing stack β provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g., ClickHouse) β ensuring AI behavior is auditable and debuggable in production.
β’ Create analytics and monitoring pipelines that highlight latency, error rates, quality, and regression indicators to engineering and clinical stakeholders.
β’ Define SLOs, alerting protocols, on-call runbooks, and incident response strategies for AI infrastructure; spearhead troubleshooting efforts and continuously enhance platform reliability.
β’ Oversee and refine the monorepo build system and CI/CD pipelines for AI workloads β encompassing evaluation workflows, Docker image creation, automated PR checks and convention enforcement, as well as cross-platform test execution.
β’ Manage shared infrastructure tools, CLIs, and IaC modules (Terraform, Scalr) utilized daily by AI and product engineers.
β’ Identify and eliminate platform bottlenecks β decreasing CI/CD cycle times, build latency, and deployment friction β to boost developer productivity across the Applied AI organization.
β’ Construct IAM, OIDC, and secrets management as essential infrastructure β implementing scoped, least-privilege roles, write-only secret rotation, and cross-account access audits.
β’ Integrate security-by-default, scope boundaries, and access controls into the platform to ensure that AI services are HIPAA-compliant and prioritize privacy.
β’ Collaborate with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant and auditable data access.
β’ Lead multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and the evolution of observability.
β’ Write and guide technical design documents and design reviews, establish infrastructure standards and development workflow conventions, and contribute to technical governance across AI engineering.
β’ Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, bridging the gap between prototypes and production-ready systems.
β’ 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering β with at least 3 years concentrated on ML/AI systems in production.
β’ Extensive, hands-on experience with Kubernetes (preferably EKS) and the cloud-native ecosystem β autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration.
β’ Strong skills in infrastructure-as-code (Terraform) and experience in designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access.
β’ Proficient in Python, with experience in building production infrastructure tools, CLIs, and data/observability pipelines.
β’ 2+ years of experience managing LLM-based systems in production (LLMOps) β including inference routing, serving, tracing, and the reliability patterns required to operate them at scale.
β’ Practical experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or similar) and metrics/log/trace pipelines.
β’ Experience in designing and maintaining CI/CD pipelines, build systems, and developer tools for agile engineering teams.
β’ A systems-and-operations mindset: you consider failure modes, SLOs, observability, security, and long-term maintainability before deployment.
β’ Experience in writing and leading technical design documents (TDDs/RFCs) for large-scale infrastructure initiatives.
β’ Strong collaboration skills across engineering, ML, product, security, and clinical teams.
β’ A deep understanding of safety, privacy, and security β ideally with experience in a regulated sector such as healthcare, fintech, or life sciences.
β’ Competitive salary & equity compensation for full-time positions.
β’ Unlimited PTO, company holidays, and quarterly mental health days.
β’ Comprehensive health benefits including medical, dental & vision, and parental leave.
β’ Employee Stock Purchase Program (ESPP).
β’ 401k benefits with employer matching contribution.
β’ Offsite team retreats.
Instacart
CLASP
Tevora
Tailor
Get handpicked remote jobs straight to your inbox weekly.