This is a fully remote position, open to applicants in United States.

📋 Description

• Take ownership and expand the AI compute and deployment platform.

• Manage and enhance our containerized application deployment platform and associated systems for AI workloads, which includes general process and job orchestration (e.g., Kubernetes) — covering cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production.

• Develop and sustain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that enable teams to deploy AI services safely and consistently.

• Create ephemeral/preview environments, feature-branched deployments, and nightly release pipelines, allowing teams to validate AI modifications in environments resembling production prior to release.

• Enhance efficiency and manage costs related to compute, autoscaling, and inference infrastructure.

• Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g., Bedrock, Vertex, and other providers) — including management of credentials, rate limits, and failover mechanisms.

• Establish reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level.

• Develop reusable infrastructure abstractions and contracts to standardize the deployment, configuration, and consumption of AI services across the organization.

• Manage the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g., ClickHouse) — ensuring AI behavior is auditable and debuggable in production.

• Create analytics and monitoring pipelines that highlight latency, error rates, quality, and regression indicators to engineering and clinical stakeholders.

• Define SLOs, alerting protocols, on-call runbooks, and incident response strategies for AI infrastructure; spearhead troubleshooting efforts and continuously enhance platform reliability.

• Oversee and refine the monorepo build system and CI/CD pipelines for AI workloads — encompassing evaluation workflows, Docker image creation, automated PR checks and convention enforcement, as well as cross-platform test execution.

• Manage shared infrastructure tools, CLIs, and IaC modules (Terraform, Scalr) utilized daily by AI and product engineers.

• Identify and eliminate platform bottlenecks — decreasing CI/CD cycle times, build latency, and deployment friction — to boost developer productivity across the Applied AI organization.

• Construct IAM, OIDC, and secrets management as essential infrastructure — implementing scoped, least-privilege roles, write-only secret rotation, and cross-account access audits.

• Integrate security-by-default, scope boundaries, and access controls into the platform to ensure that AI services are HIPAA-compliant and prioritize privacy.

• Collaborate with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant and auditable data access.

• Lead multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and the evolution of observability.

• Write and guide technical design documents and design reviews, establish infrastructure standards and development workflow conventions, and contribute to technical governance across AI engineering.

• Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, bridging the gap between prototypes and production-ready systems.

⛳️ Requirements

• 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years concentrated on ML/AI systems in production.

• Extensive, hands-on experience with Kubernetes (preferably EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration.

• Strong skills in infrastructure-as-code (Terraform) and experience in designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access.

• Proficient in Python, with experience in building production infrastructure tools, CLIs, and data/observability pipelines.

• 2+ years of experience managing LLM-based systems in production (LLMOps) — including inference routing, serving, tracing, and the reliability patterns required to operate them at scale.

• Practical experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or similar) and metrics/log/trace pipelines.

• Experience in designing and maintaining CI/CD pipelines, build systems, and developer tools for agile engineering teams.

• A systems-and-operations mindset: you consider failure modes, SLOs, observability, security, and long-term maintainability before deployment.

• Experience in writing and leading technical design documents (TDDs/RFCs) for large-scale infrastructure initiatives.

• Strong collaboration skills across engineering, ML, product, security, and clinical teams.

• A deep understanding of safety, privacy, and security — ideally with experience in a regulated sector such as healthcare, fintech, or life sciences.

🏝️ Benefits

• Competitive salary & equity compensation for full-time positions.

• Unlimited PTO, company holidays, and quarterly mental health days.

• Comprehensive health benefits including medical, dental & vision, and parental leave.

• Employee Stock Purchase Program (ESPP).

• 401k benefits with employer matching contribution.

• Offsite team retreats.

Staff Machine Learning Systems Engineer – MLOps

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Program Manager II

Senior Product Manager – Candidate & Recruiter Platform

Account Director

Forward-Deployed Product Manager – FDPM

Human Resource Generalist

Product Marketing Engineer

Never miss a great job!