This is a fully remote position, open to applicants in United States.

📋 Description

• Design, develop, and manage the infrastructure layer that supports AI agent workflows in a production setting.

• Ensure the reliability, scalability, and observability of agentic systems across both internal and external products.

• Create and implement platform services, APIs, SDKs, and self-service features that enable engineering teams to effectively utilize AI infrastructure and agent platform services.

• Oversee and maintain the compute, orchestration, and serving infrastructure that powers model inference and agent execution.

• Establish robust monitoring, alerting, and incident response protocols tailored specifically for AI/ML workloads.

• Use Infrastructure as Code (IaC) tools like Terraform to provision and manage cloud infrastructure components (AWS).

• Develop and sustain CI/CD pipelines that facilitate the rapid and reliable deployment of AI services and agent workflows.

• Define and implement guardrails, failure handling, and recovery strategies specific to agentic and LLM-powered systems.

• Work collaboratively with AI and Data Engineering teams to evolve experimental agent prototypes into robust production systems.

• Manage containerized workloads with Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services.

• Enforce access controls and security best practices across AI infrastructure environments.

• Document architecture, runbooks, and best practices to promote knowledge sharing within the team.

⛳️ Requirements

• A minimum of 5 years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or a similar role in a production environment.

• Practical experience in supporting ML infrastructure, model serving, or MLOps workflows in a production context.

• Proven experience in building developer platforms, internal tools, APIs, or SDKs utilized by engineering teams at scale.

• Strong grasp of platform engineering principles, focusing on developer experience, self-service infrastructure, and API-driven platform design.

• Proficient in Infrastructure as Code tools, especially Terraform.

• Familiarity with containerization and orchestration, particularly with Kubernetes and Docker.

• Solid understanding of cloud infrastructure, preferably AWS.

• Strong scripting capabilities (bash/shell) and proficiency in at least one programming language (Python preferred).

• Experience in designing and managing observability, monitoring, and alerting systems.

• Background in implementing incident response procedures and participating in on-call rotations.

• Excellent collaboration skills when working with data, AI, and engineering teams.

• A high ownership mindset in a dynamic, high-stakes production environment.

🏝️ Benefits

• Equity offerings.

• Bonus opportunities.

• Wellness allowance.

• Comprehensive health insurance (medical, dental, vision).

• 401(k) plan.

Site Reliability Engineer – AI Agents

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!