
Site Reliability Engineer – AI Agents
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in United States.
• Design, develop, and manage the infrastructure layer that supports AI agent workflows in a production setting.
• Ensure the reliability, scalability, and observability of agentic systems across both internal and external products.
• Create and implement platform services, APIs, SDKs, and self-service features that enable engineering teams to effectively utilize AI infrastructure and agent platform services.
• Oversee and maintain the compute, orchestration, and serving infrastructure that powers model inference and agent execution.
• Establish robust monitoring, alerting, and incident response protocols tailored specifically for AI/ML workloads.
• Use Infrastructure as Code (IaC) tools like Terraform to provision and manage cloud infrastructure components (AWS).
• Develop and sustain CI/CD pipelines that facilitate the rapid and reliable deployment of AI services and agent workflows.
• Define and implement guardrails, failure handling, and recovery strategies specific to agentic and LLM-powered systems.
• Work collaboratively with AI and Data Engineering teams to evolve experimental agent prototypes into robust production systems.
• Manage containerized workloads with Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services.
• Enforce access controls and security best practices across AI infrastructure environments.
• Document architecture, runbooks, and best practices to promote knowledge sharing within the team.
• A minimum of 5 years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or a similar role in a production environment.
• Practical experience in supporting ML infrastructure, model serving, or MLOps workflows in a production context.
• Proven experience in building developer platforms, internal tools, APIs, or SDKs utilized by engineering teams at scale.
• Strong grasp of platform engineering principles, focusing on developer experience, self-service infrastructure, and API-driven platform design.
• Proficient in Infrastructure as Code tools, especially Terraform.
• Familiarity with containerization and orchestration, particularly with Kubernetes and Docker.
• Solid understanding of cloud infrastructure, preferably AWS.
• Strong scripting capabilities (bash/shell) and proficiency in at least one programming language (Python preferred).
• Experience in designing and managing observability, monitoring, and alerting systems.
• Background in implementing incident response procedures and participating in on-call rotations.
• Excellent collaboration skills when working with data, AI, and engineering teams.
• A high ownership mindset in a dynamic, high-stakes production environment.
• Equity offerings.
• Bonus opportunities.
• Wellness allowance.
• Comprehensive health insurance (medical, dental, vision).
• 401(k) plan.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.