Remotery

Lead Site Reliability Engineer

Posted May 30

This is a fully remote position, open to applicants in India.

📋 Description

• Develop, deploy, and troubleshoot microservices within Kubernetes and Amazon EKS, ensuring both scalability and reliability.

• Architect secure, highly available web applications with an emphasis on capacity planning and performance enhancement.

• Oversee the deployment and management of the lifecycle for LLMs and embedding models, establishing KPIs to assess and enhance AI application performance.

• Assess and incorporate emerging technologies such as RAG systems, MCP servers, AI Agents, and agentic workflows into our platform.

• Manage core AWS services and GenAI offerings (S3, IAM, EKS, Bedrock, etc.) utilizing infrastructure-as-code tools like Terraform and Chef, while ensuring observability through platforms like New Relic or PagerDuty.

• Collaborate with product, platform, and engineering teams on architecture design, security updates, incident response, and release management to maintain the reliability of our ML and GenAI infrastructure.


⛳️ Requirements

• Bachelor’s degree and over 8 years of experience in managing large-scale cloud applications, with a solid foundation in Linux administration and troubleshooting.

• More than 5 years of hands-on experience managing cloud infrastructure across AWS, GCP, and Azure environments.

• A deep understanding of the current generative AI landscape, supplemented by practical experience with LLMs and embedding models (OpenAI, AWS Bedrock, SageMaker); familiarity with vector databases like LanceDB is advantageous.

• Strong scripting capabilities in Bash or Python, and experience with container orchestration platforms such as Amazon EKS or Azure AKS.

• Expertise in DevOps and automation tools like Chef, GitHub Actions, Rundeck, and IaC frameworks such as Terraform, Spacelift, and Helm.

• Working knowledge of DNS, load balancers, and MySQL, along with a solid understanding of source control and branching strategies in Git.


🏝️ Benefits

• Pioneering Technology: At Coupa, we are leading the way in innovation, utilizing the latest technology to provide our customers with enhanced efficiency and visibility in their expenditures.

• Collaborative Culture: We cherish collaboration and teamwork; our culture is rooted in transparency, openness, and a collective commitment to excellence.

• Global Impact: Become part of a company where your contributions have a worldwide, measurable influence on our clients, the business, and one another.

People also viewed

Work Life Group25 min ago

Lead DevOps Engineer, Data & AI Platform

HU flagHungary OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
accesa.eu25 min ago

DevOps Engineer, German

RO flagRomania OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cisco31 min ago

Site Reliability Engineer – Kubernetes Platform

IN flagIndia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Work Life Group38 min ago

Lead DevOps Engineer – Data & AI Platform

CZ flagCzechia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
JumpCloud38 min ago

Security Engineer, DevSecOps

MX flagMexico OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Unit438 min ago

Cloud Operations Engineer

PT flagPortugal OnlyFull-timeDevOps & Site Reliability Engineer (SRE)€30.5k – €35.1k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers