
Lead Site Reliability Engineer
Posted May 30

Posted May 30
This is a fully remote position, open to applicants in India.
• Develop, deploy, and troubleshoot microservices within Kubernetes and Amazon EKS, ensuring both scalability and reliability.
• Architect secure, highly available web applications with an emphasis on capacity planning and performance enhancement.
• Oversee the deployment and management of the lifecycle for LLMs and embedding models, establishing KPIs to assess and enhance AI application performance.
• Assess and incorporate emerging technologies such as RAG systems, MCP servers, AI Agents, and agentic workflows into our platform.
• Manage core AWS services and GenAI offerings (S3, IAM, EKS, Bedrock, etc.) utilizing infrastructure-as-code tools like Terraform and Chef, while ensuring observability through platforms like New Relic or PagerDuty.
• Collaborate with product, platform, and engineering teams on architecture design, security updates, incident response, and release management to maintain the reliability of our ML and GenAI infrastructure.
• Bachelor’s degree and over 8 years of experience in managing large-scale cloud applications, with a solid foundation in Linux administration and troubleshooting.
• More than 5 years of hands-on experience managing cloud infrastructure across AWS, GCP, and Azure environments.
• A deep understanding of the current generative AI landscape, supplemented by practical experience with LLMs and embedding models (OpenAI, AWS Bedrock, SageMaker); familiarity with vector databases like LanceDB is advantageous.
• Strong scripting capabilities in Bash or Python, and experience with container orchestration platforms such as Amazon EKS or Azure AKS.
• Expertise in DevOps and automation tools like Chef, GitHub Actions, Rundeck, and IaC frameworks such as Terraform, Spacelift, and Helm.
• Working knowledge of DNS, load balancers, and MySQL, along with a solid understanding of source control and branching strategies in Git.
• Pioneering Technology: At Coupa, we are leading the way in innovation, utilizing the latest technology to provide our customers with enhanced efficiency and visibility in their expenditures.
• Collaborative Culture: We cherish collaboration and teamwork; our culture is rooted in transparency, openness, and a collective commitment to excellence.
• Global Impact: Become part of a company where your contributions have a worldwide, measurable influence on our clients, the business, and one another.
Work Life Group
accesa.eu
Cisco
Work Life Group
Get handpicked remote jobs straight to your inbox weekly.