This is a fully remote position, open to applicants in India.

• Oversee the entire incident lifecycle, including detection, triage, escalation, resolution, and postmortems.

• Serve as the primary command center during significant incidents, coordinating war rooms and updating stakeholders.

• Establish and uphold SLAs/SLOs, incident severity frameworks, and operational runbooks.

• Collaborate effectively with Engineering, ML, and Integrations teams to resolve issues swiftly.

• Monitor system health across various integrations, including agent desks, LLMs, and ASR/TTS pipelines.

• Lead root cause analysis (RCA) efforts and implement preventive measures.

• Enhance observability, alerting mechanisms, and incident management tools.

• Ensure clear communication with both internal teams and customers throughout incidents.

• 3–6 years of experience in Incident Management, Site Reliability Engineering (SRE), or Production Support roles.

• Strong understanding of distributed systems, APIs, and cloud environments, particularly AWS.

• Proficiency with observability tools such as DataDog.

• Familiarity with AI/ML systems, specifically LLM integrations and voice technology stacks (ASR/TTS), is advantageous.

• Experience with monitoring and tracing tools like Langfuse or similar tools.

• Exceptional communication and stakeholder management abilities.

• Capability to maintain composure under pressure and guide structured resolutions.

• Equal opportunity employer dedicated to fostering diversity.

Incident Engineer

People also viewed