
Incident Engineer
Posted May 22

Posted May 22
This is a fully remote position, open to applicants in India.
• Oversee the entire incident lifecycle, including detection, triage, escalation, resolution, and postmortems.
• Serve as the primary command center during significant incidents, coordinating war rooms and updating stakeholders.
• Establish and uphold SLAs/SLOs, incident severity frameworks, and operational runbooks.
• Collaborate effectively with Engineering, ML, and Integrations teams to resolve issues swiftly.
• Monitor system health across various integrations, including agent desks, LLMs, and ASR/TTS pipelines.
• Lead root cause analysis (RCA) efforts and implement preventive measures.
• Enhance observability, alerting mechanisms, and incident management tools.
• Ensure clear communication with both internal teams and customers throughout incidents.
• 3–6 years of experience in Incident Management, Site Reliability Engineering (SRE), or Production Support roles.
• Strong understanding of distributed systems, APIs, and cloud environments, particularly AWS.
• Proficiency with observability tools such as DataDog.
• Familiarity with AI/ML systems, specifically LLM integrations and voice technology stacks (ASR/TTS), is advantageous.
• Experience with monitoring and tracing tools like Langfuse or similar tools.
• Exceptional communication and stakeholder management abilities.
• Capability to maintain composure under pressure and guide structured resolutions.
• Equal opportunity employer dedicated to fostering diversity.
Akka (formerly Lightbend)
Swimlane
Get handpicked remote jobs straight to your inbox weekly.