
Senior Incident Manager
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in California.
• Lead the response to critical incidents (SEV-1 / SEV-2) that affect AI infrastructure, GPU clusters, networking, storage, and data center operations.
• Act as the Incident Commander during significant outages, orchestrating efforts among engineering, networking, facilities, and vendor teams.
• Serve as the communication bridge between leadership and external teams during incidents and post-incident scenarios, delivering updates and status reports.
• Take ownership of the entire incident response lifecycle, which includes:
• - Assisting with Technical Triage
• - Escalation processes
• - Coordination of resources
• - Resolution of incidents
• Ensure that communication with internal stakeholders and leadership is timely and accurate.
• Maintain documentation for incident response and operational playbooks.
• Analyze incidents to recognize patterns and trends that can enhance response strategies and system reliability.
• Participate in an On-Call Rotation to lead and coordinate incident responses.
• Foster collaboration during outages that span multiple layers of infrastructure.
• Conduct post-incident reviews (PIRs) and root cause analyses, identifying systemic reliability issues and implementing corrective measures.
• Over 8 years of experience in incident management, site reliability engineering, or infrastructure operations.
• Experience managing incidents within large-scale distributed infrastructure environments.
• Strong knowledge of:
• - Data center operations
• - GPU compute clusters
• - Networking and storage infrastructure
• - Cloud or hybrid infrastructure platforms
• Demonstrated capability to lead in high-pressure incident response scenarios.
• Familiarity with incident management frameworks such as ITIL, SRE, or equivalent.
• Exceptional communication and stakeholder management abilities.
• Experience with incident tracking and monitoring tools including:
• - PagerDuty
• - ServiceNow
• - Jira
• - Datadog
• - Prometheus / Grafana
• Comprehensive health, dental, and vision coverage for you and your dependents.
• Wellness and commuter stipends available for select roles.
• 401k plan with a 2% company match (for USA employees).
• Flexible paid time off plan that is actively utilized by all employees.
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.