Remotery

Senior Incident Manager

atLambdaUS flagCaliforniaFull-timeUncategorizedSenior$125k – $195k/year

Posted 1 day ago

This is a fully remote position, open to applicants in California.

📋 Description

• Lead the response to critical incidents (SEV-1 / SEV-2) that affect AI infrastructure, GPU clusters, networking, storage, and data center operations.

• Act as the Incident Commander during significant outages, orchestrating efforts among engineering, networking, facilities, and vendor teams.

• Serve as the communication bridge between leadership and external teams during incidents and post-incident scenarios, delivering updates and status reports.

• Take ownership of the entire incident response lifecycle, which includes:

• - Assisting with Technical Triage

• - Escalation processes

• - Coordination of resources

• - Resolution of incidents

• Ensure that communication with internal stakeholders and leadership is timely and accurate.

• Maintain documentation for incident response and operational playbooks.

• Analyze incidents to recognize patterns and trends that can enhance response strategies and system reliability.

• Participate in an On-Call Rotation to lead and coordinate incident responses.

• Foster collaboration during outages that span multiple layers of infrastructure.

• Conduct post-incident reviews (PIRs) and root cause analyses, identifying systemic reliability issues and implementing corrective measures.


⛳️ Requirements

• Over 8 years of experience in incident management, site reliability engineering, or infrastructure operations.

• Experience managing incidents within large-scale distributed infrastructure environments.

• Strong knowledge of:

• - Data center operations

• - GPU compute clusters

• - Networking and storage infrastructure

• - Cloud or hybrid infrastructure platforms

• Demonstrated capability to lead in high-pressure incident response scenarios.

• Familiarity with incident management frameworks such as ITIL, SRE, or equivalent.

• Exceptional communication and stakeholder management abilities.

• Experience with incident tracking and monitoring tools including:

• - PagerDuty

• - ServiceNow

• - Jira

• - Datadog

• - Prometheus / Grafana


🏝️ Benefits

• Comprehensive health, dental, and vision coverage for you and your dependents.

• Wellness and commuter stipends available for select roles.

• 401k plan with a 2% company match (for USA employees).

• Flexible paid time off plan that is actively utilized by all employees.

People also viewed

Anchor Utility11 hours ago

Rate Analyst

US flagTexas OnlyFull-timeUncategorized
ApplyView job
Honeywell11 hours ago

HSE Manager

US flagNorth Carolina OnlyFull-timeUncategorized
ApplyView job
Cision France11 hours ago

People Partner

CA flagCanada OnlyFull-timeUncategorized$85k/year
ApplyView job
Navigate Power11 hours ago

B2B Outside Sales Consultant

US flagPennsylvania OnlyFreelanceUncategorized$50k – $250k/year
ApplyView job
TELUS11 hours ago

Business Development Executive, Early Career – European Language Required

GB flagUnited Kingdom OnlyFull-timeUncategorized
ApplyView job
Gilead Sciences11 hours ago

Statistical Programmer II

US flagUnited States OnlyFull-timeUncategorized$107.2k – $138.7k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers