This is a fully remote position, open to applicants in California.

📋 Description

• Lead the Site Reliability Operations team, including the Network Operations Center (NOC), which is responsible for observability, real-time monitoring, incident response, and ensuring operational excellence for vital enterprise services; establish direction, priorities, and success metrics for the team.

• Collaborate with Product Management, Engineering, SRE, and the broader infrastructure team to integrate CI/CD and release best practices into operations, including automated build/test/deploy, health checks, rollbacks, release monitoring via the NOC, and change-management protocols.

• Supervise service reliability monitoring and incident management: guarantee adequate observability (metrics, logs, traces, dashboards), well-calibrated alerting thresholds, escalation pathways, and effective communication with stakeholders and leadership during incidents.

• Take ownership of and enhance the Problem Management function for the team: drive root cause analysis (RCA) for recurring or high-severity incidents, standardize post-incident reviews, and ensure implementation and verification of corrective actions and follow-ups.

• Define, monitor, and report on operational and reliability metrics (e.g., availability, MTTR, incident volume, change failure rate, deployment frequency, problem resolution time); provide regular insights and recommendations to Technology Operations leadership.

• Promote automation and “operations as code” (infrastructure as code, configuration as code, automated runbooks), collaborating with engineering teams to minimize manual toil and enhance consistency, speed, and safety of operations and releases.

• Recruit, mentor, coach, and assess team members; offer performance feedback, make salary and promotion recommendations, and cultivate a high-performing, collaborative culture aligned with Mercury’s core values.

• Provide leadership coverage for 7x24 mission-critical support through the NOC and on-call rotations; ensure sustainable on-call practices, high-quality runbooks, and ongoing enhancement of tools and processes.

⛳️ Requirements

• Minimum: Bachelor’s degree in computer science, Information Systems, Engineering, or a related field, or an equivalent combination of education and work experience.

• Minimum: 7+ years of experience in IT operations, SRE, DevOps, or similar roles supporting mission-critical systems.

• 3+ years of experience in a lead or management role overseeing technical teams in a 24x7 environment.

• Preferred: Advanced coursework, certifications, or experience in Site Reliability Engineering, DevOps, Cloud platforms, or ITIL.

• Strong understanding of CI/CD pipelines (build, test, security scanning, deployment, rollback) and their role in supporting reliable operations.

• Solid knowledge of observability practices and tools (metrics, logs, traces, dashboards, alerts) and the ability to design actionable monitoring and alerting for production systems.

• Extensive familiarity with incident and problem management processes, including root cause analysis methods and facilitating post-incident reviews.

• Working knowledge of DevOps/SRE concepts such as SLOs/SLIs, error budgets, resilience patterns, automation to minimize toil, and fostering a blameless culture.

• Proven ability to lead and influence cross-functional teams, establish relationships, and effectively collaborate with engineering, InfoSec, infrastructure, and business stakeholders.

• Excellent communication skills, both written and verbal; capable of clearly conveying technical issues, risks, and recommendations to both technical and non-technical audiences, including senior leadership.

• Strong analytical and problem-solving skills; able to assess operational data and trends to identify risks, drive decisions, and prioritize improvements.

• Self-motivated, adaptable, and able to function with minimal supervision in a rapidly changing environment.

• Willingness to work extended hours, nights, or weekends as necessary to support critical releases or address major incidents.

🏝️ Benefits

• Competitive compensation

• Flexibility to work from anywhere in the United States for most positions

• Paid time off (vacation time, sick time, 9 paid Company holidays, volunteer hours)

• Incentive bonus programs (potential for holiday bonus, referral bonus, and performance-based bonus)

• Medical, dental, vision, life, and pet insurance

• 401 (k) retirement savings plan with company match

• Engaging work environment

• Promotional opportunities

• Education assistance

• Professional and personal development opportunities

• Company recognition program

• Health and wellbeing resources, including free mental wellbeing therapy/coaching sessions, child and eldercare resources, and more

Manager – Site Reliability Operations

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Staff Site Reliability Engineer

Site Reliability Engineer

Autonomy Release Engineer II

Senior Security Engineer, Content Engineering

DevOps Engineer – ML & Data Infrastructure

Senior Site Reliability Engineer

Never miss a great job!