This is a fully remote position, open to applicants in Canada.

📋 Description

• Manage the daily activities of the SRE practice, including team planning, shift assignments, escalation routing, and workload distribution.

• Maintain a robust on-call program by defining rotation rules, monitoring fatigue, ensuring coverage, and continuously enhancing response capabilities.

• Oversee incident management procedures to guarantee consistent triage, high-quality postmortems, and effective follow-through on remediation efforts.

• Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and promote accountability.

• Mentor and develop SREs at all levels through individual meetings, technical advice, and structured development plans.

• Ensure that the team’s processes, documentation, and runbooks are up-to-date and properly audited.

• Provide architectural guidance on resilience, observability, and reliability patterns; directly intervene when the team faces obstacles or when customer-impacting work requires senior technical insight.

• Validate SLIs/SLOs and error budgets across services, ensuring consistent application and reporting.

• Review and authorize reliability design initiatives, including monitoring strategies, automation projects, CI/CD modifications, deployment safety measures, and cloud cost/performance optimizations.

• Engage in high-severity incidents, serving as an escalation point and technical lead when necessary.

• Ensure engineering excellence in IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.

• Act as the primary liaison for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.

• Translate business objectives into reliability roadmaps, staffing strategies, and operational enhancements.

• Align teams around shared reliability goals, ensuring that corrective actions, automation priorities, and capacity planning are effectively executed.

• Support customer-facing discussions when reliability posture, operational processes, or technical improvements necessitate leadership representation.

⛳️ Requirements

• 6–10 years of experience in SRE/Operations/Platform roles, with a minimum of 2 years in a leadership or management position.

• Hands-on technical expertise across cloud platforms (AWS/Azure/GCP) and Kubernetes.

• Proven experience in defining and operating SLIs/SLOs, incident response, and postmortem initiatives.

• Strong foundation in Terraform or similar Infrastructure as Code (IaC), CI/CD systems, and observability tools (Prometheus, Grafana, OpenTelemetry, ELK).

• Ability to evaluate technical work, mentor engineers through complex challenges, and make informed trade-offs under pressure.

• Excellent operational judgment in triage, prioritization, team load balancing, and process design.

• Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional).

🏝️ Benefits

• Opportunity to work in a dynamic and innovative environment.

• Competitive salary and comprehensive benefits package.

• Professional development and growth opportunities.

• Collaborative and supportive team culture.

Technical Manager

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Proposals Manager – Pharmacovigilance

Case Manager

Technology and Digital Transformation Manager

Manager, Tailor Shop – Retail Innovation

National Accounts Manager – Retail

Service Delivery Manager

Never miss a great job!