
Technical Manager
Posted 1 day ago

Posted 1 day ago
• Manage the daily activities of the SRE practice, including team planning, shift assignments, escalation routing, and workload distribution.
• Maintain a robust on-call program by defining rotation rules, monitoring fatigue, ensuring coverage, and continuously enhancing response capabilities.
• Oversee incident management procedures to guarantee consistent triage, high-quality postmortems, and effective follow-through on remediation efforts.
• Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and promote accountability.
• Mentor and develop SREs at all levels through individual meetings, technical advice, and structured development plans.
• Ensure that the team’s processes, documentation, and runbooks are up-to-date and properly audited.
• Provide architectural guidance on resilience, observability, and reliability patterns; directly intervene when the team faces obstacles or when customer-impacting work requires senior technical insight.
• Validate SLIs/SLOs and error budgets across services, ensuring consistent application and reporting.
• Review and authorize reliability design initiatives, including monitoring strategies, automation projects, CI/CD modifications, deployment safety measures, and cloud cost/performance optimizations.
• Engage in high-severity incidents, serving as an escalation point and technical lead when necessary.
• Ensure engineering excellence in IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.
• Act as the primary liaison for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.
• Translate business objectives into reliability roadmaps, staffing strategies, and operational enhancements.
• Align teams around shared reliability goals, ensuring that corrective actions, automation priorities, and capacity planning are effectively executed.
• Support customer-facing discussions when reliability posture, operational processes, or technical improvements necessitate leadership representation.
• 6–10 years of experience in SRE/Operations/Platform roles, with a minimum of 2 years in a leadership or management position.
• Hands-on technical expertise across cloud platforms (AWS/Azure/GCP) and Kubernetes.
• Proven experience in defining and operating SLIs/SLOs, incident response, and postmortem initiatives.
• Strong foundation in Terraform or similar Infrastructure as Code (IaC), CI/CD systems, and observability tools (Prometheus, Grafana, OpenTelemetry, ELK).
• Ability to evaluate technical work, mentor engineers through complex challenges, and make informed trade-offs under pressure.
• Excellent operational judgment in triage, prioritization, team load balancing, and process design.
• Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional).
• Opportunity to work in a dynamic and innovative environment.
• Competitive salary and comprehensive benefits package.
• Professional development and growth opportunities.
• Collaborative and supportive team culture.
Jones Lang LaSalle Americas, Inc.
Westlake Financial
AbbVie
Westbury Street Holdings
Get handpicked remote jobs straight to your inbox weekly.