This is a fully remote position, open to applicants in United States.

📋 Description

• Take ownership of SLOs, SLIs, and error budgets for all production services; enforce error budget discipline across engineering teams.

• Create reliability patterns for AI agent pipelines, including LLM observability, tool-use tracking, failure detection, and graceful degradation strategies.

• Design architecture for blast radius containment to ensure that agent failures have a limited customer impact through isolation, circuit breaking, and rapid recovery protocols.

• Enhance our Canada Central/West active-active architecture towards achieving a 24-hour RTO with complete regional failover capabilities.

• Lead incident response efforts and conduct post-incident reviews to implement lasting fixes; maintain disaster recovery procedures through consistent testing.

• Act as the main reliability liaison for Software and AI Engineering, converting requirements into actionable standards.

• Collaborate with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation strategies.

• Oversee CI/CD pipeline strategy using Bitbucket Pipelines and GitHub Actions; establish standards, optimize deployment frequency, and ensure teams can deploy with confidence.

• Promote IDP adoption and empower teams in SRE practices, including on-call readiness, SLO definition, runbook development, and self-service tooling.

• Represent reliability considerations during architectural discussions; identify risks before they become part of the design.

• Maintain the service catalog, which serves as a dynamic inventory of all services, AI agents, dependencies, ownership, and SLOs.

• Utilize Datadog as a comprehensive view for service health, infrastructure, and agent pipeline telemetry.

• Extend observability to AI workloads, focusing on LLM latency, token consumption, agent completion rates, and pipeline throughput.

• Develop golden path templates in Backstage and/or Atlassian Compass to enable teams to deploy reliably without routine SRE involvement.

• Implement AIOps in Datadog to automate anomaly detection, incident triage, and provide remediation recommendations.

• Manage infrastructure as code utilizing Terraform and GitOps; enforce IaC policies in collaboration with Trust Assurance.

• Oversee FinOps visibility into AWS cost segments; model cloud cost impacts as AI/ML workloads expand.

• Provide formal mentorship to junior and intermediate SRE engineers, ensuring accountability for their technical development and career advancement.

• Create AI-assisted automation to progressively minimize toil and enhance the operational capacity of the team.

⛳️ Requirements

• Bachelor's degree in Computer Science, Engineering, or a related field, or an equivalent combination of education and experience.

• 6–8 years of progressively responsible experience in site reliability engineering, platform engineering, or DevOps, showcasing technical leadership at the senior individual contributor level.

• In-depth knowledge of AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns.

• Proficient in Terraform and GitOps, with experience in policy-as-code (Sentinel, OPA/Rego, or similar).

• Hands-on experience with Datadog at an operational level, including dashboards, SLO tracking, alerting, log management, and distributed tracing.

• Strong expertise in containerization technologies, specifically Docker and Kubernetes (EKS preferred).

• Proficiency in Python and/or Bash, with experience in developing operational tooling; solid understanding of Java and Spring Boot microservice architecture to make reliability and deployment decisions for EKS-hosted services.

• Extensive experience in designing and optimizing CI/CD pipelines using Bitbucket Pipelines and GitHub Actions.

• Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) is highly preferred.

• Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations is considered a significant advantage.

🏝️ Benefits

• Company-sponsored training and development opportunities.

• Comprehensive benefits package including health, dental, vision, wellness, 401K matching, and annual fitness reimbursement.

• Flexible vacation policy.

• Opportunities for community involvement through charitable alliances: https://www.techinsights.com/community-involvement.

• Wellness resources and support available.

• An inclusive environment that prioritizes diversity, equity, and accessibility.

• A high-growth company driven by high performance.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Release Engineer

Senior Staff Engineer, SRE/DevOps, Produit Logiciel

Senior Site Reliability Engineer

DevOps Cloud Networking Engineer – English Advanced

Senior Software Engineer, DevOps/Infrastructure

Senior Cloud Site Reliability Engineer

Never miss a great job!