This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Take ownership of SLOs, SLIs, and error budgets for all production services; promote error budget discipline throughout the engineering team.

• Create reliability patterns for AI agent pipelines, including LLM observability, tool-use tracking, failure detection, and graceful degradation strategies.

• Design systems to contain blast radius; ensure that agent failures have limited customer impact through isolation, circuit breaking, and swift recovery.

• Advance our Canada Central/West active-active architecture to achieve a 24-hour RTO with complete regional failover capabilities.

• Lead incident response efforts and post-incident reviews that yield lasting solutions; uphold disaster recovery procedures through routine testing.

• Act as the primary reliability contact for Software and AI Engineering, translating requirements into practical standards.

• Collaborate with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation matters.

• Oversee the CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions), establish standards, enhance deployment frequency, and ensure teams can deploy confidently.

• Promote IDP adoption and empower teams in SRE practices, including on-call readiness, SLO definition, runbook development, and self-service tooling.

• Represent reliability considerations in architectural discussions; identify risks before they are committed to design.

• Manage the service catalog as a dynamic inventory of all services, AI agents, dependencies, ownership, and SLOs.

• Utilize Datadog as the comprehensive view for service health, infrastructure, and agentic pipeline telemetry.

• Develop golden path templates in Backstage and/or Atlassian Compass, enabling teams to deploy reliably without frequent SRE intervention.

• Implement AIOps in Datadog to automate anomaly detection, incident triage, and remediation suggestions.

• Manage infrastructure as code using Terraform and GitOps; enforce IaC policies in collaboration with Trust Assurance.

• Ensure FinOps visibility into AWS cost segments; model the impact of cloud costs as AI/ML workloads grow.

• Provide formal mentorship to junior and intermediate SRE engineers, overseeing their technical development and career advancement.

• Create AI-assisted automation to systematically reduce toil and enhance the operational capacity of the team.

⛳️ Requirements

• Bachelor's degree in Computer Science, Engineering, or a comparable combination of education and experience.

• 6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, showcasing technical leadership at the senior individual contributor level.

• Extensive expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns.

• Proficiency in Terraform and GitOps; experience with policy-as-code tools (Sentinel, OPA/Rego, or equivalent).

• Practical experience with Datadog at an operational level: dashboards, SLO tracking, alerting, log management, and distributed tracing.

• Strong expertise in containerization technologies: Docker and Kubernetes (EKS preferred).

• Proficiency in Python and/or Bash; experience developing operational tools; solid understanding of Java and Spring Boot microservice architecture to make reliability and deployment decisions for EKS-hosted services.

• Deep knowledge of CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub Actions.

• Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) is highly preferred.

• Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations is considered a significant asset.

🏝️ Benefits

• Company-sponsored training and development opportunities.

• Comprehensive benefits package (health, dental, vision, wellness, retirement, annual fitness reimbursement).

• Flexible vacation policy.

• Opportunities for community involvement through charitable alliances.

• Wellness resources and support.

• An inclusive environment that prioritizes diversity, equity, and accessibility.

Senior Site Reliability Engineer – Remote UK

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

DevOps Engineer – Platform, AWS, CI/CD

Site Reliability Engineer

Never miss a great job!