Remotery

Senior Site Reliability Engineer

Posted 5 days ago

📋 Description

• Take ownership of SLOs, SLIs, and error budgets for all production services; promote error budget discipline throughout engineering.

• Create reliability patterns for AI agent pipelines, including LLM observability, tool-use tracking, failure detection, and graceful degradation.

• Design architectures that contain blast radius — ensure agent failures have limited customer impact through isolation, circuit breaking, and rapid recovery mechanisms.

• Advance our Canada Central/West active-active architecture towards a 24-hour RTO with complete regional failover capabilities.

• Lead incident response efforts and post-incident evaluations to implement lasting solutions; uphold disaster recovery procedures through consistent testing.

• Act as the main reliability liaison for Software and AI Engineering, converting requirements into practical standards.

• Collaborate with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation strategies.

• Manage CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — establish standards, enhance deployment frequency, and ensure teams can deploy with confidence.

• Promote IDP adoption and empower teams in SRE practices, including on-call readiness, SLO definition, runbook development, and self-service tooling.

• Represent reliability in architectural discussions, identifying risks before they are integrated into design.

• Utilize Datadog as the central tool for monitoring service health, infrastructure, and agentic pipeline telemetry.

• Expand observability for AI workloads, focusing on LLM latency, token consumption, agent completion rates, and pipeline throughput.

• Create golden path templates in Backstage and/or Atlassian Compass to ensure teams can deliver reliably without regular SRE involvement.

• Oversee infrastructure as code using Terraform and GitOps; enforce IaC policies in collaboration with Trust Assurance.

• Manage FinOps visibility into AWS cost segments; analyze cloud cost implications as AI/ML workloads grow.

• Provide formal mentorship to junior and intermediate SRE engineers, overseeing their technical development and career advancement.

• Develop AI-assisted automation to progressively decrease manual toil and enhance the team's operational capacity.


⛳️ Requirements

• Bachelor's degree in Computer Science, Engineering, or a comparable combination of education and experience.

• 6–8 years of extensive experience in site reliability engineering, platform engineering, or DevOps, demonstrating technical leadership at the senior individual contributor level.

• In-depth knowledge of AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns.

• Proficiency in Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent).

• Practical experience with Datadog at an operational level: dashboards, SLO tracking, alerting, log management, and distributed tracing.

• Strong expertise in containerization technologies: Docker, Kubernetes (EKS preferred).

• Proficient in Python and/or Bash; experienced in building operational tools; solid understanding of Java and Spring Boot microservice architecture to make reliability and deployment decisions for EKS-hosted services.

• Extensive expertise in designing and optimizing CI/CD pipelines using Bitbucket Pipelines and GitHub Actions.

• Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) is strongly preferred.

• Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations is considered a significant asset.


🏝️ Benefits

• Company-sponsored training and development opportunities.

• Comprehensive benefits package (health, wellness, life insurance, fitness, English classes).

• Flexible vacation policy.

• Opportunities for community involvement through charitable partnerships.

• Wellness resources and support available.

• An inclusive environment that prioritizes diversity, equity, and accessibility.

• A high-growth company driven by high performance.

People also viewed

Arctiq17 hours ago

Site Reliability Engineer

US flagVirginia OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job
Arctiq17 hours ago

Senior Site Reliability Engineer

US flagVirginia OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind17 hours ago

Senior DevOps Manager, German speaking

PL flagPoland OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Mediastream17 hours ago

DevOps Engineer

RO flagRomania OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Kyndryl17 hours ago

Site Reliability Engineer

US flagOhio OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$161.5k – $290.8k/year
ApplyView job
Guidehouse17 hours ago

Senior Azure DevOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$118k – $196k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers