Remotery

Site Reliability Engineer

Posted Jun 20

This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Serve as a primary or escalation responder in a 24/7 on-call rotation.

• Lead or assist in Major Incident (MI) response, covering triage, mitigation, and resolution.

• Collaborate with Engineering, Infrastructure, Security, and Product teams.

• Execute and enhance runbooks, playbooks, and escalation protocols.

• Conduct blameless post-incident reviews (PIRs) and monitor corrective actions.

• Oversee service health monitoring across infrastructure, applications, and dependencies.

• Design and maintain alerting strategies in alignment with SLIs/SLOs.

• Mitigate alert fatigue through improvements in signal-to-noise ratios.

• Create dashboards utilizing tools such as Grafana, Prometheus, Datadog, Splunk, and CloudWatch.

• Automate repetitive operational tasks to minimize manual effort.

• Enhance mean time to detect (MTTD) and mean time to resolve (MTTR).

• Develop scripts and tools (in Python, Bash, Go, etc.) to facilitate NOC/SRE workflows.

• Implement self-healing and auto-remediation solutions wherever feasible.

• Collaborate with engineering teams to optimize system design for reliability.

• Provide support and troubleshooting for Linux-based systems, cloud platforms, and Kubernetes/containerized environments.

• Assist in capacity planning and availability assessments.

• Ensure operational readiness for production releases.


⛳️ Requirements

• Proficient in Linux systems administration.

• Experience in incident management and production support.

• Familiarity with cloud infrastructure, preferably AWS.

• Knowledge of containers and orchestration tools (Docker, Kubernetes).

• Experience with monitoring and alerting platforms.

• Scripting or programming skills in Python, Bash, Go, or similar languages.

• Understanding of networking fundamentals (DNS, TCP/IP, load balancing).

• Experience in 24/7 NOC or production operations settings.

• Ability to manage high-pressure incidents with composure and effectiveness.

• Strong written and verbal communication skills for incident coordination.

• Comfortable utilizing runbooks, with a focus on improving them when necessary.

• Experience in defining or working towards SLOs/SLIs.

• Previous transition from a traditional NOC to an SRE model.

• Experience with Infrastructure as Code (Terraform, Ansible, etc.).

• Exposure to security, compliance, or regulated environments.


🏝️ Benefits

• Opportunities for professional development.

• Flexible working hours.

• Option to work from home.

People also viewed

Innovative Solutions45 min ago

Cloud Engineer – DevOps

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$100k – $160k/year
ApplyView job
Caspar Health45 min ago

DevSecOps/DevOps Engineer

DE flagGermany OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
IVIX45 min ago

Deployment Engineer

US flagNew York OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Investigo11 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind11 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal11 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers