This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Serve as a primary or escalation responder in a 24/7 on-call rotation.

• Lead or assist in Major Incident (MI) response, covering triage, mitigation, and resolution.

• Collaborate with Engineering, Infrastructure, Security, and Product teams.

• Execute and enhance runbooks, playbooks, and escalation protocols.

• Conduct blameless post-incident reviews (PIRs) and monitor corrective actions.

• Oversee service health monitoring across infrastructure, applications, and dependencies.

• Design and maintain alerting strategies in alignment with SLIs/SLOs.

• Mitigate alert fatigue through improvements in signal-to-noise ratios.

• Create dashboards utilizing tools such as Grafana, Prometheus, Datadog, Splunk, and CloudWatch.

• Automate repetitive operational tasks to minimize manual effort.

• Enhance mean time to detect (MTTD) and mean time to resolve (MTTR).

• Develop scripts and tools (in Python, Bash, Go, etc.) to facilitate NOC/SRE workflows.

• Implement self-healing and auto-remediation solutions wherever feasible.

• Collaborate with engineering teams to optimize system design for reliability.

• Provide support and troubleshooting for Linux-based systems, cloud platforms, and Kubernetes/containerized environments.

• Assist in capacity planning and availability assessments.

• Ensure operational readiness for production releases.

⛳️ Requirements

• Proficient in Linux systems administration.

• Experience in incident management and production support.

• Familiarity with cloud infrastructure, preferably AWS.

• Knowledge of containers and orchestration tools (Docker, Kubernetes).

• Experience with monitoring and alerting platforms.

• Scripting or programming skills in Python, Bash, Go, or similar languages.

• Understanding of networking fundamentals (DNS, TCP/IP, load balancing).

• Experience in 24/7 NOC or production operations settings.

• Ability to manage high-pressure incidents with composure and effectiveness.

• Strong written and verbal communication skills for incident coordination.

• Comfortable utilizing runbooks, with a focus on improving them when necessary.

• Experience in defining or working towards SLOs/SLIs.

• Previous transition from a traditional NOC to an SRE model.

• Experience with Infrastructure as Code (Terraform, Ansible, etc.).

• Exposure to security, compliance, or regulated environments.

🏝️ Benefits

• Opportunities for professional development.

• Flexible working hours.

• Option to work from home.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!