
Site Reliability Engineer
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in United Kingdom.
• Serve as a primary or escalation responder in a 24/7 on-call rotation.
• Lead or assist in Major Incident (MI) response, covering triage, mitigation, and resolution.
• Collaborate with Engineering, Infrastructure, Security, and Product teams.
• Execute and enhance runbooks, playbooks, and escalation protocols.
• Conduct blameless post-incident reviews (PIRs) and monitor corrective actions.
• Oversee service health monitoring across infrastructure, applications, and dependencies.
• Design and maintain alerting strategies in alignment with SLIs/SLOs.
• Mitigate alert fatigue through improvements in signal-to-noise ratios.
• Create dashboards utilizing tools such as Grafana, Prometheus, Datadog, Splunk, and CloudWatch.
• Automate repetitive operational tasks to minimize manual effort.
• Enhance mean time to detect (MTTD) and mean time to resolve (MTTR).
• Develop scripts and tools (in Python, Bash, Go, etc.) to facilitate NOC/SRE workflows.
• Implement self-healing and auto-remediation solutions wherever feasible.
• Collaborate with engineering teams to optimize system design for reliability.
• Provide support and troubleshooting for Linux-based systems, cloud platforms, and Kubernetes/containerized environments.
• Assist in capacity planning and availability assessments.
• Ensure operational readiness for production releases.
• Proficient in Linux systems administration.
• Experience in incident management and production support.
• Familiarity with cloud infrastructure, preferably AWS.
• Knowledge of containers and orchestration tools (Docker, Kubernetes).
• Experience with monitoring and alerting platforms.
• Scripting or programming skills in Python, Bash, Go, or similar languages.
• Understanding of networking fundamentals (DNS, TCP/IP, load balancing).
• Experience in 24/7 NOC or production operations settings.
• Ability to manage high-pressure incidents with composure and effectiveness.
• Strong written and verbal communication skills for incident coordination.
• Comfortable utilizing runbooks, with a focus on improving them when necessary.
• Experience in defining or working towards SLOs/SLIs.
• Previous transition from a traditional NOC to an SRE model.
• Experience with Infrastructure as Code (Terraform, Ansible, etc.).
• Exposure to security, compliance, or regulated environments.
• Opportunities for professional development.
• Flexible working hours.
• Option to work from home.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.