This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Oversee Reliability Engineering for User Experience.

• Enhance reliability, scalability, and operational efficiency for essential user-facing systems and services. Boost performance and resilience across APIs, content delivery, feed generation, search, messaging, and real-time experiences.

• Collaborate with product and infrastructure engineering teams to design systems that maintain high availability and performance under significant global demand. Guide architectural decisions regarding failover, redundancy, graceful degradation, traffic management, and capacity planning.

• Identify systemic risks and reliability hindrances across services, dependencies, deployments, and infrastructure. Develop proactive mitigation strategies and promote engineering enhancements that reduce incidents and elevate service health.

• Minimize repetitive operational tasks through automation and tooling. Create systems that enhance deployment safety, incident response, remediation workflows, and reliability safeguards.

• Lead intricate incident response initiatives across engineering teams. Facilitate blameless postmortems, identify root causes, and ensure the implementation of sustainable long-term solutions.

• Establish and advocate for best practices in reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity throughout the company.

• Offer technical leadership and mentorship to engineers within SRE and software engineering teams. Contribute to shaping a reliability culture and elevate operational excellence across the organization.

⛳️ Requirements

• A minimum of 8 years of experience in Site Reliability Engineering, Infrastructure Engineering, or similar roles managing large-scale distributed systems.

• Exceptional collaboration and communication skills with the capacity to influence technical direction across teams.

• Extensive experience in supporting high-traffic, user-facing production environments.

• Profound knowledge of one or more areas: distributed systems, networking, Linux systems, cloud-native architectures.

• Experience in designing highly available systems with robust operational and reliability practices.

• Proficient programming skills in languages such as Go, Python, or similar.

• Strong understanding of observability systems, including metrics, logging, tracing, and alerting.

• Proven experience in enhancing reliability through SLOs, automation, incident management, and performance optimization.

• Demonstrated capability to troubleshoot complex issues across applications, infrastructure, networking, and services.

🏝️ Benefits

• Global benefit programs tailored to your lifestyle, including workspace, professional development, and caregiving support.

• Family planning assistance.

• Gender-affirming care.

• Mental health and coaching benefits.

• Group personal pension scheme with employer match.

• Private medical and dental scheme.

• Income replacement programs.

• Bike to work scheme.

• Flexible vacation and paid volunteer time off.

• Generous paid parental leave.

Staff Site Reliability Engineer – Site Experience

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

DevOps Engineer – Platform, AWS, CI/CD

Site Reliability Engineer

Never miss a great job!