This is a fully remote position, open to applicants in United Kingdom.

📋 Description

• Design, implement, and maintain robust, scalable, and secure infrastructure that underpins Orion Health's products and services.

• Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and enhance customer satisfaction.

• Develop and sustain observability solutions, encompassing monitoring, logging, alerting, and tracing capabilities across cloud environments.

• Engage in incident response activities, which include troubleshooting, root cause analysis, remediation planning, and conducting post-incident reviews.

• Spearhead initiatives aimed at minimizing operational toil through automation, Infrastructure as Code (IaC), and self-service functionalities.

• Collaborate closely with software engineering teams to enhance application reliability, performance, and operational preparedness.

• Identify and address reliability bottlenecks through performance tuning, capacity planning, and system optimization.

• Support infrastructure and platform upgrades while ensuring minimal disruption and sustained service availability.

• Conduct capacity forecasting and scalability planning to align with future business and customer requirements.

• Create operational runbooks, standards, and best practices that bolster system resilience and operational efficiency.

• Advocate for reliability engineering principles and cultivate a culture of continuous improvement across teams.

• Contribute to initiatives related to disaster recovery, business continuity, and platform resilience.

⛳️ Requirements

• A minimum of 3 years of experience in Site Reliability Engineering, Platform Engineering, DevOps, Cloud Operations, or Infrastructure Engineering roles.

• Proven experience in supporting and managing production cloud environments.

• Strong background with cloud platforms such as AWS, Azure, or Google Cloud Platform.

• Experience in implementing Infrastructure as Code (IaC) utilizing tools like Terraform, Bicep, ARM, or CloudFormation.

• Familiarity with containerization and orchestration technologies, including Docker and Kubernetes.

• Proven track record in building and maintaining monitoring, logging, and observability solutions.

• Experience in managing production incidents and performing root cause analysis.

• Knowledge of CI/CD pipelines and contemporary software delivery methodologies.

• Proficiency in automation and scripting with tools such as PowerShell, Bash, Python, or similar.

• Understanding of networking, security, high availability, and disaster recovery principles.

• Experience in supporting highly available, customer-facing applications and services.

🏝️ Benefits

• Comprehensive health and wellness programs.

• Opportunities for professional development and career growth.

• Flexible work environment with remote work options.

• Collaborative and innovative team culture.

• Competitive salary and performance-based bonuses.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

DevOps Engineer – Platform, AWS, CI/CD

Site Reliability Engineer

Never miss a great job!