
Site Reliability Engineer
Posted 22 hours ago

Posted 22 hours ago
This is a fully remote position, open to applicants in United Kingdom.
• Design, implement, and maintain robust, scalable, and secure infrastructure that underpins Orion Health's products and services.
• Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and enhance customer satisfaction.
• Develop and sustain observability solutions, encompassing monitoring, logging, alerting, and tracing capabilities across cloud environments.
• Engage in incident response activities, which include troubleshooting, root cause analysis, remediation planning, and conducting post-incident reviews.
• Spearhead initiatives aimed at minimizing operational toil through automation, Infrastructure as Code (IaC), and self-service functionalities.
• Collaborate closely with software engineering teams to enhance application reliability, performance, and operational preparedness.
• Identify and address reliability bottlenecks through performance tuning, capacity planning, and system optimization.
• Support infrastructure and platform upgrades while ensuring minimal disruption and sustained service availability.
• Conduct capacity forecasting and scalability planning to align with future business and customer requirements.
• Create operational runbooks, standards, and best practices that bolster system resilience and operational efficiency.
• Advocate for reliability engineering principles and cultivate a culture of continuous improvement across teams.
• Contribute to initiatives related to disaster recovery, business continuity, and platform resilience.
• A minimum of 3 years of experience in Site Reliability Engineering, Platform Engineering, DevOps, Cloud Operations, or Infrastructure Engineering roles.
• Proven experience in supporting and managing production cloud environments.
• Strong background with cloud platforms such as AWS, Azure, or Google Cloud Platform.
• Experience in implementing Infrastructure as Code (IaC) utilizing tools like Terraform, Bicep, ARM, or CloudFormation.
• Familiarity with containerization and orchestration technologies, including Docker and Kubernetes.
• Proven track record in building and maintaining monitoring, logging, and observability solutions.
• Experience in managing production incidents and performing root cause analysis.
• Knowledge of CI/CD pipelines and contemporary software delivery methodologies.
• Proficiency in automation and scripting with tools such as PowerShell, Bash, Python, or similar.
• Understanding of networking, security, high availability, and disaster recovery principles.
• Experience in supporting highly available, customer-facing applications and services.
• Comprehensive health and wellness programs.
• Opportunities for professional development and career growth.
• Flexible work environment with remote work options.
• Collaborative and innovative team culture.
• Competitive salary and performance-based bonuses.
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.