This is a fully remote position, open to applicants in Saudi Arabia.

📋 Description

• You will design and uphold infrastructure that is highly available, fault-tolerant, and scalable.

• You will proactively identify and eliminate single points of failure before they escalate into incidents.

• You will ensure our production systems remain stable, even as scale and load increase.

• You will manage and continually enhance workloads across AWS, GCP, or Azure.

• You will utilize Infrastructure as Code (Terraform) to standardize and scale infrastructure.

• You will optimize resource usage to achieve a balance between performance and cost.

• You will operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence.

• You will quickly troubleshoot issues to ensure smooth deployments and upgrades.

• You will guarantee that our containerized workloads perform reliably at scale.

• You will implement and refine monitoring systems using tools such as Prometheus, Grafana, Datadog, or ELK.

• You will define alerting that is meaningful rather than excessive.

• You will respond to incidents, lead root cause analyses, and ensure lessons are learned from every failure.

• You will write scripts and build tools to eliminate repetitive operational tasks.

• You will continuously enhance infrastructure efficiency through automation.

• You will foster a culture where manual tasks are viewed as a temporary state, not the standard.

• You will collaborate closely with DevOps and engineering teams to address performance bottlenecks.

• You will contribute to improvements in CI/CD and deployment reliability.

• You will help establish reliability best practices across the organization.

⛳️ Requirements

• You have approximately 3 years of experience in SRE, DevOps, or infrastructure engineering, and you understand what can fail at scale.

• You are comfortable working in cloud environments such as AWS, GCP, or Azure, and you comprehend how distributed systems function.

• You have hands-on experience with Kubernetes in production and know how to troubleshoot it when issues arise.

• You do not just resolve issues; you investigate why they occurred and ensure they are not repeated.

• You utilize Terraform (or similar Infrastructure as Code tools) to manage infrastructure.

• You work confidently with Docker and Kubernetes.

• You write scripts in Python, Bash, or similar languages to automate workflows.

• You have a solid understanding of CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.).

• You possess a strong grasp of networking, load balancing, and high-availability design.

• You have implemented tools like Prometheus, Grafana, Datadog, or ELK.

• You distinguish between useful alerts and noise.

• You focus on signals that genuinely drive action.

• You take ownership and do not wait to be informed when something is broken.

• You remain calm under pressure and methodical during incidents.

• You simplify complexity rather than adding to it.

• You communicate clearly, even when discussing complex technical issues.

• You are committed to building systems that enhance the effectiveness of other engineers.

• Nice to have (but not required):

• Experience with RabbitMQ or Redis in production.

• Familiarity with Ansible or AWX.

• Exposure to multi-cloud or hybrid environments.

• Cloud certifications (AWS, GCP) or Linux certifications.

• Background from ITI (Information Technology Institute).

🏝️ Benefits

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Opportunities for professional development and training.

• Flexible working hours and remote work options.

• A supportive and collaborative work environment.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!