
Site Reliability Engineer
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Saudi Arabia.
• You will design and uphold infrastructure that is highly available, fault-tolerant, and scalable.
• You will proactively identify and eliminate single points of failure before they escalate into incidents.
• You will ensure our production systems remain stable, even as scale and load increase.
• You will manage and continually enhance workloads across AWS, GCP, or Azure.
• You will utilize Infrastructure as Code (Terraform) to standardize and scale infrastructure.
• You will optimize resource usage to achieve a balance between performance and cost.
• You will operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence.
• You will quickly troubleshoot issues to ensure smooth deployments and upgrades.
• You will guarantee that our containerized workloads perform reliably at scale.
• You will implement and refine monitoring systems using tools such as Prometheus, Grafana, Datadog, or ELK.
• You will define alerting that is meaningful rather than excessive.
• You will respond to incidents, lead root cause analyses, and ensure lessons are learned from every failure.
• You will write scripts and build tools to eliminate repetitive operational tasks.
• You will continuously enhance infrastructure efficiency through automation.
• You will foster a culture where manual tasks are viewed as a temporary state, not the standard.
• You will collaborate closely with DevOps and engineering teams to address performance bottlenecks.
• You will contribute to improvements in CI/CD and deployment reliability.
• You will help establish reliability best practices across the organization.
• You have approximately 3 years of experience in SRE, DevOps, or infrastructure engineering, and you understand what can fail at scale.
• You are comfortable working in cloud environments such as AWS, GCP, or Azure, and you comprehend how distributed systems function.
• You have hands-on experience with Kubernetes in production and know how to troubleshoot it when issues arise.
• You do not just resolve issues; you investigate why they occurred and ensure they are not repeated.
• You utilize Terraform (or similar Infrastructure as Code tools) to manage infrastructure.
• You work confidently with Docker and Kubernetes.
• You write scripts in Python, Bash, or similar languages to automate workflows.
• You have a solid understanding of CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.).
• You possess a strong grasp of networking, load balancing, and high-availability design.
• You have implemented tools like Prometheus, Grafana, Datadog, or ELK.
• You distinguish between useful alerts and noise.
• You focus on signals that genuinely drive action.
• You take ownership and do not wait to be informed when something is broken.
• You remain calm under pressure and methodical during incidents.
• You simplify complexity rather than adding to it.
• You communicate clearly, even when discussing complex technical issues.
• You are committed to building systems that enhance the effectiveness of other engineers.
• Nice to have (but not required):
• Experience with RabbitMQ or Redis in production.
• Familiarity with Ansible or AWX.
• Exposure to multi-cloud or hybrid environments.
• Cloud certifications (AWS, GCP) or Linux certifications.
• Background from ITI (Information Technology Institute).
• Competitive salary and performance-based bonuses.
• Comprehensive health, dental, and vision insurance.
• Opportunities for professional development and training.
• Flexible working hours and remote work options.
• A supportive and collaborative work environment.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.