This is a fully remote position, open to applicants in Egypt.

📋 Description

• Responsible for the reliability, uptime, and scalability of essential production services 24/7.

• Engaging in the on-call rotation to address incidents, troubleshoot live production challenges, and conduct post-incident reviews.

• Developing comprehensive operational playbooks, escalation processes, and enhancing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

• Ensuring operational excellence by proactively identifying and mitigating reliability risks through SLO monitoring, chaos testing, and capacity planning.

• Automating operational tasks to reduce the need for manual intervention.

• Designing, implementing, and managing infrastructure across AWS, Oracle Cloud Infrastructure (OCI), and OpenStack environments.

• Optimizing cloud resources to achieve a balance of performance, security, and cost-effectiveness.

• Overseeing Kubernetes clusters (EKS, OKE, Rancher RKE2) to ensure scalability, availability, and performance.

• Managing and optimizing high-performance messaging and caching systems such as Kafka, RabbitMQ, and Redis.

• Administering and optimizing production-grade MySQL and PostgreSQL databases.

• Leading the planning and execution of comprehensive disaster recovery plans.

• Implementing sophisticated observability solutions (Prometheus, Grafana, CloudWatch).

• Driving automation efforts utilizing Terraform, Helm, Jenkins, Tekton, or GitLab CI/CD.

• Incorporating security best practices into both infrastructure and applications.

• Collaborating with cross-functional teams to promote SRE culture and mentor junior engineers.

⛳️ Requirements

• A bachelor's or master's degree in computer science, engineering, or a related technical discipline.

• Over 8 years of direct experience in production roles related to SRE, DevOps, or cloud engineering.

• Strong expertise in AWS, OCI, and OpenStack environments.

• In-depth knowledge of Kubernetes ecosystems (EKS, OKE, Rancher RKE2).

• Proven background with Kafka, RabbitMQ, Redis, and distributed messaging and caching systems.

• Solid experience in managing MySQL and PostgreSQL within production settings.

• Advanced scripting and automation capabilities (Python, Bash, Go).

• High proficiency with Helm, Terraform, and contemporary CI/CD toolchains.

• Demonstrable experience in Linux system administration and troubleshooting.

• Must be available during nighttime hours as part of the on-call schedule.

🏝️ Benefits

• Competitive salary and bonus structure.

• Unifonic share scheme (we are all owners!).

• 30 days of holiday after the first anniversary.

• Your birthday off!

• The opportunity to work from anywhere in the world for up to 25 days each year!

• Paid leave and support for new parents.

• LinkedIn learning license.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!