
Senior Site Reliability Engineer
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Egypt.
• Responsible for the reliability, uptime, and scalability of essential production services 24/7.
• Engaging in the on-call rotation to address incidents, troubleshoot live production challenges, and conduct post-incident reviews.
• Developing comprehensive operational playbooks, escalation processes, and enhancing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
• Ensuring operational excellence by proactively identifying and mitigating reliability risks through SLO monitoring, chaos testing, and capacity planning.
• Automating operational tasks to reduce the need for manual intervention.
• Designing, implementing, and managing infrastructure across AWS, Oracle Cloud Infrastructure (OCI), and OpenStack environments.
• Optimizing cloud resources to achieve a balance of performance, security, and cost-effectiveness.
• Overseeing Kubernetes clusters (EKS, OKE, Rancher RKE2) to ensure scalability, availability, and performance.
• Managing and optimizing high-performance messaging and caching systems such as Kafka, RabbitMQ, and Redis.
• Administering and optimizing production-grade MySQL and PostgreSQL databases.
• Leading the planning and execution of comprehensive disaster recovery plans.
• Implementing sophisticated observability solutions (Prometheus, Grafana, CloudWatch).
• Driving automation efforts utilizing Terraform, Helm, Jenkins, Tekton, or GitLab CI/CD.
• Incorporating security best practices into both infrastructure and applications.
• Collaborating with cross-functional teams to promote SRE culture and mentor junior engineers.
• A bachelor's or master's degree in computer science, engineering, or a related technical discipline.
• Over 8 years of direct experience in production roles related to SRE, DevOps, or cloud engineering.
• Strong expertise in AWS, OCI, and OpenStack environments.
• In-depth knowledge of Kubernetes ecosystems (EKS, OKE, Rancher RKE2).
• Proven background with Kafka, RabbitMQ, Redis, and distributed messaging and caching systems.
• Solid experience in managing MySQL and PostgreSQL within production settings.
• Advanced scripting and automation capabilities (Python, Bash, Go).
• High proficiency with Helm, Terraform, and contemporary CI/CD toolchains.
• Demonstrable experience in Linux system administration and troubleshooting.
• Must be available during nighttime hours as part of the on-call schedule.
• Competitive salary and bonus structure.
• Unifonic share scheme (we are all owners!).
• 30 days of holiday after the first anniversary.
• Your birthday off!
• The opportunity to work from anywhere in the world for up to 25 days each year!
• Paid leave and support for new parents.
• LinkedIn learning license.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.