
SysOps Engineer – Monitoring, Cloud Operations
Posted Jun 12

Posted Jun 12
This is a fully remote position, open to applicants in Egypt.
• Ensure the uptime, performance, and resilience of infrastructure through proactive monitoring, incident management, disaster recovery, and cloud operations in mission-critical environments.
• Utilize tools such as New Relic, Prometheus, and Grafana for infrastructure monitoring.
• Set up and maintain alerts, dashboards, and service health checks.
• Conduct incident management, troubleshooting, and root cause analysis (RCA).
• Guarantee uptime and compliance with Service Level Agreements (SLAs) for all systems.
• Oversee monitoring of CPU, memory, disk, and system processes.
• Manage OS-level operations (Linux/Windows), including patching and tuning.
• Handle system backups and perform routine restoration validations.
• Execute and verify disaster recovery (DR) plans across various environments.
• Conduct failover and failback testing for critical services (on-prem cloud / multi-region).
• Organize DR drills and simulate outage scenarios.
• Ensure the health of replication and data consistency in collaboration with DataOps.
• Update and maintain DR runbooks and incident playbooks.
• Perform capacity planning and optimize performance.
• Keep logs, metrics, and operational documentation up to date.
• A bachelor's degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience.
• Demonstrated experience in Systems Operations, Cloud Operations, Infrastructure Support, Site Reliability Engineering (SRE), or a similar role.
• Strong hands-on experience with administering Linux and Windows operating systems.
• Familiarity with enterprise monitoring and observability platforms such as New Relic, Prometheus, Grafana, Datadog, or comparable tools.
• Solid comprehension of incident management, problem management, and root cause analysis methodologies.
• Experience with cloud platforms like AWS, Azure, or Google Cloud Platform.
• Strong knowledge of backup, disaster recovery, business continuity, and failover processes.
• Experience in managing compute infrastructure, including virtual machines, cloud instances, and physical servers.
• Understanding of system services and web servers such as Nginx, IIS, and systemd.
• Knowledge of capacity planning, performance tuning, and infrastructure optimization practices.
• Excellent troubleshooting and analytical abilities to resolve complex operational issues.
• Strong communication, documentation, and cross-functional collaboration skills.
• Experience in high-availability, mission-critical production environments is highly preferred.
• Fully Remote
• Full-time
Remote
Get handpicked remote jobs straight to your inbox weekly.