This is a fully remote position, open to applicants in Egypt.

📋 Description

• Ensure the uptime, performance, and resilience of infrastructure through proactive monitoring, incident management, disaster recovery, and cloud operations in mission-critical environments.

• Utilize tools such as New Relic, Prometheus, and Grafana for infrastructure monitoring.

• Set up and maintain alerts, dashboards, and service health checks.

• Conduct incident management, troubleshooting, and root cause analysis (RCA).

• Guarantee uptime and compliance with Service Level Agreements (SLAs) for all systems.

• Oversee monitoring of CPU, memory, disk, and system processes.

• Manage OS-level operations (Linux/Windows), including patching and tuning.

• Handle system backups and perform routine restoration validations.

• Execute and verify disaster recovery (DR) plans across various environments.

• Conduct failover and failback testing for critical services (on-prem cloud / multi-region).

• Organize DR drills and simulate outage scenarios.

• Ensure the health of replication and data consistency in collaboration with DataOps.

• Update and maintain DR runbooks and incident playbooks.

• Perform capacity planning and optimize performance.

• Keep logs, metrics, and operational documentation up to date.

⛳️ Requirements

• A bachelor's degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience.

• Demonstrated experience in Systems Operations, Cloud Operations, Infrastructure Support, Site Reliability Engineering (SRE), or a similar role.

• Strong hands-on experience with administering Linux and Windows operating systems.

• Familiarity with enterprise monitoring and observability platforms such as New Relic, Prometheus, Grafana, Datadog, or comparable tools.

• Solid comprehension of incident management, problem management, and root cause analysis methodologies.

• Experience with cloud platforms like AWS, Azure, or Google Cloud Platform.

• Strong knowledge of backup, disaster recovery, business continuity, and failover processes.

• Experience in managing compute infrastructure, including virtual machines, cloud instances, and physical servers.

• Understanding of system services and web servers such as Nginx, IIS, and systemd.

• Knowledge of capacity planning, performance tuning, and infrastructure optimization practices.

• Excellent troubleshooting and analytical abilities to resolve complex operational issues.

• Strong communication, documentation, and cross-functional collaboration skills.

• Experience in high-availability, mission-critical production environments is highly preferred.

🏝️ Benefits

• Fully Remote

• Full-time

SysOps Engineer – Monitoring, Cloud Operations

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

IT Operations Analyst II

Deal Operations

Cloud Operations Manager

Deal Lead – Commercial Strategy & Operations

Operations Analyst – Contractor Role

Sales Analytics and Data Operations Analyst

Never miss a great job!