This is a fully remote position, open to applicants in Philippines.

📋 Description

• Oversee Production Systems: Utilize monitoring tools (e.g., Cloud Monitoring) to guarantee the health and performance of cloud-based production systems on Google Cloud Platform (GCP).

• Incident Management: Address production incidents, prioritize issues, and ensure prompt resolution. Conduct root cause analysis (RCA) and document findings.

• Performance Optimization: Evaluate system performance, identify constraints, and provide recommendations for enhancements to improve service reliability, scalability, and speed.

• System Alerts and Incident Escalation: Establish and manage system alerts to proactively identify issues. Escalate critical problems to the appropriate teams and ensure rapid resolution.

• Collaboration with Engineering: Collaborate closely with development and operations teams to ensure seamless production releases, offer feedback on system performance, and execute monitoring solutions for new services.

• System Documentation: Keep documentation updated regarding system configurations, monitoring setups, and incident resolutions to foster knowledge-sharing practices across teams.

• Service Level Agreements (SLAs): Monitor and report on SLA performance, ensuring that production services align with established availability and reliability standards.

• Proactive System Health Checks: Perform routine system health evaluations, reviewing logs and performance metrics, to ensure system uptime.

• Disaster Recovery and Backup: Oversee backup systems and confirm that disaster recovery procedures are established and tested.

⛳️ Requirements

• Minimum of 3 years of experience in cloud production support, Site Reliability Engineering, or System Reliability roles.

• At least 3 years of hands-on experience with Google Cloud Platform (GCP), encompassing Compute Engine, GKE, Cloud Monitoring, Logging, and Storage.

• A minimum of 3 years of experience utilizing monitoring and observability tools to assess system health and performance.

• Over 3 years of experience with system performance metrics (CPU, memory, disk, network) and issue diagnosis.

• At least 3 years of experience managing incidents and troubleshooting live production systems.

• A minimum of 3 years of experience in scripting or automation using Bash, Python, or similar languages.

• Strong knowledge of VoIP and UC technologies, including SIP, RTP/SRTP, WebRTC, SBCs (Ribbon, Oracle, AudioCodes), SIP trunks, gateways, and voice codecs (G.711, G.729).

• Demonstrated ability to troubleshoot IP telephony and real-time communications using tools such as Wireshark and network analyzers.

• Solid grasp of network fundamentals (TCP/IP, VLANs, routing, switching, QoS) and voice security best practices (TLS, SRTP, firewalls).

• Experience in integrating voice, contact center (ACD/IVR), and UC platforms within cloud-native and hybrid environments.

• Proficiency in automation and scripting for voice and system management (Python, Bash, PowerShell).

• Familiarity with observability and monitoring tools (Prometheus, Grafana, Zabbix, Elastic Stack).

• Hands-on experience with network and VoIP analysis tools such as Netscout NG1 and Wireshark.

• Knowledge of automation and CI/CD tools (Ansible, N8N, Jenkins, GitLab CI/CD).

• Exposure to multi-cloud environments (AWS, Azure).

• Certifications (Preferred): CCNA (Collaboration) or CompTIA Network+ Cloud certifications (GCP, AWS, or Azure).

🏝️ Benefits

• Work From Home Set-up

• Night Shift (8PM to 5AM), rotating weekend shifts

Site Reliability Engineer, GCP

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!