
Site Reliability Engineer, GCP
Posted May 19

Posted May 19
This is a fully remote position, open to applicants in Philippines.
• Oversee Production Systems: Utilize monitoring tools (e.g., Cloud Monitoring) to guarantee the health and performance of cloud-based production systems on Google Cloud Platform (GCP).
• Incident Management: Address production incidents, prioritize issues, and ensure prompt resolution. Conduct root cause analysis (RCA) and document findings.
• Performance Optimization: Evaluate system performance, identify constraints, and provide recommendations for enhancements to improve service reliability, scalability, and speed.
• System Alerts and Incident Escalation: Establish and manage system alerts to proactively identify issues. Escalate critical problems to the appropriate teams and ensure rapid resolution.
• Collaboration with Engineering: Collaborate closely with development and operations teams to ensure seamless production releases, offer feedback on system performance, and execute monitoring solutions for new services.
• System Documentation: Keep documentation updated regarding system configurations, monitoring setups, and incident resolutions to foster knowledge-sharing practices across teams.
• Service Level Agreements (SLAs): Monitor and report on SLA performance, ensuring that production services align with established availability and reliability standards.
• Proactive System Health Checks: Perform routine system health evaluations, reviewing logs and performance metrics, to ensure system uptime.
• Disaster Recovery and Backup: Oversee backup systems and confirm that disaster recovery procedures are established and tested.
• Minimum of 3 years of experience in cloud production support, Site Reliability Engineering, or System Reliability roles.
• At least 3 years of hands-on experience with Google Cloud Platform (GCP), encompassing Compute Engine, GKE, Cloud Monitoring, Logging, and Storage.
• A minimum of 3 years of experience utilizing monitoring and observability tools to assess system health and performance.
• Over 3 years of experience with system performance metrics (CPU, memory, disk, network) and issue diagnosis.
• At least 3 years of experience managing incidents and troubleshooting live production systems.
• A minimum of 3 years of experience in scripting or automation using Bash, Python, or similar languages.
• Strong knowledge of VoIP and UC technologies, including SIP, RTP/SRTP, WebRTC, SBCs (Ribbon, Oracle, AudioCodes), SIP trunks, gateways, and voice codecs (G.711, G.729).
• Demonstrated ability to troubleshoot IP telephony and real-time communications using tools such as Wireshark and network analyzers.
• Solid grasp of network fundamentals (TCP/IP, VLANs, routing, switching, QoS) and voice security best practices (TLS, SRTP, firewalls).
• Experience in integrating voice, contact center (ACD/IVR), and UC platforms within cloud-native and hybrid environments.
• Proficiency in automation and scripting for voice and system management (Python, Bash, PowerShell).
• Familiarity with observability and monitoring tools (Prometheus, Grafana, Zabbix, Elastic Stack).
• Hands-on experience with network and VoIP analysis tools such as Netscout NG1 and Wireshark.
• Knowledge of automation and CI/CD tools (Ansible, N8N, Jenkins, GitLab CI/CD).
• Exposure to multi-cloud environments (AWS, Azure).
• Certifications (Preferred): CCNA (Collaboration) or CompTIA Network+ Cloud certifications (GCP, AWS, or Azure).
• Work From Home Set-up
• Night Shift (8PM to 5AM), rotating weekend shifts
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.