
Senior Reliability Operations Engineer – Mexico
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Mexico.
• Act as the primary incident lead during daytime hours in your region, facilitating technical investigations, centralizing communication, and engaging relevant engineering and SRE teams when escalations are necessary.
• Address escalations from Tier 1 support by utilizing runbooks, metrics, logs, and system diagnostics to investigate and resolve issues or decide when to escalate to Tier 3.
• Create and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to gradually expand coverage.
• Develop, maintain, and enhance automation scripts and tools that streamline common remediation steps, enhance response times, and minimize manual operational workload.
• Employ metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify issues, validate system behavior, and support continuous improvement of detection methods.
• Serve as the central communication point during active incidents, ensuring timely updates and proper routing to the appropriate product engineering and SRE stakeholders.
• Work alongside reliability and product teams to share insights, suggest improvements, and help refine processes that boost the stability and operability of our systems.
• Engage in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as required and coordinating with engineering teams when issues arise.
• Assist in establishing operational best practices, refining workflows, and laying the groundwork for a more extensive reliability operations function.
• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.
• Over 5 years of professional experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support role.
• Proven experience in leading or participating in Tier 2 or Tier 3 technical investigations, including triage, log analysis, and structured escalation.
• Experience in supporting distributed systems, cloud-hosted services, or production operational environments.
• Practical experience in incident response processes.
• Strong proficiency in Linux, including system navigation, log reviews, and diagnostics.
• Experience in writing, executing, and maintaining runbooks, automation scripts, and operational workflows.
• Ability to interpret metrics, logs, and traces using tools like Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.
• Familiarity with modern cloud environments, preferably Google Cloud Platform (GCP), including basic debugging, permissions, and service-level triage.
• Capability to investigate and resolve issues following documented procedures, effectively escalating when necessary.
• Understanding of CI/CD pipelines, deployed application behavior, and operational dependencies across microservices.
• Proficiency with Jira or similar platforms for ticketing and structured incident tracking.
• Exceptional communication skills, especially during high-pressure incidents where clear and concise updates are essential.
• A calm and methodical approach to troubleshooting, prioritization, and decision-making.
• Strong collaboration skills when working with product engineering, SRE, and global support teams.
• High level of ownership, reliability, and accountability when managing operational tasks.
• Competitive salary and performance-based incentives.
• Comprehensive health, dental, and vision insurance.
• Generous paid time off and holiday leave.
• Opportunities for professional development and career advancement.
• Flexible working hours and remote work options.
Remote
Get handpicked remote jobs straight to your inbox weekly.