This is a fully remote position, open to applicants in Mexico.

📋 Description

• Act as the primary incident lead during daytime hours in your region, facilitating technical investigations, centralizing communication, and engaging relevant engineering and SRE teams when escalations are necessary.

• Address escalations from Tier 1 support by utilizing runbooks, metrics, logs, and system diagnostics to investigate and resolve issues or decide when to escalate to Tier 3.

• Create and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to gradually expand coverage.

• Develop, maintain, and enhance automation scripts and tools that streamline common remediation steps, enhance response times, and minimize manual operational workload.

• Employ metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify issues, validate system behavior, and support continuous improvement of detection methods.

• Serve as the central communication point during active incidents, ensuring timely updates and proper routing to the appropriate product engineering and SRE stakeholders.

• Work alongside reliability and product teams to share insights, suggest improvements, and help refine processes that boost the stability and operability of our systems.

• Engage in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as required and coordinating with engineering teams when issues arise.

• Assist in establishing operational best practices, refining workflows, and laying the groundwork for a more extensive reliability operations function.

⛳️ Requirements

• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.

• Over 5 years of professional experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support role.

• Proven experience in leading or participating in Tier 2 or Tier 3 technical investigations, including triage, log analysis, and structured escalation.

• Experience in supporting distributed systems, cloud-hosted services, or production operational environments.

• Practical experience in incident response processes.

• Strong proficiency in Linux, including system navigation, log reviews, and diagnostics.

• Experience in writing, executing, and maintaining runbooks, automation scripts, and operational workflows.

• Ability to interpret metrics, logs, and traces using tools like Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.

• Familiarity with modern cloud environments, preferably Google Cloud Platform (GCP), including basic debugging, permissions, and service-level triage.

• Capability to investigate and resolve issues following documented procedures, effectively escalating when necessary.

• Understanding of CI/CD pipelines, deployed application behavior, and operational dependencies across microservices.

• Proficiency with Jira or similar platforms for ticketing and structured incident tracking.

• Exceptional communication skills, especially during high-pressure incidents where clear and concise updates are essential.

• A calm and methodical approach to troubleshooting, prioritization, and decision-making.

• Strong collaboration skills when working with product engineering, SRE, and global support teams.

• High level of ownership, reliability, and accountability when managing operational tasks.

🏝️ Benefits

• Competitive salary and performance-based incentives.

• Comprehensive health, dental, and vision insurance.

• Generous paid time off and holiday leave.

• Opportunities for professional development and career advancement.

• Flexible working hours and remote work options.

Senior Reliability Operations Engineer – Mexico

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

IT Operations Analyst II

Deal Operations

Cloud Operations Manager

Deal Lead – Commercial Strategy & Operations

Operations Analyst – Contractor Role

Sales Analytics and Data Operations Analyst

Never miss a great job!