
Senior Reliability Operations Engineer
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Sweden.
• Act as the primary incident lead during daytime hours in your region, coordinating technical investigations, centralizing communication, and engaging the relevant engineering and SRE teams when escalations are necessary.
• Address escalations from Tier 1 support by utilizing runbooks, metrics, logs, and system diagnostics to investigate and resolve issues or determine when to escalate to Tier 3.
• Create and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to enhance coverage over time.
• Develop, maintain, and improve automation scripts and tools that streamline common remediation tasks, enhance response times, and minimize manual operational overhead.
• Leverage metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify issues, validate system behavior, and support the continuous enhancement of detection mechanisms.
• Serve as the central communication point during active incidents, ensuring prompt updates and proper routing to the appropriate product engineering and SRE stakeholders.
• Work alongside reliability and product teams to share insights, suggest improvements, and refine processes that enhance the stability and operability of our systems.
• Participate in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as necessary and coordinating with engineering teams when issues arise.
• Contribute to the establishment of operational best practices, refining workflows, and laying the groundwork for a broader reliability operations function.
• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.
• Over 5 years of professional experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support role.
• Proven experience owning or participating in Tier 2 or Tier 3 technical investigations, including triage, log analysis, and structured escalation.
• Experience in supporting distributed systems, cloud-hosted services, or production operational environments.
• Hands-on experience in incident response processes.
• Strong proficiency with Linux, including system navigation, log review, and diagnostics.
• Experience in writing, executing, and maintaining runbooks, automations, and operational workflows.
• Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.
• Familiarity with modern cloud environments, preferably Google Cloud Platform (GCP), including basic debugging, permissions, and service-level triage.
• Ability to investigate and resolve issues following documented procedures, effectively escalating when necessary.
• Understanding of CI/CD pipelines, deployed application behavior, and operational dependencies across microservices.
• Proficiency in Jira or similar platforms for ticketing and structured incident tracking.
• Outstanding communication skills, particularly during high-pressure incidents where clear and concise updates are essential.
• Calm and methodical approach to troubleshooting, prioritization, and decision-making.
• Strong collaboration skills when working with product engineering, SRE, and global support teams.
• High level of ownership, reliability, and accountability in managing operational responsibilities and incident leadership.
• Offers Equity
Remote
Get handpicked remote jobs straight to your inbox weekly.