This is a fully remote position, open to applicants in Sweden.

📋 Description

• Act as the primary incident lead during daytime hours in your region, coordinating technical investigations, centralizing communication, and engaging the relevant engineering and SRE teams when escalations are necessary.

• Address escalations from Tier 1 support by utilizing runbooks, metrics, logs, and system diagnostics to investigate and resolve issues or determine when to escalate to Tier 3.

• Create and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to enhance coverage over time.

• Develop, maintain, and improve automation scripts and tools that streamline common remediation tasks, enhance response times, and minimize manual operational overhead.

• Leverage metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify issues, validate system behavior, and support the continuous enhancement of detection mechanisms.

• Serve as the central communication point during active incidents, ensuring prompt updates and proper routing to the appropriate product engineering and SRE stakeholders.

• Work alongside reliability and product teams to share insights, suggest improvements, and refine processes that enhance the stability and operability of our systems.

• Participate in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as necessary and coordinating with engineering teams when issues arise.

• Contribute to the establishment of operational best practices, refining workflows, and laying the groundwork for a broader reliability operations function.

⛳️ Requirements

• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.

• Over 5 years of professional experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a related technical support role.

• Proven experience owning or participating in Tier 2 or Tier 3 technical investigations, including triage, log analysis, and structured escalation.

• Experience in supporting distributed systems, cloud-hosted services, or production operational environments.

• Hands-on experience in incident response processes.

• Strong proficiency with Linux, including system navigation, log review, and diagnostics.

• Experience in writing, executing, and maintaining runbooks, automations, and operational workflows.

• Ability to interpret metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.

• Familiarity with modern cloud environments, preferably Google Cloud Platform (GCP), including basic debugging, permissions, and service-level triage.

• Ability to investigate and resolve issues following documented procedures, effectively escalating when necessary.

• Understanding of CI/CD pipelines, deployed application behavior, and operational dependencies across microservices.

• Proficiency in Jira or similar platforms for ticketing and structured incident tracking.

• Outstanding communication skills, particularly during high-pressure incidents where clear and concise updates are essential.

• Calm and methodical approach to troubleshooting, prioritization, and decision-making.

• Strong collaboration skills when working with product engineering, SRE, and global support teams.

• High level of ownership, reliability, and accountability in managing operational responsibilities and incident leadership.

🏝️ Benefits

• Offers Equity

Senior Reliability Operations Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

IT Operations Analyst II

Deal Operations

Cloud Operations Manager

Deal Lead – Commercial Strategy & Operations

Operations Analyst – Contractor Role

Sales Analytics and Data Operations Analyst

Never miss a great job!