This is a fully remote position, open to applicants in Sweden.

📋 Description

• Oversee incident investigations during daytime hours in your region, delivering timely updates, escalating as necessary, and assisting senior engineers in managing the response.

• Address escalations from Tier 1 support by utilizing established runbooks, metrics, logs, and diagnostics to resolve issues or escalate to Tier 3 when required.

• Revise runbooks and operational documentation based on new issues, findings, and feedback to ensure clarity and consistency across all procedures.

• Execute existing automations and work with senior team members to improve tools and scripts that facilitate troubleshooting and remediation tasks.

• Leverage observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to analyze metrics, logs, and traces, assisting in the identification of anomalies and validating system performance.

• Deliver clear and precise updates during incidents, ensuring that information is communicated to the appropriate engineering and SRE contacts while supporting organized incident coordination.

• Engage in discussions regarding root causes, share operational insights, and contribute to enhancements that improve system stability and supportability.

• Participate in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as needed, and coordinating with engineering teams when issues arise.

• Proactively enhance workflows, implement best practices, and help establish the foundation of the Reliability Operations function as it develops.

⛳️ Requirements

• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.

• 2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a similar technical support role.

• Experience in Tier 1 or Tier 2 investigations, including log reviews, basic triage, and structured escalation.

• Familiarity with operational environments that support distributed or cloud-based systems.

• Experience participating in incident response workflows and/or on-call rotations.

• Proficient in Linux, including system navigation, log reviews, and basic diagnostics.

• Experience in using and contributing to runbooks and operational workflows.

• Ability to analyze metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.

• Knowledge of cloud platforms, preferably Google Cloud Platform (GCP).

• Capable of following documented remediation procedures, with sound judgment regarding when to escalate issues.

• Understanding of CI/CD pipelines and the impact of application deployments on runtime behavior.

• Experience with Jira or similar ticketing systems.

• Strong communication skills, particularly during time-sensitive operational situations.

• Calm and organized approach to troubleshooting and prioritization.

• Collaborative mindset, effectively working with senior operations engineers, product teams, and SREs.

• Strong sense of ownership and accountability for operational duties.

🏝️ Benefits

• Offers Equity

Reliability Operations Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

IT Operations Analyst II

Deal Operations

Cloud Operations Manager

Deal Lead – Commercial Strategy & Operations

Operations Analyst – Contractor Role

Sales Analytics and Data Operations Analyst

Never miss a great job!