
Reliability Operations Engineer
Posted May 30

Posted May 30
This is a fully remote position, open to applicants in Sweden.
• Oversee incident investigations during daytime hours in your region, delivering timely updates, escalating as necessary, and assisting senior engineers in managing the response.
• Address escalations from Tier 1 support by utilizing established runbooks, metrics, logs, and diagnostics to resolve issues or escalate to Tier 3 when required.
• Revise runbooks and operational documentation based on new issues, findings, and feedback to ensure clarity and consistency across all procedures.
• Execute existing automations and work with senior team members to improve tools and scripts that facilitate troubleshooting and remediation tasks.
• Leverage observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to analyze metrics, logs, and traces, assisting in the identification of anomalies and validating system performance.
• Deliver clear and precise updates during incidents, ensuring that information is communicated to the appropriate engineering and SRE contacts while supporting organized incident coordination.
• Engage in discussions regarding root causes, share operational insights, and contribute to enhancements that improve system stability and supportability.
• Participate in a shared weekend on-call rotation to maintain operational coverage for production systems, responding to incidents and escalations as needed, and coordinating with engineering teams when issues arise.
• Proactively enhance workflows, implement best practices, and help establish the foundation of the Reliability Operations function as it develops.
• Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience.
• 2–4 years of experience in Reliability Operations, Site Reliability Engineering, DevOps, IT Operations, or a similar technical support role.
• Experience in Tier 1 or Tier 2 investigations, including log reviews, basic triage, and structured escalation.
• Familiarity with operational environments that support distributed or cloud-based systems.
• Experience participating in incident response workflows and/or on-call rotations.
• Proficient in Linux, including system navigation, log reviews, and basic diagnostics.
• Experience in using and contributing to runbooks and operational workflows.
• Ability to analyze metrics, logs, and traces using tools such as Grafana/Prometheus, Google Cloud Monitoring, and OpenTelemetry.
• Knowledge of cloud platforms, preferably Google Cloud Platform (GCP).
• Capable of following documented remediation procedures, with sound judgment regarding when to escalate issues.
• Understanding of CI/CD pipelines and the impact of application deployments on runtime behavior.
• Experience with Jira or similar ticketing systems.
• Strong communication skills, particularly during time-sensitive operational situations.
• Calm and organized approach to troubleshooting and prioritization.
• Collaborative mindset, effectively working with senior operations engineers, product teams, and SREs.
• Strong sense of ownership and accountability for operational duties.
• Offers Equity
Remote
Get handpicked remote jobs straight to your inbox weekly.