This is a fully remote position, open to applicants in Malaysia.

📋 Description

• Act as the Incident Commander for significant incidents — coordinating cross-functional response teams, steering investigations, making escalation choices, and ensuring incidents are resolved within SLA targets.

• Manage all incident communications: compose and distribute clear, timely updates to senior leadership, Customer Success, and partner/customer contacts throughout the incident lifecycle, and oversee customer-facing status page updates (status.xsolla.com).

• Conduct non-blaming Post-Incident Reviews (PIRs) for major incidents — guiding root cause analysis, assigning corrective actions with definitive owners and deadlines, and monitoring them to completion.

• During periods without incidents, proactively assess incident trends, recurring issues, and production bugs — identify patterns, create Problem tickets, and regularly report findings and recommendations to product and engineering teams.

• Implement the incident management framework across the organization, including the severity model, priority matrix, SLA targets, escalation protocols, and deployment readiness checkpoints.

• Supervise and mentor the Operations Engineer during your shift — providing guidance on triage, investigation, runbook execution, and documentation quality while conducting periodic knowledge transfer sessions to deepen service portfolio expertise.

• Generate shift handoff reports and provide regular operational metrics: incident trends, KPI performance (MTTD, MTTA, MTTR), SLA adherence, proactive detection rate, and repeat incident analysis.

• Regularly evaluate service catalogue completeness and oversee JIRA Service Management workflows for incident, PIR, and problem management.

• Fill in for the Operations Engineer role during absences, breaks, or surge incidents. Engage in weekend on-call rotation for major incidents.

⛳️ Requirements

• Over 6 years of experience in incident management, SRE, NOC leadership, or technical operations within a production environment supporting high-availability, high-transaction systems (preferred experience in payments, e-commerce, SaaS, or gaming platforms).

• Demonstrated incident management experience — coordinating multi-team responses, making real-time escalation decisions, and effectively communicating with executive stakeholders under pressure.

• Exceptional written and verbal communication skills in English — capable of drafting clear, concise executive updates at 3 AM under pressure, facilitating blameless PIRs, presenting operational metrics to senior leadership, and conveying incident status to customers and partners with clarity and professionalism.

• Strong ITIL foundation — comprehensive understanding of incident, problem, and change management lifecycles with hands-on experience implementing or operating ITIL-aligned workflows.

• Technical expertise across the observability stack — ability to read and interpret logs, traces, and metrics in Datadog (or equivalent tools like Grafana, Splunk, New Relic). Familiarity with APM, SLOs, error budgets, burn-rate alerting, and synthetic monitoring.

• Practical experience with incident management tools: Datadog, PagerDuty or OpsGenie, JIRA or JIRA Service Management, Slack, and Confluence.

• Analytical mindset — skilled at identifying trends, patterns, and recurring issues from incident data and transforming them into actionable recommendations for product and engineering teams.

• Experience with SLA/SLO-driven operations where MTTD, MTTA, and MTTR are tracked, reported, and enhanced.

• Familiarity with or strong interest in AI/ML-assisted operations: anomaly detection, alert correlation, predictive alerting, automated remediation, or self-healing automation.

• Willingness to work in 24x7 shift-based operations as part of a follow-the-sun model with handoff overlaps. Weekend on-call (rotating) for critical severities is a requirement.

🏝️ Benefits

• Convenient work tools

• Latest Mac workplaces + additional hardware to enhance your work efficiency

• Access to Google Chat, Gmail, Google Drive, Confluence, Jira, GitLab

• Professional growth opportunities

• Complimentary training sessions and participation in specialized conferences

• Rich knowledge exchange within the company

• Health insurance (medical, dental, and optical) for employees and dependents

• Flexible working hours: organize your day according to your needs and team demands

• No dress code

• Comfortable and modern office environment

Technical Service Operations Lead

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Tiktok Shop Operations Manager

Ad Ops Specialist – Freelance

Research Operations Coordinator

Operations Manager, LATAM

IT Operations Analyst II

Deal Operations

Never miss a great job!