
Technical Service Operations Lead
Posted May 22

Posted May 22
This is a fully remote position, open to applicants in Malaysia.
• Act as the Incident Commander for significant incidents — coordinating cross-functional response teams, steering investigations, making escalation choices, and ensuring incidents are resolved within SLA targets.
• Manage all incident communications: compose and distribute clear, timely updates to senior leadership, Customer Success, and partner/customer contacts throughout the incident lifecycle, and oversee customer-facing status page updates (status.xsolla.com).
• Conduct non-blaming Post-Incident Reviews (PIRs) for major incidents — guiding root cause analysis, assigning corrective actions with definitive owners and deadlines, and monitoring them to completion.
• During periods without incidents, proactively assess incident trends, recurring issues, and production bugs — identify patterns, create Problem tickets, and regularly report findings and recommendations to product and engineering teams.
• Implement the incident management framework across the organization, including the severity model, priority matrix, SLA targets, escalation protocols, and deployment readiness checkpoints.
• Supervise and mentor the Operations Engineer during your shift — providing guidance on triage, investigation, runbook execution, and documentation quality while conducting periodic knowledge transfer sessions to deepen service portfolio expertise.
• Generate shift handoff reports and provide regular operational metrics: incident trends, KPI performance (MTTD, MTTA, MTTR), SLA adherence, proactive detection rate, and repeat incident analysis.
• Regularly evaluate service catalogue completeness and oversee JIRA Service Management workflows for incident, PIR, and problem management.
• Fill in for the Operations Engineer role during absences, breaks, or surge incidents. Engage in weekend on-call rotation for major incidents.
• Over 6 years of experience in incident management, SRE, NOC leadership, or technical operations within a production environment supporting high-availability, high-transaction systems (preferred experience in payments, e-commerce, SaaS, or gaming platforms).
• Demonstrated incident management experience — coordinating multi-team responses, making real-time escalation decisions, and effectively communicating with executive stakeholders under pressure.
• Exceptional written and verbal communication skills in English — capable of drafting clear, concise executive updates at 3 AM under pressure, facilitating blameless PIRs, presenting operational metrics to senior leadership, and conveying incident status to customers and partners with clarity and professionalism.
• Strong ITIL foundation — comprehensive understanding of incident, problem, and change management lifecycles with hands-on experience implementing or operating ITIL-aligned workflows.
• Technical expertise across the observability stack — ability to read and interpret logs, traces, and metrics in Datadog (or equivalent tools like Grafana, Splunk, New Relic). Familiarity with APM, SLOs, error budgets, burn-rate alerting, and synthetic monitoring.
• Practical experience with incident management tools: Datadog, PagerDuty or OpsGenie, JIRA or JIRA Service Management, Slack, and Confluence.
• Analytical mindset — skilled at identifying trends, patterns, and recurring issues from incident data and transforming them into actionable recommendations for product and engineering teams.
• Experience with SLA/SLO-driven operations where MTTD, MTTA, and MTTR are tracked, reported, and enhanced.
• Familiarity with or strong interest in AI/ML-assisted operations: anomaly detection, alert correlation, predictive alerting, automated remediation, or self-healing automation.
• Willingness to work in 24x7 shift-based operations as part of a follow-the-sun model with handoff overlaps. Weekend on-call (rotating) for critical severities is a requirement.
• Convenient work tools
• Latest Mac workplaces + additional hardware to enhance your work efficiency
• Access to Google Chat, Gmail, Google Drive, Confluence, Jira, GitLab
• Professional growth opportunities
• Complimentary training sessions and participation in specialized conferences
• Rich knowledge exchange within the company
• Health insurance (medical, dental, and optical) for employees and dependents
• Flexible working hours: organize your day according to your needs and team demands
• No dress code
• Comfortable and modern office environment
Adaptive Teams
Get handpicked remote jobs straight to your inbox weekly.