This is a fully remote position, open to applicants in Malaysia.

📋 Description

• Define and execute SLIs, SLOs, and error budgets for essential CloudBlue services to guarantee reliability and performance.

• Influence system architecture with an emphasis on reliability, scalability, and operability, designing systems that ensure fault tolerance, graceful degradation, and self-healing capabilities.

• Minimize operational toil by identifying areas for automation and process enhancement.

• Design and manage CloudBlue’s observability stack encompassing metrics, logs, and traces using tools like Datadog, Grafana, and Elastic Stack.

• Create actionable alerting strategies and dashboards that deliver clear visibility into platform and business health.

• Design and uphold high-availability architectures, implementing redundancy, failover, and disaster recovery plans across various regions and availability zones.

• Conduct capacity planning, load testing, and performance tuning to ensure platform stability and scalability.

• Serve as a senior responder during production incidents, overseeing incident coordination, communication, and service restoration efforts.

• Take ownership of blameless postmortems and drive enhancements that decrease incident frequency, MTTR, and customer impact.

• Enhance the reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing.

• Collaborate with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability.

• Maintain runbooks and operational documentation while promoting SRE best practices throughout engineering teams.

• Assist with additional tasks or projects as assigned to fulfill team and business requirements.

⛳️ Requirements

• 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, demonstrating strong ownership of production systems.

• Proven experience managing highly available, enterprise-grade, multi-tenant SaaS platforms.

• Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana.

• Solid understanding of Linux, networking, and the fundamentals of distributed systems.

• Experience working with containerized environments, particularly Docker and Kubernetes.

• Strong scripting and automation capabilities using Python and/or Bash.

• Experience in on-call rotations and incident response in production settings.

• Proficient in written and spoken English.

• Experience in defining SLIs/SLOs and managing error budgets at scale will be considered an advantage.

• Exposure to hyperscale or service-provider-grade platforms is beneficial.

• Cloud experience, preferably with Azure; experience with AWS and/or GCP is also valued.

• Experience working with hybrid or on-premises integrations is a plus.

• Familiarity with chaos engineering and resilience testing will be regarded as an asset.

🏝️ Benefits

• Competitive salary that recognizes your unique skills and contributions.

• Career advancement and professional development opportunities to help you achieve your full potential.

• Flexible work arrangements to support a healthy work/life balance.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!