
Site Reliability Engineer
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Malaysia.
• Define and execute SLIs, SLOs, and error budgets for essential CloudBlue services to guarantee reliability and performance.
• Influence system architecture with an emphasis on reliability, scalability, and operability, designing systems that ensure fault tolerance, graceful degradation, and self-healing capabilities.
• Minimize operational toil by identifying areas for automation and process enhancement.
• Design and manage CloudBlue’s observability stack encompassing metrics, logs, and traces using tools like Datadog, Grafana, and Elastic Stack.
• Create actionable alerting strategies and dashboards that deliver clear visibility into platform and business health.
• Design and uphold high-availability architectures, implementing redundancy, failover, and disaster recovery plans across various regions and availability zones.
• Conduct capacity planning, load testing, and performance tuning to ensure platform stability and scalability.
• Serve as a senior responder during production incidents, overseeing incident coordination, communication, and service restoration efforts.
• Take ownership of blameless postmortems and drive enhancements that decrease incident frequency, MTTR, and customer impact.
• Enhance the reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing.
• Collaborate with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability.
• Maintain runbooks and operational documentation while promoting SRE best practices throughout engineering teams.
• Assist with additional tasks or projects as assigned to fulfill team and business requirements.
• 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, demonstrating strong ownership of production systems.
• Proven experience managing highly available, enterprise-grade, multi-tenant SaaS platforms.
• Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana.
• Solid understanding of Linux, networking, and the fundamentals of distributed systems.
• Experience working with containerized environments, particularly Docker and Kubernetes.
• Strong scripting and automation capabilities using Python and/or Bash.
• Experience in on-call rotations and incident response in production settings.
• Proficient in written and spoken English.
• Experience in defining SLIs/SLOs and managing error budgets at scale will be considered an advantage.
• Exposure to hyperscale or service-provider-grade platforms is beneficial.
• Cloud experience, preferably with Azure; experience with AWS and/or GCP is also valued.
• Experience working with hybrid or on-premises integrations is a plus.
• Familiarity with chaos engineering and resilience testing will be regarded as an asset.
• Competitive salary that recognizes your unique skills and contributions.
• Career advancement and professional development opportunities to help you achieve your full potential.
• Flexible work arrangements to support a healthy work/life balance.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.