This is a fully remote position, open to applicants in Philippines.

📋 Description

• Design, implement, and continuously enhance highly available, scalable, secure, and resilient cloud infrastructure and platform services.

• Define and refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and operational metrics to achieve measurable reliability outcomes.

• Lead incident response efforts, manage major incidents, perform root cause analysis, and conduct post-incident reviews with a focus on systemic improvements.

• Promote the reduction of operational toil through automation, standardization, and the development of self-healing platform capabilities.

• Develop and uphold disaster recovery, backup, failover, and resilience strategies to fulfill defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

• Conduct capacity planning, performance analysis, and proactive optimization of infrastructure and application environments.

• Architect, build, and maintain scalable cloud-native infrastructure primarily within AWS environments.

• Develop and manage infrastructure-as-code utilizing tools such as Terraform and CloudFormation.

• Create reusable platform components and shared services that enhance developer productivity and operational consistency.

• Design and maintain comprehensive observability solutions encompassing metrics, logging, tracing, alerting, and dashboards.

• Collaborate with engineering teams to integrate reliability, scalability, performance, and security considerations into the software development lifecycle (SDLC).

⛳️ Requirements

• 5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or similar infrastructure roles.

• Strong hands-on experience managing production workloads within AWS cloud environments.

• In-depth experience with infrastructure-as-code tools such as Terraform and/or CloudFormation.

• Significant experience in designing and supporting CI/CD pipelines and modern software delivery practices.

• Solid understanding of distributed systems, microservices architecture, networking, and cloud-native technologies.

• Experience in implementing observability and monitoring solutions across complex environments.

• Proficient in scripting and automation using Python, Bash, or comparable languages.

• Experience in managing production incidents and conducting structured root cause analyses.

• Strong grasp of system reliability, scalability, security, and operational best practices.

• Excellent analytical, troubleshooting, and problem-solving skills.

• Strong communication and stakeholder engagement abilities.

• Ability to thrive in fast-paced, agile, and collaborative engineering environments.

🏝️ Benefits

• Paid time off.

• Remote work options.

• Professional development opportunities.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!