
Senior Site Reliability Engineer
Posted 13 hours ago

Posted 13 hours ago
This is a fully remote position, open to applicants in Philippines.
• Design, implement, and continuously enhance highly available, scalable, secure, and resilient cloud infrastructure and platform services.
• Define and refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and operational metrics to achieve measurable reliability outcomes.
• Lead incident response efforts, manage major incidents, perform root cause analysis, and conduct post-incident reviews with a focus on systemic improvements.
• Promote the reduction of operational toil through automation, standardization, and the development of self-healing platform capabilities.
• Develop and uphold disaster recovery, backup, failover, and resilience strategies to fulfill defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
• Conduct capacity planning, performance analysis, and proactive optimization of infrastructure and application environments.
• Architect, build, and maintain scalable cloud-native infrastructure primarily within AWS environments.
• Develop and manage infrastructure-as-code utilizing tools such as Terraform and CloudFormation.
• Create reusable platform components and shared services that enhance developer productivity and operational consistency.
• Design and maintain comprehensive observability solutions encompassing metrics, logging, tracing, alerting, and dashboards.
• Collaborate with engineering teams to integrate reliability, scalability, performance, and security considerations into the software development lifecycle (SDLC).
• 5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or similar infrastructure roles.
• Strong hands-on experience managing production workloads within AWS cloud environments.
• In-depth experience with infrastructure-as-code tools such as Terraform and/or CloudFormation.
• Significant experience in designing and supporting CI/CD pipelines and modern software delivery practices.
• Solid understanding of distributed systems, microservices architecture, networking, and cloud-native technologies.
• Experience in implementing observability and monitoring solutions across complex environments.
• Proficient in scripting and automation using Python, Bash, or comparable languages.
• Experience in managing production incidents and conducting structured root cause analyses.
• Strong grasp of system reliability, scalability, security, and operational best practices.
• Excellent analytical, troubleshooting, and problem-solving skills.
• Strong communication and stakeholder engagement abilities.
• Ability to thrive in fast-paced, agile, and collaborative engineering environments.
• Paid time off.
• Remote work options.
• Professional development opportunities.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.