
Cloud Reliability Engineer – Recovery
Posted May 22

Posted May 22
This is a fully remote position, open to applicants in India.
• Design and execute AWS architectures spanning multiple regions and availability zones that fulfill RTO/RPO objectives.
• Engineer failover patterns, both active-active and active-passive, utilizing Route 53, Global Accelerator, and CloudFront.
• Create automated disaster recovery runbooks and playbooks via AWS Systems Manager Automation and Step Functions.
• Implement chaos engineering techniques using AWS Fault Injection Simulator (FIS) to assess and validate system resiliency.
• Architect strategies for cross-region data replication across S3, DynamoDB Global Tables, RDS, and Aurora Global.
• Evaluate containerized workloads within Kubernetes, ensuring resilience through self-healing mechanisms, auto-scaling, and deployments across multiple clusters or regions.
• Manage AWS Backup for all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) through policy-based automation.
• Design immutable backup vaults and pipelines for backup replication across accounts and regions.
• Develop and automate procedures for data recovery testing, ensuring data integrity and compliance with defined service level agreements (SLAs).
• Implement point-in-time recovery (PITR) for databases and storage, validating through regular restore drills.
• Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including the monitoring of RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
• Minimum of 5 years in cloud infrastructure, Site Reliability Engineering (SRE), or IT disaster recovery engineering roles.
• At least 3 years of practical AWS experience in large-scale production environments.
• Demonstrated success in delivering multi-region disaster recovery architectures with clearly defined and tested RTO/RPO targets.
• Expert-level knowledge of core AWS resilience services (refer to the skills matrix below).
• Strong proficiency in scripting languages such as Python, Bash, or PowerShell for automation and orchestration tasks.
• Experience with Infrastructure as Code tools: Terraform and/or AWS CloudFormation.
• Comprehensive understanding of networking fundamentals, including VPC, Transit Gateway (TGW), Direct Connect, VPN, and DNS failover mechanisms.
• Exceptional written and verbal communication skills, capable of creating executive-level disaster recovery reports.
• Health insurance
• Retirement plans
• Paid time off
• Flexible work arrangements
• Professional development opportunities
Akka (formerly Lightbend)
Swimlane
Get handpicked remote jobs straight to your inbox weekly.