This is a fully remote position, open to applicants in India.

📋 Description

• Design and execute AWS architectures spanning multiple regions and availability zones that fulfill RTO/RPO objectives.

• Engineer failover patterns, both active-active and active-passive, utilizing Route 53, Global Accelerator, and CloudFront.

• Create automated disaster recovery runbooks and playbooks via AWS Systems Manager Automation and Step Functions.

• Implement chaos engineering techniques using AWS Fault Injection Simulator (FIS) to assess and validate system resiliency.

• Architect strategies for cross-region data replication across S3, DynamoDB Global Tables, RDS, and Aurora Global.

• Evaluate containerized workloads within Kubernetes, ensuring resilience through self-healing mechanisms, auto-scaling, and deployments across multiple clusters or regions.

• Manage AWS Backup for all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) through policy-based automation.

• Design immutable backup vaults and pipelines for backup replication across accounts and regions.

• Develop and automate procedures for data recovery testing, ensuring data integrity and compliance with defined service level agreements (SLAs).

• Implement point-in-time recovery (PITR) for databases and storage, validating through regular restore drills.

• Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including the monitoring of RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

⛳️ Requirements

• Minimum of 5 years in cloud infrastructure, Site Reliability Engineering (SRE), or IT disaster recovery engineering roles.

• At least 3 years of practical AWS experience in large-scale production environments.

• Demonstrated success in delivering multi-region disaster recovery architectures with clearly defined and tested RTO/RPO targets.

• Expert-level knowledge of core AWS resilience services (refer to the skills matrix below).

• Strong proficiency in scripting languages such as Python, Bash, or PowerShell for automation and orchestration tasks.

• Experience with Infrastructure as Code tools: Terraform and/or AWS CloudFormation.

• Comprehensive understanding of networking fundamentals, including VPC, Transit Gateway (TGW), Direct Connect, VPN, and DNS failover mechanisms.

• Exceptional written and verbal communication skills, capable of creating executive-level disaster recovery reports.

🏝️ Benefits

• Health insurance

• Retirement plans

• Paid time off

• Flexible work arrangements

• Professional development opportunities

Cloud Reliability Engineer – Recovery

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Forward Deployed Engineer

Professional Services Engineer

Senior Cisco CUCM Engineer

Ingeniero de Observabilidad IA

Field Services Engineer

Technical Services Engineer

Never miss a great job!