
Senior Cloud Resilience Architect
Posted May 7

Posted May 7
This is a fully remote position, open to applicants in New York.
• Assess and enhance the organization's disaster recovery capabilities, including recovery time objectives (RTO/RPO), dependency mapping, and failure domain analysis across applications, data, and infrastructure.
• Create, document, and implement disaster recovery standards and best practices for cloud infrastructure, platforms, and application architectures.
• Collaborate with SRE, platform, security, and product engineering teams to design and build resilient, fault-tolerant systems, transitioning from backup-based recovery to multi-region and active-active architectures.
• Direct the disaster recovery roadmap, weighing technical feasibility, cost, risk, and business priorities.
• Design and propose reference architectures for various disaster recovery patterns, including pilot-light, warm standby, hot standby, and active-active configurations.
• Promote the adoption of active-active disaster recovery for key systems, including traffic management, data replication, consistency models, and automated failover.
• Define and implement testing strategies for disaster recovery, such as game days, chaos testing, and routine recovery exercises.
• Ensure comprehensive documentation, runbooks, and escalation procedures are in place to guarantee that recoverability is well understood and not reliant on specific individuals.
• Assess and recommend platform upgrades, cloud services, and tools that enhance resilience, recovery speed, and reliability.
• Act as a technical authority and advisor on disaster recovery and resilience for leadership and engineering teams.
• Provide architectural guidance, conduct design reviews, and mentor engineers executing disaster recovery-related changes.
• Collaborate with security and compliance teams to confirm that disaster recovery strategies adhere to regulatory, audit, and data protection standards.
• Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
• Over 8 years of experience in cloud infrastructure, platform engineering, SRE, or reliability-focused architecture roles.
• In-depth knowledge of disaster recovery principles, including RTO/RPO, blast radius reduction, failure domains, and dependency isolation.
• Demonstrated experience in designing and implementing multi-region and multi-availability zone architectures.
• Practical experience transitioning systems to active-active or highly available architectures.
• Strong understanding of data replication strategies, consistency trade-offs, and recovery patterns for databases and stateful systems.
• Extensive experience with major cloud providers (AWS preferred, GCP/Azure acceptable).
• Solid understanding of managed cloud services and their disaster recovery characteristics and limitations.
• Familiarity with Kubernetes-based platforms, including regional failover, workload portability, and cluster recovery strategies.
• Knowledge of global traffic management, DNS, load balancing, and service mesh patterns.
• Experience in designing and maintaining Infrastructure as Code using tools like Terraform, Pulumi, CloudFormation, or Ansible.
• Strong emphasis on automating recovery workflows, failover testing, and environment provisioning.
• Ability to eliminate manual recovery processes and minimize recovery time through software solutions.
• Experience in defining and conducting disaster recovery tests, game days, and failure simulations.
• Comfortable collaborating across organizational boundaries to influence priorities and standards.
• Excellent documentation and communication skills, with the ability to convey complex technical risks into business impacts.
• Health insurance
• Remote work flexibility
• Professional development
• Paid time off
Allegion
Ibility LLC
Planexia
Get handpicked remote jobs straight to your inbox weekly.