
Manager, Software Engineering – Resilience Engineering
Posted Jun 19

Posted Jun 19
This is a fully remote position, open to applicants in Canada.
• Define and lead the vision for resilience engineering at Affirm, emphasizing production load testing and chaos engineering as essential engineering practices.
• Mentor and guide a team of engineers responsible for developing platforms and tools for secure production experimentation.
• Collaborate with infrastructure, product, and security leadership to integrate resilience validation into the software development lifecycle.
• Set best practices for safely testing system limits and failure scenarios in a production environment.
• Oversee the design and advancement of platforms that facilitate safe, controlled production load testing and fault injection.
• Ensure robust safeguards are established, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect end-users.
• Create systems that deliver comprehensive observability, traceability, and auditability for all resilience experiments.
• Propel reliability enhancements by systematically identifying vulnerabilities through load testing and chaos experiments.
• Develop monitoring, alerting, and incident response strategies specifically designed for proactive resilience validation.
• Collaborate closely with engineering teams to safely design and implement production load tests and chaos experiments.
• Partner with infrastructure teams to establish guardrails around testing and experimentation processes.
• Enable teams to embrace resilience practices by offering reusable tools, frameworks, and standardized workflows.
• Identify systemic vulnerabilities and spearhead cross-functional initiatives to enhance reliability and fault tolerance.
• Champion a culture of “test failure before failure tests you” throughout the organization.
• Demonstrated experience in leading engineering teams focused on reliability, infrastructure, or distributed systems.
• Practical experience with production load testing, chaos engineering, or validation of large-scale systems.
• Familiarity with utilizing a chaos engineering vendor such as Gremlin, Harness, or a similar platform.
• Strong comprehension of failure modes in distributed systems, including latency, partial failures, and cascading outages.
• Experience in building or managing systems with robust safety guarantees (isolation, rate limiting, guardrails, auditability).
• Knowledge of cloud-native environments (AWS, Kubernetes) and observability tools.
• Solid programming background (e.g., Python, Kotlin, Java, or similar languages).
• Exceptional problem-solving abilities and the capacity to balance long-term resilience investments with immediate business priorities.
• Strong communication and leadership skills, with a proven record of influencing engineering practices across various teams.
• Health care coverage - Affirm pays all premiums for all coverage levels for you and your dependents.
• Flexible Spending Wallets - generous stipends for expenses related to Technology, Food, various Lifestyle needs, and family formation costs.
• Time off - competitive vacation and holiday policies allowing you to take time off to rest and rejuvenate.
• ESPP - An employee stock purchase plan that enables you to purchase shares of Affirm at a discounted rate.
Zero Hash
Anthology Careers
Flosum
Mozilla
Get handpicked remote jobs straight to your inbox weekly.