
Senior Site Reliability Engineer II – Infrastructure, AI Native
Posted 5 days ago

Posted 5 days ago
This is a fully remote position, open to applicants in Canada.
• Enhancing and managing our infrastructure and services with AI (Claude Code) as an integral partner in your daily development processes.
• Providing clear guidance on technical direction and strategy, while documenting these insights for team alignment.
• Mentoring and guiding fellow engineers within the team.
• Taking ownership of and resolving intricate infrastructure challenges — including Kubernetes scheduling nuances, networking issues, cross-service cascading failures, and AWS platform concerns escalated by other engineers.
• Engaging in a shared on-call schedule (approximately one week every six to eight weeks).
• Estimating timelines and breaking down tasks into manageable 1-3 day segments.
• Promoting cloud cost efficiency by pinpointing over-provisioned resources, optimizing EC2 and container workloads, and developing tools to identify cost anomalies before they escalate.
• Bachelor’s degree in Computer Science, Engineering, a related field, or equivalent practical experience.
• Extensive experience (5+ years) managing medium to large-scale deployments on AWS (~5000 instances, 50+ accounts), or a comparable environment.
• Over 3 years of programming experience in Java, Python, or other formal programming languages.
• Significant Kubernetes experience (3+ years) in deploying and managing at scale (hundreds of Deployments, over 10k containers, 20k+ Cores).
• Proficient understanding of container orchestration and microservices.
• Familiarity with service discovery/service mesh technologies.
• Strong Linux administration skills, along with shell/bash scripting expertise.
• Advanced experience with Infrastructure as Code tools: Terraform, CloudFormation; and configuration management/provisioning tools: Ansible, Chef, etc.
• Solid experience in Build/Automation/CI/CD practices.
• In-depth knowledge and experience with networking and load-balancer technologies.
• Familiarity with existing open-source projects like Consul, Docker, ArgoCD, Nexus, Jenkins.
• Experience with large-scale Kafka implementations.
• Database knowledge is an advantage.
• Exceptional troubleshooting abilities, proficiency with monitoring tools, and meticulous attention to detail.
• Outstanding interpersonal skills and a highly collaborative working approach.
• Practical experience with AI coding tools (Claude Code, Cursor, or similar) utilized for infrastructure scripting, incident response automation, or tooling development.
• Competitive salary and comprehensive benefits package.
• Medical, dental, vision, life, and disability insurance options.
• RRSP plan featuring a DPSP company matching program.
• Employee Assistance Program (EAP) focused on mental well-being.
• Flexible paid time off along with several company-wide holidays throughout the year.
• Week-long synchronized company shutdowns during Winter and Summer.
• Opportunities for Learning & Development programs.
• Provision of equipment, tools, and reimbursement support to foster a productive remote working environment.
• Complimentary Life360 Platinum Membership for your chosen circle.
• Free Tile Products.
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.