This is a fully remote position, open to applicants in Arizona, +9 more states.

📋 Description

• Design, provision, and oversee AWS infrastructure utilizing Terraform.

• Operate, maintain, and scale production workloads that run on Kubernetes.

• Package, deploy, and manage applications with Helm and infrastructure automation tools.

• Construct, operate, and enhance distributed and event-driven systems.

• Define, monitor, and uphold Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

• Develop automation for deployment, scaling, monitoring, incident response, and operational workflows.

• Take ownership of platform observability by implementing and maintaining metrics, logging, tracing, monitoring, and alerting solutions.

• Lead incident response initiatives, facilitate blameless postmortems, and promote long-term corrective measures.

• Collaborate with Product and Engineering teams on capacity planning, performance enhancement, and resilient system architecture.

• Implement and uphold security best practices to support HIPAA, SOC 2, and other compliance mandates.

• Participate in an on-call rotation and offer operational support for production systems.

⛳️ Requirements

• Three to five (3–5) years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, Cloud Infrastructure Engineering, or similar infrastructure-oriented roles.

• Bachelor's degree in Computer Science, Information Systems, Software Engineering, or a related technical discipline; equivalent professional experience will also be taken into account.

• Strong practical experience in managing production workloads within AWS environments.

• Demonstrated experience in managing infrastructure as code using Terraform.

• Experience in operating and supporting production Kubernetes environments.

• Hands-on experience in deploying and managing applications using Helm.

• Experience with distributed systems, event-driven architectures, or event-sourcing platforms.

• Experience in establishing and managing observability practices including monitoring, logging, tracing, alerting, and incident response.

• Strong understanding of Linux systems administration, networking, cloud architecture, and the fundamentals of distributed systems.

• Experience in designing, implementing, and maintaining CI/CD pipelines and deployment automation.

• Strong problem-solving abilities with the capacity to troubleshoot complex infrastructure and application challenges.

• Excellent written and verbal communication skills, with the ability to collaborate effectively across technical and non-technical teams.

• High level of ownership, accountability, and initiative.

• Willingness and ability to participate in an on-call rotation supporting production systems.

🏝️ Benefits

• Medical, dental, and vision insurance.

• Income protection benefits.

• Flexible PTO.

• Company holidays.

• 401k.

• Access to additional wellness benefits.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Assistente Virtual Comercial

Field Execution Representative

Service Technician

Adjunct Faculty, CFP Education

Licensed Mental Health Clinician – Montana

Tour Guide – Morocco

Never miss a great job!