
Site Reliability Engineer
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Arizona, +9 more states.
• Design, provision, and oversee AWS infrastructure utilizing Terraform.
• Operate, maintain, and scale production workloads that run on Kubernetes.
• Package, deploy, and manage applications with Helm and infrastructure automation tools.
• Construct, operate, and enhance distributed and event-driven systems.
• Define, monitor, and uphold Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
• Develop automation for deployment, scaling, monitoring, incident response, and operational workflows.
• Take ownership of platform observability by implementing and maintaining metrics, logging, tracing, monitoring, and alerting solutions.
• Lead incident response initiatives, facilitate blameless postmortems, and promote long-term corrective measures.
• Collaborate with Product and Engineering teams on capacity planning, performance enhancement, and resilient system architecture.
• Implement and uphold security best practices to support HIPAA, SOC 2, and other compliance mandates.
• Participate in an on-call rotation and offer operational support for production systems.
• Three to five (3–5) years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, Cloud Infrastructure Engineering, or similar infrastructure-oriented roles.
• Bachelor's degree in Computer Science, Information Systems, Software Engineering, or a related technical discipline; equivalent professional experience will also be taken into account.
• Strong practical experience in managing production workloads within AWS environments.
• Demonstrated experience in managing infrastructure as code using Terraform.
• Experience in operating and supporting production Kubernetes environments.
• Hands-on experience in deploying and managing applications using Helm.
• Experience with distributed systems, event-driven architectures, or event-sourcing platforms.
• Experience in establishing and managing observability practices including monitoring, logging, tracing, alerting, and incident response.
• Strong understanding of Linux systems administration, networking, cloud architecture, and the fundamentals of distributed systems.
• Experience in designing, implementing, and maintaining CI/CD pipelines and deployment automation.
• Strong problem-solving abilities with the capacity to troubleshoot complex infrastructure and application challenges.
• Excellent written and verbal communication skills, with the ability to collaborate effectively across technical and non-technical teams.
• High level of ownership, accountability, and initiative.
• Willingness and ability to participate in an on-call rotation supporting production systems.
• Medical, dental, and vision insurance.
• Income protection benefits.
• Flexible PTO.
• Company holidays.
• 401k.
• Access to additional wellness benefits.
Dib Consultoria
CELSIUS
American Refrigeration
Kaplan Test Prep
Get handpicked remote jobs straight to your inbox weekly.