This is a fully remote position, open to applicants in Colorado, +3 more states.

📋 Description

• Take charge of designing, implementing, and supporting the operational and reliability elements of large-scale Kubernetes clusters, emphasizing performance, real-time monitoring, logging, and alerting.

• Participate in and enhance the entire service lifecycle—from conception and design to deployment, operation, and ongoing refinement.

• Provide support to services prior to their launch through activities such as system design consulting, software tool and platform development, capacity management, and launch reviews.

• Ensure the maintenance of live services by measuring and monitoring their availability, latency, and overall system health.

• Sustainably scale systems through automation and promote system evolution by advocating for changes that enhance reliability and speed.

• Engage in sustainable incident response practices and conduct blameless postmortems.

• Join an on-call rotation to assist with production system support.

⛳️ Requirements

• Bachelor’s degree in Computer Science or a related technical discipline involving coding (such as physics or mathematics), or equivalent professional experience.

• Over 16 years of experience in infrastructure automation, distributed systems design, and developing tools for managing large-scale private or public cloud systems in production environments.

• Proficiency in one or more programming languages, including Python, Go, Perl, or Ruby.

• Extensive knowledge of Linux, Networking, and Containers.

🏝️ Benefits

• Equity

• Benefits

Distinguished Site Reliability Engineer – Cloud

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Program Manager II

Senior Product Manager – Candidate & Recruiter Platform

Account Director

Forward-Deployed Product Manager – FDPM

Human Resource Generalist

Product Marketing Engineer

Never miss a great job!