
Distinguished Site Reliability Engineer – Cloud
Posted 13 hours ago

Posted 13 hours ago
This is a fully remote position, open to applicants in Colorado, +3 more states.
• Take charge of designing, implementing, and supporting the operational and reliability elements of large-scale Kubernetes clusters, emphasizing performance, real-time monitoring, logging, and alerting.
• Participate in and enhance the entire service lifecycle—from conception and design to deployment, operation, and ongoing refinement.
• Provide support to services prior to their launch through activities such as system design consulting, software tool and platform development, capacity management, and launch reviews.
• Ensure the maintenance of live services by measuring and monitoring their availability, latency, and overall system health.
• Sustainably scale systems through automation and promote system evolution by advocating for changes that enhance reliability and speed.
• Engage in sustainable incident response practices and conduct blameless postmortems.
• Join an on-call rotation to assist with production system support.
• Bachelor’s degree in Computer Science or a related technical discipline involving coding (such as physics or mathematics), or equivalent professional experience.
• Over 16 years of experience in infrastructure automation, distributed systems design, and developing tools for managing large-scale private or public cloud systems in production environments.
• Proficiency in one or more programming languages, including Python, Go, Perl, or Ruby.
• Extensive knowledge of Linux, Networking, and Containers.
• Equity
• Benefits
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.