
Senior Site Reliability Engineer – Observability, Telemetry Platform
Posted May 9

Posted May 9
This is a fully remote position, open to applicants in California.
• Design, implement, and support the operational and reliability components of a large-scale Observability & Telemetry collection platform, emphasizing performance at scale, real-time monitoring, logging, and alerting.
• Engage in and enhance the entire lifecycle of services—from initial conception and design to deployment, operation, and refinement.
• Provide support for services prior to their launch through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
• Maintain live services by measuring and monitoring their availability, latency, and overall system health.
• Sustainably scale systems through automation and advocate for changes that enhance reliability and velocity.
• Implement sustainable incident response practices and conduct blameless postmortems.
• Participate in an on-call rotation to support production systems.
• Bachelor’s degree in Computer Science or a related technical field involving coding (such as physics or mathematics), or equivalent experience.
• Over 8 years of experience with infrastructure automation.
• Expertise in distributed systems design.
• Experience in designing and developing tools for operating large-scale private or public cloud systems in production.
• More than 5 years of experience in delivering foundational infrastructure and observability platforms.
• Proficiency in one or more of the following programming languages: Python, Go, Perl, or Ruby.
• Extensive knowledge of Linux, Networking, and Containers.
• Equity.
• Comprehensive benefits.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.