
Senior Site Reliability Engineer
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in United States.
• Contribute to system observability by implementing and enhancing metrics, alerting, and dashboards to gain better insights and achieve quicker recovery.
• Develop automation, tools, and monitoring solutions aimed at ensuring high service availability.
• Collaborate with application and quality engineering teams to adopt best practices in reliability, release automation, and testing.
• Promote operational excellence through proactive incident prevention, conducting blameless postmortems, and engaging in capacity planning.
• Take part in on-call rotations to support critical services and guarantee a swift response to incidents.
• Solid experience in Python, particularly for automation, tooling, and data-driven operational tasks.
• Proficiency in at least one programming language (Java, C++, or Go).
• Strong understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), and contemporary deployment practices (Docker, Kubernetes, Terraform).
• Experience with CI/CD pipelines, version control, and automated testing frameworks.
• Familiarity with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.) and log/metric analysis for troubleshooting issues.
• Proven experience in facilitating and documenting Critical User Journeys, translating them into actionable SLA/SLO for automation.
• Demonstrated capability to work with cross-functional teams and communicate effectively in high-stakes situations.
• A problem-solver who views reliability as a collective responsibility within engineering.
• Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow.
• **Nice to Have**
• Experience in writing or maintaining end-to-end or integration tests for distributed systems.
• Background in performance testing, capacity planning, or chaos engineering.
• Contributions to internal developer tools or reliability-focused frameworks.
• Exposure to security, compliance, or change management processes in production environments.
• Relevant certifications.
• Multiple medical insurance plans to select from.
• Dental, vision, life, and disability insurance.
• Employee Emergency Fund.
• Company equity (stock options).
• Open PTO policy.
• 401K plan with company matching.
• Hybrid/flexible work environment.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.