
Senior Site Reliability Engineer
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in New York.
• Ensure stability and scalability across our worldwide compute platform, which encompasses numerous data centers, various public clouds, and on-premise environments, serving as the essential foundation for all products.
• Manage and enhance our GitOps delivery model, utilizing Rancher Fleet and Flux with Helm to deploy core cluster services and application workloads in a declarative and repeatable manner.
• Create self-healing, fault-tolerant infrastructure and internal tools that eliminate repetitive operational tasks and minimize toil for both platform and application teams.
• Take ownership of cluster autoscaling and capacity strategy, including Karpenter, HPA, KEDA, and predictive scaling guided by event and calendar data.
• Establish SLOs and reliability metrics for platform components, leveraging Datadog and our logging pipeline to highlight cluster and workload health.
• Foster technical development by sharing knowledge, engaging in design discussions, and promoting a collaborative team culture, including participation in on-call rotations.
• Bachelor's degree in Computer Science or equivalent education, experience, and training.
• A minimum of 4 years of experience managing distributed cloud and on-premise environments at scale, with substantial hands-on experience in AWS.
• Familiarity with GCP, vSphere, or Nutanix is advantageous.
• Extensive expertise in container orchestration with Kubernetes, including the capability to design, scale, and troubleshoot intricate workloads.
• Strong background in developing software for automation and infrastructure tooling, particularly using Go and Python.
• Proficient knowledge of networking and Linux-based systems, including container runtimes like Docker and containerd, as well as packet-level debugging and kernel troubleshooting.
• Experience with Infrastructure as Code (IaC) and configuration management tools to ensure scalable and repeatable infrastructure provisioning.
• Bonus
• Equity
• Benefits as applicable
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.