This is a fully remote position, open to applicants in New York.

📋 Description

• Ensure stability and scalability across our worldwide compute platform, which encompasses numerous data centers, various public clouds, and on-premise environments, serving as the essential foundation for all products.

• Manage and enhance our GitOps delivery model, utilizing Rancher Fleet and Flux with Helm to deploy core cluster services and application workloads in a declarative and repeatable manner.

• Create self-healing, fault-tolerant infrastructure and internal tools that eliminate repetitive operational tasks and minimize toil for both platform and application teams.

• Take ownership of cluster autoscaling and capacity strategy, including Karpenter, HPA, KEDA, and predictive scaling guided by event and calendar data.

• Establish SLOs and reliability metrics for platform components, leveraging Datadog and our logging pipeline to highlight cluster and workload health.

• Foster technical development by sharing knowledge, engaging in design discussions, and promoting a collaborative team culture, including participation in on-call rotations.

⛳️ Requirements

• Bachelor's degree in Computer Science or equivalent education, experience, and training.

• A minimum of 4 years of experience managing distributed cloud and on-premise environments at scale, with substantial hands-on experience in AWS.

• Familiarity with GCP, vSphere, or Nutanix is advantageous.

• Extensive expertise in container orchestration with Kubernetes, including the capability to design, scale, and troubleshoot intricate workloads.

• Strong background in developing software for automation and infrastructure tooling, particularly using Go and Python.

• Proficient knowledge of networking and Linux-based systems, including container runtimes like Docker and containerd, as well as packet-level debugging and kernel troubleshooting.

• Experience with Infrastructure as Code (IaC) and configuration management tools to ensure scalable and repeatable infrastructure provisioning.

🏝️ Benefits

• Bonus

• Equity

• Benefits as applicable

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!