This is a fully remote position, open to applicants in New York.

📋 Description

• Oversee and sustain Kubernetes clusters along with containerized applications.

• Manage cloud infrastructures across environments such as OCI, AWS, GCP, or Azure.

• Design and uphold CI/CD pipelines to ensure dependable application deployments.

• Implement and administer Infrastructure as Code (IaC) utilizing Terraform and Helm.

• Create automation tools and operational processes using Python, Go, or Bash.

• Propel observability initiatives, including enhancements in monitoring, logging, tracing, and alerting.

• Track, troubleshoot, and address production incidents while engaging in on-call rotations.

• Support and enhance distributed data platforms including Kafka, Elasticsearch, Spark, Redis, and MongoDB.

• Boost platform reliability, scalability, and operational effectiveness through SRE best practices.

• Collaborate with cross-functional teams across various time zones.

• Conduct Linux system administration and networking troubleshooting.

• Participate in incident response processes, postmortems, and reliability enhancements.

• Assist in GitOps and deployment workflows with tools like ArgoCD and GitHub Actions.

• Assess and adopt AI-assisted operational tools for auto-remediation, alert correlation, and operational intelligence.

⛳️ Requirements

• Over 5 years of experience in DevOps, SRE, or Platform Engineering roles.

• Strong proficiency in Kubernetes, Docker, and container orchestration.

• Practical experience in managing production cloud environments.

• Robust knowledge of Infrastructure as Code with Terraform and Helm.

• Experience with CI/CD tools and deployment automation practices.

• Advanced troubleshooting capabilities in Linux systems, networking, and distributed systems.

• Familiarity with observability platforms such as Prometheus, Grafana, Loki, Alertmanager, and Elastic Stack.

• Strong programming and scripting capabilities in Python, Bash, or Go.

• Background in supporting high-availability production systems and on-call operations.

• Knowledge of incident management and reliability engineering methodologies.

• Understanding of data platform technologies like Kafka, Spark, Elasticsearch, Redis, or MongoDB.

• Awareness of AI-driven operational tools and automated remediation strategies.

• Excellent communication, collaboration, and problem-solving abilities.

• Must reside on the East Coast.

🏝️ Benefits

• Pre-IPO Stock Options

• Medical, Dental & Vision care

• 401(k)

• Employee Assistance Program

• Employee Discount Program

• Life Insurance

• Paid time off

• Referral Program

• Rewards and Recognition Program

Senior DevOps Engineer/Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Site Reliability Engineer

Senior DevOps Engineer

Senior Cloud Operations Engineer

Staff Software Engineer – Databases SRE

DevOps Engineer

Senior DevOps Engineer

Never miss a great job!