
Senior DevOps Engineer/Site Reliability Engineer
Posted Jun 2

Posted Jun 2
This is a fully remote position, open to applicants in New York.
• Oversee and sustain Kubernetes clusters along with containerized applications.
• Manage cloud infrastructures across environments such as OCI, AWS, GCP, or Azure.
• Design and uphold CI/CD pipelines to ensure dependable application deployments.
• Implement and administer Infrastructure as Code (IaC) utilizing Terraform and Helm.
• Create automation tools and operational processes using Python, Go, or Bash.
• Propel observability initiatives, including enhancements in monitoring, logging, tracing, and alerting.
• Track, troubleshoot, and address production incidents while engaging in on-call rotations.
• Support and enhance distributed data platforms including Kafka, Elasticsearch, Spark, Redis, and MongoDB.
• Boost platform reliability, scalability, and operational effectiveness through SRE best practices.
• Collaborate with cross-functional teams across various time zones.
• Conduct Linux system administration and networking troubleshooting.
• Participate in incident response processes, postmortems, and reliability enhancements.
• Assist in GitOps and deployment workflows with tools like ArgoCD and GitHub Actions.
• Assess and adopt AI-assisted operational tools for auto-remediation, alert correlation, and operational intelligence.
• Over 5 years of experience in DevOps, SRE, or Platform Engineering roles.
• Strong proficiency in Kubernetes, Docker, and container orchestration.
• Practical experience in managing production cloud environments.
• Robust knowledge of Infrastructure as Code with Terraform and Helm.
• Experience with CI/CD tools and deployment automation practices.
• Advanced troubleshooting capabilities in Linux systems, networking, and distributed systems.
• Familiarity with observability platforms such as Prometheus, Grafana, Loki, Alertmanager, and Elastic Stack.
• Strong programming and scripting capabilities in Python, Bash, or Go.
• Background in supporting high-availability production systems and on-call operations.
• Knowledge of incident management and reliability engineering methodologies.
• Understanding of data platform technologies like Kafka, Spark, Elasticsearch, Redis, or MongoDB.
• Awareness of AI-driven operational tools and automated remediation strategies.
• Excellent communication, collaboration, and problem-solving abilities.
• Must reside on the East Coast.
• Pre-IPO Stock Options
• Medical, Dental & Vision care
• 401(k)
• Employee Assistance Program
• Employee Discount Program
• Life Insurance
• Paid time off
• Referral Program
• Rewards and Recognition Program
Ad Hoc LLC
Acuity, Inc.
NICE
Grafana Labs
Get handpicked remote jobs straight to your inbox weekly.