
Senior DevOps – Platform Reliability Engineer
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in New York.
• Take ownership and enhance CI/CD pipelines utilizing GitHub Actions and OIDC-based authentication for microservices and agentic workloads, ensuring safe, rapid, and reversible deployments.
• Automate the provisioning of infrastructure through Infrastructure as Code (IaC) tools like Terraform and CloudFormation.
• Manage and scale our Kubernetes platform (EKS + Argo CD), which includes autoscaling, ingress, external-dns, cert-manager, External Secrets Operator, backups, runtime guardrails, and multi-tenant isolation for enterprise clients.
• Oversee the edge and network perimeter, which encompasses Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), CloudFront, API Gateway, ALB/NLB, Route 53, and network security measures.
• Handle the data and event tier, including Aurora MySQL, ElastiCache/Redis, S3, and MSK (Kafka), with accountability for backups, point-in-time recovery (PITR), and multi-AZ disaster recovery aligned with defined RTO/RPO targets.
• Develop and maintain Lambda workloads where event-driven or serverless architectures are applicable.
• Create observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems such as token costs, tool-call latency, evaluation signals, and prompt/version tracking.
• Enhance our security and compliance posture for SOC 2 and HIPAA, incorporating least-privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, GuardDuty, Inspector, and evidence automation.
• Lead FinOps initiatives, including tagging standards, Savings Plans and Reserved Instances, cost attribution per tenant and workload, and LLM cost management.
• Develop and advance our AI-native DevOps capabilities.
• Over 5 years of experience in DevOps, SRE, or Platform Engineering managing production systems on AWS.
• Extensive experience with CI/CD pipelines and tools such as GitHub Actions, GitLab CI, Jenkins, or CircleCI.
• Practical experience in operating production EKS environments, covering autoscaling, ingress, secrets management, and cluster upgrades.
• Strong AWS networking knowledge, including multi-account VPC design, subnets, routing, security groups, NACLs, Route 53, ACM, and load balancers.
• In-depth experience with Terraform and GitHub Actions, preferably using OIDC-based cloud authentication.
• Familiarity with Aurora/RDS MySQL, Redis (ElastiCache), and S3, including backups, PITR, migrations, and lifecycle management.
• Solid observability experience with Prometheus, Grafana, and OpenTelemetry.
• Experience in operating Argo CD at scale.
• Proficiency with Infrastructure as Code tools such as Terraform, CloudFormation, or Ansible.
• Experience managing Cloudflare services, including WAF, Bot Management, Rate Limiting, and Zero Trust / Access, along with CloudFront.
• Experience in operating Kafka/MSK at scale, including topics, consumer groups, and schema registries.
• Familiar with Lambda and event-driven architectures.
• Proficient in working with Python, Bash, and Linux systems.
• Strong grasp of security best practices across IAM, KMS, secrets management, networking, and software supply chain security.
• Knowledge of vulnerability scanning and compliance tools.
• Competitive compensation packages
• Comprehensive health benefits:
• 100% of employee premiums covered
• 75%–80% of dependent premiums covered for most health, dental, and vision plans
• 401(k) plans to assist with retirement planning (no employer matching currently)
• Paid parental leave
• Unlimited PTO
• Flexible remote work from any location
• Up to $200/month co-working reimbursement
• Home office stipend:
• Up to $500 for home office setup
• $100/month for internet, phone, and related expenses
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.