
Senior Cloud Platform Engineer
Posted 1 day ago

Posted 1 day ago
β’ Develop platform infrastructure β Create and implement self-service tools that empower product teams to deploy services without needing infrastructure tickets or manual provisioning.
β’ Minimize operational toil β Recognize repetitive manual tasks and develop automation solutions to eradicate them.
β’ Enhance visibility and observability β Establish monitoring, alerting, and dashboards that provide teams with reassurance regarding the health of their services. Design systems that identify issues before users notice them and simplify debugging of production problems, regardless of whether itβs 3pm or 3am.
β’ Engage in on-call rotation β Participate in the on-call rotation to address infrastructure incidents. Your efforts will focus on decreasing incident frequency through improved automation and resilience strategies.
β’ Scale infrastructure β Strategize capacity, enhance performance, and ensure our platform manages increasing traffic without performance loss. You will tackle challenges such as minimizing deployment times, optimizing resource utilization, and maintaining sub-100ms p99 latencies.
β’ Collaborate with various teams β Work closely with security, product engineering, and SRE teams to comprehend their needs and develop solutions that accommodate everyone.
β’ 5+ years of experience with distributed systems and microservices in production settings.
β’ Strong expertise in AWS β Proficient with EC2, ECS/EKS, VPC networking, IAM, and capable of architecting resilient systems across multiple availability zones.
β’ Proficient in Infrastructure as Code β Daily experience with Terraform or CloudFormation, thinking in code rather than through graphical interfaces.
β’ Programming capabilities for automation β Skilled in writing Go, Python, or similar languages to create tools and automate processes.
β’ Production experience with Kubernetes multi-tenancy β You have deployed, scaled, and troubleshot containerized workloads in production clusters with multiple tenants.
β’ Expertise in observability β Practical experience with tools such as Prometheus, Grafana, Datadog, or similar. You understand what to monitor and how to set effective alerts.
β’ Incident response experience β You have participated in on-call duties, resolved outages, and authored postmortems that led to systemic enhancements.
β’ Security-focused mindset β You adhere to least-privilege principles, ensure encryption both at rest and in transit, and consider threat models.
β’ Health, dental, 401k, and numerous other benefits.
β’ Generous paid time off.
β’ Equity grant.
VALCE Talent Solutions
DXC Technology
Tech Minds Agency
BTS
Get handpicked remote jobs straight to your inbox weekly.