
Senior Cloud Operations Engineer
Posted 3 days ago

Posted 3 days ago
This is a fully remote position, open to applicants in United Kingdom.
• Design, implement, and manage scalable, secure, and highly available AWS cloud infrastructure utilizing services such as EC2, EKS, ECS, RDS, S3, VPC, Lambda, and IAM.
• Enhance the reliability and performance of containerized applications by overseeing Amazon EKS and ECS environments, which includes cluster operations, networking, scaling, and troubleshooting.
• Ensure the stability, security, and efficiency of production Linux environments through system administration, performance tuning, storage management, networking, and incident resolution.
• Maintain and optimize both relational databases (PostgreSQL, MySQL, Aurora) and NoSQL platforms (DynamoDB, Redis), ensuring they are highly available, performant, and ready for disaster recovery.
• Strengthen the organization's cloud security posture by effectively managing IAM, network security controls, secrets management, and adhering to compliance best practices.
• Improve platform observability and operational excellence by implementing and enhancing monitoring, logging, alerting, and performance analytics using tools like CloudWatch, Prometheus, and Grafana.
• Take charge of production incidents by engaging in on-call rotations, leading troubleshooting efforts, conducting root cause analysis, and fostering continuous improvement initiatives.
• Collaborate closely with software engineering, DevOps, and platform teams to enhance deployment processes, application reliability, and operational efficiency.
• Identify and execute cloud cost optimization opportunities through resource right-sizing, capacity planning, automation, and governance best practices.
• 4–5 years of experience in a cloud operation, infrastructure engineering, or SRE role with a strong hands-on technical emphasis.
• Extensive hands-on experience with core AWS services: EC2, EKS, ECS, RDS/Aurora, S3, VPC, IAM, Lambda, CloudWatch, Route 53, and ALB/NLB.
• Demonstrated ability to design and troubleshoot complex AWS networking architectures (VPCs, subnets, transit gateways, security groups).
• Strong understanding of AWS IAM, including roles, policies, permission boundaries, and cross-account access.
• Hands-on production experience managing workloads on Amazon EKS and ECS, including cluster lifecycle, node group management, networking (CNI, service mesh basics), and autoscaling.
• Fundamental knowledge of Docker: image builds, registries (ECR), multi-stage builds, and container security.
• Strong Linux administration skills, including Bash/Python scripting, process and memory management, filesystem and storage operations, kernel parameters, and network diagnostics.
• Experience in managing and hardening Linux servers in production environments (RHEL, Ubuntu, or Amazon Linux).
• Proficient in Terraform, including module design, state management, remote backends, and workspace strategies.
• Practical experience with Puppet for configuration management, node classification, and enforcing system state at scale.
• Hands-on experience with relational databases such as PostgreSQL, MySQL, or AWS RDS/Aurora, including schema management, query optimization, replication, backups, and failover.
• Familiarity with NoSQL databases like DynamoDB, Redis, or MongoDB, including data modeling, performance tuning, and operational monitoring.
• Understanding of CI/CD pipelines using tools such as GitHub Actions, Jenkins, or AWS CodePipeline.
• Experience with observability tools, including CloudWatch, Datadog, Prometheus, or Grafana.
• Flexible working arrangements.
• Professional development opportunities.
Ad Hoc LLC
Acuity, Inc.
Grafana Labs
Castillians
Get handpicked remote jobs straight to your inbox weekly.