This is a fully remote position, open to applicants in Spain.

• Oversee and maintain container orchestration platforms and workloads that are containerized.

• Monitor and resolve issues in production systems, participating in on-call rotations to ensure system reliability.

• Enhance observability by improving monitoring, logging, and alerting processes across various systems and data platforms.

• Administer and optimize cloud environments across multiple service providers.

• Manage and provide support for distributed data platforms and real-time processing systems.

• Develop and uphold continuous integration and delivery pipelines for efficient and dependable deployments.

• Lead the implementation of Infrastructure as Code (IaC) practices to guarantee consistency and scalability.

• Automate and orchestrate infrastructure using programming and scripting languages.

• Conduct system administration and networking tasks to support both internal and external environments.

• Collaborate proficiently with engineers and stakeholders across various time zones.

• A minimum of 5 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.

• Proven track record in managing large-scale production systems within cloud environments (AWS, GCP, Azure, or OCI).

• Demonstrated leadership in incident response, best practices for on-call duties, and fostering a reliability-oriented culture.

• Strong experience with production on-call operations and incident management procedures.

• Advanced skills in administering and troubleshooting Kubernetes.

• Practical experience with observability tools such as Prometheus, Grafana, Loki, and Alertmanager.

• Knowledge of chat-based operational interfaces and/or auto-remediation controllers leveraging AI frameworks.

• Understanding of AI agents for auto-triaging alerts, correlating signals, and suggesting/root-cause hypotheses.

• Expertise in managing data platforms like Elasticsearch, MongoDB, Spark, Kafka, and Redis.

• Proficiency in public cloud services (AWS, Azure, GCP, or OCI).

• Strong programming and automation expertise in Python and Bash.

• In-depth knowledge of Infrastructure as Code (Terraform, Helm).

• Experience with CI/CD pipelines (GitHub Actions, Bitbucket, ArgoCD).

• Strong technical foundation in distributed systems, databases, networking, and Linux administration.

• Excellent problem-solving, communication, and leadership skills.

• Bachelor's degree in Computer Science, Engineering, or a related technical discipline.

• Certifications in AWS, GCP, Observability, Linux, or Kubernetes are advantageous.

• Competitive salary and performance-based incentives.

• Comprehensive health, dental, and vision insurance.

• Flexible working hours and remote work options.

• Professional development opportunities and training programs.

• A collaborative and inclusive company culture.

Staff SRE Engineer

People also viewed