This is a fully remote position, open to applicants in Hungary.

• Manage and oversee container orchestration platforms along with containerized workloads.

• Monitor and diagnose production systems, taking part in on-call rotations to guarantee reliability.

• Enhance observability by improving monitoring, logging, and alerting functionalities across systems and data platforms.

• Administer and optimize cloud environments across various providers.

• Manage and support distributed data platforms as well as real-time processing systems.

• Develop and sustain continuous integration and delivery pipelines for efficient and dependable deployments.

• Take ownership of and execute Infrastructure as Code (IaC) practices to ensure uniformity and scalability.

• Automate and orchestrate infrastructure using programming and scripting languages.

• Conduct system administration and networking tasks to support both internal and external environments.

• Collaborate effectively with engineers and stakeholders across different time zones.

• Over 5 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.

• Proven track record of leading large-scale production systems in cloud environments (AWS, GCP, Azure, or OCI).

• Demonstrated leadership in incident response, on-call best practices, and fostering a reliability-focused culture.

• Strong experience in production on-call operations and incident management.

• Advanced skills in Kubernetes administration and troubleshooting.

• Practical experience with observability tools: Prometheus, Grafana, Loki, and Alertmanager.

• Knowledge of chat-based operations interfaces and/or auto-remediation controllers leveraging AI agentic frameworks.

• Understanding of AI agents for auto-triaging alerts, correlating signals, and suggesting/root-cause hypotheses.

• Expertise in managing data platforms (Elasticsearch, MongoDB, Spark, Kafka, Redis).

• Proficient in public cloud services (AWS, Azure, GCP, or OCI).

• Strong programming and automation abilities in Python and Bash.

• In-depth knowledge of Infrastructure as Code (Terraform, Helm).

• Experience with CI/CD pipelines (GitHub Actions, Bitbucket, ArgoCD).

• Solid technical foundation in distributed systems, databases, networking, and Linux administration.

• Exceptional problem-solving, communication, and leadership skills.

• Bachelor's degree in Computer Science, Engineering, or a related technical field.

• Certifications in AWS, GCP, Observability, Linux, or Kubernetes are a plus.

• Competitive salary and performance-based bonuses.

• Comprehensive health insurance plans.

• Opportunities for professional development and career advancement.

• Flexible working hours and remote work options.

• A vibrant and inclusive company culture.

Senior SRE Engineer

People also viewed