This is a fully remote position, open to applicants in Poland.

📋 Description

• Ensure the accessibility, dependability, and performance of high-traffic Java applications within a distributed setting.

• Diagnose and resolve intricate issues in both production and non-production environments.

• Engage in pre- and post-deployment performance testing and monitoring to enhance application performance continually.

• Design, develop, and manage agentic AI workflows that automate operational tasks such as alert triage and root cause analysis.

⛳️ Requirements

• Bachelor’s degree in Computer Science or a related field, or equivalent professional experience.

• Over 5 years of experience in SRE, DevOps, or similar infrastructure roles, with a background in managing large-scale, high-availability production systems.

• At least 3 years of hands-on experience in managing production Kubernetes clusters, with a comprehensive understanding of architecture, networking, storage, and security.

• Advanced proficiency with the Grafana observability stack, including dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.

• Strong scripting skills in Python, Bash, or Go, with a background in building CI/CD pipelines and deployment automation.

• Minimum of 1 year of practical experience in developing or operating AI/LLM-powered tools, agents, or workflows.