This is a fully remote position, open to applicants in Poland.

📋 Description

• Offering guidance and mentorship to fellow engineers within the department.

• Creating and maintaining automated tools and scripts to improve system reliability, deployment processes, and the efficiency of incident response.

• Enhancing our system monitoring to accelerate error detection and resolution, thus boosting the performance and reliability of the virtualization platform.

• Engaging in on-call rotations to assist in the restoration and repair of issues that impact services.

• Developing automation and tools to minimize operational labor, enhance deployment safety, and expedite incident response.

• Assisting in capacity planning, autoscaling configurations, and workload scheduling for AI compute infrastructure.

⛳️ Requirements

• Have expert-level experience in a SysAdmin (Linux/Unix Administration), DevOps, or SRE position, specifically with large-scale distributed systems.

• Exhibit proficiency in Kubernetes and large-scale containerization technologies.

• Be skilled in at least one programming language (Python/Golang) and configuration management using Terraform/SaltStack/Ansible.

• Establish Service Level Objectives (SLOs) and utilize observability tools like Prometheus, Grafana, and distributed tracing to improve system monitoring.

• Possess experience in architecting software and infrastructure on a large scale.

• Show accountability for reliability, develop automation and monitoring solutions, and work collaboratively with an engineering team that may be unfamiliar with SRE practices.

🏝️ Benefits

• Your health

• Your finances

• Your family

• Your time at work

• Your time pursuing other endeavors

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!