
Senior Site Reliability Engineer
Posted May 11

Posted May 11
This is a fully remote position, open to applicants in Poland.
• Offering guidance and mentorship to fellow engineers within the department.
• Creating and maintaining automated tools and scripts to improve system reliability, deployment processes, and the efficiency of incident response.
• Enhancing our system monitoring to accelerate error detection and resolution, thus boosting the performance and reliability of the virtualization platform.
• Engaging in on-call rotations to assist in the restoration and repair of issues that impact services.
• Developing automation and tools to minimize operational labor, enhance deployment safety, and expedite incident response.
• Assisting in capacity planning, autoscaling configurations, and workload scheduling for AI compute infrastructure.
• Have expert-level experience in a SysAdmin (Linux/Unix Administration), DevOps, or SRE position, specifically with large-scale distributed systems.
• Exhibit proficiency in Kubernetes and large-scale containerization technologies.
• Be skilled in at least one programming language (Python/Golang) and configuration management using Terraform/SaltStack/Ansible.
• Establish Service Level Objectives (SLOs) and utilize observability tools like Prometheus, Grafana, and distributed tracing to improve system monitoring.
• Possess experience in architecting software and infrastructure on a large scale.
• Show accountability for reliability, develop automation and monitoring solutions, and work collaboratively with an engineering team that may be unfamiliar with SRE practices.
• Your health
• Your finances
• Your family
• Your time at work
• Your time pursuing other endeavors
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.