
Site Reliability Engineer
Posted 2 hours ago

Posted 2 hours ago
• Take ownership of reliability, availability, and performance for production systems operating in cloud environments.
• Establish and monitor SLIs/SLOs while assisting in managing error budgets across the platform.
• Lead incident response initiatives encompassing detection, triage, mitigation, and post-incident reviews.
• Enhance observability through effective logging, monitoring, alerting, and dashboard implementations.
• Automate operational processes and minimize manual tasks wherever feasible.
• Collaborate closely with engineering teams to bolster system resilience and scalability.
• Assist in capacity planning, infrastructure optimization, and performance enhancement.
• Develop internal tools, runbooks, and best practices for operations.
• Provide support for Kubernetes-based infrastructure and large-scale distributed systems.
• Serve as an escalation point for intricate production and platform challenges.
• Over 5 years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or comparable infrastructure positions.
• Extensive experience with cloud platforms such as AWS, GCP, or Azure.
• Practical experience with Kubernetes and containerized environments.
• Strong grasp of distributed systems and microservices architecture.
• Familiarity with observability tools such as Prometheus, Grafana, Datadog, ELK, or OpenTelemetry.
• Skilled in infrastructure automation and scripting (Terraform, Python, Bash, etc.).
• Experience with managing CI/CD pipelines and automating deployments.
• Excellent troubleshooting and incident management capabilities.
• Ability to collaborate across functions and communicate effectively in high-pressure scenarios.
• Comprehensive health coverage including medical, dental, and vision.
• Flexible paid time off (PTO).
• Support for personal development.
Launch Potato
Xtremepush
BI2run
S + S Regeltechnik GmbH
Get handpicked remote jobs straight to your inbox weekly.