• Take ownership of reliability, availability, and performance for production systems operating in cloud environments.

• Establish and monitor SLIs/SLOs while assisting in managing error budgets across the platform.

• Lead incident response initiatives encompassing detection, triage, mitigation, and post-incident reviews.

• Enhance observability through effective logging, monitoring, alerting, and dashboard implementations.

• Automate operational processes and minimize manual tasks wherever feasible.

• Collaborate closely with engineering teams to bolster system resilience and scalability.

• Assist in capacity planning, infrastructure optimization, and performance enhancement.

• Develop internal tools, runbooks, and best practices for operations.

• Provide support for Kubernetes-based infrastructure and large-scale distributed systems.

• Serve as an escalation point for intricate production and platform challenges.

• Over 5 years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or comparable infrastructure positions.

• Extensive experience with cloud platforms such as AWS, GCP, or Azure.

• Practical experience with Kubernetes and containerized environments.

• Strong grasp of distributed systems and microservices architecture.

• Familiarity with observability tools such as Prometheus, Grafana, Datadog, ELK, or OpenTelemetry.

• Skilled in infrastructure automation and scripting (Terraform, Python, Bash, etc.).

• Experience with managing CI/CD pipelines and automating deployments.

• Excellent troubleshooting and incident management capabilities.

• Ability to collaborate across functions and communicate effectively in high-pressure scenarios.

• Comprehensive health coverage including medical, dental, and vision.

• Flexible paid time off (PTO).

• Support for personal development.

Site Reliability Engineer

People also viewed