This is a fully remote position, open to applicants in United States.

• Ensure the availability and resilience of essential services within production environments.

• Track service health utilizing SLIs, SLOs, and error budgets, escalating issues when thresholds are jeopardized.

• Engage in on-call rotations, incident management, and post-incident evaluations to enhance service quality.

• Adhere to established ITIL/OSS methodologies (incident, change, problem, and capacity management).

• Create automation for routine operational tasks to minimize manual efforts and reduce toil.

• Contribute to monitoring, logging, and alerting infrastructures (e.g., Prometheus, Grafana, Catchpoint, ELK).

• Collaborate with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).

• Develop scripts (Bash, Python, Go, etc.) to enhance system reliability and operational efficiency.

• Team up with engineering, product, and operations departments to foster resilient system design and operations.

• Support capacity planning and participate in disaster recovery drills.

• Work alongside vendors and service providers to diagnose service issues and monitor SLA performance.

• Document systems, share insights, and contribute to cultivating a reliability-focused engineering culture.

• Assist in the development of playbooks, runbooks, and operational documentation.

• Identify recurring challenges and suggest long-term enhancements.

• Advocate for reliability-centric practices within development and operations teams.

• Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).

• 2–4 years of experience in site reliability, systems engineering, or operations.

• Familiarity with large-scale, production-grade systems.

• Strong Linux systems administration and troubleshooting capabilities.

• Knowledge of service reliability principles - monitoring, alerting, incident response, and root cause analysis.

• Proficient in at least one scripting language (Python, Bash, or Go).

• Understanding of containerization (Kubernetes, Docker) and microservices architecture.

• Awareness of incident response and operational best practices.

• Flexible working hours

• Professional development opportunities

• Remote work options

Site Reliability Engineer II

People also viewed