
Site Reliability Engineer II
Posted 22 hours ago

Posted 22 hours ago
This is a fully remote position, open to applicants in United States.
• Ensure the availability and resilience of essential services within production environments.
• Track service health utilizing SLIs, SLOs, and error budgets, escalating issues when thresholds are jeopardized.
• Engage in on-call rotations, incident management, and post-incident evaluations to enhance service quality.
• Adhere to established ITIL/OSS methodologies (incident, change, problem, and capacity management).
• Create automation for routine operational tasks to minimize manual efforts and reduce toil.
• Contribute to monitoring, logging, and alerting infrastructures (e.g., Prometheus, Grafana, Catchpoint, ELK).
• Collaborate with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
• Develop scripts (Bash, Python, Go, etc.) to enhance system reliability and operational efficiency.
• Team up with engineering, product, and operations departments to foster resilient system design and operations.
• Support capacity planning and participate in disaster recovery drills.
• Work alongside vendors and service providers to diagnose service issues and monitor SLA performance.
• Document systems, share insights, and contribute to cultivating a reliability-focused engineering culture.
• Assist in the development of playbooks, runbooks, and operational documentation.
• Identify recurring challenges and suggest long-term enhancements.
• Advocate for reliability-centric practices within development and operations teams.
• Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
• 2–4 years of experience in site reliability, systems engineering, or operations.
• Familiarity with large-scale, production-grade systems.
• Strong Linux systems administration and troubleshooting capabilities.
• Knowledge of service reliability principles - monitoring, alerting, incident response, and root cause analysis.
• Proficient in at least one scripting language (Python, Bash, or Go).
• Understanding of containerization (Kubernetes, Docker) and microservices architecture.
• Awareness of incident response and operational best practices.
• Flexible working hours
• Professional development opportunities
• Remote work options
Investigo
Software Mind
Cherokee Federal
Avaya
Get handpicked remote jobs straight to your inbox weekly.