Remotery

Site Reliability Engineer II

Posted 22 hours ago

This is a fully remote position, open to applicants in United States.

📋 Description

• Ensure the availability and resilience of essential services within production environments.

• Track service health utilizing SLIs, SLOs, and error budgets, escalating issues when thresholds are jeopardized.

• Engage in on-call rotations, incident management, and post-incident evaluations to enhance service quality.

• Adhere to established ITIL/OSS methodologies (incident, change, problem, and capacity management).

• Create automation for routine operational tasks to minimize manual efforts and reduce toil.

• Contribute to monitoring, logging, and alerting infrastructures (e.g., Prometheus, Grafana, Catchpoint, ELK).

• Collaborate with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).

• Develop scripts (Bash, Python, Go, etc.) to enhance system reliability and operational efficiency.

• Team up with engineering, product, and operations departments to foster resilient system design and operations.

• Support capacity planning and participate in disaster recovery drills.

• Work alongside vendors and service providers to diagnose service issues and monitor SLA performance.

• Document systems, share insights, and contribute to cultivating a reliability-focused engineering culture.

• Assist in the development of playbooks, runbooks, and operational documentation.

• Identify recurring challenges and suggest long-term enhancements.

• Advocate for reliability-centric practices within development and operations teams.


⛳️ Requirements

• Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).

• 2–4 years of experience in site reliability, systems engineering, or operations.

• Familiarity with large-scale, production-grade systems.

• Strong Linux systems administration and troubleshooting capabilities.

• Knowledge of service reliability principles - monitoring, alerting, incident response, and root cause analysis.

• Proficient in at least one scripting language (Python, Bash, or Go).

• Understanding of containerization (Kubernetes, Docker) and microservices architecture.

• Awareness of incident response and operational best practices.


🏝️ Benefits

• Flexible working hours

• Professional development opportunities

• Remote work options

People also viewed

Investigo10 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind10 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal10 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job
Avaya10 hours ago

Site Reliability Engineer – Azure, DevSecOps, IaC, Governance, Observability

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$129k – $143k/year
ApplyView job
Agilent Technologies10 hours ago

DevOps Engineer – Platform, AWS, CI/CD

US flagColorado OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$143.8k – $224.6k/year
ApplyView job
Dropbox10 hours ago

Site Reliability Engineer

PL flagPoland OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers