
Site Reliability Engineer II
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in Argentina.
• Ensure the availability and resilience of essential services across production environments.
• Track service health through SLIs, SLOs, and error budgets, escalating issues when thresholds are in jeopardy.
• Engage in on-call rotations, incident response, and post-incident reviews to enhance service quality.
• Adhere to established ITIL/OSS methodologies (incident, change, problem, and capacity management).
• Create automation for routine operational tasks, minimizing manual intervention and toil.
• Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK).
• Collaborate with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
• Develop scripts (Bash, Python, Go, etc.) to enhance system reliability and efficiency.
• Partner with engineering, product, and operations teams to ensure robust system design and operations.
• Assist with capacity planning and conduct disaster recovery exercises.
• Work alongside vendors and service providers to troubleshoot service issues and monitor SLA performance.
• Document systems, share insights, and help nurture a reliability-focused engineering culture.
• Contribute to playbooks, runbooks, and operational documentation.
• Identify recurring issues and suggest long-term solutions.
• Advocate for reliability-centered practices within development and operations teams.
• Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
• 2–4 years of experience in site reliability, systems engineering, or operations.
• Familiarity with large-scale, production-grade systems.
• Strong Linux systems administration and troubleshooting capabilities.
• Knowledge of service reliability concepts, including monitoring, alerting, incident response, and root cause analysis.
• Proficient in at least one scripting language (Python, Bash, or Go).
• Understanding of containers (Kubernetes, Docker) and microservices principles.
• Awareness of incident response and operational best practices.
• Excellent problem-solving abilities and eagerness to learn new technologies.
• Experience in a SaaS, service provider, or distributed systems environment is preferred.
• We prioritize fairness and integrity in our relationships with customers, partners, and employees.
• Diversity, equity, and inclusion are fundamental to our values.
• We are dedicated to creating a workplace where all employees feel a sense of belonging, regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education.
• We believe that our commitment to fostering a diverse work environment enables us to better serve our customers.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.