This is a fully remote position, open to applicants in Argentina.

📋 Description

• Ensure the availability and resilience of essential services across production environments.

• Track service health through SLIs, SLOs, and error budgets, escalating issues when thresholds are in jeopardy.

• Engage in on-call rotations, incident response, and post-incident reviews to enhance service quality.

• Adhere to established ITIL/OSS methodologies (incident, change, problem, and capacity management).

• Create automation for routine operational tasks, minimizing manual intervention and toil.

• Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK).

• Collaborate with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).

• Develop scripts (Bash, Python, Go, etc.) to enhance system reliability and efficiency.

• Partner with engineering, product, and operations teams to ensure robust system design and operations.

• Assist with capacity planning and conduct disaster recovery exercises.

• Work alongside vendors and service providers to troubleshoot service issues and monitor SLA performance.

• Document systems, share insights, and help nurture a reliability-focused engineering culture.

• Contribute to playbooks, runbooks, and operational documentation.

• Identify recurring issues and suggest long-term solutions.

• Advocate for reliability-centered practices within development and operations teams.

⛳️ Requirements

• Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).

• 2–4 years of experience in site reliability, systems engineering, or operations.

• Familiarity with large-scale, production-grade systems.

• Strong Linux systems administration and troubleshooting capabilities.

• Knowledge of service reliability concepts, including monitoring, alerting, incident response, and root cause analysis.

• Proficient in at least one scripting language (Python, Bash, or Go).

• Understanding of containers (Kubernetes, Docker) and microservices principles.

• Awareness of incident response and operational best practices.

• Excellent problem-solving abilities and eagerness to learn new technologies.

• Experience in a SaaS, service provider, or distributed systems environment is preferred.

🏝️ Benefits

• We prioritize fairness and integrity in our relationships with customers, partners, and employees.

• Diversity, equity, and inclusion are fundamental to our values.

• We are dedicated to creating a workplace where all employees feel a sense of belonging, regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education.

• We believe that our commitment to fostering a diverse work environment enables us to better serve our customers.

Site Reliability Engineer II

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!