This is a fully remote position, open to applicants in Philippines.

📋 Description

• Ensure the reliability and availability of the platform across both production and pre-production environments through proactive monitoring, alerting, and automation.

• Act as the first responder for incidents and contribute to problem management and root cause analysis.

• Support the development team's initiatives towards reliability, fostering a strong reliability culture throughout the development lifecycle.

• Create troubleshooting documentation for production support resources.

• Collaborate with engineering teams to produce optimized and productive runbooks, operational documentation, and the automation of operational tasks.

• Work alongside development and cloud engineering teams to integrate reliability and performance into the software delivery lifecycle.

• Design, implement, and enhance observability solutions (metrics, logs, traces, dashboards) utilizing tools such as Prometheus, Grafana, and ELK.

• Participate in on-call rotations and continuously refine alert quality and response processes.

• Promote a culture of reliability, performance, and continuous improvement across teams.

⛳️ Requirements

• Bachelor's Degree or Master's in Engineering or a related field.

• Experience in managing at least one container orchestration cluster (Kubernetes, Docker Swarm).

• Proven experience in developing or maintaining software for production services at scale.

• Familiarity with ELK.

• Experience with AWS.

• Knowledge of the Grafana/Prometheus stack.

• Strong scripting abilities (Bash, Python, or Go).

• Excellent communication skills.

• Ability to think creatively and anticipate challenges. It is crucial to be proactive rather than reactive; we must foresee challenges and critically evaluate existing technologies, procedures, and mindsets. Continuous review and questioning at all levels are expected.

• Versatility is essential. We employ agile/lean methodologies and prefer to iterate and learn rather than assume we have all the answers.

• A team player mentality is vital. You will not always work in isolation and should be enthusiastic about collaborating with product, experience design, engineering, and more.

• **Considered a plus:**

• - Telephony knowledge (SIP, VoIP);

• - Experience in Linux Administration (RedHat, CentOS, AL);

• - Working knowledge of Configuration Management tools (Terraform, Ansible);

• - Understanding of TCP/IP and general networking concepts;

• - RDBMS knowledge (MySQL, Postgres);

• - NoSQL knowledge (Redis).

🏝️ Benefits

• Competitive fixed compensation;

• Long-term employment with vacation days;

• Opportunities for professional development (courses, training, etc);

• Be part of innovative technology products that have a global impact on the service industry;

• Work alongside skilled and enjoyable colleagues;

• Access to Apple gear.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!