This is a fully remote position, open to applicants in Australia.

📋 Description

• Ensure the reliability and availability of the platform across both production and pre-production environments through proactive monitoring, alerting, and automation.

• Serve as the initial point of contact for incidents, contributing to problem management and conducting root cause analysis.

• Aid the development team in fostering a culture of reliability, creating a robust reliability ethos within the development lifecycle.

• Create troubleshooting documentation for production support resources.

• Work collaboratively with Engineering teams to develop optimized and effective runbooks, operational documentation, and automate operational tasks.

• Partner with development and cloud engineering teams to integrate reliability and performance into the software delivery lifecycle.

• Design, implement, and enhance observability solutions (metrics, logs, traces, dashboards) utilizing tools such as Prometheus, Grafana, and ELK.

• Participate in on-call rotations and consistently improve alert quality and response processes.

• Advocate for a culture of reliability, performance, and continuous improvement across various teams.

⛳️ Requirements

• Bachelor's Degree or MS in Engineering or an equivalent field.

• Experience managing at least one container orchestration cluster (Kubernetes, Docker Swarm).

• Proven experience in developing or maintaining software for production services at scale.

• Familiarity with ELK.

• Experience with AWS.

• Proficiency in the Grafana/Prometheus stack.

• Strong scripting abilities (Bash, Python, or Go).

• Excellent communication skills.

• Ability to think creatively and anticipate challenges. It is crucial to be proactive rather than merely reactive; we must foresee challenges and critically evaluate existing technologies, procedures, and mindsets. Constant review and questioning at all levels are expected.

• Versatility. We adopt agile/lean methodologies and prefer to iterate and learn rather than assume we have all the answers.

• Team player mentality. You thrive on collaboration and are enthusiastic about involving product, experience design, engineering, and more in the process.

Will be considered as a plus:

• Knowledge of telephony (SIP, VoIP);

• Experience in Linux Administration (RedHat, CentOS, AL);

• Working knowledge of Configuration Management tools (Terraform, Ansible);

• Familiarity with TCP/IP and general networking concepts;

• Knowledge of RDBMS (MySQL, Postgres);

• Understanding of NoSQL (Redis).

🏝️ Benefits

• Fixed compensation;

• Long-term employment with vacation days;

• Opportunities for professional growth (courses, training, etc.);

• Being part of successful, cutting-edge technology products that are making a global impact in the service industry;

• Engaging and enjoyable colleagues;

• Apple gear.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!