
Senior Site Reliability Engineer
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Australia.
• Ensure the reliability and availability of the platform across both production and pre-production environments through proactive monitoring, alerting, and automation.
• Serve as the initial point of contact for incidents, contributing to problem management and conducting root cause analysis.
• Aid the development team in fostering a culture of reliability, creating a robust reliability ethos within the development lifecycle.
• Create troubleshooting documentation for production support resources.
• Work collaboratively with Engineering teams to develop optimized and effective runbooks, operational documentation, and automate operational tasks.
• Partner with development and cloud engineering teams to integrate reliability and performance into the software delivery lifecycle.
• Design, implement, and enhance observability solutions (metrics, logs, traces, dashboards) utilizing tools such as Prometheus, Grafana, and ELK.
• Participate in on-call rotations and consistently improve alert quality and response processes.
• Advocate for a culture of reliability, performance, and continuous improvement across various teams.
• Bachelor's Degree or MS in Engineering or an equivalent field.
• Experience managing at least one container orchestration cluster (Kubernetes, Docker Swarm).
• Proven experience in developing or maintaining software for production services at scale.
• Familiarity with ELK.
• Experience with AWS.
• Proficiency in the Grafana/Prometheus stack.
• Strong scripting abilities (Bash, Python, or Go).
• Excellent communication skills.
• Ability to think creatively and anticipate challenges. It is crucial to be proactive rather than merely reactive; we must foresee challenges and critically evaluate existing technologies, procedures, and mindsets. Constant review and questioning at all levels are expected.
• Versatility. We adopt agile/lean methodologies and prefer to iterate and learn rather than assume we have all the answers.
• Team player mentality. You thrive on collaboration and are enthusiastic about involving product, experience design, engineering, and more in the process.
Will be considered as a plus:
• Knowledge of telephony (SIP, VoIP);
• Experience in Linux Administration (RedHat, CentOS, AL);
• Working knowledge of Configuration Management tools (Terraform, Ansible);
• Familiarity with TCP/IP and general networking concepts;
• Knowledge of RDBMS (MySQL, Postgres);
• Understanding of NoSQL (Redis).
• Fixed compensation;
• Long-term employment with vacation days;
• Opportunities for professional growth (courses, training, etc.);
• Being part of successful, cutting-edge technology products that are making a global impact in the service industry;
• Engaging and enjoyable colleagues;
• Apple gear.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.