
Senior Site Reliability Engineer
Posted Jun 1

Posted Jun 1
This is a fully remote position, open to applicants in Arizona, +31 more states.
• Executing daily operational and DevOps duties on Wikimedia’s publicly accessible infrastructure, including deployment, maintenance, configuration, and troubleshooting.
• Utilizing and implementing configuration management and deployment tools such as Puppet and Kubernetes.
• Driving continuous improvements by automating the installation, configuration, and upkeep of services on our platform.
• Collaborating closely with product teams to help deliver scalable functionality to users by contributing to the architectural design of new services and ensuring their scalability.
• Engaging in a 24/7 on-call rotation shared with the wider SRE team, which involves participating in incident response, diagnosing issues, and following up on system outages or alerts within Wikimedia’s production infrastructure.
• Working in partnership with a global, cross-functional team in an asynchronous communication environment.
• Providing mentorship to peers in your areas of technical and operational expertise.
• Over 6 years of experience in an SRE, Operations, or DevOps role within a team setting.
• Proficiency in shell scripting and any programming language used in an SRE context (such as Python, Go, Bash, Ruby; with a primary focus on Python) and familiarity with configuration management tools (Puppet and Ansible; we primarily use Puppet).
• Knowledge of distributed caching systems, including their underlying algorithms and performance optimization techniques.
• Experience with package management on Linux systems, specifically Debian.
• Strong troubleshooting skills at the Linux system level.
• Proven track record in automating tasks and processes, identifying process inefficiencies, and discovering opportunities for automation.
• Excellent English language skills, both verbal and written, along with the ability to work independently as an effective member of a globally distributed team across multiple time zones.
• Experience in leading and participating in incident response and post-incident review processes, focusing on conducting root cause analysis and implementing preventive measures.
• Competitive salary.
• Comprehensive health insurance.
• Flexible work arrangements.
• Generous paid time off.
• Opportunities for professional development.
N2JSoft, administrative and HR softwares
It's Prodigy
ARA
Kenlo
Get handpicked remote jobs straight to your inbox weekly.