
Senior Site Reliability Engineer, Infrastructure Foundations
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Arizona, +31 more states.
• Conducting daily operational and DevOps activities on Wikimedia’s public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting.
• Utilizing and implementing configuration management and deployment tools such as Puppet and Kubernetes.
• Driving continuous improvement by automating the installation, configuration, and maintenance of services on our platform.
• Collaborating closely with product teams to enhance scalable functionality for our users by aiding in the architectural design of new services and ensuring they operate at scale.
• Engaging in a 24/7 on-call rotation shared among the broader Site Reliability Engineering (SRE) team, which includes participating in incident response, diagnosing issues, and following up on system outages or alerts across Wikimedia’s production infrastructure.
• Working with a global, cross-functional team in an asynchronous communication environment.
• Mentoring colleagues in your areas of technical and operational expertise.
• Willingness and ability to travel 1-2 times a year for in-person events and team meetings.
• A minimum of 6 years of experience in an SRE, Operations, or DevOps role as part of a team.
• Proficiency in shell and various scripting languages relevant to the SRE field (Python, Go, Bash, Ruby; with a primary focus on Python) and in configuration management tools (Puppet, Ansible; we primarily use Puppet).
• Experience in designing and managing infrastructure security for extensive fleets of diverse services.
• Background in technical incident response during security events.
• Familiarity with package management on Linux systems, specifically Debian.
• Strong troubleshooting skills at the Linux system level.
• Proven track record of automating tasks and processes, identifying gaps in processes, and recognizing automation opportunities.
• Excellent English language skills, both verbal and written, along with the ability to work independently as an effective member of a globally distributed team across multiple time zones.
• Experience in leading and participating in incident responses and post-incident review rituals to conduct root cause analysis and implement preventive measures.
• Competitive salary.
• Health insurance.
• Flexible working hours.
• Opportunities for professional development.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.