Senior Site Reliability Engineer, Infrastructure Foundations

atWikimedia Foundation

Arizona California Colorado Connecticut District of Columbia

Full-time DevOps & Site Reliability Engineer (SRE)Senior$113.1k – $175.7k/year

Posted May 14

This is a fully remote position, open to applicants in Arizona, +31 more states.

📋 Description

• Executing daily operational and DevOps duties on Wikimedia’s public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting.

• Utilizing and implementing configuration management and deployment tools such as Puppet and Kubernetes.

• Driving continuous enhancements by automating the installation, configuration, and upkeep of services on our platform.

• Collaborating closely with product teams to help deliver scalable functionalities to users by assisting in the architectural design of new services and ensuring they operate effectively at scale.

• Engaging in a 24/7 on-call rotation shared among the broader SRE team, which involves participating in incident response, diagnosing issues, and following up on system outages or alerts across Wikimedia’s production infrastructure.

• Working with a global, cross-functional team in an asynchronous communication setting.

• Guiding peers in your areas of technical expertise and operational strengths.

• Willingness and ability to travel 1-2 times a year for in-person events and team gatherings.

⛳️ Requirements

• Over 6 years of experience in an SRE, Operations, or DevOps role as part of a team.

• Proficiency with shell and various scripting languages relevant to an SRE context (Python, Go, Bash, Ruby; with a primary focus on Python) and configuration management tools (Puppet, Ansible; we use Puppet).

• Experience in designing and managing infrastructure security for a large array of diverse services.

• Involvement in technical responses during security incidents.

• Familiarity with package management on Linux systems, particularly Debian.

• Strong troubleshooting skills at the Linux system level.

• Proven track record of automating tasks and processes, identifying process gaps, and discovering automation opportunities.

• Excellent English language proficiency (both verbal and written) and capacity to work independently as an effective member of a globally distributed team across multiple time zones.

• Experience in leading and participating in incident response and post-incident review processes, aiming for root cause analysis and implementing preventive measures.