
Senior Site Reliability Engineer, Infrastructure Foundations
Posted 23 hours ago

Posted 23 hours ago
• Executing daily operational and DevOps duties on Wikimedia’s public-facing infrastructure, including deployment, maintenance, configuration, and troubleshooting.
• Utilizing and implementing configuration management and deployment tools such as Puppet and Kubernetes.
• Driving continuous enhancements by automating the installation, configuration, and upkeep of services on our platform.
• Collaborating closely with product teams to help deliver scalable functionalities to users by assisting in the architectural design of new services and ensuring they operate effectively at scale.
• Engaging in a 24/7 on-call rotation shared among the broader SRE team, which involves participating in incident response, diagnosing issues, and following up on system outages or alerts across Wikimedia’s production infrastructure.
• Working with a global, cross-functional team in an asynchronous communication setting.
• Guiding peers in your areas of technical expertise and operational strengths.
• Willingness and ability to travel 1-2 times a year for in-person events and team gatherings.
• Over 6 years of experience in an SRE, Operations, or DevOps role as part of a team.
• Proficiency with shell and various scripting languages relevant to an SRE context (Python, Go, Bash, Ruby; with a primary focus on Python) and configuration management tools (Puppet, Ansible; we use Puppet).
• Experience in designing and managing infrastructure security for a large array of diverse services.
• Involvement in technical responses during security incidents.
• Familiarity with package management on Linux systems, particularly Debian.
• Strong troubleshooting skills at the Linux system level.
• Proven track record of automating tasks and processes, identifying process gaps, and discovering automation opportunities.
• Excellent English language proficiency (both verbal and written) and capacity to work independently as an effective member of a globally distributed team across multiple time zones.
• Experience in leading and participating in incident response and post-incident review processes, aiming for root cause analysis and implementing preventive measures.
• Competitive salary.
• Health insurance.
• Flexible working hours.
• Opportunities for professional development.
Arctiq
Arctiq
Software Mind
Mediastream
Get handpicked remote jobs straight to your inbox weekly.