This is a fully remote position, open to applicants in Italy.

📋 Description

• Engage in hands-on Reliability & System Engineering by designing, constructing, and managing reliable and scalable systems, defining and monitoring SLOs/SLIs, directly working on production infrastructure, and collaborating closely with software engineers to enhance system design and reliability.

• Focus on Automation, Operations & Incident Response by developing automation for infrastructure and operational workflows to minimize toil and reduce MTTR. Participate in and lead incident responses, as well as conduct blameless post-incident reviews with clear follow-ups implemented in code and tooling.

• Analyze and optimize system performance and cost under the Performance, Capacity & Security domain, providing data, insights, and recommendations for capacity planning, while supporting security best practices through direct involvement in vulnerability remediation and threat mitigation.

⛳️ Requirements

• Possess hands-on experience with SRE practices in production environments, showcasing strong expertise in AWS, Kubernetes, networking, DNS, and Infrastructure as Code (with a preference for Pulumi and knowledge of Terraform being a plus).

• Exhibit a strong foundation in Automation & Software Engineering, emphasizing code quality and maintainability, including proficiency in Python and in-depth knowledge of the Python ecosystem (testing, debugging, packaging), along with a consistent focus on crafting clean, well-structured, and maintainable code.

• Demonstrate skills in Reliability, Data & Operations by engaging stakeholders, mentoring others, leading incident responses and root cause analyses (RCAs), enhancing system reliability, and proposing solutions while sharing insights.

• Nice-to-Have: Experience in highly regulated industries (such as Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking configurations, with familiarity in security technologies like Cloudflare.

• Have a solid understanding of microservices architectures, including their principles and trade-offs.

• Gain hands-on experience with Datadog for platform and application monitoring, performance optimization, and a strong foundation in database structures.

🏝️ Benefits

• Work Your Way: Enjoy full flexibility – work from home, the office, or a combination of both. Additionally, work from anywhere for up to 30 days each year.

• Grow with us: Access learning resources, mentorship, and a personalized growth plan tailored to your development.

• Thrive and perform: Benefit from private healthcare, gym discounts, wellbeing programs, and mental health support.

Senior Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Lead DevOps Engineer, Data & AI Platform

DevOps Engineer, German

Site Reliability Engineer – Kubernetes Platform

Lead DevOps Engineer – Data & AI Platform

Security Engineer, DevSecOps

Cloud Operations Engineer

Never miss a great job!