• Implement the Observability Ladder and establish SLAs, SLIs, and SLOs.

• Create deployment tools that enable teams to automate rollbacks when error budgets are exhausted.

• Foster a blameless post-mortem culture centered on actionable insights and measurable metrics.

• Continuously enhance alerting and on-call frameworks to minimize alert fatigue.

• Develop systems for verification both before and after deployments.

• Lead the initiative to manage the reliability suite through Infrastructure as Code (IaC) utilizing Terraform.

• Bachelor's degree in Computer Science, Information Technology, or a related discipline.

• Over 5 years of experience in Software Engineering, Site Reliability Engineering (SRE), DevOps, or Platform Engineering.

• Strong coding skills: Proficient in Python (or a similar programming language).

• Practical experience with AWS and a robust understanding of Infrastructure as Code (Terraform or CloudFormation).

• Proven experience with monitoring tools such as DataDog, Prometheus, or the ELK stack.

• Solid understanding of SRE principles, including Golden Signals and error budget calculations.

• Demonstrated ability to define and enforce reliability standards across multiple teams.

• Flexibility and the option to work remotely.

• A work-life balance that ensures you are not expected to work on weekends or outside of regular hours.

• A progressive remote company that offers virtual social platforms for employee engagement.

• A monthly allowance for working from home.

• A MacBook or Windows laptop to enable you to perform at your best.

• Support for your professional development, along with recognition of your achievements and career advancement.

Senior Site Reliability Engineer

People also viewed