This is a fully remote position, open to applicants in Serbia.

• Take ownership of and enhance on-call procedures, incident response documentation, and post-mortem practices.

• Establish, monitor, and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for essential services.

• Facilitate blameless post-mortems and promote systematic improvements in reliability.

• Address production incidents and oversee cross-functional resolutions.

• Design, develop, and sustain scalable AWS infrastructure utilizing Infrastructure as Code (IaC) tools like Terraform and Pulumi.

• Oversee Kubernetes clusters and manage containerized applications in a production environment.

• Create and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines to enhance deployment efficiency and reliability.

• Assess and adopt tools to boost developer productivity and system stability.

• Implement monitoring, alerting, and distributed tracing solutions (Prometheus, Grafana, Datadog, Jaeger).

• Detect and address performance bottlenecks across services, networks, and databases.

• Develop dashboards and runbooks for self-service operational insights.

• Collaborate with engineering teams to integrate reliability practices such as load testing, capacity planning, and chaos engineering.

• Conduct architecture reviews emphasizing reliability and operability.

• Minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering.

• Extensive knowledge of AWS and cloud-native architectures.

• Significant experience with Kubernetes and large-scale container orchestration.

• Practical experience with Infrastructure as Code tools (Terraform or Pulumi).

• Proficient in programming languages such as Python, Go, or Bash.

• Familiarity with observability tools (Prometheus, Grafana, Datadog, or similar).

• Strong grasp of SLOs, SLIs, and error budgets.

• Experience with service mesh technologies (Istio, Linkerd).

• Knowledge of chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos).

• Background in Oracle database reliability and administration.

• Contributions to open-source infrastructure projects.

• Experience in a high-growth SaaS or product-led setting.

• Exceptional English communication skills, both written and verbal.

• A pivotal role within a developing SaaS company that prioritizes personal development, accountability, and teamwork.

• An environment that fosters open collaboration and effective problem-solving.

• Fully remote work opportunity.

• Competitive salary.

Senior Site Reliability Engineer

People also viewed