Remotery

Senior Site Reliability Engineer

Posted May 6

This is a fully remote position, open to applicants in Serbia.

📋 Description

• Take ownership of and enhance on-call procedures, incident response documentation, and post-mortem practices.

• Establish, monitor, and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for essential services.

• Facilitate blameless post-mortems and promote systematic improvements in reliability.

• Address production incidents and oversee cross-functional resolutions.

• Design, develop, and sustain scalable AWS infrastructure utilizing Infrastructure as Code (IaC) tools like Terraform and Pulumi.

• Oversee Kubernetes clusters and manage containerized applications in a production environment.

• Create and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines to enhance deployment efficiency and reliability.

• Assess and adopt tools to boost developer productivity and system stability.

• Implement monitoring, alerting, and distributed tracing solutions (Prometheus, Grafana, Datadog, Jaeger).

• Detect and address performance bottlenecks across services, networks, and databases.

• Develop dashboards and runbooks for self-service operational insights.

• Collaborate with engineering teams to integrate reliability practices such as load testing, capacity planning, and chaos engineering.

• Conduct architecture reviews emphasizing reliability and operability.


⛳️ Requirements

• Minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering.

• Extensive knowledge of AWS and cloud-native architectures.

• Significant experience with Kubernetes and large-scale container orchestration.

• Practical experience with Infrastructure as Code tools (Terraform or Pulumi).

• Proficient in programming languages such as Python, Go, or Bash.

• Familiarity with observability tools (Prometheus, Grafana, Datadog, or similar).

• Strong grasp of SLOs, SLIs, and error budgets.

• Experience with service mesh technologies (Istio, Linkerd).

• Knowledge of chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos).

• Background in Oracle database reliability and administration.

• Contributions to open-source infrastructure projects.

• Experience in a high-growth SaaS or product-led setting.

• Exceptional English communication skills, both written and verbal.


🏝️ Benefits

• A pivotal role within a developing SaaS company that prioritizes personal development, accountability, and teamwork.

• An environment that fosters open collaboration and effective problem-solving.

• Fully remote work opportunity.

• Competitive salary.

People also viewed

Innovative Solutions2 hours ago

Cloud Engineer – DevOps

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$100k – $160k/year
ApplyView job
Caspar Health2 hours ago

DevSecOps/DevOps Engineer

DE flagGermany OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
IVIX2 hours ago

Deployment Engineer

US flagNew York OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Investigo12 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind12 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal12 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers