
Senior Site Reliability Engineer
Posted May 6

Posted May 6
This is a fully remote position, open to applicants in Serbia.
• Take ownership of and enhance on-call procedures, incident response documentation, and post-mortem practices.
• Establish, monitor, and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for essential services.
• Facilitate blameless post-mortems and promote systematic improvements in reliability.
• Address production incidents and oversee cross-functional resolutions.
• Design, develop, and sustain scalable AWS infrastructure utilizing Infrastructure as Code (IaC) tools like Terraform and Pulumi.
• Oversee Kubernetes clusters and manage containerized applications in a production environment.
• Create and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines to enhance deployment efficiency and reliability.
• Assess and adopt tools to boost developer productivity and system stability.
• Implement monitoring, alerting, and distributed tracing solutions (Prometheus, Grafana, Datadog, Jaeger).
• Detect and address performance bottlenecks across services, networks, and databases.
• Develop dashboards and runbooks for self-service operational insights.
• Collaborate with engineering teams to integrate reliability practices such as load testing, capacity planning, and chaos engineering.
• Conduct architecture reviews emphasizing reliability and operability.
• Minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering.
• Extensive knowledge of AWS and cloud-native architectures.
• Significant experience with Kubernetes and large-scale container orchestration.
• Practical experience with Infrastructure as Code tools (Terraform or Pulumi).
• Proficient in programming languages such as Python, Go, or Bash.
• Familiarity with observability tools (Prometheus, Grafana, Datadog, or similar).
• Strong grasp of SLOs, SLIs, and error budgets.
• Experience with service mesh technologies (Istio, Linkerd).
• Knowledge of chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos).
• Background in Oracle database reliability and administration.
• Contributions to open-source infrastructure projects.
• Experience in a high-growth SaaS or product-led setting.
• Exceptional English communication skills, both written and verbal.
• A pivotal role within a developing SaaS company that prioritizes personal development, accountability, and teamwork.
• An environment that fosters open collaboration and effective problem-solving.
• Fully remote work opportunity.
• Competitive salary.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.