Remotery

Senior Site Reliability Engineer – Cloud and Networking

Posted Jun 12

This is a fully remote position, open to applicants in Poland.

📋 Description

• Taking ownership of the SRE lifecycle for NodeBalancer and Network Load Balancer — encompassing design reviews, pre-rollout readiness evaluations, production sign-off, and ongoing reliability management.

• Creating and implementing SLO/SLI frameworks that accurately represent customer experiences for L4 and L7 load balancing services, and initiating actions when error budgets are at risk.

• Developing and maintaining observability pipelines for NB/NLB infrastructure, which includes Prometheus metrics from load balancing components and system-level sources, as well as Grafana dashboards that facilitate quick incident triage.

• Leading technical incident responses for intricate NB/NLB failures — such as BGP/VIP issues, failover failures, data plane degradations, and configuration challenges — acting as the technical commander and driving root cause analysis along with preventive measures.

• Creating and automating safe deployment workflows for phased NB/NLB releases, which involve monitoring bake periods, managing feature flags, and validating GO/NO-GO decisions across global datacenter rollouts.

• Reviewing design documents and product requirement documents while providing actionable SRE insights related to operational risks, capacity implications, Day-2 concerns, and product strategy gaps.

• Building automation and tools using Python or Go that minimize operational toil and enhance team-wide operational capabilities.

• Mentoring SRE II engineers on the NB team by offering hands-on technical guidance, conducting code/config reviews, and elevating the team's SRE practices.

• Participating in an on-call rotation for NB/NLB production systems, addressing incidents, and driving resolutions for customer-facing load balancing infrastructure.

• Engaging in a scheduled, daytime-only on-call rotation to lead technical incident responses and resolve complex NB/NLB failures.


⛳️ Requirements

• Possess extensive experience in SRE, platform engineering, or infrastructure engineering, particularly with large-scale distributed systems.

• Exhibit deep expertise in Linux networking fundamentals — including routing, BGP, nftables/iptables, ARP, and VXLAN — and be adept at diagnosing issues at the packet level using tools like tcpdump and netstat.

• Have practical experience with L4/L7 load balancing technologies — both proxy-based and kernel-level load balancers — covering aspects such as configuration, health checking, high availability, and failure modes at scale.

• Demonstrate a history of defining SLO/SLI frameworks, constructing observability platforms from the ground up, and managing incident response processes at scale.

• Show expertise in Kubernetes and containerization at scale — including workload scheduling, networking (CNI, Services, ingress), resource management, and managing stateful or network-intensive workloads within a cluster environment.

• Build automation and tools using Python or Go, with experience in infrastructure-as-code (SaltStack, Ansible, or Terraform) and a strong focus on deployment safety.

• Hold 4+ years of experience in SRE or infrastructure engineering, with a minimum of 2 years at cloud scale.


🏝️ Benefits

• Your health

• Your finances

• Your family

• Your time at work

• Your time pursuing other endeavors

People also viewed

Advanced Solutions International, Inc.10 hours ago

DevOps Reliability Engineer

AU flagAustralia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$90k – $110k/year
ApplyView job
Stone10 hours ago

Senior Site Reliability Engineer – Network

BR flagBrazil OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Replit1 day ago

Staff Site Reliability Engineer

EuropeFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Soum1 day ago

DevOps Engineer, Mid Level

EG flagEgypt OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Lakeside Software1 day ago

DevOps Engineer, Azure

IN flagIndia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Interval Group1 day ago

DevOps Engineer, mk8s

DE flagGermany OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers