This is a fully remote position, open to applicants in Poland.

📋 Description

• Taking ownership of the SRE lifecycle for NodeBalancer and Network Load Balancer — encompassing design reviews, pre-rollout readiness evaluations, production sign-off, and ongoing reliability management.

• Creating and implementing SLO/SLI frameworks that accurately represent customer experiences for L4 and L7 load balancing services, and initiating actions when error budgets are at risk.

• Developing and maintaining observability pipelines for NB/NLB infrastructure, which includes Prometheus metrics from load balancing components and system-level sources, as well as Grafana dashboards that facilitate quick incident triage.

• Leading technical incident responses for intricate NB/NLB failures — such as BGP/VIP issues, failover failures, data plane degradations, and configuration challenges — acting as the technical commander and driving root cause analysis along with preventive measures.

• Creating and automating safe deployment workflows for phased NB/NLB releases, which involve monitoring bake periods, managing feature flags, and validating GO/NO-GO decisions across global datacenter rollouts.

• Reviewing design documents and product requirement documents while providing actionable SRE insights related to operational risks, capacity implications, Day-2 concerns, and product strategy gaps.

• Building automation and tools using Python or Go that minimize operational toil and enhance team-wide operational capabilities.

• Mentoring SRE II engineers on the NB team by offering hands-on technical guidance, conducting code/config reviews, and elevating the team's SRE practices.

• Participating in an on-call rotation for NB/NLB production systems, addressing incidents, and driving resolutions for customer-facing load balancing infrastructure.

• Engaging in a scheduled, daytime-only on-call rotation to lead technical incident responses and resolve complex NB/NLB failures.

⛳️ Requirements

• Possess extensive experience in SRE, platform engineering, or infrastructure engineering, particularly with large-scale distributed systems.

• Exhibit deep expertise in Linux networking fundamentals — including routing, BGP, nftables/iptables, ARP, and VXLAN — and be adept at diagnosing issues at the packet level using tools like tcpdump and netstat.

• Have practical experience with L4/L7 load balancing technologies — both proxy-based and kernel-level load balancers — covering aspects such as configuration, health checking, high availability, and failure modes at scale.

• Demonstrate a history of defining SLO/SLI frameworks, constructing observability platforms from the ground up, and managing incident response processes at scale.

• Show expertise in Kubernetes and containerization at scale — including workload scheduling, networking (CNI, Services, ingress), resource management, and managing stateful or network-intensive workloads within a cluster environment.

• Build automation and tools using Python or Go, with experience in infrastructure-as-code (SaltStack, Ansible, or Terraform) and a strong focus on deployment safety.

• Hold 4+ years of experience in SRE or infrastructure engineering, with a minimum of 2 years at cloud scale.

🏝️ Benefits

• Your health

• Your finances

• Your family

• Your time at work

• Your time pursuing other endeavors

Senior Site Reliability Engineer – Cloud and Networking

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!