
Senior Site Reliability Engineer – Cloud and Networking
Posted Jun 12

Posted Jun 12
This is a fully remote position, open to applicants in Poland.
• Taking ownership of the SRE lifecycle for NodeBalancer and Network Load Balancer — encompassing design reviews, pre-rollout readiness evaluations, production sign-off, and ongoing reliability management.
• Creating and implementing SLO/SLI frameworks that accurately represent customer experiences for L4 and L7 load balancing services, and initiating actions when error budgets are at risk.
• Developing and maintaining observability pipelines for NB/NLB infrastructure, which includes Prometheus metrics from load balancing components and system-level sources, as well as Grafana dashboards that facilitate quick incident triage.
• Leading technical incident responses for intricate NB/NLB failures — such as BGP/VIP issues, failover failures, data plane degradations, and configuration challenges — acting as the technical commander and driving root cause analysis along with preventive measures.
• Creating and automating safe deployment workflows for phased NB/NLB releases, which involve monitoring bake periods, managing feature flags, and validating GO/NO-GO decisions across global datacenter rollouts.
• Reviewing design documents and product requirement documents while providing actionable SRE insights related to operational risks, capacity implications, Day-2 concerns, and product strategy gaps.
• Building automation and tools using Python or Go that minimize operational toil and enhance team-wide operational capabilities.
• Mentoring SRE II engineers on the NB team by offering hands-on technical guidance, conducting code/config reviews, and elevating the team's SRE practices.
• Participating in an on-call rotation for NB/NLB production systems, addressing incidents, and driving resolutions for customer-facing load balancing infrastructure.
• Engaging in a scheduled, daytime-only on-call rotation to lead technical incident responses and resolve complex NB/NLB failures.
• Possess extensive experience in SRE, platform engineering, or infrastructure engineering, particularly with large-scale distributed systems.
• Exhibit deep expertise in Linux networking fundamentals — including routing, BGP, nftables/iptables, ARP, and VXLAN — and be adept at diagnosing issues at the packet level using tools like tcpdump and netstat.
• Have practical experience with L4/L7 load balancing technologies — both proxy-based and kernel-level load balancers — covering aspects such as configuration, health checking, high availability, and failure modes at scale.
• Demonstrate a history of defining SLO/SLI frameworks, constructing observability platforms from the ground up, and managing incident response processes at scale.
• Show expertise in Kubernetes and containerization at scale — including workload scheduling, networking (CNI, Services, ingress), resource management, and managing stateful or network-intensive workloads within a cluster environment.
• Build automation and tools using Python or Go, with experience in infrastructure-as-code (SaltStack, Ansible, or Terraform) and a strong focus on deployment safety.
• Hold 4+ years of experience in SRE or infrastructure engineering, with a minimum of 2 years at cloud scale.
• Your health
• Your finances
• Your family
• Your time at work
• Your time pursuing other endeavors
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.