Remotery

Senior Principal Site Reliability Engineer

Posted May 22

This is a fully remote position, open to applicants in Poland.

📋 Description

• Establishing the reliability architecture for Akamai's AI compute and platform services, which encompasses SLO frameworks, fault tolerance patterns, and capacity planning models.

• Actively developing automation and tooling that minimizes operational toil and amplifies the SRE team's effectiveness.

• Crafting an observability strategy by utilizing Akamai's current platform to create the telemetry, dashboards, alerts, and GPU-specific monitoring essential for AI workloads.

• Designing deployment safety practices that include progressive rollouts, canary analysis, rollback automation, and change safety processes.

• Shaping product engineering architecture and design choices, integrating reliability into the development lifecycle at the system level.

• Guiding and uplifting fellow SREs through design reviews, code assessments, and hands-on troubleshooting, thereby establishing the technical standards for the team.


⛳️ Requirements

• Possess extensive experience in SRE, platform engineering, and/or infrastructure engineering, with proven impact at a principal or staff level.

• Exhibit comprehensive Kubernetes proficiency, managing autoscaling, resource scheduling, and container orchestration to effectively handle compute-intensive workloads.

• Cultivate programming skills in Python or Go, concentrating on developing automation and tooling for production-grade environments.

• Demonstrate programming expertise in Python and/or Go, along with experience in creating production-grade automation, tooling, and platform services.

• Influence cross-team technical decisions, mentor engineers, raise technical standards, and collaborate effectively with product engineering teams.

• Acquire experience in AI/ML infrastructure, model deployment, or GPU workloads to strengthen technical expertise and practical knowledge.

• Design reliability into innovative platforms at the system level while building influence with product engineering teams through technical acumen.


🏝️ Benefits

• Your health

• Your finances

• Your family

• Your time at work

• Your time pursuing other endeavors

People also viewed

Advanced Solutions International, Inc.12 hours ago

DevOps Reliability Engineer

AU flagAustralia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$90k – $110k/year
ApplyView job
Stone12 hours ago

Senior Site Reliability Engineer – Network

BR flagBrazil OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Replit1 day ago

Staff Site Reliability Engineer

EuropeFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Soum1 day ago

DevOps Engineer, Mid Level

EG flagEgypt OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Lakeside Software1 day ago

DevOps Engineer, Azure

IN flagIndia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Interval Group1 day ago

DevOps Engineer, mk8s

DE flagGermany OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers