
Senior Principal Site Reliability Engineer
Posted May 22

Posted May 22
This is a fully remote position, open to applicants in Poland.
• Establishing the reliability architecture for Akamai's AI compute and platform services, which encompasses SLO frameworks, fault tolerance patterns, and capacity planning models.
• Actively developing automation and tooling that minimizes operational toil and amplifies the SRE team's effectiveness.
• Crafting an observability strategy by utilizing Akamai's current platform to create the telemetry, dashboards, alerts, and GPU-specific monitoring essential for AI workloads.
• Designing deployment safety practices that include progressive rollouts, canary analysis, rollback automation, and change safety processes.
• Shaping product engineering architecture and design choices, integrating reliability into the development lifecycle at the system level.
• Guiding and uplifting fellow SREs through design reviews, code assessments, and hands-on troubleshooting, thereby establishing the technical standards for the team.
• Possess extensive experience in SRE, platform engineering, and/or infrastructure engineering, with proven impact at a principal or staff level.
• Exhibit comprehensive Kubernetes proficiency, managing autoscaling, resource scheduling, and container orchestration to effectively handle compute-intensive workloads.
• Cultivate programming skills in Python or Go, concentrating on developing automation and tooling for production-grade environments.
• Demonstrate programming expertise in Python and/or Go, along with experience in creating production-grade automation, tooling, and platform services.
• Influence cross-team technical decisions, mentor engineers, raise technical standards, and collaborate effectively with product engineering teams.
• Acquire experience in AI/ML infrastructure, model deployment, or GPU workloads to strengthen technical expertise and practical knowledge.
• Design reliability into innovative platforms at the system level while building influence with product engineering teams through technical acumen.
• Your health
• Your finances
• Your family
• Your time at work
• Your time pursuing other endeavors
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.