
Site Reliability Engineer II
Posted 1 hour ago

Posted 1 hour ago
• Developing and maintaining dashboards, alerts, and monitoring systems for inference workloads utilizing Akamai's existing observability framework.
• Creating automation and tools in Python or Go to minimize operational burdens and enhance system reliability.
• Engaging in on-call rotations, addressing production incidents, and participating in post-incident analysis.
• Constructing and refining runbooks for inference-related operational tasks, integrating them into Akamai's current incident management workflows.
• Assisting in SLO tracking and reporting, identifying patterns and opportunities for improvement.
• Aiding in the maintenance of CI/CD pipelines, ensuring deployment safety checks, and managing rollback procedures.
• Collaborating with product engineering teams to resolve complex issues throughout the technology stack.
• Possess commercial experience in Site Reliability Engineering.
• Demonstrate proficiency in a programming language like Python or Go, with experience in developing automation solutions.
• Have experience with Linux systems administration and the capability to troubleshoot intricate infrastructure challenges.
• Be familiar with Kubernetes and containerization principles.
• Have experience using monitoring and observability tools such as Prometheus, Grafana, or equivalent.
• Have been exposed to CI/CD pipelines and infrastructure-as-code tools (Terraform, SaltStack, or comparable).
• Exhibit a desire to learn and grow, with a genuine interest in AI infrastructure and distributed systems.
• Your health
• Your finances
• Your family
• Your time at work
• Your time pursuing other endeavors
Auvaria
Grupo Salta Educação
Parlay Games Inc.
BHSSYEN
Get handpicked remote jobs straight to your inbox weekly.