• Developing and maintaining dashboards, alerts, and monitoring systems for inference workloads utilizing Akamai's existing observability framework.

• Creating automation and tools in Python or Go to minimize operational burdens and enhance system reliability.

• Engaging in on-call rotations, addressing production incidents, and participating in post-incident analysis.

• Constructing and refining runbooks for inference-related operational tasks, integrating them into Akamai's current incident management workflows.

• Assisting in SLO tracking and reporting, identifying patterns and opportunities for improvement.

• Aiding in the maintenance of CI/CD pipelines, ensuring deployment safety checks, and managing rollback procedures.

• Collaborating with product engineering teams to resolve complex issues throughout the technology stack.

• Possess commercial experience in Site Reliability Engineering.

• Demonstrate proficiency in a programming language like Python or Go, with experience in developing automation solutions.

• Have experience with Linux systems administration and the capability to troubleshoot intricate infrastructure challenges.

• Be familiar with Kubernetes and containerization principles.

• Have experience using monitoring and observability tools such as Prometheus, Grafana, or equivalent.

• Have been exposed to CI/CD pipelines and infrastructure-as-code tools (Terraform, SaltStack, or comparable).

• Exhibit a desire to learn and grow, with a genuine interest in AI infrastructure and distributed systems.

• Your health

• Your finances

• Your family

• Your time at work

• Your time pursuing other endeavors

Site Reliability Engineer II

People also viewed