This is a fully remote position, open to applicants in Poland.

📋 Description

• Oversee the investigation and resolution of intricate incidents related to infrastructure, networking, and platforms.

• Serve as a senior escalation contact for operational teams during critical events that impact services.

• Provide support for expansive NVIDIA GPU infrastructure and high-performance networking settings.

• Diagnose complex issues involving Linux, Kubernetes, networking, storage, and hardware.

• Assess platform performance, capacity, stability, and reliability trends to proactively pinpoint risks.

• Lead root cause analysis initiatives and implement long-term corrective measures.

• Collaborate with engineering teams, hardware vendors, and datacenter staff to tackle complex technical challenges.

• Engage in major incident management and service restoration efforts.

• Offer technical leadership for Kubernetes platform operations and associated infrastructure services.

• Enhance platform reliability, observability, monitoring, and operational processes.

• Identify and implement automation opportunities for repetitive operational tasks to boost efficiency.

• Contribute to operational readiness reviews, infrastructure modifications, upgrades, and service implementations.

• Facilitate the adoption and functioning of AI-powered infrastructure services and operational capabilities via k0rdent AI.

• Examine emerging technologies and operational methodologies to enhance service delivery and platform resilience.

• Mentor and assist AI Infrastructure & Platform Operations Engineers.

• Disseminate technical knowledge through documentation, training sessions, and operational assessments.

• Create and maintain operational standards, runbooks, troubleshooting manuals, and best practices.

• Help establish operational procedures, escalation protocols, and service reliability benchmarks.

• Act as a trusted technical advisor during operational planning and service enhancement projects.

⛳️ Requirements

• 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or similar technical positions.

• Advanced Linux administration and troubleshooting capabilities.

• Extensive networking knowledge, including the ability to diagnose complex performance, connectivity, and reliability challenges.

• Significant experience operating Kubernetes in production settings.

• Background in supporting large-scale production infrastructure and distributed systems.

• Demonstrated experience leading technical investigations and managing complex incidents.

• Experience conducting root cause analysis and promoting long-term operational enhancements.

• Strong grasp of observability, monitoring, and service reliability practices.

• Excellent troubleshooting and analytical abilities across various infrastructure domains.

• Exceptional communication, collaboration, and stakeholder management skills.

🏝️ Benefits

• Operate within some of the most advanced AI infrastructure environments currently in production.

• Work with cutting-edge NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.

• Help shape operational standards and reliability practices for next-generation AI infrastructure services.

• Influence the integration of AI-powered operational capabilities through k0rdent AI.

• Collaborate with a team of highly skilled engineers addressing complex infrastructure and platform challenges at scale.

• Join a growing organization heavily investing in AI infrastructure, platform services, and operational innovation.

Senior AI Infrastructure, Platform Operations Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Platform Engineer

AWS Platform Engineer

Platform Engineer

Junior Platform Engineer

Platform Engineer

Senior Platform Engineer

Never miss a great job!