
Senior AI Infrastructure, Platform Operations Engineer
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Poland.
• Oversee the investigation and resolution of intricate incidents related to infrastructure, networking, and platforms.
• Serve as a senior escalation contact for operational teams during critical events that impact services.
• Provide support for expansive NVIDIA GPU infrastructure and high-performance networking settings.
• Diagnose complex issues involving Linux, Kubernetes, networking, storage, and hardware.
• Assess platform performance, capacity, stability, and reliability trends to proactively pinpoint risks.
• Lead root cause analysis initiatives and implement long-term corrective measures.
• Collaborate with engineering teams, hardware vendors, and datacenter staff to tackle complex technical challenges.
• Engage in major incident management and service restoration efforts.
• Offer technical leadership for Kubernetes platform operations and associated infrastructure services.
• Enhance platform reliability, observability, monitoring, and operational processes.
• Identify and implement automation opportunities for repetitive operational tasks to boost efficiency.
• Contribute to operational readiness reviews, infrastructure modifications, upgrades, and service implementations.
• Facilitate the adoption and functioning of AI-powered infrastructure services and operational capabilities via k0rdent AI.
• Examine emerging technologies and operational methodologies to enhance service delivery and platform resilience.
• Mentor and assist AI Infrastructure & Platform Operations Engineers.
• Disseminate technical knowledge through documentation, training sessions, and operational assessments.
• Create and maintain operational standards, runbooks, troubleshooting manuals, and best practices.
• Help establish operational procedures, escalation protocols, and service reliability benchmarks.
• Act as a trusted technical advisor during operational planning and service enhancement projects.
• 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or similar technical positions.
• Advanced Linux administration and troubleshooting capabilities.
• Extensive networking knowledge, including the ability to diagnose complex performance, connectivity, and reliability challenges.
• Significant experience operating Kubernetes in production settings.
• Background in supporting large-scale production infrastructure and distributed systems.
• Demonstrated experience leading technical investigations and managing complex incidents.
• Experience conducting root cause analysis and promoting long-term operational enhancements.
• Strong grasp of observability, monitoring, and service reliability practices.
• Excellent troubleshooting and analytical abilities across various infrastructure domains.
• Exceptional communication, collaboration, and stakeholder management skills.
• Operate within some of the most advanced AI infrastructure environments currently in production.
• Work with cutting-edge NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.
• Help shape operational standards and reliability practices for next-generation AI infrastructure services.
• Influence the integration of AI-powered operational capabilities through k0rdent AI.
• Collaborate with a team of highly skilled engineers addressing complex infrastructure and platform challenges at scale.
• Join a growing organization heavily investing in AI infrastructure, platform services, and operational innovation.
Attio
TechBiz Global
Get handpicked remote jobs straight to your inbox weekly.