This is a fully remote position, open to applicants in Ukraine.

📋 Description

• Construct and manage production-level model serving infrastructure utilizing frameworks such as vLLM, TGI, Triton, or similar alternatives.

• Create and implement reliable deployment pipelines featuring blue/green and canary rollout strategies for machine learning models.

• Develop and sustain auto-scaling systems, multi-model serving architectures, and smart request routing layers.

• Enhance GPU utilization, memory efficiency, network throughput, and model artifact storage performance.

• Design observability systems for monitoring inference latency, throughput, GPU usage, cost metrics, and overall system health.

• Oversee model registries and CI/CD pipelines to facilitate automated and reproducible model deployments.

• Manage the complete lifecycle of machine learning systems from development to production, including operational support and on-call duties.

• Establish engineering best practices and contribute to the scalability of the platform within a dynamic startup setting.

⛳️ Requirements

• A minimum of 4 years of experience in ML Ops, Platform Engineering, SRE, or comparable infrastructure roles with a focus on machine learning systems.

• Practical experience with model serving frameworks such as vLLM, TGI, Triton, or similar.

• Strong expertise in container orchestration and managing GPU-based workloads in a production environment.

• Familiarity with MLOps tools, including model registries, experiment tracking, and automated deployment pipelines.

• Proficient in Python and infrastructure-as-code tools (e.g., Terraform, Helm, or similar).

• Solid understanding of distributed systems, performance optimization, and production reliability engineering.

• Capability to effectively utilize AI coding assistants to enhance development and debugging processes.

• Ownership mindset with the capacity to work independently in a remote-first setting.

🏝️ Benefits

• Take charge of essential infrastructure supporting a rapidly expanding AI-native cloud platform.

• Build foundational ML inference systems from the ground up in a high-growth, well-funded startup environment.

• Work at the crossroads of distributed systems, GPU computing, and sustainable cloud architecture.

• Acquire deep knowledge in next-generation AI infrastructure and large-scale model serving systems.

• Influence key engineering decisions and establish best practices that will scale alongside the company.

ML Ops Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Machine Learning Engineer

Machine Learning Engineer

Senior AI/ML Engineer

Machine Learning Engineer

Senior MLOps Engineer

Senior Data/ML Engineer

Never miss a great job!