This is a fully remote position, open to applicants in United States.

📋 Description

• Take ownership of and enhance our multi-engine inference platform, accommodating various model types and serving needs.

• Develop and optimize production ML pipelines — transitioning models from experimentation to dependable, high-throughput serving.

• Establish and execute strategies for model versioning, rollout, rollback, and lifecycle management to ensure reproducibility and operational dependability.

• Set and uphold serving-layer SLAs, covering latency, availability, GPU utilization, Time-to-First-Token (TTFT), and Inter-Token Latency (ITL).

• Create observability, monitoring, alerting, and operational tools for production inference systems.

• Implement software engineering best practices, including testing, CI/CD integration, and reproducibility across ML workflows.

• Enhance inference performance through effective resource utilization, hardware-aware serving strategies, and cost-efficient infrastructure design.

• Guarantee that ML serving systems are secure, scalable, and resilient in operations.

• Collaborate with ML, Data, Product, and DevOps teams to transform concepts into production systems, influencing technical decisions related to serving and scaling.

⛳️ Requirements

• Bachelor’s or Master’s degree in Computer Science, Data Science, Engineering, or a related field, or equivalent practical experience.

• 5–8+ years of experience in Software Engineering, ML Engineering, Platform Engineering, or Infrastructure Engineering, with direct responsibility for production ML serving systems.

• Practical experience managing an LLM serving engine (vLLM, TGI, TensorRT-LLM, or SGLang) in production under real load — not merely managed or hosted endpoints.

• Proficient in Python and possess strong software engineering fundamentals, along with extensive knowledge of systems and infrastructure.

• Familiarity with cloud platforms such as AWS, GCP, or Azure, and experience with ML lifecycle tools, experimentation platforms, and model registries.

• Strong understanding of inference performance — including continuous batching, KV-cache and GPU-memory behavior, quantization, and CPU versus GPU bottlenecks — with a tendency to profile before optimizing.

• Experience managing heterogeneous workloads, including LLMs, embedding models, and extraction models, each with unique latency, throughput, and scaling demands.

• Proven ability to balance latency, throughput, reliability, and infrastructure costs while managing production-scale ML systems.

• Experience in high-growth startup settings and ability to thrive in rapidly changing technical environments.

🏝️ Benefits

• Health insurance

• Flexible work arrangements

• Professional development opportunities

Senior Machine Learning Engineer – Inference Platform

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Rate Analyst

HSE Manager

People Partner

B2B Outside Sales Consultant

Business Development Executive, Early Career – European Language Required

Statistical Programmer II

Never miss a great job!