This is a fully remote position, open to applicants in France.

📋 Description

• Manage and enhance our self-hosted inference infrastructure.

• Operate the inference serving layer on our proprietary GPU hardware: select and fine-tune the serving stack (vLLM, SGLang, TensorRT-LLM) to achieve high throughput and minimal latency.

• Optimize extensively: employ tensor parallelism, quantization (FP8, AWQ, GPTQ), KV-cache and prefix caching, continuous batching, speculative decoding, and concurrency tuning.

• Serve multiple models and features using shared hardware: implement multi-LoRA, routing, and request scheduling to balance internal workloads with latency-sensitive product traffic.

• Enhance the efficiency of our AI workloads: focus on reducing latency, increasing throughput, and maximizing GPU utilization to optimize our operations.

• Develop visibility: instrument performance and usage metrics across our AI surfaces to provide transparent data on system performance.

• Highlight technical trade-offs (performance, latency, efficiency) to ensure decision-makers have the necessary information to make informed choices.

• Deliver the in-app agent layer that assists families in coordination: provides proactive nudges, intelligent suggestions, and agents that summarize, draft, schedule, and act on behalf of busy parents.

• Construct the foundational elements: tools, memory management, orchestration, guardrails, and evaluation harnesses, seamlessly integrating with production APIs alongside our architecture team.

• Collaborate in agile pairs with feature owners to quickly prototype and test ideas, including creating a vibe-coded UI when it expedites reaching real customers. Deploy rough prototypes, learn quickly, and refine successful implementations.

⛳️ Requirements

• Over 5 years of experience in delivering production software, including substantial applied AI or ML work.

• Proven experience in running and optimizing self-hosted LLMs on dedicated multi-GPU hardware: familiarity with a serving stack (vLLM, SGLang, or TensorRT-LLM) and associated optimizations (tensor parallelism, quantization, batching, KV cache).

• A demonstrated history of improving inference performance and efficiency (latency, throughput, GPU utilization).

• Strong skills in Python and engineering fundamentals, with the versatility to quickly create a UI, and a genuine interest in developing app-layer features rather than just infrastructure.

• Practical experience with agent frameworks (Claude Agent SDK, LangGraph, or similar), LLM APIs, embeddings, and RAG.

• Familiarity with AWS and the DevOps responsibilities associated with this role: Docker, CI/CD, monitoring, and observability.

• Experience in building internal tools or platforms that others rely on, with a bonus for knowledge of Slack apps, MCP, or agent orchestration at a team scale.

🏝️ Benefits

• Medical: In Tandem covers 100% of the premium for employees and 99% for all additional family members.

• 401k: Up to a 4% match with immediate vesting.

• Paid leave for all new parents.

• Learning & Development stipend for employees.

• Paid Time Off: 11 Holidays + Winter Break (3 Days) + Volunteer Time Off (1 Day) + Floating Holiday (1 Day).

• Personal Time Off: 15 days for 0-1 year of employment, 20 days for 1-3 years of employment.

• Supportive and flexible working environment – work from anywhere!

AI Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

AI Vibe Coding Engineer

AI Architect, Value Engineer

Senior Applied AI Engineer

ML Engineer – Applied AI

AI Engineer

Agentic AI Lead

Never miss a great job!