
AI Engineer
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in France.
• Manage and enhance our self-hosted inference infrastructure.
• Operate the inference serving layer on our proprietary GPU hardware: select and fine-tune the serving stack (vLLM, SGLang, TensorRT-LLM) to achieve high throughput and minimal latency.
• Optimize extensively: employ tensor parallelism, quantization (FP8, AWQ, GPTQ), KV-cache and prefix caching, continuous batching, speculative decoding, and concurrency tuning.
• Serve multiple models and features using shared hardware: implement multi-LoRA, routing, and request scheduling to balance internal workloads with latency-sensitive product traffic.
• Enhance the efficiency of our AI workloads: focus on reducing latency, increasing throughput, and maximizing GPU utilization to optimize our operations.
• Develop visibility: instrument performance and usage metrics across our AI surfaces to provide transparent data on system performance.
• Highlight technical trade-offs (performance, latency, efficiency) to ensure decision-makers have the necessary information to make informed choices.
• Deliver the in-app agent layer that assists families in coordination: provides proactive nudges, intelligent suggestions, and agents that summarize, draft, schedule, and act on behalf of busy parents.
• Construct the foundational elements: tools, memory management, orchestration, guardrails, and evaluation harnesses, seamlessly integrating with production APIs alongside our architecture team.
• Collaborate in agile pairs with feature owners to quickly prototype and test ideas, including creating a vibe-coded UI when it expedites reaching real customers. Deploy rough prototypes, learn quickly, and refine successful implementations.
• Over 5 years of experience in delivering production software, including substantial applied AI or ML work.
• Proven experience in running and optimizing self-hosted LLMs on dedicated multi-GPU hardware: familiarity with a serving stack (vLLM, SGLang, or TensorRT-LLM) and associated optimizations (tensor parallelism, quantization, batching, KV cache).
• A demonstrated history of improving inference performance and efficiency (latency, throughput, GPU utilization).
• Strong skills in Python and engineering fundamentals, with the versatility to quickly create a UI, and a genuine interest in developing app-layer features rather than just infrastructure.
• Practical experience with agent frameworks (Claude Agent SDK, LangGraph, or similar), LLM APIs, embeddings, and RAG.
• Familiarity with AWS and the DevOps responsibilities associated with this role: Docker, CI/CD, monitoring, and observability.
• Experience in building internal tools or platforms that others rely on, with a bonus for knowledge of Slack apps, MCP, or agent orchestration at a team scale.
• Medical: In Tandem covers 100% of the premium for employees and 99% for all additional family members.
• 401k: Up to a 4% match with immediate vesting.
• Paid leave for all new parents.
• Learning & Development stipend for employees.
• Paid Time Off: 11 Holidays + Winter Break (3 Days) + Volunteer Time Off (1 Day) + Floating Holiday (1 Day).
• Personal Time Off: 15 days for 0-1 year of employment, 20 days for 1-3 years of employment.
• Supportive and flexible working environment – work from anywhere!
Granicus
Omada Health
NineTwoThree Studio
Get handpicked remote jobs straight to your inbox weekly.