This is a fully remote position, open to applicants in Minnesota.

📋 Description

• Manage and enhance our self-hosted inference infrastructure.

• Operate the inference serving layer on our dedicated GPU hardware: select and fine-tune the serving stack (vLLM, SGLang, TensorRT-LLM) to achieve high throughput and minimal latency.

• Optimize rigorously: implement tensor parallelism, quantization (FP8, AWQ, GPTQ), KV-cache and prefix caching, continuous batching, and concurrency tuning.

• Serve various models and features from shared hardware: multi-LoRA, routing, and request scheduling to balance internal workloads with latency-sensitive product traffic.

• Maintain the speed, efficiency, and observability of our AI systems.

• Enhance the efficiency of our AI workloads: reduce latency, increase throughput, and optimize GPU utilization to maximize resource usage.

• Establish visibility: instrument performance and usage metrics across our AI platforms to provide clear insights into operational performance.

• Highlight technical trade-offs (performance, latency, efficiency) to equip decision-makers with the necessary information.

• Develop AI features and proactive agents.

• Deliver the in-app agent layer designed to assist families with coordination: offering proactive nudges, smart suggestions, and agents that summarize, draft, schedule, and act on behalf of busy parents.

• Create the underlying infrastructure: tools, memory management, orchestration, guardrails, and evaluation harnesses, seamlessly integrated with production APIs in collaboration with our architecture team.

• Collaborate closely with feature owners, quickly building whatever is necessary to test ideas, including a vibe-coded UI when it provides the fastest route to customer feedback. Embrace rapid iteration: ship rough drafts, learn swiftly, and refine what proves effective.

⛳️ Requirements

• 5+ years of experience in delivering production software, with significant applied AI or ML expertise.

• Proven experience in running and optimizing self-hosted LLMs on dedicated multi-GPU hardware: familiarity with a serving stack (vLLM, SGLang, or TensorRT-LLM) and associated optimization techniques (tensor parallelism, quantization, batching, KV cache).

• A solid history of enhancing inference performance and efficiency (latency, throughput, GPU utilization).

• Strong proficiency in Python and engineering principles, with the capability to quickly develop a UI and a genuine interest in app-layer features, not just infrastructure.

• Practical experience with agent frameworks (Claude Agent SDK, LangGraph, or equivalents), LLM APIs, embeddings, and RAG.

• Familiarity with AWS and the associated DevOps responsibilities of this role: Docker, CI/CD, monitoring, and observability.

• Experience in building internal tools or platforms relied upon by others, with a bonus for experience in Slack apps, MCP, or agent orchestration at team scale.

🏝️ Benefits

• Medical: In Tandem covers 100% of the premium for employees and 99% for additional family members.

• 401k: Offers up to a 4% match with immediate vesting.

• Paid leave for all new parents.

• Learning & Development stipend available for employees.

• Paid Time Off: 11 Holidays + Winter Break (3 Days) + Volunteer Time Off (1 Day) + Floating Holiday (1 Day).

• Personal Time Off: 15 days for employees with 0-1 years of service, increasing to 20 days for those with 1-3 years of service.

• Supportive and flexible work environment – work from anywhere!

AI Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Applied AI Engineer

ML Engineer – Applied AI

AI Engineer

Agentic AI Lead

AI-Native Product Engineer

Enterprise AI Architect

Never miss a great job!