
AI Engineer
Posted 5 days ago

Posted 5 days ago
This is a fully remote position, open to applicants in Minnesota.
• Manage and enhance our self-hosted inference infrastructure.
• Operate the inference serving layer on our dedicated GPU hardware: select and fine-tune the serving stack (vLLM, SGLang, TensorRT-LLM) to achieve high throughput and minimal latency.
• Optimize rigorously: implement tensor parallelism, quantization (FP8, AWQ, GPTQ), KV-cache and prefix caching, continuous batching, and concurrency tuning.
• Serve various models and features from shared hardware: multi-LoRA, routing, and request scheduling to balance internal workloads with latency-sensitive product traffic.
• Maintain the speed, efficiency, and observability of our AI systems.
• Enhance the efficiency of our AI workloads: reduce latency, increase throughput, and optimize GPU utilization to maximize resource usage.
• Establish visibility: instrument performance and usage metrics across our AI platforms to provide clear insights into operational performance.
• Highlight technical trade-offs (performance, latency, efficiency) to equip decision-makers with the necessary information.
• Develop AI features and proactive agents.
• Deliver the in-app agent layer designed to assist families with coordination: offering proactive nudges, smart suggestions, and agents that summarize, draft, schedule, and act on behalf of busy parents.
• Create the underlying infrastructure: tools, memory management, orchestration, guardrails, and evaluation harnesses, seamlessly integrated with production APIs in collaboration with our architecture team.
• Collaborate closely with feature owners, quickly building whatever is necessary to test ideas, including a vibe-coded UI when it provides the fastest route to customer feedback. Embrace rapid iteration: ship rough drafts, learn swiftly, and refine what proves effective.
• 5+ years of experience in delivering production software, with significant applied AI or ML expertise.
• Proven experience in running and optimizing self-hosted LLMs on dedicated multi-GPU hardware: familiarity with a serving stack (vLLM, SGLang, or TensorRT-LLM) and associated optimization techniques (tensor parallelism, quantization, batching, KV cache).
• A solid history of enhancing inference performance and efficiency (latency, throughput, GPU utilization).
• Strong proficiency in Python and engineering principles, with the capability to quickly develop a UI and a genuine interest in app-layer features, not just infrastructure.
• Practical experience with agent frameworks (Claude Agent SDK, LangGraph, or equivalents), LLM APIs, embeddings, and RAG.
• Familiarity with AWS and the associated DevOps responsibilities of this role: Docker, CI/CD, monitoring, and observability.
• Experience in building internal tools or platforms relied upon by others, with a bonus for experience in Slack apps, MCP, or agent orchestration at team scale.
• Medical: In Tandem covers 100% of the premium for employees and 99% for additional family members.
• 401k: Offers up to a 4% match with immediate vesting.
• Paid leave for all new parents.
• Learning & Development stipend available for employees.
• Paid Time Off: 11 Holidays + Winter Break (3 Days) + Volunteer Time Off (1 Day) + Floating Holiday (1 Day).
• Personal Time Off: 15 days for employees with 0-1 years of service, increasing to 20 days for those with 1-3 years of service.
• Supportive and flexible work environment – work from anywhere!
Omada Health
NineTwoThree Studio
Stride, Inc.
Get handpicked remote jobs straight to your inbox weekly.