
Staff Applied Scientist – AI Evaluation, Trust
Posted 2 hours ago

Posted 2 hours ago
This is a fully remote position, open to applicants in United States.
• Spearhead the creation of specialized "judge models," transitioning from general-purpose frontier models to tailored architectures designed for evaluation and failure mode detection.
• Develop and implement rigorous scoring pipelines and empirical threshold calibrations for agentic systems, which include multi-turn conversations and Graph RAG reasoning.
• Create domain-specific evaluation frameworks that assess whether a system can replicate the work of human experts, rather than merely succeeding in general capability benchmarks.
• Manage the entire lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services in a production environment.
• Investigate and apply advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.
• Collaborate across functions with Product, Data Engineering, and the SVP of AI to convert complex statistical uncertainties into clear, actionable product insights.
• Serve as a technical leader and "Scientific Conscience" within the AI team, ensuring that every AI-driven risk signal is supported by an empirical derivation narrative.
• Over 10 years of experience in Machine Learning, particularly focused on Deep Neural Network activities, assessing model performance and trust.
• 1-2 years of experience concentrating on post-training activities.
• A minimum of 1 year of experience in developing benchmarks for evaluating LLMs.
• Technical Mastery: In-depth expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).
• Statistical Rigor: Proficiency in statistics and experimental design, encompassing significance testing, distribution analysis, and inter-rater reliability.
• Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.
• Builder Mindset: Demonstrated capability to oversee the journey from data collection to production deployment; our team is small and every role is "hands-on."
• Domain Fluency: Familiarity with Graph RAG and the specific challenges associated with evaluating non-deterministic, agentic workflows.
• 100% fully covered medical, vision, and dental insurance for employees and their dependents.
• Generous time-off policy; we observe all US federal holidays, close the office for a winter break (12/24-12/31), and provide 18 PTO days along with 10 sick days.
• Exceptional compensation package, featuring competitive commissions for revenue roles and bonuses for non-revenue positions.
• A strong commitment to diversity, equity, and inclusion.
• Eligibility to participate in additional benefits, including a 401k match up to 5%, fully paid life insurance (up to $100,000 coverage), and parental leave.
• A positive and collaborative culture - your team will be as intelligent and driven as you are.
• Unlimited growth and learning opportunities.
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.