This is a fully remote position, open to applicants in United States.

📋 Description

• Spearhead the creation of specialized "judge models," transitioning from general-purpose frontier models to tailored architectures designed for evaluation and failure mode detection.

• Develop and implement rigorous scoring pipelines and empirical threshold calibrations for agentic systems, which include multi-turn conversations and Graph RAG reasoning.

• Create domain-specific evaluation frameworks that assess whether a system can replicate the work of human experts, rather than merely succeeding in general capability benchmarks.

• Manage the entire lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services in a production environment.

• Investigate and apply advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.

• Collaborate across functions with Product, Data Engineering, and the SVP of AI to convert complex statistical uncertainties into clear, actionable product insights.

• Serve as a technical leader and "Scientific Conscience" within the AI team, ensuring that every AI-driven risk signal is supported by an empirical derivation narrative.

⛳️ Requirements

• Over 10 years of experience in Machine Learning, particularly focused on Deep Neural Network activities, assessing model performance and trust.

• 1-2 years of experience concentrating on post-training activities.

• A minimum of 1 year of experience in developing benchmarks for evaluating LLMs.

• Technical Mastery: In-depth expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).

• Statistical Rigor: Proficiency in statistics and experimental design, encompassing significance testing, distribution analysis, and inter-rater reliability.

• Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.

• Builder Mindset: Demonstrated capability to oversee the journey from data collection to production deployment; our team is small and every role is "hands-on."

• Domain Fluency: Familiarity with Graph RAG and the specific challenges associated with evaluating non-deterministic, agentic workflows.

🏝️ Benefits

• 100% fully covered medical, vision, and dental insurance for employees and their dependents.

• Generous time-off policy; we observe all US federal holidays, close the office for a winter break (12/24-12/31), and provide 18 PTO days along with 10 sick days.

• Exceptional compensation package, featuring competitive commissions for revenue roles and bonuses for non-revenue positions.

• A strong commitment to diversity, equity, and inclusion.

• Eligibility to participate in additional benefits, including a 401k match up to 5%, fully paid life insurance (up to $100,000 coverage), and parental leave.

• A positive and collaborative culture - your team will be as intelligent and driven as you are.

• Unlimited growth and learning opportunities.

Staff Applied Scientist – AI Evaluation, Trust

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Program Manager II

Senior Product Manager – Candidate & Recruiter Platform

Account Director

Forward-Deployed Product Manager – FDPM

Human Resource Generalist

Product Marketing Engineer

Never miss a great job!