Remotery

Staff Applied Scientist – AI Evaluation, Trust

atSayariUS flagUnited StatesFull-timeUncategorizedLead$195k – $205k/year

Posted 2 hours ago

This is a fully remote position, open to applicants in United States.

📋 Description

• Spearhead the creation of specialized "judge models," transitioning from general-purpose frontier models to tailored architectures designed for evaluation and failure mode detection.

• Develop and implement rigorous scoring pipelines and empirical threshold calibrations for agentic systems, which include multi-turn conversations and Graph RAG reasoning.

• Create domain-specific evaluation frameworks that assess whether a system can replicate the work of human experts, rather than merely succeeding in general capability benchmarks.

• Manage the entire lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services in a production environment.

• Investigate and apply advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.

• Collaborate across functions with Product, Data Engineering, and the SVP of AI to convert complex statistical uncertainties into clear, actionable product insights.

• Serve as a technical leader and "Scientific Conscience" within the AI team, ensuring that every AI-driven risk signal is supported by an empirical derivation narrative.


⛳️ Requirements

• Over 10 years of experience in Machine Learning, particularly focused on Deep Neural Network activities, assessing model performance and trust.

• 1-2 years of experience concentrating on post-training activities.

• A minimum of 1 year of experience in developing benchmarks for evaluating LLMs.

• Technical Mastery: In-depth expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).

• Statistical Rigor: Proficiency in statistics and experimental design, encompassing significance testing, distribution analysis, and inter-rater reliability.

• Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.

• Builder Mindset: Demonstrated capability to oversee the journey from data collection to production deployment; our team is small and every role is "hands-on."

• Domain Fluency: Familiarity with Graph RAG and the specific challenges associated with evaluating non-deterministic, agentic workflows.


🏝️ Benefits

• 100% fully covered medical, vision, and dental insurance for employees and their dependents.

• Generous time-off policy; we observe all US federal holidays, close the office for a winter break (12/24-12/31), and provide 18 PTO days along with 10 sick days.

• Exceptional compensation package, featuring competitive commissions for revenue roles and bonuses for non-revenue positions.

• A strong commitment to diversity, equity, and inclusion.

• Eligibility to participate in additional benefits, including a 401k match up to 5%, fully paid life insurance (up to $100,000 coverage), and parental leave.

• A positive and collaborative culture - your team will be as intelligent and driven as you are.

• Unlimited growth and learning opportunities.

People also viewed

Instacart47 min ago

Program Manager II

US flagCalifornia, +18 more statesFull-timeUncategorized$122k – $155k/year
ApplyView job
CLASP47 min ago

Senior Product Manager – Candidate & Recruiter Platform

US flagMassachusetts OnlyFull-timeUncategorized$140k – $170k/year
ApplyView job
Tevora47 min ago

Account Director

US flagOregon OnlyFull-timeUncategorized$110k – $130k/year
ApplyView job
Tailor47 min ago

Forward-Deployed Product Manager – FDPM

US flagCalifornia OnlyFull-timeUncategorized$130k – $170k/year
ApplyView job
Cube Care Company47 min ago

Human Resource Generalist

US flagUnited States OnlyFull-timeUncategorized
ApplyView job
Juniper Square47 min ago

Product Marketing Engineer

US flagUnited States OnlyFull-timeUncategorized$160k – $215k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers