This is a fully remote position, open to applicants in Brazil.

📋 Description

• Create tasks and benchmarks that differentiate capability levels across cutting-edge models — encompassing agentic, reasoning-intensive, and domain-specific (healthcare, finance, scientific) environments.

• Rigorously validate evaluations: conduct human baselines, assess inter-rater reliability, investigate how elicitation and scaffolding influence outcomes, and measure what constitutes signal versus noise.

• Advance the “science of evaluations” at Protege — incorporating item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with suitable uncertainty.

• Conduct evaluations on existing frontier models, occasionally in collaboration with partners from AI labs, enterprises, and governmental bodies.

• Publish research that positions Protege as the benchmark for evaluation data, while also contributing to the wider AI community’s grasp of what constitutes effective evaluations.

• Transform findings into products, working closely with data and engineering teams to convert research into evaluation datasets that customers can utilize.

• Collaborate with outsourced annotation vendors - Evaluation data is only as good as the individuals producing it. A significant part of this role involves managing the statistical machinery that determines which annotators we can trust, for which tasks, and to what extent — and translating that into trustworthiness scores that Protege’s customers can depend on.

⛳️ Requirements

• Advanced degree (PhD preferred, or MS/BS with equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics, or any applied research discipline.

• Practical experience in evaluating LLMs, agents, or other machine learning systems — including prompting, scaffolding, and proficiency with the tools researchers utilize to conduct evaluations at scale.

• Familiarity with annotator quality and inter-rater reliability — designing labeling protocols, calculating agreement statistics, and understanding annotator bias and calibration.

• Exceptional scientific writing and communication skills — capable of synthesizing technical findings into narratives that frontier labs, enterprise customers, and policymakers can act upon.

• A proactive approach to speed. You recognize which pipelines require production-grade quality and which can be more flexible, ensuring you achieve reliable results swiftly.

🏝️ Benefits

• Health insurance

• Flexible work hours

• Professional development opportunities

Research Scientist, Benchmarks & Evaluations

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Research Fellow, Computational Chemistry

Principal Applied Scientist

Senior Researcher, Employment and Economic Opportunity

Senior Researcher – Public Opinion, Political Attitudes

Senior Research Scientist

Senior Researcher, VoC & Insights

Never miss a great job!