
Research Scientist, Benchmarks & Evaluations
Posted May 24

Posted May 24
This is a fully remote position, open to applicants in Brazil.
• Create tasks and benchmarks that differentiate capability levels across cutting-edge models — encompassing agentic, reasoning-intensive, and domain-specific (healthcare, finance, scientific) environments.
• Rigorously validate evaluations: conduct human baselines, assess inter-rater reliability, investigate how elicitation and scaffolding influence outcomes, and measure what constitutes signal versus noise.
• Advance the “science of evaluations” at Protege — incorporating item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with suitable uncertainty.
• Conduct evaluations on existing frontier models, occasionally in collaboration with partners from AI labs, enterprises, and governmental bodies.
• Publish research that positions Protege as the benchmark for evaluation data, while also contributing to the wider AI community’s grasp of what constitutes effective evaluations.
• Transform findings into products, working closely with data and engineering teams to convert research into evaluation datasets that customers can utilize.
• Collaborate with outsourced annotation vendors - Evaluation data is only as good as the individuals producing it. A significant part of this role involves managing the statistical machinery that determines which annotators we can trust, for which tasks, and to what extent — and translating that into trustworthiness scores that Protege’s customers can depend on.
• Advanced degree (PhD preferred, or MS/BS with equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics, or any applied research discipline.
• Practical experience in evaluating LLMs, agents, or other machine learning systems — including prompting, scaffolding, and proficiency with the tools researchers utilize to conduct evaluations at scale.
• Familiarity with annotator quality and inter-rater reliability — designing labeling protocols, calculating agreement statistics, and understanding annotator bias and calibration.
• Exceptional scientific writing and communication skills — capable of synthesizing technical findings into narratives that frontier labs, enterprise customers, and policymakers can act upon.
• A proactive approach to speed. You recognize which pipelines require production-grade quality and which can be more flexible, ensuring you achieve reliable results swiftly.
• Health insurance
• Flexible work hours
• Professional development opportunities
Eurofins
American Institutes for Research
Get handpicked remote jobs straight to your inbox weekly.