This is a fully remote position, open to applicants in Colombia.

📋 Description

• Develop benchmark tasks for multi-agent systems that necessitate the reading, analysis, and synthesis of extensive document collections.

• Assemble real-world research data sets — including academic papers, case studies, and technical reports — and formulate questions that demand thorough analysis.

• Compose structured ground-truth oracles (JSON) containing specific, verifiable answers that validate the agent's engagement with the source material.

• Create prompts for LLM judges to assess agent outputs on a field-by-field basis against the oracle.

• Establish decomposition guides that distribute research tasks among multiple parallel sub-agents (one for each document, one for each domain, followed by synthesis).

⛳️ Requirements

• Over 5 years of experience in **research (academic or industry)** within a scientific, technical, or analytical field.

• Strong capability to **read, analyze, and extract structured information from unstructured documents**.

• Experience in designing or handling **structured data formats (JSON, schemas, validation)**.

• Proficient in **Python scripting** (for data processing, validation, or evaluation scripts).

• Background in **AI evaluation, coding benchmarks, or structured reasoning tasks** (such as SWE-bench, Terminal-bench, or similar).

• Familiarity with **Docker** (including image building and container debugging).

• Exceptional attention to detail, particularly when specifying **exact, verifiable outputs**.

• Capability to design **complex, multi-step problem-solving workflows**.

🏝️ Benefits

• Opportunity to work on cutting-edge projects in a collaborative environment.

• Access to continuous learning and professional development resources.

• Flexible work hours and potential remote work options.

• Competitive salary and comprehensive benefits package.

AI Evaluation Engineer, Knowledge and Research

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior AI Vertical Mini-Series Director

Risk Analyst – AI Trainer, Freelance

Senior AI Vertical Mini-Series Director – Freelance

Language Alignment & Resource Partner – Haitian Creole, Freelance AI Trainer

Automation & AI Manager

Mathematics AI Training Expert

Never miss a great job!