
AI Evaluation Engineer, Knowledge and Research
Posted May 21

Posted May 21
This is a fully remote position, open to applicants in Colombia.
• Develop benchmark tasks for multi-agent systems that necessitate the reading, analysis, and synthesis of extensive document collections.
• Assemble real-world research data sets — including academic papers, case studies, and technical reports — and formulate questions that demand thorough analysis.
• Compose structured ground-truth oracles (JSON) containing specific, verifiable answers that validate the agent's engagement with the source material.
• Create prompts for LLM judges to assess agent outputs on a field-by-field basis against the oracle.
• Establish decomposition guides that distribute research tasks among multiple parallel sub-agents (one for each document, one for each domain, followed by synthesis).
• Over 5 years of experience in **research (academic or industry)** within a scientific, technical, or analytical field.
• Strong capability to **read, analyze, and extract structured information from unstructured documents**.
• Experience in designing or handling **structured data formats (JSON, schemas, validation)**.
• Proficient in **Python scripting** (for data processing, validation, or evaluation scripts).
• Background in **AI evaluation, coding benchmarks, or structured reasoning tasks** (such as SWE-bench, Terminal-bench, or similar).
• Familiarity with **Docker** (including image building and container debugging).
• Exceptional attention to detail, particularly when specifying **exact, verifiable outputs**.
• Capability to design **complex, multi-step problem-solving workflows**.
• Opportunity to work on cutting-edge projects in a collaborative environment.
• Access to continuous learning and professional development resources.
• Flexible work hours and potential remote work options.
• Competitive salary and comprehensive benefits package.
EverAI
10x.Team
EverAI
Invisible Technologies
Get handpicked remote jobs straight to your inbox weekly.