This is a fully remote position, open to applicants in Pakistan.

📋 Description

• Develop and construct **benchmark tasks for multi-agent systems** that necessitate multi-step mathematical reasoning and algorithmic problem-solving.

• Generate **complex and decomposable problems** across various fields, including:

• - Competitive mathematics

• - Numerical analysis

• - Combinatorial optimization

• - Statistical inference

• Create **verification scripts** to ensure the accuracy of:

• - Numerical results (within specified tolerance levels)

• - Proof correctness and logical progression

• - Algorithmic outputs and their constraints

• Compose **clear and organized problem statements** utilizing precise notation and defined outcomes.

• Formulate **task decomposition strategies** suitable for parallel or multi-agent execution.

• Execute computational solutions and validation processes using Python.

• Engage with containerized environments (Docker) to ensure reproducibility and evaluation.

⛳️ Requirements

• Over 5 years of experience in mathematics, quantitative research, or computational science — with a background in competition mathematics, university-level mathematics, or quantitative research.

• Proficient in Python programming — including libraries such as NumPy, SciPy, or symbolic computation (SymPy). Experience in writing mathematical proofs or formal derivations is essential.

• Capability to design problems that yield precise, verifiable answers — avoiding subjective or open-ended questions.

• Familiarity with AI coding benchmarks (SWE-bench, Terminal-bench).

• Proficient with Docker — including writing Dockerfiles, building images, and troubleshooting container-related issues.

• Strong understanding of numerical methods — including floating point tolerance, convergence criteria, and error bounds.

• **Nice to Have**

• Experience in crafting competition math problems (AMC, AIME, Putnam, IMO).

• A background in **theoretical computer science or advanced mathematics research**.

• Exposure to **automated theorem proving or formal verification** techniques.

• Familiarity with AI reasoning benchmarks (GSM8K, MATH, AIME, GPQA, ARC-AGI).

• Experience in **large-scale numerical or scientific computing**.

🏝️ Benefits

• Opportunity to work on cutting-edge mathematical and computational challenges.

• Collaborative environment with a focus on innovation and problem-solving.

• Continuous learning and development opportunities in advanced mathematics and algorithm design.

• Flexibility in work arrangements, including options for remote work.

AI Evaluation Engineer – Mathematics & Algorithms

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior AI Vertical Mini-Series Director

Risk Analyst – AI Trainer, Freelance

Senior AI Vertical Mini-Series Director – Freelance

Language Alignment & Resource Partner – Haitian Creole, Freelance AI Trainer

Automation & AI Manager

Mathematics AI Training Expert

Never miss a great job!