This is a fully remote position, open to applicants in Egypt.

• Create realistic terminal-based benchmark tasks for evaluating AI systems.

• Develop in-depth debugging scenarios and investigation tasks.

• Formulate task specifications that encompass infrastructure, workflows, pipelines, or operational issues.

• Articulate clear solution strategies and definitive evaluation standards.

• Identify plausible edge cases, failure modes, and system limitations.

• Craft multi-step reasoning challenges within intricate technical environments.

• Offer expertise in one or more engineering or operational fields.

• Assess and enhance the quality, difficulty, and validation logic of benchmarks.

• Partner with reviewers and researchers on workflows for AI evaluation.

• 3–10 years of experience in software engineering or similar technical areas.

• Strong skills in debugging, analysis, and systems reasoning.

• Solid understanding of system architecture, dependencies, and operational workflows.

• Familiarity with terminal, CLI, automation, or developer tooling processes.

• Experience with AI systems, large language models, benchmarking, or evaluation frameworks is advantageous.

• Capability to design technically robust and realistic engineering scenarios.

• Competitive salary and performance-based bonuses.

• Opportunities for professional development and training.

• Flexible working hours and remote work options.

• Comprehensive health and wellness benefits.

AI Evaluation Engineer – Software Engineering Domain

People also viewed