
Senior Software Engineer – AI Evaluation & Benchmarks, Python
Posted 22 hours ago

Posted 22 hours ago
• Create and develop coding benchmarks and evaluation pipelines utilized to assess frontier AI models in real software engineering tasks:
• Design coding benchmarks that test frontier models on practical programming challenges — including reasoning, debugging, and production-level code.
• Develop and sustain scalable data pipelines for evaluation processes.
• Evaluate model-generated code for accuracy, reliability, and edge-case failures.
• Create structured evaluation scenarios across extensive repositories and multi-language environments.
• Offer comprehensive technical feedback on model performance and failure patterns.
• Contribute to evaluation frameworks that establish standards for measuring coding capabilities.
• The ultimate objective: benchmarks that effectively differentiate what frontier models can and cannot achieve — influencing how the next generation is trained and refined.
• AI coding evaluation summarized: Design task → build harness → execute model → analyze failures → integrate findings back into the benchmark → evaluations that truly differentiate robust models from weaker ones.
• A minimum of 4 years of professional software engineering experience (mandatory).
• Proficient in Python — producing clean, efficient, and thoroughly tested code.
• Practical experience with large, complex codebases.
• Demonstrated experience in designing and implementing LLM coding benchmarks and evaluation data pipelines.
• Strong proficiency in Git and modern development workflows.
• Proven track record at a high-growth tech company or a prestigious software organization.
• Excellent written communication skills in English.
• Identity verification: Applicants are required to verify their identity and possess valid documentation to work as an independent contractor.
• Identity verification is required for independent contractors residing in their home country.
• Weekly payments via PayPal or Stripe.
Rox Partner
Very
Get handpicked remote jobs straight to your inbox weekly.