• Design, validate, and enhance evaluation frameworks for AI agents.

• Implement automated and regression testing suites for generative models.

• Define and monitor quality metrics related to: Relevance, Fidelity, Consistency, Accuracy, and Hallucinations.

• Build evaluation systems like “LLM-as-a-Judge.”

• Establish performance benchmarks for new models and existing agents.

• Validate updates for prompts, models, and RAG pipelines.

• Collaborate with AI and development teams to define acceptance criteria (pass/fail).

• Analyze evaluation results and propose continuous improvements.

• Generate metric reports and traceability regarding agent quality.

• Minimum of 3 years of experience in QA automation, Data/AI Quality, or evaluation of AI systems.

• Advanced experience in Python.

• Experience working with AI evaluation frameworks such as: RAGAS, DeepEval, Vertex Gen AI Evaluation Service.

• Experience in evaluating RAG systems and LLM models.

• Ability to design “LLM-as-a-Judge” systems.

• Experience in test automation and validations.

• Knowledge in: Prompt evaluation, Response quality, Model benchmarking, and Testing of generative AI.

• Familiarity with metrics such as: Groundedness, Faithfulness, Context relevance, and Answer relevance.

• Experience working with non-deterministic systems.

• Desirable: Experience in conversational AI platforms.

• Knowledge of RAG pipelines.

• Experience with generative model APIs.

• Proficiency in observability and monitoring tools.

• Knowledge in MLOps or LLMOps.

• Experience in cloud environments (GCP, AWS, or Azure).

• Work mode: 100% Remote

• Excellent work environment

• Opportunities for growth and participation in innovative projects.

AI QA Engineer – Calidad y Evaluación de IA Generativa

People also viewed