
Senior SQA Engineer – LLM
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Pakistan.
• Develop and oversee the comprehensive QA strategy for the Conversational Banking Platform, which includes functional, regression, performance, security, and AI-specific assessments.
• Create and uphold golden datasets, evaluation suites, and frameworks utilizing LLM-as-judge to ensure conversational quality across various intents, languages, and tenants.
• Establish the QA gate for tenant onboarding, along with the certification checklist that each new business unit must complete prior to going live.
• Formulate regression strategies for modifications to prompts, upgrades to models, updates to retrieval indexes, and alterations in guardrail policies.
• Leverage Langfuse traces for evaluations: analyze production failures, transform them into test cases, and provide feedback to engineering.
• Assess NeMo Guardrails configurations against vulnerabilities such as jailbreaks, prompt injections, off-topic drifts, and instances of false-positive over-blocking.
• Ensure validation of governance and compliance measures, including data residency, handling of PII, disclosures for regulated products, and off-limits topics.
• Develop automated testing harnesses for Spring AI services, which encompass tool-calling validation, RAG groundedness, and integration with Cosmos DB and MongoDB data layers.
• Collaborate with the Platform team to establish quality metrics, SLOs, and the platform evaluation scorecard.
• Mentor feature engineers and tenant teams on crafting their own evaluations, promoting self-service quality at the platform level over time.
• A minimum of 6 years of experience in software QA, with at least 1–2 years dedicated to testing LLM-based, RAG, or conversational AI systems in a production environment.
• Practical experience with LLM observability and evaluation tools such as Langfuse, LangSmith, Arize, or Phoenix.
• Familiarity with evaluation frameworks like Ragas, DeepEval, Promptfoo, or TruLens — including metrics such as faithfulness, groundedness, answer relevance, and context precision.
• A solid understanding of how to test non-deterministic systems: golden datasets, semantic similarity, LLM-as-judge, and statistical regression detection.
• Experience with testing guardrail or policy frameworks (like NeMo Guardrails, Guardrails AI, or similar solutions).
• Strong foundation in API testing, automation frameworks (e.g., pytest, JUnit, Karate, RestAssured), and CI/CD integration.
• Familiarity with Spring and Spring Boot applications as well as JVM-based services.
• Proficiency in writing queries against NoSQL databases (MongoDB, Cosmos DB) for setting up test data and inspecting traces.
• Excellent written communication skills: capable of producing clear test plans, defect reports, and tenant readiness assessments.
• Preferred experience in banking, financial services, or other regulated industries.
• Exposure to multi-tenant platforms: understanding the implications of shared infrastructure on testing challenges.
• Familiarity with red-teaming, adversarial prompt testing, and defenses against prompt injection.
• Working knowledge of vector databases, embedding models, and retrieval evaluation methodologies.
• Experience with multi-language conversational systems.
• Performance and load testing experience specifically for AI workloads (including token throughput, latency percentiles, and cost per conversation).
• Contributions to open-source evaluation or AI testing tools.
• Experience collaborating with compliance, risk, or audit teams on AI assurance initiatives.
• Comprehensive health, dental, and vision insurance.
• Flexible working hours and remote work options.
• Opportunities for professional development and career advancement.
• Engaging work environment and collaborative team culture.
Auditdata
Tether.to
Montreal Oficial
Get handpicked remote jobs straight to your inbox weekly.