This is a fully remote position, open to applicants in Poland.

• Analyze requirements and establish the testing strategy for new features and product modifications.

• Automate test scenarios utilizing the current framework built on Python and PyTest.

• Develop automated quality assessment pipelines for AI systems using metrics and LLM-as-judge methodologies.

• Conduct testing on MCP servers, tool schemas, and tool-call behaviors, including edge cases and invalid inputs.

• Assess agentic workflows, focusing on tool selection, multi-step reasoning, error management, loop recovery, and state accuracy.

• Sustain and enhance the test automation framework and aid in the development of internal testing tools, including mocks.

• Create and uphold test documentation, which encompasses checklists, test cases, and quality reports.

• Engage in test design, estimations, release testing, and product quality evaluations.

• Contribute to improvements in CI/CD and QA processes.

• Design and manage evaluation suites and golden datasets for RAG and agentic workflows.

• Execute adversarial testing for AI systems, addressing prompt injection, jailbreaks, tool misuse, and data leakage concerns.

• Establish regression checks for alterations in prompts, models, retrieval settings, and chunking strategies.

• Monitor the quality of AI systems alongside cost, latency, and token usage.

• Utilize tracing and observability tools to debug, assess, and enhance LLM application performance.

• Over 5 years of experience in Quality Assurance, encompassing both manual and automated testing.

• Strong grasp of QA principles, test design, test coverage, test pyramid, and Software Development Life Cycle (SDLC).

• Proficient with Python-based test automation frameworks, such as PyTest, Behave, or comparable tools.

• Familiarity with CI/CD and monitoring or alerting tools, like Datadog, ELK, Sentry, or similar.

• Passion for testing AI/LLM-based systems; hands-on experience is preferred, though quick learners eager to develop in this field are also welcome.

• Knowledge of RAG, LLM evaluation, and quality metrics such as groundedness, faithfulness, answer relevance, and retrieval quality.

• Experience or interest in AI evaluation tools, including RAGAS, DeepEval, promptfoo, LangSmith Eval, TruLens, Arize Phoenix, or similar resources.

• Understanding of how to test non-deterministic systems, where multiple correct outputs may exist.

• Familiarity with LangChain, LangGraph, MCP, vector databases, semantic search, or LLM observability tools is a significant advantage.

• Proficient in spoken and written English (B2 level or higher).

• Full-time employment opportunities.

• Private health insurance.

• An additional day off (1) each calendar year.

• Compensation for sports programs.

• Comprehensive mental health program.

• Free online English classes with native speakers.

• Generous referral program.

• Training, internal workshops, and opportunities to participate in international professional conferences and corporate events.

Senior Software Developer in Test, Python

People also viewed