
Applied Data Scientist, LLM Evaluation
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in Texas.
• Take ownership of the LLM evaluation strategy at Driver, guiding it from foundational principles to production-level infrastructure.
• Establish quality metrics and create evaluation datasets.
• Define the criteria for what constitutes 'good' for each content type throughout the pipeline.
• Develop and maintain gold-standard evaluation datasets across various languages and repository types (monorepos, microservices, libraries, applications).
• Create rubrics that assess accuracy, completeness, usefulness, and readability.
• Construct benchmarking and experimentation infrastructure.
• Develop automated evaluation pipelines that measure output against reference datasets.
• Instrument the content generation pipeline to facilitate A/B comparisons — executing the same codebase through two different strategies and analyzing the results.
• Create tools for LLM-as-judge evaluation and regression detection.
• Incorporate evaluation into CI so that pipeline modifications are accompanied by quality evidence.
• Generate automated quality signals at scale.
• Implement quality checks that identify degraded output without necessitating human review of every document.
• Track content quality trends over time.
• Design sampling strategies for human review that optimize signal detection with minimal annotation effort.
• Measure trade-offs and inform decision-making.
• Conduct experiments on model selection, context strategies, and changes to pipeline architecture.
• Analyze cost, quality, and latency trade-offs.
• Collaborate with the engineering team to translate evaluation insights into tangible improvements.
• Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative discipline.
• At least 3 to 5 years of experience in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI; 7+ years of experience is preferred.
• Strong foundation in statistics: experimental design, hypothesis testing, confidence intervals, effect sizes, and power analysis.
• Experience in designing and executing evaluations for LLM or NLP systems — you have carefully considered what 'better' signifies when dealing with open-ended text outputs.
• Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).
• Comfortable utilizing Jupyter notebooks for exploration and prototyping, and transforming that work into automated pipelines.
• Familiar with LLM-as-judge methodologies, inter-annotator agreement, and rubric design for subjective quality evaluation.
• Knowledgeable about the practical challenges associated with non-deterministic systems: variance decomposition, multi-run methodology, and differentiating signal from noise at scale.
• Strong data storytelling skills — able to convert experimental results into clear recommendations that inform engineering and product strategies.
• Competitive Compensation Packages - Cash & Equity
• Flexible Work Culture
• Unlimited Time Off + 12 Paid Company Holidays
• Insurance - Health, Dental, & Vision
• Life Insurance & FSA Accounts
• 401(k) Retirement Accounts - Traditional, Roth, or Both
• Quarterly Team Offsites
Urrly
Weiler Abrasives Group
Abbott
Segoso
Get handpicked remote jobs straight to your inbox weekly.