This is a fully remote position, open to applicants in Texas.

📋 Description

• Take ownership of the LLM evaluation strategy at Driver, guiding it from foundational principles to production-level infrastructure.

• Establish quality metrics and create evaluation datasets.

• Define the criteria for what constitutes 'good' for each content type throughout the pipeline.

• Develop and maintain gold-standard evaluation datasets across various languages and repository types (monorepos, microservices, libraries, applications).

• Create rubrics that assess accuracy, completeness, usefulness, and readability.

• Construct benchmarking and experimentation infrastructure.

• Develop automated evaluation pipelines that measure output against reference datasets.

• Instrument the content generation pipeline to facilitate A/B comparisons — executing the same codebase through two different strategies and analyzing the results.

• Create tools for LLM-as-judge evaluation and regression detection.

• Incorporate evaluation into CI so that pipeline modifications are accompanied by quality evidence.

• Generate automated quality signals at scale.

• Implement quality checks that identify degraded output without necessitating human review of every document.

• Track content quality trends over time.

• Design sampling strategies for human review that optimize signal detection with minimal annotation effort.

• Measure trade-offs and inform decision-making.

• Conduct experiments on model selection, context strategies, and changes to pipeline architecture.

• Analyze cost, quality, and latency trade-offs.

• Collaborate with the engineering team to translate evaluation insights into tangible improvements.

⛳️ Requirements

• Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative discipline.

• At least 3 to 5 years of experience in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI; 7+ years of experience is preferred.

• Strong foundation in statistics: experimental design, hypothesis testing, confidence intervals, effect sizes, and power analysis.

• Experience in designing and executing evaluations for LLM or NLP systems — you have carefully considered what 'better' signifies when dealing with open-ended text outputs.

• Proficient in Python and the scientific/data stack (pandas, NumPy, scipy, sklearn).

• Comfortable utilizing Jupyter notebooks for exploration and prototyping, and transforming that work into automated pipelines.

• Familiar with LLM-as-judge methodologies, inter-annotator agreement, and rubric design for subjective quality evaluation.

• Knowledgeable about the practical challenges associated with non-deterministic systems: variance decomposition, multi-run methodology, and differentiating signal from noise at scale.

• Strong data storytelling skills — able to convert experimental results into clear recommendations that inform engineering and product strategies.

🏝️ Benefits

• Competitive Compensation Packages - Cash & Equity

• Flexible Work Culture

• Unlimited Time Off + 12 Paid Company Holidays

• Insurance - Health, Dental, & Vision

• Life Insurance & FSA Accounts

• 401(k) Retirement Accounts - Traditional, Roth, or Both

• Quarterly Team Offsites

Applied Data Scientist, LLM Evaluation

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Vice President, Client Strategy

National Accounts Manager

Associate Sales Representative, CRM

3rd Party Collections Specialist

Client Director – Strategic AI Infrastructure

Regional Sales Manager

Never miss a great job!