This is a fully remote position, open to applicants in Europe.

📋 Description

• Join our data team, where your primary focus will be on ensuring the quality of datasets utilized for training our models.

• This is a hands-on position in which your main objective will be to enhance the quality of pretraining datasets by utilizing your prior experience, intuition, and experimental training methods.

• Responsibilities will include generating synthetic data and optimizing data mixes.

• You will collaborate closely with teams such as Pretraining, Postraining, Evals, and Product to identify the high-quality data requirements that correspond to unmet model capabilities and downstream applications.

• Keeping abreast of the latest research in dataset design and pretraining is crucial for success in this position.

• You will lead innovative research projects through short-term, time-constrained experiments while implementing highly technical engineering solutions in production.

• Given the vast amounts of data to process, you will have access to a high-performance distributed data pipeline alongside a large GPU cluster.

⛳️ Requirements

• A solid background in machine learning and engineering.

• Proficiency with Large Language Models (LLM), including:

• Knowledge of transformer architectures and the learning mechanisms of LLMs.

• Familiarity with data ablations and scaling laws.

• Understanding of mid-training and post-training techniques.

• Experience in training reasoning and agentic models.

• Familiarity with evaluation processes that track model capabilities (general knowledge, reasoning, mathematics, coding, long-context, etc.).

• Experience in constructing trillion-scale pretraining datasets, with an understanding of concepts such as data curation, deduplication, data mixing, tokenization, curriculum, and the impact of data repetition.

• Excellent programming skills in Python.

• Strong capabilities in prompt engineering.

• Experience with large-scale GPU clusters and distributed data pipelines.

• A strong commitment to data quality.

• Research experience:

• Having authored scientific papers on topics such as applied deep learning, LLMs, source code generation, etc., is a plus.

• Ability to discuss the latest research papers and delve into intricate details.

• Possessing well-informed opinions on the subject matter.

🏝️ Benefits

• Fully remote work and flexible hours.

• 37 days per year of vacation and holidays.

• Health insurance allowance for you and your dependents.

• Equipment provided by the company.

• Allowances for well-being, continuous learning, and home office setup.

• Regular team gatherings.

• A diverse and inclusive, people-first culture.

Engineering Member – Pre-training, Data Research

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

PL/SQL Developer, PL

Engineering Sales Specialist

Senior Symfony Developer

Bare Developer

Mechanical Designer – Ventilation & Engineering

Survey Programmer – Ops, Scripting

Never miss a great job!