
Engineering Member – Pre-training, Data Research
Posted May 20

Posted May 20
This is a fully remote position, open to applicants in Europe.
• Join our data team, where your primary focus will be on ensuring the quality of datasets utilized for training our models.
• This is a hands-on position in which your main objective will be to enhance the quality of pretraining datasets by utilizing your prior experience, intuition, and experimental training methods.
• Responsibilities will include generating synthetic data and optimizing data mixes.
• You will collaborate closely with teams such as Pretraining, Postraining, Evals, and Product to identify the high-quality data requirements that correspond to unmet model capabilities and downstream applications.
• Keeping abreast of the latest research in dataset design and pretraining is crucial for success in this position.
• You will lead innovative research projects through short-term, time-constrained experiments while implementing highly technical engineering solutions in production.
• Given the vast amounts of data to process, you will have access to a high-performance distributed data pipeline alongside a large GPU cluster.
• A solid background in machine learning and engineering.
• Proficiency with Large Language Models (LLM), including:
• Knowledge of transformer architectures and the learning mechanisms of LLMs.
• Familiarity with data ablations and scaling laws.
• Understanding of mid-training and post-training techniques.
• Experience in training reasoning and agentic models.
• Familiarity with evaluation processes that track model capabilities (general knowledge, reasoning, mathematics, coding, long-context, etc.).
• Experience in constructing trillion-scale pretraining datasets, with an understanding of concepts such as data curation, deduplication, data mixing, tokenization, curriculum, and the impact of data repetition.
• Excellent programming skills in Python.
• Strong capabilities in prompt engineering.
• Experience with large-scale GPU clusters and distributed data pipelines.
• A strong commitment to data quality.
• Research experience:
• Having authored scientific papers on topics such as applied deep learning, LLMs, source code generation, etc., is a plus.
• Ability to discuss the latest research papers and delve into intricate details.
• Possessing well-informed opinions on the subject matter.
• Fully remote work and flexible hours.
• 37 days per year of vacation and holidays.
• Health insurance allowance for you and your dependents.
• Equipment provided by the company.
• Allowances for well-being, continuous learning, and home office setup.
• Regular team gatherings.
• A diverse and inclusive, people-first culture.
Spread Tecnologia
Adistec
Get handpicked remote jobs straight to your inbox weekly.