
Senior Member of Technical Staff, Web Data
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in Canada.
• Oversee extensive pipelines designed for the processing of web corpora.
• Engage in filtering and quality assessment systems to pinpoint valuable web documents.
• Examine the composition of web data across various domains, languages, and timeframes.
• Create and sustain highly efficient deduplication pipelines.
• Work collaboratively with interdisciplinary teams, including researchers and engineers, to ensure that data pipelines align with the requirements of advanced language models.
• Robust software engineering abilities, particularly in Python, along with experience in constructing data pipelines.
• Knowledge of data processing frameworks such as Apache Spark, Apache Beam, Pandas, or comparable tools.
• Experience with large-scale web datasets.
• Understanding of data quality evaluation methods and experimentation with data mixtures.
• A strong interest in merging research and engineering to address intricate data-related issues in AI model training.
• Bonus: published work in prestigious venues (e.g., NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).
• An open and inclusive culture and work environment
• Work closely with a team on the cutting edge of AI research
• Weekly lunch stipend, in-office lunches & snacks
• Full health and dental benefits, including a separate budget to take care of your mental health
• 100% Parental Leave top-up for up to 6 months
• Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
• Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
• 6 weeks of vacation (30 working days!)
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.