
Member of Engineering – Pre-training, Data Acquisition
Posted 1 day ago

Posted 1 day ago
This is a fully remote position, open to applicants in United States.
• Design, construct, and manage a large-scale web crawler tasked with gathering all publicly available data on the internet.
• Create specialized deep crawlers aimed at high-value sources to enhance data recall and coverage.
• Collaborate with data researchers to establish a long-term strategy for data acquisition.
• Develop observability, monitoring, and debugging tools to ensure reliability and transparency within the crawling infrastructure.
• Work alongside pre-training, post-training, and evaluation teams to align data acquisition priorities with the requirements of model training.
• Construct high-throughput ingestion pipelines for swiftly integrating partner data and assessing its quality.
• Strong background in distributed systems with demonstrated experience in building and managing large-scale infrastructures such as data pipelines or web crawlers.
• Proficient in Python, with the ability to optimize performance and troubleshoot complex systems in production environments.
• Practical experience with web crawling or large-scale data extraction, including knowledge of HTTP protocols, distributed job queues, and data parsing at scale.
• Familiarity with cloud platforms (AWS) and container orchestration technologies (Kubernetes, Docker) for deploying and overseeing high-throughput workloads.
• Understanding of the non-technical aspects of internet-scale crawling, including data privacy, compliance with robots.txt, and ethical crawling practices.
• Nice to have:
• Previous experience with pre-training large language models (LLMs).
• Experience in creating trillion-scale state-of-the-art pre-training datasets.
• Proven track record of translating research into scalable production implementations.
• Fully remote work & flexible hours.
• 37 days/year of vacation & holidays.
• 16 weeks of flexible, full-pay parental leave.
• Health insurance allowance for you & your dependents.
• Company-provided equipment.
• Well-being, continuous learning, & home office allowances.
• Frequent team get-togethers.
• Diverse & inclusive, people-first culture.
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.