Remotery

Member of Engineering – Pre-training, Data Acquisition

Posted 1 day ago

This is a fully remote position, open to applicants in United States.

📋 Description

• Design, construct, and manage a large-scale web crawler tasked with gathering all publicly available data on the internet.

• Create specialized deep crawlers aimed at high-value sources to enhance data recall and coverage.

• Collaborate with data researchers to establish a long-term strategy for data acquisition.

• Develop observability, monitoring, and debugging tools to ensure reliability and transparency within the crawling infrastructure.

• Work alongside pre-training, post-training, and evaluation teams to align data acquisition priorities with the requirements of model training.

• Construct high-throughput ingestion pipelines for swiftly integrating partner data and assessing its quality.


⛳️ Requirements

• Strong background in distributed systems with demonstrated experience in building and managing large-scale infrastructures such as data pipelines or web crawlers.

• Proficient in Python, with the ability to optimize performance and troubleshoot complex systems in production environments.

• Practical experience with web crawling or large-scale data extraction, including knowledge of HTTP protocols, distributed job queues, and data parsing at scale.

• Familiarity with cloud platforms (AWS) and container orchestration technologies (Kubernetes, Docker) for deploying and overseeing high-throughput workloads.

• Understanding of the non-technical aspects of internet-scale crawling, including data privacy, compliance with robots.txt, and ethical crawling practices.

• Nice to have:

• Previous experience with pre-training large language models (LLMs).

• Experience in creating trillion-scale state-of-the-art pre-training datasets.

• Proven track record of translating research into scalable production implementations.


🏝️ Benefits

• Fully remote work & flexible hours.

• 37 days/year of vacation & holidays.

• 16 weeks of flexible, full-pay parental leave.

• Health insurance allowance for you & your dependents.

• Company-provided equipment.

• Well-being, continuous learning, & home office allowances.

• Frequent team get-togethers.

• Diverse & inclusive, people-first culture.

People also viewed

Anchor Utility12 hours ago

Rate Analyst

US flagTexas OnlyFull-timeUncategorized
ApplyView job
Honeywell12 hours ago

HSE Manager

US flagNorth Carolina OnlyFull-timeUncategorized
ApplyView job
Cision France12 hours ago

People Partner

CA flagCanada OnlyFull-timeUncategorized$85k/year
ApplyView job
Navigate Power12 hours ago

B2B Outside Sales Consultant

US flagPennsylvania OnlyFreelanceUncategorized$50k – $250k/year
ApplyView job
TELUS12 hours ago

Business Development Executive, Early Career – European Language Required

GB flagUnited Kingdom OnlyFull-timeUncategorized
ApplyView job
Gilead Sciences12 hours ago

Statistical Programmer II

US flagUnited States OnlyFull-timeUncategorized$107.2k – $138.7k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers