This is a fully remote position, open to applicants in United States.

📋 Description

• Collaborate with model researchers to establish the definition of “good data” for our models, which includes quality metrics, validation checks, and acceptance thresholds.

• Investigate open source datasets and develop internal datasets that are most appropriate for constructing fundamental World Models.

• Create algorithms for automated evaluation of data quality, data domain mixtures, and the adaptation of synthetic data to real data.

• Monitor datasets, metadata, provenance, and versions to ensure experiments are reproducible and to clarify the data used in various training and evaluation processes.

• Oversee CI/CD and development tools for the data stack (GitHub, Python, PyTorch), and automate repetitive tasks to streamline workflows.

• Evaluate and enhance throughput, storage, and compute utilization across pipelines and associated assets.

⛳️ Requirements

• Strong foundational knowledge in ML and deep learning, with experience in building and managing large-scale data and/or computing systems.

• Comfortably navigate between research inquiries and production engineering: capable of analyzing data, conducting analyses, and deploying reliable systems.

• Proven research experience related to data compositions, quality, and dataset releases.

• Skill in designing and executing experiments that yield convincing and unbiased results.

• Practical experience with distributed processing and orchestration tools (such as Spark, Ray, Airflow, or similar alternatives).

• Proficient in Python, with familiarity in tools associated with contemporary model training workflows (datasets, checkpoints, experiment tracking).

• Strong understanding of data quality: methods for measurement, monitoring, and preventing regressions as systems scale.

• Capable of thriving in a dynamic environment, prioritizing effectively, and communicating clearly with both researchers and engineers.

• Bonus: experience with large video datasets, dataset curation for training purposes, or developing internal tools for evaluation/analysis in ML environments.

🏝️ Benefits

• Flexible work arrangements

Technical Staff Member – Data Intelligence

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Oracle Fusion Developer

Software Integrations Developer

Director, Software Development

Senior PeopleSoft Developer

Developer Marketing Lead

Associate Director – Engineering

Never miss a great job!