Remotery

Senior Machine Learning Engineer – Multimodal Data

Posted May 19

This is a fully remote position, open to applicants in Austria.

📋 Description

• Design and develop data pipelines for agent training, including tasks such as collection, filtering, deduplication, formatting, and versioning across various sources like text, images, and multimodal data.

• Construct and oversee infrastructure to ensure scalable data loading, storage, and retrieval (S3, distributed systems, streaming pipelines).

• Work collaboratively with research scientists to convert research needs into specific data specifications, adapting as experiments uncover new requirements.

• Generate evaluation datasets and benchmarks in partnership with researchers, curating task distributions that highlight actual failure modes.

• Create tools for dataset construction, which includes human annotation workflows, synthetic data generation, and preference data collection for RLHF/DPO-style training.

• Take ownership of data quality by developing validation frameworks, monitoring for drift and contamination, and setting standards that ensure datasets are reliable and reproducible.

• Thoroughly document datasets, including their provenance, known limitations, intended use cases, and versioning history.

• Implement extensive test coverage for data pipelines and ML workflows to guarantee reliability and identify regressions early.

• Enhance codebase quality through code reviews, refactoring, and establishing engineering best practices that promote sustainable research velocity.

• Contribute to team roadmaps by identifying data bottlenecks and suggesting solutions that facilitate research progress.


⛳️ Requirements

• Proficient software engineering skills in Python, with experience in building production-grade data pipelines and ML DevOps.

• Practical expertise in prompt engineering, including designing, testing, and refining prompts for consistent LLM/VLM outputs.

• Experience with ML data workflows, including large-scale data processing and loading (Ray or similar), data versioning, and training format considerations (tokenization, batching, sharding).

• Hands-on experience with data pipelines for large-scale distributed ML training runs.

• Familiarity with annotation tools and human-in-the-loop data collection (such as Label Studio or internal systems).

• Strong understanding of ML training requirements, recognizing what constitutes "good data" for LLM/VLM fine-tuning and anticipating potential downstream issues.

• Experience in loading and writing large datasets to and from cloud infrastructure (AWS) and distributed storage systems.

• Excellent communication skills, enabling effective collaboration with researchers to define ambiguous problems and translate needs into actionable plans.

• A collaborative mindset, comfortable with taking ownership and iterating quickly.


🏝️ Benefits

• Equity packages - we want our success to be yours too.

• An inclusive parental leave policy that supports all parents and caregivers.

• An annual Vibe & Thrive allowance to enhance your wellbeing, social connections, office setup, and more.

• Flexible leave options that empower you to contribute positively, take time to recharge, and support your personal needs.

People also viewed

Hyatt1 day ago

Senior Machine Learning Engineer

MX flagMexico OnlyFull-timeMachine Learning Engineer
ApplyView job
Scopic1 day ago

Machine Learning Engineer

Anywhere in the WorldFull-timeMachine Learning Engineer
ApplyView job
Perform1 day ago

Senior AI/ML Engineer

Anywhere in the WorldFull-timeMachine Learning Engineer
ApplyView job
Greenlight Planet3 days ago

Machine Learning Engineer

IN flagIndia OnlyFull-timeMachine Learning Engineer
ApplyView job
Gympass6 days ago

Senior MLOps Engineer

BR flagBrazil OnlyFull-timeMachine Learning Engineer
ApplyView job
IDT BY INDET GROUP6 days ago

Senior Data/ML Engineer

BR flagBrazil OnlyFull-timeMachine Learning Engineer
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers