This is a fully remote position, open to applicants in Poland.

📋 Description

• Take full ownership of the data architecture for the Training Environment: design datasets and schemas for all ML training pipelines, which include dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and complete call recordings for speech-to-speech model development.

• Establish and manage data selection and sampling strategies: create criteria that identify which production conversations offer the greatest training value, incorporating diversity-optimized sampling, confidence-based filtering, prioritization of edge cases, and deduplication methods.

• Develop and sustain the data catalog and dataset discovery framework: enable ML engineers from LLM, NLU, Speech, and Agentic teams to easily locate, comprehend, and utilize training data.

• Design the annotation pipeline architecture: establish data labeling requirements including intent annotation, entity tagging, dialog act classification, task completion scoring, and evaluations of agentic reasoning across internal and external annotators.

• Create the data flywheel architecture: a closed-loop system where actual customer conversations are reintegrated into the training data collection, curation, annotation, model retraining, and evaluation processes.

• Manage and maintain data pipelines and infrastructure that encompass Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and their integration with ML training workflows on AWS SageMaker.

• Collaborate closely with LLM, NLU, and Agentic systems teams to grasp training data needs — understanding what conversational patterns enhance zero-shot routing accuracy, what dialog structures improve task planners, and which edge cases stress-test agentic reasoning — and convert these insights into specific dataset specifications and pipeline configurations.

• Define and uphold the data architecture for Omilia's Training Environment: design schemas, manage data flow from production (OCP) to centralized training infrastructure, develop a storage strategy (Snowflake + S3), ensure cross-pipeline consistency, and maintain clear, auditable data lineage, including anonymization requirements for compliance.

• Create data quality frameworks that directly enhance model outcomes: strategies for content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and focused extraction of NLU improvement corpora from low-confidence and no-match production data.

• Establish annotation requirements for ML model development — including intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessments — while designing annotation workflows that yield consistent, high-quality labels at scale; oversee and evaluate external data annotation vendors.

• Build and maintain a data catalog that facilitates cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic).

• Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; establish feedback mechanisms to route model failure cases into targeted training data collection.

• Identify gaps in production training data and specify requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); devise data augmentation strategies for underrepresented languages, domains, or conversational patterns.

• Collaborate closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to ensure alignment of data architecture with model training needs; work in conjunction with Platform Engineering, Security & Compliance, and Product Management stakeholders.

• Maintain thorough documentation of data architecture, dataset specifications, pipeline configurations, and the data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

⛳️ Requirements

• 5+ years of experience in data architecture, data engineering, or LLM/ML data infrastructure, with proven accountability for production data systems supporting ML/AI model development.

• In-depth understanding of ML training data requirements — recognizing what constitutes high-quality, diverse, and beneficial training data for LLM and NLU model development, beyond mere cleanliness and structure.

• Extensive experience in data modeling, schema design, and data pipeline architecture.

• High proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).

• Experience in defining annotation requirements and managing data annotation workflows — including intent labeling, entity tagging, dialog classification, or similar NLP annotation tasks.

• Familiarity with data cataloging, metadata management, and dataset discovery at scale.

• Strong SQL and Python skills for developing data pipelines and conducting data quality analysis.

• Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.

• Desirable: hands-on experience with LLM training data preparation — including instruction tuning datasets, preference data, RLHF/DPO annotation, and synthetic data generation.

• Desirable: experience with data anonymization and PII/PCI redaction within ML data pipelines.

• Desirable: familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies.

• Desirable: knowledge of voice/audio data handling, storage, and processing at scale.

• Excellent communication skills — capable of translating ML team data needs into specific pipeline specifications and clearly articulating data architecture decisions to both technical and compliance audiences.

• Strong cross-functional collaboration skills: a proven track record of effective work with ML engineers, platform teams, and product stakeholders.

• Analytical mindset with the ability to make informed trade-off decisions regarding data quality, diversity, and scale.

• Self-motivated ownership mentality: comfortable acting as the accountable technical owner of a crucial platform domain.

• Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a relevant field.

• Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations) is a significant advantage.

• Experience with data governance in regulated industries (financial services, healthcare) is a plus.

• Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition) is desirable.

🏝️ Benefits

• Fixed compensation;

• Long-term employment with vacation days;

• Opportunities for professional development (courses, training, etc);

• Participation in the development of innovative technology products making a global impact in the service industry;

• A team of proficient and enjoyable colleagues;

• Apple gear.

Senior Data Architect

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Data Architect

Data Architect

Senior Data Engineer

Mid-level Data Engineer

AI Data Engineer

Data Engineer

Never miss a great job!