
Senior Data Architect
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Poland.
• Take full ownership of the data architecture for the Training Environment: design datasets and schemas for all ML training pipelines, which include dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and complete call recordings for speech-to-speech model development.
• Establish and manage data selection and sampling strategies: create criteria that identify which production conversations offer the greatest training value, incorporating diversity-optimized sampling, confidence-based filtering, prioritization of edge cases, and deduplication methods.
• Develop and sustain the data catalog and dataset discovery framework: enable ML engineers from LLM, NLU, Speech, and Agentic teams to easily locate, comprehend, and utilize training data.
• Design the annotation pipeline architecture: establish data labeling requirements including intent annotation, entity tagging, dialog act classification, task completion scoring, and evaluations of agentic reasoning across internal and external annotators.
• Create the data flywheel architecture: a closed-loop system where actual customer conversations are reintegrated into the training data collection, curation, annotation, model retraining, and evaluation processes.
• Manage and maintain data pipelines and infrastructure that encompass Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and their integration with ML training workflows on AWS SageMaker.
• Collaborate closely with LLM, NLU, and Agentic systems teams to grasp training data needs — understanding what conversational patterns enhance zero-shot routing accuracy, what dialog structures improve task planners, and which edge cases stress-test agentic reasoning — and convert these insights into specific dataset specifications and pipeline configurations.
• Define and uphold the data architecture for Omilia's Training Environment: design schemas, manage data flow from production (OCP) to centralized training infrastructure, develop a storage strategy (Snowflake + S3), ensure cross-pipeline consistency, and maintain clear, auditable data lineage, including anonymization requirements for compliance.
• Create data quality frameworks that directly enhance model outcomes: strategies for content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and focused extraction of NLU improvement corpora from low-confidence and no-match production data.
• Establish annotation requirements for ML model development — including intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessments — while designing annotation workflows that yield consistent, high-quality labels at scale; oversee and evaluate external data annotation vendors.
• Build and maintain a data catalog that facilitates cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic).
• Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; establish feedback mechanisms to route model failure cases into targeted training data collection.
• Identify gaps in production training data and specify requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); devise data augmentation strategies for underrepresented languages, domains, or conversational patterns.
• Collaborate closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to ensure alignment of data architecture with model training needs; work in conjunction with Platform Engineering, Security & Compliance, and Product Management stakeholders.
• Maintain thorough documentation of data architecture, dataset specifications, pipeline configurations, and the data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.
• 5+ years of experience in data architecture, data engineering, or LLM/ML data infrastructure, with proven accountability for production data systems supporting ML/AI model development.
• In-depth understanding of ML training data requirements — recognizing what constitutes high-quality, diverse, and beneficial training data for LLM and NLU model development, beyond mere cleanliness and structure.
• Extensive experience in data modeling, schema design, and data pipeline architecture.
• High proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
• Experience in defining annotation requirements and managing data annotation workflows — including intent labeling, entity tagging, dialog classification, or similar NLP annotation tasks.
• Familiarity with data cataloging, metadata management, and dataset discovery at scale.
• Strong SQL and Python skills for developing data pipelines and conducting data quality analysis.
• Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
• Desirable: hands-on experience with LLM training data preparation — including instruction tuning datasets, preference data, RLHF/DPO annotation, and synthetic data generation.
• Desirable: experience with data anonymization and PII/PCI redaction within ML data pipelines.
• Desirable: familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies.
• Desirable: knowledge of voice/audio data handling, storage, and processing at scale.
• Excellent communication skills — capable of translating ML team data needs into specific pipeline specifications and clearly articulating data architecture decisions to both technical and compliance audiences.
• Strong cross-functional collaboration skills: a proven track record of effective work with ML engineers, platform teams, and product stakeholders.
• Analytical mindset with the ability to make informed trade-off decisions regarding data quality, diversity, and scale.
• Self-motivated ownership mentality: comfortable acting as the accountable technical owner of a crucial platform domain.
• Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a relevant field.
• Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations) is a significant advantage.
• Experience with data governance in regulated industries (financial services, healthcare) is a plus.
• Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition) is desirable.
• Fixed compensation;
• Long-term employment with vacation days;
• Opportunities for professional development (courses, training, etc);
• Participation in the development of innovative technology products making a global impact in the service industry;
• A team of proficient and enjoyable colleagues;
• Apple gear.
Aimpoint Digital
Get handpicked remote jobs straight to your inbox weekly.