This is a fully remote position, open to applicants in India.

• Oversee the design and execution of data pipelines that generate high-quality training data for AI models.

• Develop data curation workflows that convert raw enterprise data into labeled and validated datasets.

• Create frameworks for data quality including validation, profiling, anomaly detection, and lineage tracking.

• Enhance current anonymized data export pipelines to accommodate AI training workloads.

• Establish pipelines for synthetic data generation.

• Design schema mappings across more than 197 enterprise tables for feature extraction.

• Work closely with ML engineers to clarify training data format requirements.

• Set up a data catalog and manage metadata for AI training artifacts.

• Over 10 years of experience in software engineering, with at least 5 years focused on data engineering.

• Extensive experience with Apache Spark / PySpark and large-scale data processing.

• Proven track record in building ETL/ELT pipelines in cloud environments (managed Spark, object storage, managed ETL, or similar).

• Familiarity with data quality frameworks and data governance practices.

• Experience in data anonymization and privacy-preserving data processing techniques.

• Solid understanding of ML training data requirements.

• Proficient in Python and SQL.

• Experience with data cataloging tools and metadata management systems.

• Bachelor’s or Master’s degree in Computer Science or equivalent experience.

• Experience in B2B SaaS environments with multi-tenant data preferred.

• Cutting-edge Technology

• Supportive and Collaborative Work Culture

• Opportunity for Global Impact

Senior Lead AI Engineer, Data

People also viewed