
Lead Data Engineer, AI
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in India.
• Develop, test, and sustain production pipelines (both batch and real-time) utilizing Snowflake, PySpark, Delta Lake, and Kafka.
• Establish data quality verification, schema validation, and alerting mechanisms at each stage of the pipeline.
• Transition legacy ETL/DWH systems to cloud-native AWS/Azure architectures while achieving measurable reductions in latency and costs.
• Oversee CI/CD pipelines: including automated testing, deployment, rollback, and Infrastructure as Code (IaC) using Terraform and GitHub Actions.
• Create a comprehensive retrieval infrastructure: handling document ingestion, embedding pipelines, vector store management (Pinecone, FAISS, ChromaDB, OpenSearch), and hybrid retrieval layers.
• Implement chunking, metadata filtering, and re-ranking, optimizing for precision, recall, and latency.
• Ensure data freshness and index consistency; instrument with metrics for context relevance and faithfulness.
• Develop and maintain business entity mappings, ontologies, and knowledge graphs (Neo4j) following Architect design specifications.
• Construct and version the feature store and semantic data contracts that support both ML models and LLM applications.
• Oversee metadata management, data lineage, and audit trail instrumentation throughout the platform.
• Build ML data infrastructure encompassing training curation, feature engineering, MLflow experiment tracking, and dataset versioning.
• Facilitate LLM fine-tuning workflows that include corpus curation, quality filtering, and dataset formatting.
• Establish automated evaluation pipelines for factual accuracy, hallucination detection, and regression tracking.
• Maintain production monitoring dashboards to oversee pipeline health, model metrics, and alerting systems.
• Develop and sustain data APIs, tool schemas, and memory/state stores that autonomous agents rely on.
• Implement agent observability: capturing inputs, retrieved context, tool calls, reasoning traces, and outputs.
• Maintain text-to-SQL layers, semantic query interfaces, and context APIs tailored for conversational AI consumers.
• Enforce RBAC, attribute-based access, PII detection/masking, data classification, and audit logging protocols.
• Uphold data contracts and schema governance with automated detection of breaking changes and versioned migrations.
• Establish data quality monitoring (completeness, freshness, consistency) with automated alerts and root-cause analysis tools.
• Aid in compliance readiness through audit trails, data provenance, and regulatory documentation.
• Over 7 years of experience in data engineering utilizing Cloud services.
• A minimum of 2 years in production AI/ML or LLM-era data infrastructure, demonstrating proven experience in building production pipelines at scale — both batch and streaming, specifically with Snowflake and AWS/Azure.
• Profound expertise in Python, PySpark, Snowflake, Delta Lake, Kafka, and Spark Structured Streaming.
• Practical experience with vector stores, embedding pipelines, and retrieval infrastructure within production RAG environments.
• Familiarity with MLOps practices: MLflow, CI/CD for AI, automated evaluation, and production monitoring.
• Strong foundation in data governance, quality frameworks, and compliance-aligned engineering practices.
• Health insurance
• 401(k) matching
• Flexible work hours
• Paid time off
• Remote work options
Aimpoint Digital
Get handpicked remote jobs straight to your inbox weekly.