This is a fully remote position, open to applicants in India.

📋 Description

• Develop, test, and sustain production pipelines (both batch and real-time) utilizing Snowflake, PySpark, Delta Lake, and Kafka.

• Establish data quality verification, schema validation, and alerting mechanisms at each stage of the pipeline.

• Transition legacy ETL/DWH systems to cloud-native AWS/Azure architectures while achieving measurable reductions in latency and costs.

• Oversee CI/CD pipelines: including automated testing, deployment, rollback, and Infrastructure as Code (IaC) using Terraform and GitHub Actions.

• Create a comprehensive retrieval infrastructure: handling document ingestion, embedding pipelines, vector store management (Pinecone, FAISS, ChromaDB, OpenSearch), and hybrid retrieval layers.

• Implement chunking, metadata filtering, and re-ranking, optimizing for precision, recall, and latency.

• Ensure data freshness and index consistency; instrument with metrics for context relevance and faithfulness.

• Develop and maintain business entity mappings, ontologies, and knowledge graphs (Neo4j) following Architect design specifications.

• Construct and version the feature store and semantic data contracts that support both ML models and LLM applications.

• Oversee metadata management, data lineage, and audit trail instrumentation throughout the platform.

• Build ML data infrastructure encompassing training curation, feature engineering, MLflow experiment tracking, and dataset versioning.

• Facilitate LLM fine-tuning workflows that include corpus curation, quality filtering, and dataset formatting.

• Establish automated evaluation pipelines for factual accuracy, hallucination detection, and regression tracking.

• Maintain production monitoring dashboards to oversee pipeline health, model metrics, and alerting systems.

• Develop and sustain data APIs, tool schemas, and memory/state stores that autonomous agents rely on.

• Implement agent observability: capturing inputs, retrieved context, tool calls, reasoning traces, and outputs.

• Maintain text-to-SQL layers, semantic query interfaces, and context APIs tailored for conversational AI consumers.

• Enforce RBAC, attribute-based access, PII detection/masking, data classification, and audit logging protocols.

• Uphold data contracts and schema governance with automated detection of breaking changes and versioned migrations.

• Establish data quality monitoring (completeness, freshness, consistency) with automated alerts and root-cause analysis tools.

• Aid in compliance readiness through audit trails, data provenance, and regulatory documentation.

⛳️ Requirements

• Over 7 years of experience in data engineering utilizing Cloud services.

• A minimum of 2 years in production AI/ML or LLM-era data infrastructure, demonstrating proven experience in building production pipelines at scale — both batch and streaming, specifically with Snowflake and AWS/Azure.

• Profound expertise in Python, PySpark, Snowflake, Delta Lake, Kafka, and Spark Structured Streaming.

• Practical experience with vector stores, embedding pipelines, and retrieval infrastructure within production RAG environments.

• Familiarity with MLOps practices: MLflow, CI/CD for AI, automated evaluation, and production monitoring.

• Strong foundation in data governance, quality frameworks, and compliance-aligned engineering practices.

🏝️ Benefits

• Health insurance

• 401(k) matching

• Flexible work hours

• Paid time off

• Remote work options

Lead Data Engineer, AI

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Data Architect

Data Architect

Senior Data Engineer

Mid-level Data Engineer

AI Data Engineer

Data Engineer

Never miss a great job!