This is a fully remote position, open to applicants in Brazil.

📋 Description

• Design, construct, and manage the data infrastructure that supports AI and analytics projects.

• Develop the essential data layer for LLM applications, RAG systems, and AI-driven products, in addition to traditional data pipelines and analytics frameworks.

• Oversee the entire data lifecycle: from ingestion and transformation to quality, governance, and serving, with a focus on modern data patterns necessary for contemporary AI systems.

• Create and sustain vector databases and RAG infrastructure, designing high-performance ETL/ELT pipelines while ensuring data quality at every stage.

• Empower AI engineers, data scientists, and business analysts to develop and implement AI-driven solutions with confidence in the data framework.

• Design and construct scalable, fault-tolerant data pipelines for both batch and real-time/streaming workloads;

• Implement modern ELT methodologies using dbt, Spark, or Dataflow for transformations within cloud data warehouses;

• Develop data ingestion pipelines from various sources: APIs, databases, SaaS platforms, file systems, event streams, and document repositories;

• Implement incremental processing, CDC (Change Data Capture), and event-driven pipeline architectures to ensure near-real-time data accessibility;

• Design pipeline orchestration utilizing Apache Airflow, Prefect, Dagster, or cloud-native workflow services;

• Build and maintain data contracts between producers and consumers to guarantee schema stability and backward compatibility.

• Design, deploy, and optimize vector database infrastructure for AI applications: Pinecone, Weaviate, ChromaDB, pgvector, Qdrant, or Milvus;

• Create document ingestion and processing pipelines for RAG: document parsing (PDF, DOCX, HTML, images), chunking strategies (semantic, recursive, sentence-window), and metadata enrichment;

• Implement and enhance embedding generation pipelines using models from OpenAI, Cohere, Voyage AI, or open-source alternatives (BAAI/bge, Nomic);

• Design hybrid search architectures that combine dense vector search with sparse retrieval (BM25) and metadata filtering for optimal RAG performance;

• Build and maintain knowledge base management systems: versioned document corpora, incremental indexing, and stale content detection;

• Implement RAG evaluation frameworks: retrieval accuracy metrics (MRR, NDCG, Hit Rate), context relevance scoring, and comprehensive RAG benchmarks.

• Design and implement thorough data quality frameworks: validation rules, anomaly detection, freshness monitoring, and schema enforcement;

• Construct data quality pipelines using Great Expectations, Soda, dbt tests, or Monte Carlo for automated data validation at every stage of the pipeline;

• Implement data lineage tracking and impact assessments throughout the data platform;

• Design and enforce data governance policies: access control, data classification, PII detection and masking, and retention strategies;

• Develop data catalogs and discovery tools that facilitate self-service data access for AI engineers and analysts;

• Monitor and alert on data quality SLAs: completeness, accuracy, timeliness, and consistency.

• Design and maintain the core data platform architecture using cloud-native services (AWS, Azure, GCP) — optimizing for cost, performance, and reliability;

• Create and manage data lake/data lakehouse architectures using Delta Lake, Apache Iceberg, or Apache Hudi on cloud object storage;

• Implement data warehouse solutions with Snowflake, Databricks, BigQuery, or Redshift — ensuring proper partitioning, clustering, and materialization strategies;

• Design data serving layers for a variety of consumers: low-latency APIs (feature stores), analytical dashboards, AI model training, and RAG retrieval;

• Implement observability for the data platform: pipeline monitoring, cost tracking, performance dashboards, and capacity planning;

• Develop self-service data infrastructure patterns that allow other teams to create and manage their own data pipelines with appropriate guardrails.

• Build and sustain feature stores for ML model training and serving: offline (batch) and online (real-time) feature computation and storage;

• Design data pipelines for ML workflows: training data preparation, validation sets, evaluation datasets, and model monitoring data;

• Implement data versioning and reproducibility for ML experiments using DVC, LakeFS, or Delta Lake time travel;

• Create feedback loop infrastructure: capturing AI model predictions, user interactions, and ground truth labels for ongoing model enhancement;

• Design and implement data infrastructure for monitoring AI models: input drift detection, output quality monitoring, and population stability metrics.

⛳️ Requirements

• 6+ years of experience in data engineering, including at least 2+ years focused on data infrastructure for AI/ML systems;

• Advanced Python skills and strong SQL expertise across various database engines;

• Production experience with the modern data stack: dbt, Spark (PySpark), Airflow/Prefect/Dagster, and cloud data warehouses (Snowflake, Databricks, BigQuery);

• Practical experience with vector databases (Pinecone, Weaviate, ChromaDB, pgvector) and the development of RAG data pipelines;

• Experience in creating data pipelines on at least one major cloud platform: AWS (S3, Glue, Redshift, EMR), Azure (ADLS, Synapse, Data Factory), or GCP (BigQuery, Dataflow, Dataproc);

• Strong grasp of data modeling: dimensional modeling (Kimball), data vault, and modern analytical modeling patterns;

• Experience with data quality frameworks and tools: Great Expectations, Soda, dbt tests, or equivalent;

• Comprehensive understanding of data governance: access control, PII handling, encryption both at rest and in transit, and compliance requirements;

• Familiarity with version control (Git), CI/CD for data pipelines, and infrastructure-as-code;

• Proficient in English, both written and spoken;

• Demonstrated experience in international projects, including collaboration with global and multicultural teams;

• Previous experience mentoring engineers or serving as a technical lead is strongly preferred;

• Excellent communication, stakeholder management, and problem-solving abilities.

• Experience building feature stores for ML: Feast, Tecton, Hopsworks, or custom implementations;

• Knowledge of data lakehouse architectures: Delta Lake, Apache Iceberg, Apache Hudi;

• Experience with streaming data infrastructure: Apache Kafka, Flink, Spark Structured Streaming, or Kinesis;

• Understanding of embedding models and vector search optimization: index types (HNSW, IVF), quantization, and hybrid search techniques;

• Experience in insurance, financial services, or healthcare data — including knowledge of regulatory compliance (GDPR, CCPA, SOX, HIPAA);

• Familiarity with data observability platforms: Monte Carlo, Bigeye, Metaplane, or custom observability solutions;

• Experience with graph databases (Neo4j, Amazon Neptune) for knowledge graph applications in AI;

• Knowledge of document processing pipelines: PDF parsing (PyPDF, Unstructured.io), OCR, and layout analysis;

• Familiarity with LLM-specific data patterns: prompt/completion logging, token usage analytics, and AI cost attribution.

• DevOps Experience | All team members are expected to have hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;

• Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps.

• Cloud Platforms | Basic knowledge of at least one major cloud provider (AWS, Azure, or GCP).

• Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.

🏝️ Benefits

• 100% Remote

• Flexible working hours

Senior Data Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Data Architect

Data Architect

Senior Data Engineer

Mid-level Data Engineer

AI Data Engineer

Data Engineer

Never miss a great job!