This is a fully remote position, open to applicants in Colombia.

• Design and manage the cloud data infrastructure that supports AI projects.

• Create production-grade data lakes on AWS.

• Develop real-time data ingestion and monitoring pipelines.

• Take ownership of the vector search and embedding layers that support RAG systems and autonomous agents.

• Overall Experience: 7+ years in Data Engineering, Distributed Systems, or Data Architecture.

• AWS & Infrastructure: 4+ years in designing production-scale data lakes, storage tiers, and event streaming.

• AI/LLM Pipelines: 2+ years of experience in building RAG systems, managing embeddings, and orchestrating foundational models.

• Proficiency in AWS Data Lake Architecture & Storage.

• Proficiency in Real-Time Observability & Log Analytics.

• Proficiency in Elasticsearch & OpenSearch Optimization, Vectorization, and Embeddings.

• Proficiency in Amazon Bedrock & Generative AI Pipelines.

• Proficiency in Software Engineering & API Ingestion.

• Production-level proficiency in one or more of the following: C# (.NET Core), Java, Python, or Node.js.

• Familiarity with AWS S3 partitioning strategies, lifecycle policies, and columnar formats (Parquet, Iceberg).

• Experience with AWS Glue Data Catalog and Lake Formation for fine-grained, multi-tenant access control.

• Expertise in query optimization over petabyte-scale datasets using Amazon Athena and Redshift Spectrum.

• Configuration of distributed oTel collectors for log, trace, and metrics capture and routing into S3.

• High-volume streaming of system logs, Datadog captures, and raw server events into S3.

• Real-time Change Data Capture (CDC) from PostgreSQL using Debezium or AWS DMS.

• Management of Amazon OpenSearch clusters that enable simultaneous lexical and high-dimensional vector search.

• Knowledge in OpenSearch index lifecycle management, sharding strategies, and dynamic mappings at scale.

• Familiarity with Amazon Bedrock foundational model APIs (Claude, Titan) for tasks such as data enrichment, classification, and semantic parsing.

• Understanding of Knowledge Bases for Amazon Bedrock for automatic chunking, metadata extraction, and syncing vector indexes from S3.

• Experience with ETL/ELT pipelines for ingesting unstructured event data from SaaS APIs (e.g., Pendo, Hotjar, Google Analytics).

• Development of MCP servers to provide data lake context and utilities for AI agents.

• Flexible remote work options.

• 13 floating holidays.

• 15 vacation days per year upon completion.

• Positive working environment.

Senior Data Engineer – AWS, RAG Pipelines

People also viewed