This is a fully remote position, open to applicants in Brazil.

📋 Description

• You will design and enhance the datalake, which serves as the company's data backbone — the core system that supports, in real time, the dynamic pricing engine, machine learning models, and the group's business intelligence.

• This position entails ownership: you will establish the multi-tenant Lakehouse architecture, covering aspects from streaming to the semantic layer, while ensuring its reliability, governance, and cost-effectiveness.

• Develop and improve the data lake utilizing Apache Iceberg over S3 — implementing well-defined layers, partitioning and compaction, time-travel capabilities, and support for DELETE/UPDATE in accordance with LGPD (Brazilian data protection law).

• Create real-time ingestion processes (Kafka, Flink, CDC with Debezium) with managed schema evolution (Schema Registry) and delivery assurances.

• Design the transformation layer in dbt and coordinate batch and quality workflows in Airflow, spanning from crawler to backfill.

• Uphold metric definitions in Cube.js — the unified source that powers BI and AI agents, ensuring consistency throughout the organization.

• Execute federated and low-latency OLAP queries over the lake, maintaining cost and access isolation by tenant while ensuring high-performance queries.

• Guarantee data testing, lineage tracking, and cost efficiency, ensuring the platform remains reliable as it scales.

⛳️ Requirements

• Proficient in SQL with expertise in query optimization within distributed environments (Minimum 5 years).

• Experience in Python, particularly with PySpark or distributed processing.

• Knowledge of orchestration (Airflow), ELT processes, and dbt implemented at scale (Minimum 4 years).

• Familiarity with streaming technologies (Kafka, Flink) and Lakehouse architectures utilizing Apache Iceberg (Minimum 3 years).

• Strong grasp of data governance, quality assurance, and data modeling practices.

• Comfortable engaging with AI-assisted development tools (e.g., Claude Code).

• Experience with CDC (Debezium) and low-latency OLAP systems (ClickHouse, Pinot, Trino/Athena).

• Knowledge of semantic layers (Cube.js, dbt) and Data Mesh architectures.

• Familiarity with governance and cataloging tools (OpenMetadata, Lake Formation).

• Experience with vector databases (Qdrant) and data pipelines for machine learning.

🏝️ Benefits

• Remote work

• Project duration: 6 months, with the potential for extension or conversion to permanent employment.

Senior Data Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Data Engineer

Mid-level Data Engineer

AI Data Engineer

Data Engineer

Data Engineer

Data Engineering Manager

Never miss a great job!