
Data Engineer
Posted 22 hours ago

Posted 22 hours ago
This is a fully remote position, open to applicants in Poland.
β’ Recreate a comprehensive descriptive-statistics report from start to finish, ensuring that every figure can be traced back to its original source β addressing the gaps acknowledged by the client (data points they currently cannot validate).
β’ Analyze and reconcile varying source schemas across acquired organizations: align different field names, types, encodings, and business definitions for the same concept into a unified model.
β’ Develop dbt models for staging, intermediate, and mart layers with testing; codify the harmonized definitions as specified by the Data Science Lead.
β’ Create Great Expectations suites (null, range, uniqueness, referential checks) and integrate them into the pipeline to ensure that erroneous data fails loudly, preventing silent corruption of analysis.
β’ Execute entity and identity resolution (both deterministic and fuzzy matching) in cases where there is no clean shared key for the same customer or account across different sources.
β’ Implement and validate anonymization and pseudonymization techniques (hashing, tokenization, k-anonymity) and provide evidence that re-identification risk is managed for the client's IT and compliance teams.
β’ Optimize Spark and Glue jobs handling tens of millions of rows β focusing on partitioning, file formats (Parquet), incremental loads, and cost management.
β’ Coordinate with Airflow and Step Functions; establish repeatable, scheduled pipelines instead of one-off scripts.
β’ Prepare clean, documented, and feature-ready datasets for the PD and delinquency models.
β’ Document runbooks to enable the offshore team to manage the pipelines, ensuring that handover processes take days rather than weeks; assist in scoping the onboarding of remaining sources (Ireland and additional sources).
β’ Over 4 years of experience in data engineering, with a strong focus on AWS and Spark/SQL at scale.
β’ Proven track record in harmonizing and integrating data across multiple source systems.
β’ Experience in building validated, reproducible pipelines within regulated environments (BFSI, healthcare, government) β a significant advantage.
β’ Comfortable working within a complex, partially constructed data landscape and enhancing it to meet standards.
β’ Able to operate as the sole or lead data engineer within a small delivery team (3β4 members).
β’ Preference for full-time engagement.
Anord Mardix
Stefanini Brasil
InVision Communications
Get handpicked remote jobs straight to your inbox weekly.