
Senior Data Engineer – Databricks SME
Posted May 9

Posted May 9
This is a fully remote position, open to applicants in North Carolina.
• Design, create, and sustain scalable data ingestion pipelines to integrate structured, semi-structured, and unstructured data from various sources (e.g., APIs, databases, flat files, message queues) into the Azure/Databricks environment.
• Implement de-duplication strategies for large-scale datasets utilizing both deterministic and probabilistic matching methods to maintain data integrity and minimize redundancy within the Data Lake.
• Develop and enforce data tagging frameworks to classify, label, and annotate datasets with the correct metadata (e.g., sensitivity, source, domain, lineage) to meet data governance, discoverability, and compliance obligations.
• Assist in the operationalization of deployments and support of Cloud services for ETL Operations.
• This will involve standardizing and automating processes and workflows, creating documentation/knowledge articles, and providing support to Operations staff who may have limited Cloud experience.
• Deliver written and oral presentations to senior CIO management regarding the status of current initiatives.
• Possess skills and experience related to business management, systems engineering, operations research, and management engineering.
• Typically have specialization in a specific technology or business application.
• Stay informed about technological advancements and industry trends.
• Assist with the deployment, configuration, and management of the Azure Cloud environment.
• Aid in the migration of existing ETL jobs into the Azure/Databricks cloud environment.
• Capable of sharing optimization strategies and efficiencies with the broader team and management.
• Able to automate solutions for repetitive challenges/tasks.
• A degree from an accredited College/University in a relevant field is required.
• 13+ years of overall IT experience.
• 5+ years of proven experience designing and implementing data ingestion pipelines using tools such as Azure Data Factory, Apache Kafka, Apache NiFi, Spark Structured Streaming, or equivalent technologies.
• 5+ years of experience in applying de-duplication techniques at scale, including record linkage, fuzzy matching, and entity resolution across both structured and unstructured datasets.
• 5+ years of practical experience with data tagging and metadata management, encompassing the use of tagging schemas, data catalogs (e.g., Azure Purview, Apache Atlas), and automated classification tools to support data governance and lineage tracking.
• 5+ years of demonstrated experience working with unstructured data.
• 2+ years of experience using Databricks or other Spark-based platforms.
• Proficiency in at least one scripting language (Python, Perl, Ruby, or equivalent).
• Experience with one or more of the following technologies: SAS, C++, Hadoop, SQL Database/Coding, Teradata, Oracle, Amazon S3, Apache Spark, Machine Learning, Natural Language Processing, and visualization tools like Tableau, Strategy, and QLIK is a plus.
• Knowledge of Git integration in continuous deployment and experience with DevOps monitoring tools is a plus.
• Familiarity with Cloud Operations support in Azure is advantageous.
• Excellent communication skills are essential.
• Must be able to obtain a Position of Public Trust Clearance.
• Must be a US Citizen or hold US Permanent Residence status (Green Card).
• Must have resided in the US for the past 5 years and not have traveled outside the US for a cumulative total of 6 months or more during the last 5 years.
• Health insurance
• Retirement plans
• Paid time off
Anord Mardix
Stefanini Brasil
InVision Communications
Get handpicked remote jobs straight to your inbox weekly.