This is a fully remote position, open to applicants in United States.

📋 Description

• Collaborate with senior engineers to create new ETL pipelines and data ingestion processes utilizing AWS Glue (Spark-based, PySpark), MWAA (Airflow), Lambda, and SNS.

• Incorporate the agency's ETL Common Library into Glue jobs to standardize orchestration, manage error handling, record metadata, and send SNS notifications for all successful and erroneous job events.

• Ingest structured and semi-structured datasets (CSV, XML, JSON, Avro, pipe-delimited) into S3 landing, raw, and curated zones using Apache Iceberg tables.

• Set up static ETL metadata in the centralized PostgreSQL metadata store; ensure that dynamic metadata captures job status and timestamps for all crucial execution steps.

• Oversee assigned production jobs and engage in operations support rotations.

• Ensure that ETL Load Reports are updated in real-time and ETL Gap Reports are refreshed weekly.

• Create and sustain materialized views and semantic layer objects in Trino and Athena to enhance query performance and maintain consistent business logic.

• Generate and keep up-to-date required documentation for each assigned dataset: Business Requirements, ETL Design Documents, Data Models, Data Dictionaries, Mapping Documents, Deployment Documents, O&M Guides, and ETL Test Plans.

• Develop unit and integration tests to meet a minimum code coverage threshold of 90%; conduct security scans at least once per sprint.

• Deploy ETL resources using CloudFormation templates via the agency's CICD pipeline.

• Assist in the transition of ETL jobs from other agency teams and participate in disaster recovery exercises.

⛳️ Requirements

• US Citizenship is mandatory.

• A Bachelor's Degree is required.

• A minimum of 3-5 years of relevant experience is necessary.

• Practical experience with Python (PEP 8), PySpark, and SQL for ETL pipeline development.

• Familiarity with AWS services, including Glue, S3, MWAA (Airflow), Lambda, SNS, and SQS.

• Knowledge of Apache Iceberg, Parquet, and ORC file formats, as well as S3 data lake zone concepts.

• Experience with PostgreSQL and basic knowledge of Redshift or Oracle.

• Understanding of Trino or Athena for query and semantic layer development.

• Experience with CloudFormation, GitHub branching workflows, and CI/CD-integrated deployments.

• Ability to create comprehensive ETL documentation, including data models (in Mermaid format) and data dictionaries.

• Understanding of ETL metadata concepts, including static and dynamic metadata, load reports, and gap reports.

• Experience in agile development settings with sprint-based delivery.

• Familiarity with IV&V and/or User Acceptance Testing (UAT) processes in a federal or technical program environment.

• Experience with automated testing frameworks; capability to write unit and integration tests that meet defined code coverage thresholds.

• Knowledge of FISMA, NIST 800-53, and OWASP ASVS Level 2 is a plus.

• Availability to work from 8 am to 5 pm Eastern Time, regardless of home location.

• An active federal public trust suitability determination or the ability to obtain one is required.

🏝️ Benefits

• Flexible work arrangements.

• Continuous learning opportunities.

• Professional development support.

• Special incentives for team members residing in qualified HUBZones.

Mid-Level Data Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Rate Analyst

HSE Manager

People Partner

B2B Outside Sales Consultant

Business Development Executive, Early Career – European Language Required

Statistical Programmer II

Never miss a great job!