This is a fully remote position, open to applicants in Brazil.

📋 Description

• Design, develop, and manage the ingestion systems that handle substantial amounts of multimodal data, transforming it into structured, usable datasets.

• Take ownership of the entire ingestion process, from data acquisition to validation, processing, tracking, and making it accessible for downstream use.

• Create specific processing steps tailored for real-world data sources, including medical imaging processing, audio and video metadata extraction, quality validation, and notes processing.

• Develop parsers, validators, and normalization logic capable of systematically addressing messy, non-standard, and highly variable source formats.

• Convert repetitive, one-time data handling tasks into reusable processing patterns, internal tools, and platform functionalities.

• Design systems for high volume and throughput, ensuring optimization for reliability, cost-effectiveness, and speed.

• Collaborate across distributed and parallel computing systems to manage workloads that are not suitable for a single machine.

• Select the appropriate execution model for each workload, including batch processing, distributed execution, and modern compute patterns for unstructured data and inference-heavy processing.

• Identify and resolve bottlenecks within the ingestion and processing systems, maintaining performance as volume and complexity of modalities increase.

• Implement validation and quality checks that detect poor, incomplete, or incorrectly formatted data before it propagates downstream.

• Manage sensitive and regulated data, including PHI, with the necessary security and diligence, including de-identification when required.

• Monitor provenance, metadata, and usage constraints throughout the ingestion process to ensure downstream usage remains compliant and auditable.

• Enhance the quality of observability, debuggability, and operational reliability throughout the ingestion layer.

• Collaborate with product and Data Lab teams to accommodate new modalities, partner requirements, and non-standard data sources.

• Work directly with partner engineering teams as needed to translate source-system realities into effective ingestion and processing designs.

• Identify recurring patterns that can be standardized into reusable transforms, validators, and internal tools.

• Contribute to shaping how Protege manages new data types as the platform evolves into more sophisticated data environments.

⛳️ Requirements

• 5+ years of experience in building and operating production backend or data systems, with hands-on experience in large-scale data processing.

• Proven expertise in designing and managing large-scale data pipelines.

• Strong programming capabilities in Python.

• Experience with distributed data processing systems.

• High proficiency with AWS services.

• Ability to navigate messy, diverse, high-volume data and ambiguity, with a talent for identifying patterns in complex scenarios.

• Meticulous attention to detail while maintaining speed, with a proactive approach to action.

• Enthusiastic about working on products focused on managing and processing large data volumes.

• Inquisitive, persistent, and self-motivated.

🏝️ Benefits

• Health insurance

• Professional development opportunities

• Flexible working hours

Senior Software Engineer, Data Processing

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Staff Engineer – API & Data

Senior AI Product Engineer

Full-Stack Engineering Lead

Full Stack Developer

Senior Software Engineer

Senior Software Engineer – Knowledge Graph, GraphRAG

Never miss a great job!