
Data Scientist II – Big Data R&D, Identity Graph, KYC
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in California.
• Assist in the design and execution of machine learning, data mining, statistical, and graph-based algorithms to analyze extensive datasets for identity verification and anomaly detection.
• Examine large datasets to aid in the development and enhancement of entity-resolution and identity-matching algorithms that power Socure’s KYC and compliance solutions.
• Construct and sustain components of data-processing pipelines (ETL, feature generation, normalization) utilizing tools like Spark/PySpark and AWS (e.g., EMR, S3).
• Provide support to senior data scientists with feature engineering, data exploration, error analysis, and A/B test configuration for new models and signals.
• Assist in assessing new third-party and internal data sources: evaluate data quality, design offline experiments, and summarize impacts on coverage and model performance.
• Develop and maintain SQL and Python/R scripts for data extraction, transformation, and validation; participate in code reviews and basic testing.
• Offer analytical support to compliance and regulatory product teams, including ad hoc investigations, simple dashboards, and in-depth data analyses.
• Present findings clearly and structured to colleagues and cross-functional partners (Product, Engineering, Client Analysis), emphasizing key insights and trade-offs.
• Thrive in a dynamic, cross-functional setting; take ownership of well-defined tasks and see them through to completion.
• Master’s degree with 2+ years of experience, or Ph.D. with 1+ years of experience in a data science or analytics role, or equivalent practical experience.
• Proficient in at least one general-purpose programming language utilized in data science (Python or Scala).
• Strong experience in writing and optimizing SQL for large datasets; comfortable working in data lake/warehouse environments.
• Hands-on experience with Spark or PySpark and common ML libraries (e.g., scikit-learn, XGBoost, TensorFlow/PyTorch is an advantage).
• Familiar with UNIX environments and the AWS ecosystem (e.g., EMR, S3); experience with Databricks is a plus.
• Working knowledge of supervised/unsupervised machine learning and basic statistics (similarity measures, clustering, evaluation metrics).
• Exposure to graph techniques or graph databases (Neo4j, AWS Neptune, GraphFrames) is a significant advantage.
• Bonus: experience with Elasticsearch or DynamoDB; workflow tools such as Airflow for automating data pipelines.
• Capable of breaking down loosely defined problems, asking insightful clarifying questions, and iterating swiftly with feedback.
• Offers Equity
• Offers Bonus
Zeta Global
Humana
Binance.US
10x Genomics
Get handpicked remote jobs straight to your inbox weekly.