Data Engineering · Abacus Insights
Healthcare Data Standardization
Overview
Designed and maintained large-scale batch data pipelines in Databricks on AWS to standardize US healthcare datasets received from multiple payers. Data arrived in heterogeneous formats (HL7, X12 837/835, flat files) and required extensive cleaning, normalization, and schema enforcement before it could be used for analytics.
Implemented Medallion architecture (Bronze / Silver / Gold) with Delta Lake, including SCD Type-2 change tracking for member and provider records, data validation frameworks with Great Expectations, and automated quality reporting. Orchestrated jobs using AWS Glue and Apache Airflow with EventBridge triggers for near-real-time ingestion windows.
Published curated Gold tables to Snowflake for downstream analytics and reporting teams. Reduced data quality issues by 85% through layered validation and introduced a schema registry to prevent breaking changes from upstream feeds.
Technical Implementation
Bronze Layer — Raw Ingest
All source files land in S3 and are ingested as-is into Delta Lake Bronze tables with full audit columns (ingestion timestamp, source system, file hash). No transformations — raw fidelity preserved.
Silver Layer — Cleanse & Normalise
PySpark jobs standardise member IDs, procedure/diagnosis codes (ICD-10, CPT, HCPCS), dates, and provider NPIs. SCD Type-2 handles slowly-changing dimensions. Great Expectations validation gates block bad data from promoting.
Gold Layer — Business Aggregates
Business-level aggregates (member months, claims roll-ups, risk scores) are pre-computed and written as optimised Delta tables. Z-ordering on common filter keys reduces query scan times.
Snowflake Publishing & Orchestration
Gold tables sync to Snowflake via Delta Sharing or COPY INTO. Airflow DAGs orchestrate the end-to-end pipeline; EventBridge triggers intra-day reloads when upstream S3 drops arrive.