Medallion Architecture
Medallion architecture is a data design pattern commonly used in modern data lakes and data warehouses, particularly in cloud-based environments.
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, aiming to incrementally and progressively improve the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).
The Three Tiers
Bronze (Raw)
- Contains raw, unprocessed data.
- Typically a 1:1 copy of source system data.
- Preserves the original data for auditability and reprocessing if needed.
- Often stored in formats like JSON, CSV, or Avro.
Silver (Cleaned and Conformed)
- Cleansed and conformed version of bronze data.
- Applies data quality rules, handles missing values, deduplication.
- Often includes parsed and enriched data.
- Typically stored in a more optimized format like Parquet or Delta.
Gold (Business-Level)
- Contains highly refined, query-ready data sets.
- Often aggregated and joined from multiple silver tables.
- Optimized for specific business domains or use cases.
- Can include star schemas, data marts, or wide denormalized tables.
Key Principles
- Data flows from Bronze → Silver → Gold
- Each tier adds value and improves data quality
- Promotes data governance and lineage tracking
- Enables self-service analytics at different levels of refinement
Benefits
Flexibility: Supports various data processing needs
Scalability: Easily accommodates growing data volumes
Governance: Improves data lineage and auditability
Performance: Optimizes query performance on refined data sets
Reusability: Allows multiple downstream applications to use appropriately refined data