Thoughts on data
Don't remove NULL columns or Bad data from the Source. Learn to handle that in processing.
Sample 1
| Serial_Number | List_Year | Code |
|---|---|---|
| 20093 | 2020 | 10 - A Will |
| 200192 | 2020 | 14 - Foreclosure |
| 190871 | 2019 | 18 - In Lieu Of Foreclosure |
Create a separate Dimension table for Code and split the Code No and Code Name
| dim_code |
|---|
| id |
| code_no |
| code_name |
| fact |
|---|
| Serial_Number |
| List_Year |
| dim_id |
Sample 2
| TripID | Source Lat | Source Long | Des Lat | Des Long |
|---|---|---|---|---|
| 1 | -73.9903717 | 40.73469543 | -73.98184204 | 40.73240662 |
| 2 | -73.98078156 | 40.7299118 | -73.94447327 | 40.71667862 |
| 3 | -73.98455048 | 40.67956543 | -73.95027161 | 40.78892517 |
Separate Dimension for Lat and Long
| dim_location |
|---|
| location_id |
| lat |
| long |
| location_name (If exists) |
Updated Fact table
| fact |
|---|
| trip_id |
| source_location_id |
| destination_location_id |
Another Variation
| Location |
|---|
| POINT (-72.98492 41.64753) |
| POINT (-72.96445 41.25722) |
Notes:
DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.
Ease of Update: If you need to update a location's details, you do it in one place.
Flexibility: Easier to add more attributes to locations in the future.
5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.
4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.
3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.
Sample 3
| Fiscal Year | Disbursement Date | Vendor Invoice Date | Vendor Invoice Week | Check Clearance Date |
|---|---|---|---|---|
| 2023 | 06-Oct-23 | 08-Aug-23 | 08-06-2023 | |
| 2023 | 06-Oct-23 | 16-Aug-23 | 08/13/2023 | |
| 2023 | 06-Oct-23 | 22-Sep-23 | 09/17/2023 | 10-08-2023 |
Date Dimension
| DateKey | FullDate | Year | Month | Day | Weekday | WeekOfYear | Quarter | IsWeekend | IsHoliday |
|---|---|---|---|---|---|---|---|---|---|
| 20230808 | 2023-08-08 | 2023 | 8 | 8 | 2 | 32 | 3 | FALSE | FALSE |
| 20230816 | 2023-08-16 | 2023 | 8 | 16 | 3 | 33 | 3 | FALSE | FALSE |
| 20230922 | 2023-09-22 | 2023 | 9 | 22 | 5 | 38 | 3 | FALSE | FALSE |
| 20231006 | 2023-10-06 | 2023 | 10 | 6 | 5 | 40 | 4 | FALSE | FALSE |
| 20231008 | 2023-10-08 | 2023 | 10 | 8 | 7 | 40 | 4 | TRUE | FALSE |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Fact Table
| Fiscal Year | Disbursement Date Key | Vendor Invoice Date Key | Vendor Invoice Week Date Key | Check Clearance Date Key |
|---|---|---|---|---|
| 2023 | 20231006 | 20230808 | 20230806 | |
| 2023 | 20231006 | 20230816 | 20230813 | |
| 2023 | 20231006 | 20230922 | 20230917 | 20231008 |
Sample 4
Another DateTime
| date_attending | ip_location |
|---|---|
| 2017-12-23 12:00:00 | Reseda, CA, United States |
| 2017-12-23 12:00:00 | Los Angeles, CA, United States |
| 2018-01-05 14:00:00 | Mission Viejo, CA, United States |
Date Time Dimension
| DateKey | FullDate | Year | Month | Day | Weekday | WeekOfYear | Quarter | IsWeekend | IsHoliday |
|---|---|---|---|---|---|---|---|---|---|
| 20171223 | 2017-12-23 | 2017 | 12 | 23 | 6 | 51 | 4 | TRUE | FALSE |
| 20180105 | 2018-01-05 | 2018 | 1 | 5 | 5 | 1 | 1 | FALSE | FALSE |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Fact Table
| FactID | DateKey | ip_location |
|---|---|---|
| 1 | 20171223 | Reseda, CA, United States |
| 2 | 20171223 | Los Angeles, CA, United States |
| 3 | 20180105 | Mission Viejo, CA, United States |
Sample 5
| Job Title | Experience | Qualifications | Salary Range | Age_Group |
|---|---|---|---|---|
| Digital Marketing Specialist | 5 to 15 Years | M.Tech | $59K-$99K | Youth (<25) |
| Web Developer | 2 to 12 Years | BCA | $56K-$116K | Adults (35-64) |
| Operations Manager | 0 to 12 Years | PhD | $61K-$104K | Young Adults (25-34) |
| Network Engineer | 4 to 11 Years | PhD | $65K-$91K | Young Adults (25-34) |
| Event Manager | 1 to 12 Years | MBA | $64K-$87K | Adults (35-64) |
Experience Dimension
| ExperienceID | ExperienceRange | MinExperience | MaxExperience |
|---|---|---|---|
| 1 | 5 to 15 Years | 5 | 15 |
| 2 | 2 to 12 Years | 2 | 12 |
| 3 | 0 to 12 Years | 0 | 12 |
| 4 | 4 to 11 Years | 4 | 11 |
| 5 | 1 to 12 Years | 1 | 12 |
Job Dimension Table
| JobID | Job Title | Qualifications |
|---|---|---|
| 1 | Digital Marketing Specialist | M.Tech |
| 2 | Web Developer | BCA |
| 3 | Operations Manager | PhD |
| 4 | Network Engineer | PhD |
| 5 | Event Manager | MBA |
Age Group Dimension Table
| AgeGroupID | Age_Group |
|---|---|
| 1 | Youth (<25) |
| 2 | Adults (35-64) |
| 3 | Young Adults (25-34) |
Fact Table
| JobID | ExperienceID | Salary Range | AgeGroupID |
|---|---|---|---|
| 1 | 1 | $59K-$99K | 1 |
| 2 | 2 | $56K-$116K | 2 |
| 3 | 3 | $61K-$104K | 3 |
| 4 | 4 | $65K-$91K | 3 |
| 5 | 5 | $64K-$87K | 2 |
Sample 6
| ITEM CODE | ITEM DESCRIPTION |
|---|---|
| 100293 | SANTORINI GAVALA WHITE - 750ML |
| 100641 | CORTENOVA VENETO P/GRIG - 750ML |
| 100749 | SANTA MARGHERITA P/GRIG ALTO - 375ML |
The Fact will turn out to be like this
| ItemID | ItemCode | Item Description | Quantity |
|---|---|---|---|
| 1 | 100293 | SANTORINI GAVALA WHITE | 750ML |
| 2 | 100641 | CORTENOVA VENETO P/GRIG | 750ML |
| 3 | 100749 | SANTA MARGHERITA P/GRIG ALTO | 375ML |
Note: If the same Item Descrition repeats for various quantities, then create a separate table for Quantity.
Quantity Dimension
| QuantityID | Quantity |
|---|---|
| 1 | 750ML |
| 2 | 375ML |
| 3 | 500ML |
| ... | ... |
Item Dimension
| ItemID | Item Name | QuantityID |
|---|---|---|
| 1 | SANTORINI GAVALA WHITE | 1 |
| 2 | CORTENOVA VENETO P/GRIG | 1 |
| 3 | SANTA MARGHERITA P/GRIG ALTO | 2 |
| ... | ... | ... |
Fact Table
| FactID | ItemID | QuantityID |
|---|---|---|
| 1 | 1 | 1 |
Sample 7
| production_countries | spoken_languages |
|---|---|
| United Kingdom, United States of America | English, French, Japanese, Swahili |
| United Kingdom, United States of America | English |
| United Kingdom, United States of America | English, Mandarin |
Country Dimension
| CountryID | CountryName |
|---|---|
| 1 | United Kingdom |
| 2 | United States of America |
Language Dimension
| LanguageID | LanguageName |
|---|---|
| 1 | English |
| 2 | French |
| 3 | Japanese |
| 4 | Swahili |
| 5 | Mandarin |
Approach 1 : Creating Many to Many
| FactID | CountryID | LanguageID |
|---|---|---|
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 1 | 2 |
| 1 | 2 | 2 |
Pros
Normalization: Easier to update and maintain data.
Cons
Storage: May require more storage for the additional tables and keys.
Approach 2 : Fact Table with Array Datatypes
| FactID | ProductionCountryIDs | SpokenLanguageIDs |
|---|---|---|
| 1 | [1, 2] | [1, 2, 3, 4] |
| 2 | [1, 2] | [1] |
| 3 | [1, 2] | [1, 5] |
With newer Systems like Spark SQL which supports Arrays
- Simplicity: Easier to understand and less complex to set up.
- Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.