Thoughts on data

Don't remove NULL columns or Bad data from the Source. Learn to handle that in processing.

Sample 1

Serial_NumberList_YearCode
20093202010 - A Will
200192202014 - Foreclosure
190871201918 - In Lieu Of Foreclosure

Create a separate Dimension table for Code and split the Code No and Code Name

dim_code
id
code_no
code_name
fact
Serial_Number
List_Year
dim_id

Sample 2

TripIDSource LatSource LongDes LatDes Long
1-73.990371740.73469543-73.9818420440.73240662
2-73.9807815640.7299118-73.9444732740.71667862
3-73.9845504840.67956543-73.9502716140.78892517

Separate Dimension for Lat and Long

dim_location
location_id
lat
long
location_name (If exists)

Updated Fact table

fact
trip_id
source_location_id
destination_location_id

Another Variation

Location
POINT (-72.98492 41.64753)
POINT (-72.96445 41.25722)

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.


Sample 3

Fiscal YearDisbursement DateVendor Invoice DateVendor Invoice WeekCheck Clearance Date
202306-Oct-2308-Aug-2308-06-2023
202306-Oct-2316-Aug-2308/13/2023
202306-Oct-2322-Sep-2309/17/202310-08-2023

Date Dimension

DateKeyFullDateYearMonthDayWeekdayWeekOfYearQuarterIsWeekendIsHoliday
202308082023-08-082023882323FALSEFALSE
202308162023-08-1620238163333FALSEFALSE
202309222023-09-2220239225383FALSEFALSE
202310062023-10-0620231065404FALSEFALSE
202310082023-10-0820231087404TRUEFALSE
..............................

Fact Table

Fiscal YearDisbursement Date KeyVendor Invoice Date KeyVendor Invoice Week Date KeyCheck Clearance Date Key
2023202310062023080820230806
2023202310062023081620230813
202320231006202309222023091720231008

Date Dimension Sample


Sample 4

Another DateTime

date_attendingip_location
2017-12-23 12:00:00Reseda, CA, United States
2017-12-23 12:00:00Los Angeles, CA, United States
2018-01-05 14:00:00Mission Viejo, CA, United States

Date Time Dimension

DateKeyFullDateYearMonthDayWeekdayWeekOfYearQuarterIsWeekendIsHoliday
201712232017-12-23201712236514TRUEFALSE
201801052018-01-05201815511FALSEFALSE
..............................

Fact Table

FactIDDateKeyip_location
120171223Reseda, CA, United States
220171223Los Angeles, CA, United States
320180105Mission Viejo, CA, United States

Sample 5

Job TitleExperienceQualificationsSalary RangeAge_Group
Digital Marketing Specialist5 to 15 YearsM.Tech$59K-$99KYouth (<25)
Web Developer2 to 12 YearsBCA$56K-$116KAdults (35-64)
Operations Manager0 to 12 YearsPhD$61K-$104KYoung Adults (25-34)
Network Engineer4 to 11 YearsPhD$65K-$91KYoung Adults (25-34)
Event Manager1 to 12 YearsMBA$64K-$87KAdults (35-64)

Experience Dimension

ExperienceIDExperienceRangeMinExperienceMaxExperience
15 to 15 Years515
22 to 12 Years212
30 to 12 Years012
44 to 11 Years411
51 to 12 Years112

Job Dimension Table

JobIDJob TitleQualifications
1Digital Marketing SpecialistM.Tech
2Web DeveloperBCA
3Operations ManagerPhD
4Network EngineerPhD
5Event ManagerMBA

Age Group Dimension Table

AgeGroupIDAge_Group
1Youth (<25)
2Adults (35-64)
3Young Adults (25-34)

Fact Table

JobIDExperienceIDSalary RangeAgeGroupID
11$59K-$99K1
22$56K-$116K2
33$61K-$104K3
44$65K-$91K3
55$64K-$87K2

Sample 6

ITEM CODEITEM DESCRIPTION
100293SANTORINI GAVALA WHITE - 750ML
100641CORTENOVA VENETO P/GRIG - 750ML
100749SANTA MARGHERITA P/GRIG ALTO - 375ML

The Fact will turn out to be like this

ItemIDItemCodeItem DescriptionQuantity
1100293SANTORINI GAVALA WHITE750ML
2100641CORTENOVA VENETO P/GRIG750ML
3100749SANTA MARGHERITA P/GRIG ALTO375ML

Note: If the same Item Descrition repeats for various quantities, then create a separate table for Quantity.

Quantity Dimension

QuantityIDQuantity
1750ML
2375ML
3500ML
......

Item Dimension

ItemIDItem NameQuantityID
1SANTORINI GAVALA WHITE1
2CORTENOVA VENETO P/GRIG1
3SANTA MARGHERITA P/GRIG ALTO2
.........

Fact Table

FactIDItemIDQuantityID
111

Sample 7

production_countriesspoken_languages
United Kingdom, United States of AmericaEnglish, French, Japanese, Swahili
United Kingdom, United States of AmericaEnglish
United Kingdom, United States of AmericaEnglish, Mandarin

Country Dimension

CountryIDCountryName
1United Kingdom
2United States of America

Language Dimension

LanguageIDLanguageName
1English
2French
3Japanese
4Swahili
5Mandarin

Approach 1 : Creating Many to Many

FactIDCountryIDLanguageID
111
121
112
122

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Approach 2 : Fact Table with Array Datatypes

FactIDProductionCountryIDsSpokenLanguageIDs
1[1, 2][1, 2, 3, 4]
2[1, 2][1]
3[1, 2][1, 5]

With newer Systems like Spark SQL which supports Arrays

  • Simplicity: Easier to understand and less complex to set up.
  • Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.