Disclaimer

This is not an official Textbook for Data Warehouse.

This is only a reference material with the lecture presented in the class.

If you find any errors, please do email me at chandr34 @ rowan.edu

Fundamentals

Terms to Know

These are some basic terms to know. We will learn lot more going forward.

DBMS - Database Management System

RDBMS - Relational Database Management System

ETL - Extract Transform Load - Back office process for loading data to Data Warehouse.

Bill Inmon - Considered the Father of Data Warehousing.

Ralph Kimball - He’s the father of Dimensional modeling and the Star Schema.

OLTP - Online Transaction Processing. The classic operational database for taking orders, reservations, etc.

OLAP - Online Analytical Processing. Big database providers (IBM, Teradata, Microsoft.. ) started integrating OLAP into their systems.

MetaData - Data about Data.

Data Pipeline - A set of processes that move data from one system to another.

ETL - Extract Transform Load

ELT - Extract Load Transform

Jobs

  • Data Analyst: Plays a crucial role in business decision-making by analyzing data and providing valuable insights, often sourced from a data warehouse.

  • Data Scientist: Employs data warehousing as a powerful tool for modeling, statistical analysis, and predictive analytics, enabling the resolution of complex problems.

  • Data Engineer: Demonstrates precision and dedication in their work by focusing on the design, construction, and maintenance of data pipelines that facilitate the movement of data into and out of a data warehouse.

  • Analyst Engineer: A hybrid role combining the skills of a data analyst and data engineer, often involved in analyzing data and developing the infrastructure to support that analysis.

  • Data Architect: Designs and oversees the implementation of the overall data infrastructure, including data warehouses, to ensure scalability and efficiency.

  • Database Administrator (DBA): This person manages the performance, integrity, and security of databases, including those used in data warehouses.

  • Database Security Analyst: This position focuses on ensuring the security of databases and data warehouses and protecting against threats and vulnerabilities.

  • Database Manager: Oversees the overall management and administration of databases, including those used for data warehousing.

  • Business Intelligence (BI) Analyst: Utilizes data from a data warehouse to generate reports, dashboards, and visualizations that aid business decision-making.

  • AI/ML Engineer: Uses warehouse data to build and deploy machine learning models, particularly in enterprise environments where historical data is crucial.

  • Compliance Analyst: Ensures that data warehousing solutions meet regulatory requirements, especially in industries like finance, healthcare, or insurance.

  • Chief Data Officer (CDO): An executive role responsible for an organization’s data strategy, often overseeing data warehousing as a critical component.

  • Data Doctor: Typically diagnoses and "fixes" data-related issues within an organization. This role might involve data cleansing, ensuring data quality, and resolving inconsistencies or errors in datasets.

  • Data Advocate: Champions the use of data within an organization. They promote data-driven decision-making and ensure that the value of data is recognized and utilized effectively across different departments.

Prefix with Cloud & Big Data

Skills needed

A data warehouse developer is responsible for designing, developing, and maintaining data warehouse systems. To be qualified as a data warehouse developer, a person should possess a combination of technical skills and knowledge in the following areas:

Must-have skills

  1. Database Management Systems (DBMS): A strong understanding of relational and analytical database management systems such as Oracle, SQL Server, PostgreSQL, or Teradata.

  2. SQL: Proficiency in SQL (Structured Query Language) for creating, querying, and manipulating database objects.

  3. Data Modeling: Knowledge of data modeling techniques, including dimensional modeling (star schema, snowflake schema), normalization, and denormalization. Familiarity with tools such as Vertabelo or ERwin, or PowerDesigner is a plus.

  4. ETL (Extract, Transform, Load): Experience with ETL processes and tools like Microsoft SQL Server Integration Services (SSIS), Talend, or Informatica PowerCenter for extracting, transforming, and loading data from various sources into the data warehouse.

  5. Data Integration: Understanding of data integration concepts and techniques, such as data mapping, data cleansing, and data transformation.

  6. Data Quality: Knowledge of data quality management and techniques to ensure data accuracy, consistency, and integrity in the data warehouse.

  7. Performance Tuning: Familiarity with performance optimization techniques for data warehouses, such as indexing, partitioning, and materialized views.

  8. Reporting and Data Visualization: Experience with reporting and data visualization tools like Tableau, Power BI, or QlikView for creating dashboards, reports, and visualizations to analyze and present data.

  9. Big Data Technologies: Familiarity with big data platforms such as Spark and NoSQL databases like MongoDB or Cassandra can be beneficial, as some organizations incorporate these technologies into their data warehousing solutions.

  10. Programming Languages: Knowledge of programming languages like Python, Java, or C# can help implement custom data processing logic or integrate with external systems.

  11. Cloud Platforms: Experience with cloud-based data warehousing solutions such as Databricks can be a plus as more organizations move their data warehouses to the cloud.

  12. Version Control: Familiarity with version control systems like Git or SVN for managing code and collaborating with other developers.

Nice to have skills

In summary, while Linux skills are not a core requirement for a data warehouse developer, they can be valuable for managing, optimizing, and troubleshooting your data warehousing environment.

  • Server Management
  • Scripting and Automation (AWK, Bash)
  • File System and Storage Management
  • Networking and Security
  • Performance Tuning
  • Working with Cloud Platforms
  • Deploying and Managing Containers (Docker, Podman, Kubernetes)

Application Tiers

Where does Database fit in?

  • Database Tier: Actual data
  • Application Tier: Business logic
  • Presentation Tier: Front end (Web, Client, Mobile App)

https://bluzelle.com/blog/things-you-should-know-about-database-caching

  • Robotics
  • AI
  • IoT
  • Blockchain
  • 3D Printing
  • Internet / Mobile Apps
  • Autonomous Cars - VANET Routing
  • VR / AR - Virtual Reality / Augmented Reality
  • Wireless Services
  • Quantum Computing
  • 5G
  • Voice Assistant (Siri, Alexa, Google Home)
  • Cyber Security
  • Big Data Analytics
  • Machine Learning
  • DevOps
  • NoSQL Databases
  • Microservices Architecture
  • Fintech
  • Smart Cities
  • E-commerce Platforms
  • HealthTech

Do you see any pattern or anything common in these?

Visual Interface & Data

Operational Database

An operational database management system is software that allows users to quickly define, modify, retrieve, and manage data in real time.

While conventional databases rely on batch processing, operational database systems are oriented toward real-time, transactional operations.

Let's take a Retail company that uses several systems for its day-to-day operations.

They buy software from various vendors and manage their business.

  • Sales Transactions
  • Inventory Management
  • Customer Relationship Management
  • HR Systems

and so on.

Other Examples:

  • Banner Database (Registration)
  • eCommerce Database
  • Blog Database
  • Banking Transactions

What is a Data Warehouse

  • Is it a Database?
  • Is it Big Data?
  • Is it the backend for Visualization?

Yes! Yes! & Yes!

Typical Data Architecture

Data Source Layer

Operational System: This includes data from various operational systems like transactional databases. CRM System: Data from customer relationship management systems. ERP System: Data from enterprise resource planning systems. External Data: Data that might come from external sources outside the organization. These sources feed data into the Data Warehouse system. Depending on the source, the data might be structured or unstructured.

DW Staging

Staging Area: This is an intermediate storage area used for data processing during the ETL (Extract, Transform, Load) process. ETL Process: Extract: Data is extracted from various data sources. Transform: The extracted data is transformed into a format suitable for analysis and reporting. This might include cleaning, normalizing, and aggregating the data. Load: The transformed data is then loaded into the Data Warehouse.

Data Warehouse

Raw Data: The unprocessed data loaded directly from the staging area. Metadata: Data about the data, which includes information on data definitions, structures, and rules. Aggregated Data: Summarizing or aggregating data for efficient querying and analysis.

Presentation Layer

**OLAP (Online Analytical Processing): **This tool is used for multidimensional data analysis in the warehouse, enabling users to analyze data from various perspectives. Reporting: Involves generating reports from the data warehouse, often for decision-making and business intelligence purposes. Data Mining: This involves analyzing large datasets to identify patterns, trends, and insights, often used for predictive analytics.

Flow of Data

Data Source Layer → Staging Area: Data is extracted from multiple sources and brought into the staging area. Staging Area → Data Warehouse: The data is transformed and loaded into the data warehouse. Data Warehouse → Presentation Layer: The data is then used for various purposes, such as OLAP, Reporting, and Data Mining.

This architecture ensures that data is collected, processed, and made available for analysis in a structured and efficient manner, facilitating business intelligence and decision-making processes.

Problem Statement

RetailWorld uses different systems for sales transactions, inventory management, customer relationship management (CRM), and human resources (HR). Each system generates a vast amount of data daily.

The company's management wants to make data-driven decisions to improve its operations, optimize its supply chain, and enhance customer satisfaction. However, they face the following challenges:

  1. Data Silos: Data is stored in separate systems, making gathering and analyzing information from multiple sources challenging.

  2. Inconsistent Data: Different systems use varying data formats, making it hard to consolidate and standardize the data for analysis.

  3. Slow Query Performance: As the volume of data grows, querying the operational databases directly becomes slower and impacts the performance of the transactional systems.

  4. Limited Historical Data: Operational databases are optimized for current transactions, making storing and analyzing historical data challenging.

Solution

  1. Centralized Data Repository: The Data Warehouse consolidates data from multiple sources, breaking down data silos and enabling a unified view of the company's information.

  2. Consistent Data Format: Data is cleaned, transformed, and standardized to ensure consistency and accuracy across the organization.

  3. Improved Query Performance: The Data Warehouse is optimized for analytical processing, allowing faster query performance without impacting the operational systems.

  4. Historical Data Storage: The Data Warehouse can store and manage large volumes of historical data, enabling trend analysis and long-term decision-making.

  5. Enhanced Reporting and Analysis: The Data Warehouse simplifies the process of generating reports and conducting in-depth analyses, providing insights into sales trends, customer preferences, inventory levels, and employee performance.

Key Features

  • They are used for storing historical data.
  • Low Latency / Response time is fast.
  • Data consistency and quality of data.
  • Used with Business Intelligence, Data Analysis & Data Science.

By doing all the above, it answers some of the questions the business/needs.

  • Sales of particular items this month compared to last month?
  • Top 3 best-selling products of the quarter?
  • How is the internet traffic before the pandemic / during the pandemic?

It's read-only for end-users and upper management.

Who are the end users?

  • Data Analysts
  • Data Scientists.

Data Size

A query can range from a few thousand rows to billion rows.

Need for Data Warehouse

Amazon CEO wants to know the sales of the new Kindle Scribe reader they launched this year compared to other eReaders in the next 30 mins.

Where do we look for this info?

  • Transactional databases use every resource to serve customers. Querying on them may slow down a customer's request.
  • Also, they are Live databases. So they may or may not have historical data.
  • Chances are data format may differ for each region/country in which Amazon is doing business in.

It's not just Amazon; it's everywhere.

  • Airlines
  • Bank Transactions / ATM transactions
  • Restaurant chains, and so on.

Companies will have multiple databases that are not linked with each other for specific reasons.

  • Sales Database
  • Customer Service Database
  • Marketing Database
  • Inventory Database
  • Human Resources Database

There is no reason for linking HR data to Sales data, but the CFO might need this info for budgeting.

Fun Qn: How many times have you received Marketing mail/email from the same company you have an account with?

Current State of the Art

The business world decided as follows.

  • A Database should exist just for doing BI & Strategic reports.

  • It should be separated from the operational / transaction database for the day-to-day running of the business.

  • It should encompass all aspects of the business (sales, inventory, hr, customer service…)

  • An enterprise-wide standard definition for every field name in every table.

    • Example: employee number should be identical across DB. empNo, eNo,EmployeeNum.. empID not acceptable.
  • Metadata database (data about data) defining assumptions about each field, describing transformations performed and cleansing operations, etc.

    • Example: If US telephone, it should be nnn-nnn-nnnn or (nnn) nnn-nnnn
  • Data Warehouse is read-only to its end users so that everyone will use the same data, and there will be no mismatch between teams.

  • Fast access, even if it's big data.

How its done?

  • Operational databases for tracking sales, inventory, support calls, chat, and email. (Relational / NoSQL)

  • The Back Office team (ETL team) gathers data from multiple sources, cleans it, transforms it, massages the missing, and stores it in the Staging database.

    • If the phone number is not in the format, then format it.
    • If the email address is not linked to the chat/phone record, read it from the Customer and update it.
  • Staging database: Working database where all the work is done to the data. It then dumps to the data warehouse, which is visible as “read-only” to end users.

  • Data Analysts then build reports using Data Warehouse.

Back to the original question.

If all of these things are done right, Amazon's CEO can get the report in less than 30 minutes without interfering with business operations. 👍

Types of Data

  • Structured Data (rows/columns CSV, Excel)
  • Semi-Structured Data (JSON / XML)
  • Unstructured Data (Video, Audio, Document, Email)

Structured Data

IDNameJoin Date
101Rachel Green2020-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Semi-Structured Data

JSON

[
   {
      "id":1,
      "name":"Rachel Green",
      "gender":"F",
      "series":"Friends"
   },
   {
      "id":"2",
      "name":"Sheldon Cooper",
      "gender":"M",
      "series":"BBT"
   }
]

XML

<?xml version="1.0" encoding="UTF-8"?>
<actors>
   <actor>
      <id>1</id>
      <name>Rachel Green</name>
      <gender>F</gender>
      <series>Friends</series>
   </actor>

   <actor>
      <id>2</id>
      <name>Sheldon Cooper</name>
      <gender>M</gender>
      <series>BBT</series>
   </actor>
</actors>

Unstructured Data

  1. Text Logs: Server logs, application logs.
  2. Social Media Posts: Tweets, Facebook comments.
  3. Emails: Customer support interactions.
  4. Audio/Video: Customer call recordings and marketing videos.
  5. Customer Reviews: Free-form text reviews.
  6. Images: Product images user profile pictures.
  7. Documents: PDFs, Word files.
  8. Sensor Data: IoT data streams.

These can be ingested into modern data warehouses for analytics, often after some preprocessing. For instance, text can be analyzed with NLP before storing, or images can be processed into feature vectors.

Data Storage Systems

Data Lake

A place where you dump all forms of data of your business.

Structured / Un Structured / Semi-Structured.

Example

  • Customer service chat logs, voice recordings, email, website comments, social media.
  • Need a cheap way to store different types of data in large quantities.
  • Data is not needed now, but planning to use it for later use.
  • Larger organizations need all kinds of data to analyze and improve business.

Data Warehouse

  • Data Warehouse - Stores already modeled/structured data and ready for use.
  • Data from the Warehouse can be used for analyzing its operational data.
  • There will be developers to support the data.
  • It’s multi-purpose storage for different use cases.

Data Mart

A subset of Data Warehouse for a specific use case.

A specific group of users uses it, so it is more secure and performs better.

Example: Pandemic Analysis

Dependent Data Marts - constructed from an existing data warehouse.

Example: Grocery / School Supplies

Independent Data Marts - built from scratch and operated in silos.

Example: Mask / Glove Sales

Hybrid Data Marts - Mix and match both.

Data Warehouse 1980 - Current

Data Warehouses (1980 - 2000):

Pros

  • High Quality Data.
  • Standard modeling technique (star schema/Kimball).
  • Reliability through ACID transactions.
  • Very good fit for business intelligence.

Cons

  • Closed Formats.
  • Support only SQL.
  • No support for Machine Learning.
  • No streaming support.
  • Limited scaling support.

Data Lakes (2010 - 2020)

Pros

  • Support for open formats.
  • Can support all data types & their use cases.
  • Scalability through underlying cloud storage.
  • Support for Machine Learning & AI.

Cons

  • Weak schema support.
  • No ACID transaction support.
  • Low data quality.
  • Leads to "Data Swamps".

Lakehouses (2020 and beyond):

Pros

  • Support for both BI and ML/AI workloads.
  • Standard Storage Format.
  • Reliability through ACID transactions.
  • Scalability through underlying cloud storage.

Cons

  • Cost Considerations.
  • Data Governance and Security.
  • Performance Overhead. (Due to ACID transactions)

Data Warehouse vs Data Mart

Data WarehouseData Mart
Independent application / systemSpecific to support one system.
Contains detailed data.Mostly aggregated data.
Involves top-down/bottom-up approach.Involves bottom-up approach.
Adjustable and exists for an extended period of time.Restricted for a project / shorter duration of time.

Data Warehouse Architecture

Top-Down Approach

This method begins with developing a comprehensive enterprise data warehouse (EDW) consolidating all organizational data. This central warehouse creates data marts to serve specific business units or functions. The top-down approach ensures a unified, consistent data model across the enterprise but typically requires more upfront investment in time and resources.

External Sources (ETL) >> Data Warehouse >> Data Mining
                                         >> Data Mart 1
                                         >> Data Mart 2

In the words of Inmon

"Data Warehouse as a central repository for the complete organization and data marts are created from it after the complete data warehouse has been created."

src: https://www.geeksforgeeks.org/data-warehouse-architecture/

Bottom-Up Approach

This approach starts by creating small, specific data marts for individual business units. These data marts are designed to meet the immediate analytical needs of departments. Over time, these marts are integrated into a comprehensive enterprise data warehouse (EDW). The bottom-up approach is agile and allows quick wins but can lead to challenges integrating data marts into a cohesive system.

External Sources (ETL)     >> Data Mart 1  >> Data Warehouse >> Data Mining
                           >> Data Mart 2

Kimball gives this approach: data marts are created first and provide a thin view for analysis, and a data warehouse is created after complete data marts have been created.

src: https://www.geeksforgeeks.org/data-warehouse-architecture/

Summary

Each approach has its advantages and trade-offs, with the bottom-up being more iterative and flexible, while the top-down offers a more structured and holistic view of the organization’s data.

Examples:

Top Down - Popular big retail stores likely to follow this architecture. As they have build a centralized data warehouse that feeds their stores. Similarly Financial organlizations like Banks may take top-down approach.

Bottom Up - Popular OTT have such models. Initially bring movies, then add their own production and add third party providers.

Data Warehouse Characteristic

Characteristics

  • Subject Oriented
  • Integrated
  • Time-Variant
  • Non-Volatile
    • Data Loading (ETL)
    • Data Access (Reporting / BI)

Functions

  • Data Consolidation
  • Data Cleaning
  • Data Integration

Subject Oriented

src: https://unstop.com/blog/characteristics-of-data-warehouse

Data analysis for a business's decision-makers can be done quickly by constricting to a particular subject area of the Data warehouse.

Do not add unwanted info on subjects for decision-making.

When analyzing customer information, it's crucial to focus on the relevant data and avoid unnecessary details, such as food habbits, which can distract from the main task.

Integrated

Multiple Source Systems:

In most organizations, data is stored across various systems. For example, a bank might have separate systems for savings accounts, checking accounts, and loans. Each system is designed to serve a specific purpose and might have its own database, schema, and data formats.

Unified Subject Areas:

The data warehouse is a centralized repository where data from these different source systems is brought together. This integration is not just about storing the data in one place; it involves transforming and aligning the data to be analyzed.

Consistency and Standardization:

During integration, the data is often standardized to ensure consistency. For example, account numbers might be formatted differently in the source systems, but they are unified in a standard format in the data warehouse. This standardization is crucial for accurate reporting and analysis.

Benefits:

Holistic View: The data warehouse provides a comprehensive view of a subject by integrating data from different sources. For example, a bank can now analyze a customer's relationship across all accounts rather than looking at each in isolation.

Improved Decision-Making: With integrated data, organizations can perform more sophisticated analyses, leading to better decision-making. For example, they can understand a customer's total exposure by analyzing their savings, checking, and loan accounts together.

Efficiency: Analysts and business users can access all the relevant data in one place without needing to query multiple systems.

Time Variant

Data warehouses, unlike operational databases, are designed with a unique ability to maintain a comprehensive historical record of data. This feature not only allows for trend analysis and historical reporting but also ensures the reliability of the system for comparison over time.

For example, if a customer changes their address, a data warehouse will keep old and new addresses, along with timestamps indicating when the change occurred.

Data warehouses often include a time dimension, allowing users to analyze data across different periods. This could include daily, monthly, quarterly, or yearly trends, providing insights into how data changes over time.

Non Volatile

Non-Volatile Nature: The data warehouse does not allow modifications to the data once it is loaded. This characteristic ensures that historical data is preserved for long-term analysis.

The cylinder on the left side of the diagram represents OLTP databases, which are typically used in operational systems. These databases handle day-to-day transactions, such as reading, adding, changing, or deleting data. OLTP systems are optimized for fast transaction processing.

The cube represents the data warehouse, a dedicated repository designed for analysis and reporting. Unlike OLTP systems, the data warehouse is non-volatile, meaning that once data is loaded into the warehouse, it remains stable and is not updated or deleted. This stability is a key feature, ensuring that historical data is preserved intact for analysis over time.

Tools

Traditional Solutions

  • SQL Server
  • Oracle
  • PostgreSQL
  • IBM DB2
  • Teradata
  • Informatica
  • SAP HANA

Cloud Solutions

  • Databricks
  • Snowflake
  • Microsoft Fabric
  • Google BigQuery
  • Amazon Redshift

ETL/ELT Tools

  • Talend
  • Apache NiFi
  • Fivetran
  • Apache Airflow (orchestrator)

Data Integration and Data Prep

  • Alteryx
  • Trifacta
  • dbt (data build tool)

BI and Analytics Tools

  • Tableau
  • QlikView
  • Power BI
  • Looker

Data Lakes

  • AWS Lake Formation
  • Azure Data Lake
  • Google Cloud Storage
  • Apache Hadoop (HDFS)

Data Cataloging and Governance

  • Unity Catalog
  • Apache Atlas
  • Collibra
  • Alation
  • Informatica Data Catalog

Big Data Technologies

  • Apache Spark
  • Apache Hive
  • Apache Impala
  • Apache HBase
  • Presto

Cloud vs On-Premise

FeatureCloud DatawarehouseOn-Premise
ScalabilityInstant Up / Down, Scale In / OutReconfiguring / purchasing hardware, software, etc.
AvailabilityUp to 99.99%Depends on infrastructure.
SecurityProvided by cloud providerDepends on the competence of the in-house IT team.
PerformanceServe multiple geo locations, helps query performanceScalability challenge
Cost-effectiveness

No hardware / initial cost. Pay only for usage.

*If not managed carefully, it could cost a fortune.

Requires significant initial investment, salary.

src: https://www.scnsoft.com/analytics/data-warehouse/cloud

Steps to design a Data Warehouse

  • Gather Requirements
  • Environment (Physical / Cloud)
    • Dev
    • Test
    • Prod
  • Data Modeling
    • Star Schema
    • Snowflake Schema
    • Galaxy Schema
  • Choose ETL - ELT Solution
  • OLAP Cubes or Not
  • Visualization Tool
  • Query Performance

Gather Requirements

  • DW is subject-oriented.

  • Needs data from all related sources. DW is valuable as the data contained within it.

  • Talk to groups and align goals with the overall project.

  • Could you determine the scope of the project and how it helps the business?

  • Discover future needs with the data and technology solution.

  • Disaster Recovery model.

  • Security (threat detection, mitigation, monitoring)

  • Anticipate compliance needs and mitigate regulatory risks.

Environment

  • Need separate environments for Development, Testing, and Production.

  • Development & Testing will have some % of sample data from Production.

  • Testing and Production will have a similar HW environment. (Cloud / In house)

  • Nice to have a similar environment for development.

  • Track changes, indexes, and query changes are done in the lower environment.

  • DR environment is part of the Production release.

  • Suppose it's in-house; deploy it in different data centers. If it's cloud, deploy it in different regions.

Data Modeling

  • Data Modeling is a process to visualize the data warehouse.

  • It helps to set standards in naming conventions, creating relationships between datasets, and establishing compliance and security.

  • Most Complex phase in data warehouse design.

  • Recollect Top - Down vs. Bottom - Up that decision plays a vital role.

  • Data Modelling typically starts at the data mart level and then branches out to the data warehouse.

  • Three popular data models for data warehouses

    • Star Schema
    • Galaxy Schema
    • Snowflake Schema

ETL / ELT Solution

ETL

  • Extract
  • Transform
  • Load

ETL plays a vital part in moving data across. Many ways ETL can be implemented.

GUI Tools such as

  • SSIS
  • Pentaho
  • Talend
  • Scripting Tools such as Bash, Python.

ELT

  • Extract
  • Load
  • Transform

In big data platforms such as Hadoop, and Spark, you can load a JSON, or CSV file and start using them as is. This technology can even parse compressed .gz / .bz2 files

Extensively used when dealing with Semi-Structured databases.

Register with Databricks community edition

Online Analytic Processing

In OLAP cube data can be pre-calculated and pre-aggregated, making analysis faster.

Usually, data is organized in row and column format.

OLAP contains multi-dimensional data, with data from different data sources.

Src:https://www.holistics.io

There are 4 types of analytical operations in OLAP

Roll-up: Consolidation, aggregation. Data from different cities can be rolled up to the state/country level.

Drill-down: Opposite of roll-up. If you have data by year, you can analyze monthly, weekly, and daily trends.

Slice-dice: Take one dimension of the data from the cube and create a sub-cube.

If data from various products / various quarters are available take one quarter alone and work with it.

Pivot: Rotating the data axes. Basically swapping the x and y-axis of the data.

Front End

Visualization - the primary reason for creating data warehouses.

  • Tableau
  • PowerBI
  • Looker

Check Data Warehouse Tools for more details.

Query Optimization

These are some basic best practices. There is lot more to discuss in future sessions.

  • Retrieve only necessary rows.

  • Do not use *; instead, specify the columns.

  • Create views so users will control the data pull as well as security.

  • Filter first and Join later.

  • Filter first and Group later.

  • Monitor queries regularly for performance.

  • Index only when needed. Please don't index unwanted columns.

RDBMS

Data Model

Data Models: Is used to define how the logical structure of a database is modeled.

Entity: A Database entity is a thing, person, place, unit, object, or any item about which the data should be captured and stored in properties, workflow, and tables.

Attributes: Properties of Entity.

Example:

Entity: Student

​Attributes: id, name, join date

Student

student idnamejoin date
101Rachel Green2000-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Entity Relationship Model

Student

student idnamejoin date
101Rachel Green2000-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Courses

student idsemestercourse
101Semester 1DBMS
101Semester 1Calculus
201Semester 1Algebra
201Semester 1Web

One to One Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Studentdetails {
        int student_id PK
        string SSN
        date DOB
    }
    
    Student ||--|| Studentdetails : "1 to 1"

Student

student idnamejoin date
101Rachel Green2000-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Studentdetails

student idSSNDOB
101123-56-78901980-05-01
201236-56-45861979-07-05
301365-45-98751980-12-14
401148-89-47581978-06-05

For every row on the left-hand side, there will be only one matching entry on the right-hand side.

For student id 101, you will find one SSN and one DOB.

One to Many Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Address {
        int address_id PK
        int student_id FK
        string address
        string address_type
    }
    
    Student ||--o{ Address : "has"

Student

student idnamejoin date
101Rachel Green2000-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Address

student idaddress idaddressaddress type
10111 main st, NYHome
10124 john blvd,NJDorm
30133 main st, NYHome
30145 john blvd,NJDorm
201512 center st, NYHome
401611 pint st, NYHome

What do you notice here?

Every row on the left-hand side has one or more rows on the right-hand side.

For student id 101, you will notice the home address and Dorm address.

Many to Many Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Course {
        string course_id PK
        string course_name
    }
    
    StudentCourses {
        int student_id FK
        string course_id FK
    }
    
    Student ||--o{ StudentCourses : enrolls
    Course ||--o{ StudentCourses : offered_in

Student

student idnamejoin date
101Rachel Green2000-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Student Courses

student idcourse id
101c1
101c2
301c1
301c3
201c3
401c4

Courses

course idcourse name
c1DataBase
c2Web Programming
c3Big Data
c4Data Warehouse

What do you notice here?

Students can take more than one course, and courses can have more than one student.

Attributes

Types of Attributes

Simple Attribute - Atomic Values.

An attribute that cannot be divided further.

Example: ssn

Composite Attribute

Made up of more than one simple attribute.

Example: firstname + mi + lastname

Derived Attribute

Calculated from existing attributes.

Age: Derived from DOB

Multivalued Attribute

An attribute that can have multiple values for a single entity (e.g., PhoneNumbers for a person).

src: https://www.tutorialspoint.com/dbms/er_diagram_representation.htm

Student: Entity
Name: Composite Attribute
Student ID: Simple Attribute
Age: Derived Attribute
PhoneNo: MultiValued Attribute
(The Student can have more than one phone number)

Keys

Primary Key

The Attribute helps to identify a row in an entity uniquely.

It cannot be empty and cannot have duplicates.

student idnamejoin date
101Rachel Green2020-05-01
201Joey Tribianni1998-07-05
301Monica Geller1999-12-14
401Cosmo Kramer2001-06-05

Based on the above data, which attribute can be PK?

any column (as its just four rows)

If we extend the dataset to 10000 rows, which can be PK?

yes student id

Composite Key

A Primary key that consists of two or more attributes is known as a **** composite key.

student idcourse idname
101C1Rachel Green
101C2Rachel Green
201C2Monica Geller
201C3Cosmo Kramer

Do you know how to find the unique row?

The combination of StudentID + CourseID makes it unique.

Unique Key

Unique keys are similar to Primary Key, except it can allow one NULL.

Transaction

What is a Transaction?

A transaction can be defined as a group of tasks.

A simple task is the minimum processing unit that cannot be divided further.

Example: Transfer $500 from X account to Y account.

Open_Account(X) 
Old_Balance = X.balance 
New_Balance = Old_Balance - 500 
X.balance = New_Balance 
Close_Account(X)

Open_Account(Y) 
Old_Balance = B.balance 
New_Balance = Old_Balance + 500 
Y.balance = New_Balance 
Close_Account(Y)

States of Transaction

src: https://www.tutorialspoint.com/dbms/dbms_transaction.htm

src: www.guru99.com

  • Active: When the transaction's instructions are running, the transaction is active.
  • Partially Committed: After completing all the read and write operations, the changes are made in the main memory or local buffer.
  • Committed: It is the state in which the changes are made permanent on the DataBase.
  • Failed: When any instruction of the transaction fails.
  • Aborted: The changes are only made to the local buffer or main memory; hence, these changes are deleted.

ACID

ACID Complaint

Atomicity: All or none. Changes all or none.

Consistency: Never leaves the database in a half-finished state. Deleting a customer is not possible before deleting related invoices.

Isolation: Separates a transaction from another. Transactions are isolated; they occur independently without interference. One change will not be visible to another transaction.

Durability: Recover from an abnormal termination. Once the transaction is completed, data is written to disk and persists when the system fails. DB returns to a consistent state when restarted due to abnormal termination.

Example

Atomicity

Example: Transferring money between two bank accounts. If you transfer $100 from Account A to Account B, the transaction must ensure that either the debit from Account A and credit to Account B occur or neither occurs. If any error happens during the transaction (e.g., system crash), the transaction will be rolled back, ensuring that no partial transaction is completed.

Consistency

Example: Enforcing database constraints, such as ensuring that a field that stores a percentage can only hold values between 0 and 100. If an operation tries to insert a value of 110, it will fail because it violates the consistency rule of the database schema.

Isolation:

Example: Two transactions that concurrently update the same set of rows in a database. Transaction A updates a row; before it commits, Transaction B also tries to update the same row. Depending on the isolation level (e.g., Serializable), Transaction B may be required to wait until Transaction A commits or rolls back to avoid data inconsistencies, ensuring that transactions do not interfere with each other.

Durability:

Example: After a transaction is committed, such as a user adding an item to a shopping cart in an e-commerce application, the data must be permanently saved to the database. Even if the server crashes immediately after the commit, the added item must remain in the shopping cart once the system is back online.

Online/Realtime vs Batch

Online Processing

An online system handles transactions when they occur and provides output directly to users. It is interactive.

Use Cases

E-commerce Websites:

Use Case: Processing customer orders. Example: When a customer places an order, the system immediately updates inventory, processes the payment, and provides an order confirmation in real time. Any delay could lead to issues like overselling stock or customer dissatisfaction.

Online Banking:

Use Case: Fund transfers and account balance updates. Example: When a user transfers money between accounts, the transaction is processed immediately, and the balance is updated in real-time. Real-time processing is crucial to ensure the funds are available immediately and that the account balance reflects the latest transactions.

Social Media Platforms:

Use Case: Posting updates and notifications. Example: When a user posts a new status or comment, it should be instantly visible to their followers. Notifications about likes, comments, or messages are also delivered in real time to maintain user engagement.

Ride-Sharing Services (e.g., Uber, Lyft):

Use Case: Matching drivers with passengers. Example: When a user requests a ride, the system matches them with a nearby driver in real time, providing immediate feedback on the driver's location and estimated arrival time.

Fraud Detection Systems:

Use Case: Monitoring transactions for fraudulent activities. Example: Credit card transactions are monitored in real time to detect unusual patterns and prevent fraud before they are completed.

Batch Processing

Data is processed in groups or batches. Batch processing is typically used for large amounts of data that must be processed on a routine schedule, such as paychecks or credit card transactions.

A batch processing system has several main characteristics: collect, group, and process transactions periodically.

Batch programs require no user involvement and require significantly fewer network resources than online systems.

Use Cases

End-of-Day Financial Processing:

Use Case: Reconciling daily transactions. Example: Banks often batch process all the day's transactions at the end of the business day to reconcile accounts, generate statements, and update records. This processing doesn't need to be real-time but must be accurate and comprehensive.

Data Warehousing and ETL Processes:

Use Case: Extracting, transforming, and loading data into a data warehouse. Example: A retail company may extract sales data from various stores, transform it to match the warehouse schema, and load it into a centralized data warehouse. This process is typically done in batches overnight to prepare the data for reporting and analysis the next day.

Payroll Processing:

Use Case: Calculating employee salaries. Example: Payroll systems typically calculate and process salaries in batches, often once per pay period. Employee data (hours worked, overtime, deductions) is collected over the period and processed in a single batch job.

Inventory Updates:

Use Case: Updating inventory levels across multiple locations. Example: A chain of retail stores might batch-process inventory updates at the end of each day. Each store's sales data is sent to a central system, where inventory levels are adjusted in batches to reflect the day's sales.

Billing Systems:

Use Case: Generating customer bills. Example: Utility companies often generate customer bills at the end of each month. Usage data is collected throughout the month, and bills are generated in batch processing jobs, which are then sent to customers.

DSL vs GPL

GPL - General Programming Language.  Python / JAVA / C++ 

One tool can be used to do many things. 

DSL - Domain-Specific Language. HTML / SQL / JQ / AWK /...

Specific tools to do a specific job.

src: https://tomassetti.me/domain-specific-languages/"

Storage Formats

Account numberLast nameFirst namePurchase (in dollars)
1001GreenRachel20.12
1002GellerRoss12.25
1003BingChandler45.25

Row Oriented Storage

In a row-oriented DBMS, the data would be stored as

1001,Green,Rachel,20.12;1002,Geller,Ross,12.25;1003,Bing,Chandler,45.25

Best suited for OLTP - Transaction data.

Columnar Oriented Storage

1001,1002,1003;Green,Geller,Bing;Rachel,Ross,Chandler;20.12,12.25,45.25

Best suited for OLAP - Analytical data.

  1. Compression: Since the data in a column tends to be of the same type (e.g., all integers, all strings), and often similar values, it can be compressed much more effectively than row-based data.

  2. Query Performance: Queries that only access a subset of columns can read just the data they need, reducing disk I/O and significantly speeding up query execution.

  3. Analytic Processing: Columnar storage is well-suited for analytical queries and data warehousing, which often involve complex calculations over large amounts of data. Since these queries often only affect a subset of the columns in a table, columnar storage can lead to significant performance improvements.

src: https://mariadb.com/resources/blog/why-is-columnstore-important/

File Formats

CSV/TSV

Pros

  • Tabular Row storage.
  • Human-readable is easy to edit manually.
  • Simple schema.
  • Easy to implement and parse the file(s).

Cons

  • No standard way to present binary data.
  • No complex data types.
  • Large in size.

JSON

Pros

  • Supports hierarchical structure.
  • Most languages support them.
  • Widely used in Web

Cons

  • More memory usage due to repeatable column names.
  • Not very splittable.
  • Lacks indexing.

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Twitter and Cloudera developed it to provide a compact and efficient way of storing large, flat datasets.

Best for WORM (Write Once Read Many).

The key features of Parquet are:

  1. Columnar Storage: Parquet is optimized for columnar storage, unlike row-based files like CSV or TSV. This allows it to efficiently compress and encode data, which makes it a good fit for storing data frames.
  2. Schema Evolution: Parquet supports complex nested data structures, and the schema can be modified over time. This provides much flexibility when dealing with data that may evolve.
  3. Compression and Encoding: Parquet allows for highly efficient compression and encoding schemes. This is because columnar storage makes better compression and encoding schemes possible, which can lead to significant storage savings.
  4. Language Agnostic: Parquet is built from the ground up for use in many languages. Official libraries are available for reading and writing Parquet files in many languages, including Java, C++, Python, and more.
  5. Integration: Parquet is designed to integrate well with various big data frameworks. It has deep support in Apache Hadoop, Apache Spark, and Apache Hive and works well with other data processing frameworks.

In short, Parquet is a powerful tool in the big data ecosystem due to its efficiency, flexibility, and compatibility with a wide range of tools and languages.

Difference between CSV and Parquet

AspectCSV (Comma-Separated Values)Parquet
Data FormatText-based, plain textColumnar, binary format
CompressionUsually uncompressed (or lightly compressed)Highly compressed
SchemaNone, schema-lessStrong schema enforcement
Read/Write EfficiencyRow-based, less efficient for column operationsColumn-based, efficient for analytics
File SizeGenerally largerTypically smaller due to compression
StorageMore storage space requiredLess storage space required
Data AccessGood for sequential accessEfficient for accessing specific columns
Example Size (1 GB)Could be around 1 GB or more depending on compressionCould be 200-300 MB (due to compression)
Use CasesSimple data exchange, compatibilityBig data analytics, data warehousing
Support for Data TypesLimited to text, numbersRich data types (int, float, string, etc.)
Processing SpeedSlower for large datasets, particularly for queries on specific columnsFaster, especially for column-based queries
Tool CompatibilitySupported by most tools, databases, and programming languagesSupported by big data tools like Apache Spark, Hadoop, etc.

Parquet Compression

  • Snappy (default)
  • Gzip

Snappy

  • Low CPU Util
  • Low Compression Rate
  • Splittable
  • Use Case: Hot Layer
  • Compute Intensive

GZip

  • High CPU Util
  • High Compression Rate
  • Splittable
  • Use Case: Cold Layer
  • Storage Intensive

DuckDB

DuckDB is an in-process SQL database management system designed for efficient analytical query processing.

Its a single file DB, no need of Client / Server setup.

Its like SQLLite on Steroids.

Install Duck DB

https://duckdb.org/#quickinstall

Key Features

In-Process Execution: DuckDB operates entirely within your application process, eliminating the need for a client-server setup. This makes it highly efficient for embedded analytics.

Columnar Storage: DuckDB uses a columnar storage format, unlike traditional row-based databases. This allows it to handle analytical queries more efficiently, as operations can simultaneously be performed on entire columns.

SQL Compatibility: DuckDB's wide range of SQL features makes it a breeze to use for anyone familiar with SQL. It can handle complex queries, joins, and aggregations without requiring specialized query syntax, ensuring a comfortable and familiar experience.

Integration and Ease of Use: DuckDB is designed to integrate easily into various environments. It supports multiple programming languages, including Python, R, and C++, and can be used directly within data science workflows.

Scalability: DuckDB, while optimized for in-process execution, is more than capable of scaling to handle large datasets. This reassures you that it's a reliable choice for lightweight analytics and exploratory data analysis (EDA).

Extensibility: The database supports user-defined functions (UDFs) and extensions, allowing you to extend its capabilities to fit your needs.

Limitations

  • Limited Transaction Support
  • Concurrency Limitations
  • No Client-Server Model
  • Lack of Advanced RDBMS Features
  • Not Designed for Massive Data Warehousing

Duckdb Sample - 01

Launch Duckdb

duckdb
.open newdbname.duckdb
CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name VARCHAR NOT NULL,
    email VARCHAR UNIQUE,
    department VARCHAR,
    salary DECIMAL(10, 2) CHECK (salary > 0)
);

INSERT INTO employees (id,name, email, department, salary) VALUES
(1,'Rachel Green', 'rachel.green@friends.com', 'Fashion', 55000.00);

INSERT INTO employees (id,name, email, department, salary) VALUES
(2,'Monica Geller', 'monica.geller@friends.com', 'Culinary', 62000.50);

INSERT INTO employees (id,name, email, department, salary) VALUES
(3,'Phoebe Buffay', 'phoebe.buffay@friends.com', 'Massage Therapy', 48000.00);

INSERT INTO employees (id,name, email, department, salary) VALUES
(4,'Joey Tribbiani', 'joey.tribbiani@friends.com', 'Acting', 40000.75);

INSERT INTO employees (id,name, email, department, salary) VALUES
(5,'Chandler Bing', 'chandler.bing@friends.com', 'Data Analysis', 70000.25);

INSERT INTO employees (id,name, email, department, salary) VALUES
(6,'Ross Geller', 'ross.geller@friends.com', 'Paleontology', 65000.00);
select * from employees
  • .databases or show databases;
  • .tables or show tables;
  • .mode list
  • .mode table
  • .mode csv
  • summarize employees;
  • summarize select name,email from employees;
  • .exit

CREATE SEQUENCE dept_id_seq;

drop table if exists departments;

CREATE TABLE departments (
    id INTEGER PRIMARY KEY DEFAULT nextval('dept_id_seq'),
    dept_name VARCHAR NOT NULL
);

insert into departments(dept_name) values('Fashion');
insert into departments(dept_name) values('Culinary');
insert into departments(dept_name) values('Massage');
insert into departments(dept_name) values('Acting');
insert into departments(dept_name) values('Data Analysis');
insert into departments(dept_name) values('Paleontology');


CREATE SEQUENCE emp_id_seq;

DROP TABLE if exists employees;

CREATE TABLE if not exists  employees (
    id INTEGER PRIMARY KEY DEFAULT nextval('emp_id_seq'),
    name VARCHAR NOT NULL,
    email VARCHAR UNIQUE,
    department VARCHAR,
    salary DECIMAL(10, 2) CHECK (salary > 0),
    hire_date DATE DEFAULT CURRENT_DATE,
);


INSERT INTO employees (name, email, department, salary) VALUES
('Rachel Green', 'rachel.green@friends.com', 'Fashion', 55000.00);

INSERT INTO employees (name, email, department, salary) VALUES
('Monica Geller', 'monica.geller@friends.com', 'Culinary', 62000.50);

INSERT INTO employees (name, email, department, salary) VALUES
('Phoebe Buffay', 'phoebe.buffay@friends.com', 'Massage Therapy', 48000.00);

INSERT INTO employees (name, email, department, salary) VALUES
('Joey Tribbiani', 'joey.tribbiani@friends.com', 'Acting', 40000.75);

INSERT INTO employees (name, email, department, salary) VALUES
('Chandler Bing', 'chandler.bing@friends.com', 'Data Analysis', 70000.25);

INSERT INTO employees (name, email, department, salary) VALUES
('Ross Geller', 'ross.geller@friends.com', 'Paleontology', 65000.00);


INSERT INTO employees (id,name, email, department, salary) VALUES
(8,'Ben Geller', 'ben.geller@friends.com', 'Student', 1.00);

INSERT INTO employees (name, email, department, salary) VALUES
('Emma Green', 'emma.green@friends.com', 'Kid', 1.00);


CREATE SEQUENCE emp_dept_id_seq;

CREATE TABLE emp_dept (
	id INTEGER PRIMARY KEY DEFAULT nextval('emp_dept_id_seq'),
    dept_id INTEGER NOT NULL,
    emp_id INTEGER NOT NULL,
    FOREIGN KEY (dept_id) REFERENCES departments(id),
    FOREIGN KEY (emp_id) REFERENCES employees(id)
);

INSERT INTO emp_dept(emp_id,dept_id) VALUES (1,1),(2,2),(3,3),(4,4),(5,5),(6,6);

INSERT INTO emp_dept(emp_id,dept_id) VALUES (9,7);


CREATE SEQUENCE product_id_seq;

CREATE TABLE products (
    id INTEGER PRIMARY KEY DEFAULT nextval('product_id_seq'),
    name VARCHAR NOT NULL,
    price DECIMAL(10, 2),
    discounted_price DECIMAL(10, 2) GENERATED ALWAYS AS (price * 0.9) VIRTUAL
);


CREATE SEQUENCE orders_id_seq;

CREATE TABLE orders (
    id INTEGER PRIMARY KEY DEFAULT nextval('orders_id_seq'),
    customer_name VARCHAR,
    items INTEGER[],
    shipping_address STRUCT(
        street VARCHAR,
        city VARCHAR,
        zip VARCHAR
    )
);


INSERT INTO orders (customer_name, items, shipping_address) VALUES
('John Doe', [1, 2, 3], {'street': '123 Elm St', 'city': 'Springfield', 'zip': '11111'}),
('Jane Smith', [3, 4, 5], {'street': '456 Oak St', 'city': 'Greenville', 'zip': '22222'}),
('Emily Johnson', [6, 7, 8, 9], {'street': '789 Pine St', 'city': 'Fairview', 'zip': '33333'});

Query Orders


select * from orders;

select id,customer_name,items,shipping_address.city from orders;

select id,customer_name,items,shipping_address['city'] from orders;

select id,customer_name,shipping_address.* from orders;

-- Array

select id,customer_name,len(items) from orders;

select id,customer_name,list_contains(items,3) from orders;

select id,customer_name,list_distinct(items) from orders;

select id,customer_name,unnest(items),shipping_address.* from orders;

More Functions

https://duckdb.org/docs/sql/functions/list


select current_catalog(), current_schema();

-- Returns the name of the currently active catalog. Default is memory. -- Returns the name of the currently active schema. Default is main.

Query Github Directly

select * from read_parquet('https://github.com/duckdb/duckdb-data/releases/download/v1.0/userdata1.parquet');
select * from read_parquet('https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet') limit 5;
select * from read_parquet('https://github.com/duckdb/duckdb-data/releases/download/v1.0/city_temperature.parquet') limit 5;

Create DuckDB table

create table orders as select * from read_parquet('https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet');

describe table

describe orders;
select * from read_json('https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json') limit 5;

DuckDB Arrays, CTE & DATE Dimension

select generate_series(1,100,12) as num;
select generate_series(DATE '2024-09-01', DATE '2024-09-10', INTERVAL 1 DAY) as dt;
with date_cte as(
	select strftime(unnest(generate_series(DATE '2024-09-01', DATE '2024-09-10', INTERVAL 1 DAY)),'%Y-%m-%d') as dt
)
SELECT dt FROM date_cte;
WITH date_cte AS (
    SELECT unnest(generate_series(DATE '2024-09-01', DATE '2024-09-10', INTERVAL 1 DAY)) AS dt
)
SELECT 
  REPLACE(CAST(CAST(dt AS DATE) AS VARCHAR), '-', '') AS DateKey
  ,CAST(dt AS DATE) AS ShortDate
  ,DAYNAME(dt) AS DayOfWeek
  ,CASE 
    WHEN EXTRACT(DOW FROM dt) IN (0, 6) THEN 'Weekend'
    ELSE 'Weekday'
  END AS IsWeekend
  ,CAST(CEIL(EXTRACT(DAY FROM dt) / 7.0) AS INTEGER) AS WeekOfMonth
  ,EXTRACT(WEEK FROM dt) AS WeekOfYear
  ,'Q' || CAST(EXTRACT(QUARTER FROM dt) AS VARCHAR) AS Quarter
  ,CASE 
        WHEN strftime(dt, '%m') IN ('01', '02', '03') THEN 'Q1'
        WHEN strftime(dt, '%m') IN ('04', '05', '06') THEN 'Q2'
        WHEN strftime(dt, '%m') IN ('07', '08', '09') THEN 'Q3'
        ELSE 'Q4'
    END AS Quarter_alternate
  ,EXTRACT(YEAR FROM dt) AS Year
  ,EXTRACT(MONTH FROM dt) AS Month
  ,EXTRACT(Day FROM dt) AS Day

FROM date_cte
ORDER BY dt;

Practice using SQLBolt

https://sqlbolt.com/lesson/select_queries_introduction

The answer Key (if needed)

SQL Bolt Solutions

Cloud

Overview

Definitions

Hardware: physical computer / equipment / devices

Software: programs such as operating systems, word, Excel

Web Site: Readonly web pages such as company pages, portfolios, newspapers

Web Application: Read Write - Online forms, Google Docs, email, Google apps

Cloud Plays a significant role in the Big Data world.

In today's market, Cloud helps companies to accommodate the ever-increasing volume, variety, and velocity of data.

Cloud Computing is a demand delivery of IT resources over the Internet through Pay Per Use.

Src : https://thinkingispower.com/the-blind-men-and-the-elephant-is-perception-reality/

Without Cloud knowledge, knowing Bigdata will be something like the above picture.

  1. Volume: Size of the data.
  2. Velocity: Speed at which new data is generated.
  3. Variety: Different types of data.
  4. Veracity: Trustworthiness of the data.
  5. Value: Usefulness of the data.
  6. Vulnerability: Security and privacy aspects.

When people focus on only one aspect without the help of cloud technologies, they miss out on the comprehensive picture. Cloud solutions offer ways to manage all these dimensions in an integrated manner, thus providing a fuller understanding and utilization of Big Data.

Advantages of Cloud Computing for Big Data

  • Cost Savings
  • Security
  • Flexibility
  • Mobility
  • Insight
  • Increased Collaboration
  • Quality Control
  • Disaster Recovery
  • Loss Prevention
  • Automatic Software Updates
  • Competitive Edge
  • Sustainability

Types of Cloud Computing

Public Cloud

Owned and operated by third-party providers. (AWS, Azure, GCP, Heroku, and a few more)

Private Cloud

Cloud computing resources are used exclusively by a single business or organization.

Hybrid

Public + Private: By allowing data and applications to move between private and public clouds, a hybrid cloud gives your business greater flexibility and more deployment options, and helps optimize your existing infrastructure, security, and compliance.

Types of Cloud Services

SaaS

Software as a Service

Cloud-based service providers offer end-user applications. Google Apps, DropBox, Slack, etc.

  • Web access to Software (primarily commercial).
  • Software is managed from a central location.
  • Delivery 1 - many models.
  • No patches, No upgrades

When not to use

  • Hardware integration is needed. (Price Scanner)
  • Faster processing is required.
  • Cannot host data outside the premise.

PaaS

Platform as a Service

Software tools are available over the internet. AWS RDS, Heroku, Salesforce

  • Scalable
  • Built on Virtualization Technology
  • No User needed to maintain software. (DB upgrades, patches by cloud team)

When not to use PaaS

  • Propriety tools don't allow moving to diff providers. (AWS-specific tools)
  • Using new software that is not part of the PaaS toolset.

IaaS

Infrastructure as a Service

Cloud-based hardware services. Pay-as-you-go services for Storage, Networking, and Servers.

Amazon EC2, Google Compute Engine, S3.

  • Highly flexible and scalable.
  • Accessible by more than one user.
  • Cost-effective (if used right).

Serverless computing

Focuses on building apps without spending time managing servers/infrastructure.

Feature automatic scaling, built-in high availability, and pay-per-use.

Use of resources when a specific function or event occurs.

Cloud providers handle the deployment, and capacity, and manage the servers.

Example: AWS Lambda, AWS Step Functions.

Easy way to remember SaaS, PaaS, IaaS

bigcommerce.com

Challenges of Cloud Computing

Privacy: "Both traditional and Big Data sets often contain sensitive information, such as addresses, credit card details, or social security numbers."

So, it's the responsibility of users to ensure proper security methods are followed.

Compliance: Cloud providers replicate data across regions to ensure safety. If companies have regulations that data should not be stored outside their organization or should not be stored in a specific part of the world.

Data Availability: Everything is dependent on the Internet and speed. It is also dependent on the choice of the cloud provider. Big companies like AWS / GCP / Azure have more data centers and backup facilities.

Connectivity: Internet availability + speed.

Vendor lock-in: Once an organization has migrated its data and applications to the cloud, switching to a different provider can be difficult and expensive. This is known as vendor lock-in. Some cloud agnostic tools like Databricks help enterprises to mitigate this problem, but still, its a challenge.

Cost: Cloud computing can be a cost-effective way to deploy and manage IT resources. However, it is essential to carefully consider your needs and budget before choosing a cloud provider.

Continuous Training: Employees may need to be trained to use cloud-based applications. This can be a cost and time investment.

Constant Change in Technology: Cloud providers constantly improve or change their technology. Recently, Microsoft decided to decommission Synapse and launch a new tool called Fabric.

AWS

Terms to Know

Elasticity The ability to acquire resources as you need them and release resources when you no longer need them.

Scale Up vs. Scale Down

Scale-Out vs. Scale In

Latency

Typically latency is a measurement of a round-trip between two systems, such as how long it takes data to make its way between two.

Root User

Owner of the AWS account.

IAM

Identity Access Management

ARN

Amazon Resource Name

For example

arn:aws:iam::123456789012:user/Development/product_1234/*

Policy

Rules

Amazon EC2

Allows you to deploy virtual servers within your AWS environment.

Amazon VPC

An isolated segment of the AWS cloud accessible by your own AWS account.

Amazon S3

A fully managed, object-based storage service that is highly available, highly durable, cost-effective, and widely accessible.

AWS IAM (Identify and Access Mgt)

Used to manage permissions to your AWS resources

AWS Management Services

Amazon EC2 Auto Scaling

Automatically increases or decreases your EC2 resources to meet the demand based on custom-defined metrics and thresholds.

Amazon CloudWatch

A comprehensive monitoring tool allows you to monitor your services and applications in the cloud.

Elastic Load Balancing

Used to manage and control the flow of inbound requests to a group of targets by distributing these requests evenly across a targeted resource group.

Billing & Budgeting

Helps control the cost.

AWS Global Infrastructure

The Primary two items are given below.

  • Availability Zones
  • Regions

Availability Zones (AZs)

AZs are the physical data centers of AWS.

This is where the actual computing, storage, network, and database resources are hosted that we as consumers, provision within our Virtual Private Clouds (VPCs).

A common misconception is that a single availability zone equals a single data center. Multiple data centers located closely form a single availability zone.

Each AZ will have another AZ in the same geographical area. Each AZ will be isolated from others using a separate power/network like DR.

Many AWS services use low latency links between AZs to replicate data for high availability and resilience purposes.

Multiple AZs are defined as an AWS Regions. (Example: Virginia)

Regions

Every Region will act independently of the others, containing at least two Availability Zones.

Interestingly, only some AWS services are available in some regions.

  • US East (N. Virginia) us-east-1
  • US East (Ohio) us-east-2
  • EU (Ireland) eu-west-1
  • EU (Frankfurt) eu-central-1

Note: As of today, AWS is available in 27 regions and 87 AZs

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html"

EC2

(Elastic Cloud Compute)

Compute: Closely related to CPU/RAM

Elastic Compute Cloud (EC2): AWS EC2 provides resizable compute capacity in the cloud, allowing you to run virtual servers as per your needs.

Instance Types: EC2 offers various instance types optimized for different use cases, such as general purpose, compute-optimized, memory-optimized, and GPU instances.

Pricing Models

On-Demand: Pay for computing capacity by the hour or second.

Reserved: Commit to a one or 3-year term and get a discount.

Spot: Bid for unused EC2 capacity at a reduced cost.

Savings Plans: Commit to consistent compute usage for lower prices. AMI (Amazon Machine Image): Pre-configured templates for your EC2 instances, including the operating system, application server, and applications.

Security

Security Groups: Act as a virtual firewall for your instances to control inbound and outbound traffic.

Key Pairs: These are used to access your EC2 instances via SSH or RDP securely.

Elastic IPs: These are static IP addresses that can be associated with EC2 instances. They are useful for hosting services that require a consistent IP.

Auto Scaling: Automatically adjusts the number of EC2 instances in response to changing demand, ensuring you only pay for what you need.

Elastic Load Balancing (ELB): Distributes incoming traffic across multiple EC2 instances, improving fault tolerance and availability.

EBS (Elastic Block Store): Provides persistent block storage volumes for EC2 instances, allowing data to be stored even after an instance is terminated.

Regions and Availability Zones: EC2 instances can be deployed in various geographic regions, each with multiple availability zones for high availability and fault tolerance.

Storage

Persistent Storage

  • Elastic Block Storage (EBS) Volumes / Logically attached via AWS network.
  • Automatically replicated.
  • Encryption is available.

Ephemeral Storage - Local storage

  • Physically attached to the underlying host.
  • When the instance is stopped or terminated, all the data is lost.
  • Rebooting will keep the data intact.

DEMO - Deploy EC2

S3

(Simple Storage Service)

It's an IaaS service.

  • Highly Available
  • Durable
  • Cost Effective
  • Widely Accessible
  • Uptime of 99.99%
  1. Objects and Buckets: The fundamental elements of Amazon S3 are objects and buckets. Objects are the individual data pieces stored in Amazon S3, while buckets are containers for these objects. An object consists of a file and, optionally, any metadata that describes that file.

    • It's also a regional service, meaning that when you create a bucket, you specify a region, and all objects are stored there.

    • Globally Unique: The name of an Amazon S3 bucket must be unique across all of Amazon S3, that is, across all AWS customers. It's like a domain name.

    • Globally Accessible: Even though you specify a particular region when you create a bucket, once the bucket is created, you can access it from anywhere in the world using the appropriate URL.

  2. Scalability: Amazon S3 can scale in terms of storage, request rate, and users to support unlimited web-scale applications.

  3. Security: Amazon S3 includes several robust security features, such as encryption for data at rest and in transit, access controls like Identity and Access Management (IAM) policies, bucket policies, and Access Control Lists (ACLs), and features for monitoring and logging activity, like AWS CloudTrail.

  4. Data transfer: Amazon S3 supports transfer acceleration, which speeds up uploads and downloads of large objects.

  5. Event Notification: S3 can notify you of specific events in your bucket. For instance, you could set up a notification to alert you when an object is deleted from your bucket.

  6. Management Features: S3 has a suite of features to help manage your data, including lifecycle management, which allows you to define rules for moving or expiring objects, versioning to keep multiple versions of an object in the same bucket, and analytics for understanding and optimizing storage costs.

  7. Consistency: Amazon S3 provides read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

    • Read-after-write Consistency for PUTS of New Objects: When a new object is uploaded (PUT) into an Amazon S3 bucket, it's immediately accessible for read (GET) operations. This is known as read-after-write consistency. You can immediately retrieve a new object as soon as you create it. This applies across all regions in AWS, and it's crucial when immediate, accurate data retrieval is required.

    • Eventual Consistency for Overwrite PUTS and DELETES: Overwrite PUTS and DELETES refer to operations where an existing object is updated (an overwrite PUT) or removed (a DELETE). For these operations, Amazon S3 provides eventual consistency. If you update or delete an object and immediately attempt to read or delete it, you might still get the old version or find it there (in the case of a DELETE) for a short period. This state of affairs is temporary, and shortly after the update or deletion, you'll see the new version or find the object gone, as expected.

Src: Mailbox

Src: USPS

Notes

Data is stored as an "Object."

Object storage, also known as object-based storage, manages data as objects. Each object includes the data, associated metadata, and a globally unique identifier.

Unlike file storage, there are no folders or directories in object storage. Instead, objects are organized into a flat address space, called a bucket in Amazon S3's terminology.

The unique identifier allows an object to be retrieved without needing to know the physical location of the data. Metadata can be customized, making object storage incredibly flexible.

Every object gets a UID (universal ID) and associated META data.

No Folders / SubFolders

For example, if you have an object with the key images/summer/beach.png in your bucket, Amazon S3 has no internal concept of the images or summer as separate entities—it simply sees the entire string images/summer/beach.png as the key for that object.

To store objects in S3, you must first define and create a bucket.

You can think of a bucket as a container for your data.

This bucket name must be unique, not just within the region you specify, but globally against all other S3 buckets, of which there are many millions.

Any object uploaded to your buckets is given a unique object key to identify it.

  • S3 bucket ownership is not transferable.
  • S3 bucket names should start with alphabets, and - is allowed in between.
  • An AWS account can have a maximum of 100 buckets.

More details

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

Other types of Storage are

File Storage

Block Storage


description: Identity Access Management

IAM

src: Aws

ARN: Amazon Resource Name

Users - Individual Person / Application

Groups - Collection of IAM Users

Policies - Policy sets permission/control access to AWS resources. Policies are stored in AWS as JSON documents.

A Policy can be attached to multiple entities (users, groups, and roles) in your AWS account.

Multiple Policies can be created and attached to the user.

Roles - Set of permissions that define what actions are allowed and denied by an entity in the AWS console. Similar to a user, it can be accessed by any type of entity.

// Examples of ARNs

arn:aws:s3:::my_corporate_bucket/*

arn:aws:s3:::my_corporate_bucket/Development/*

arn:aws:iam::123456789012:user/chandr34

arn:aws:iam::123456789012:group/bigdataclass

arn:aws:iam::123456789012:group/*

Types of Policies

Identity-based policies: Identity-based policies are attached to an IAM user, group, or role (identities). These policies control what actions an identity can perform, on which resources, and under what conditions.

Resource-based policies: Resource-based policies are attached to a resource such as an Amazon S3 bucket. These policies control what actions a specified principal can perform on that resource and under what conditions.

Permission Boundary: You can use an AWS-managed policy or a customer-managed policy to set the boundary for an IAM entity (user or role). A permissions boundary is an advanced feature for using a managed policy to set the maximum permissions that an identity-based policy can grant to an IAM entity.

Inline Policies: Policies that are embedded in an IAM identity. Inline policies maintain a strict one-to-one relationship between a policy and an identity. They are deleted when you delete the identity.

AWS CloudShell

AWS CloudShell is a browser-based shell environment available directly through the AWS Management Console. It provides a command-line interface (CLI) to manage and interact with AWS resources securely without needing to install any software or set up credentials on your local machine.

Use Cases

Quick Access to AWS CLI

Allows you to run AWS CLI commands directly without configuring your local machine. It's perfect for quick tasks like managing AWS resources (e.g., EC2 instances, S3 buckets, or Lambda functions).

Development and Automation

You can write and execute scripts using common programming languages like Python and Shell. It’s great for testing and automating tasks directly within your AWS environment.

Secure and Pre-Configured Environment

AWS CloudShell comes pre-configured with AWS CLI, Python, Node.js, and other essential tools. It uses your IAM permissions, so you don’t need to handle keys or credentials directly, making it secure and convenient.

Access to Filesystem and Persistent Storage

You get a persistent 1 GB home directory per region to store scripts, logs, or other files between sessions, which can be used to manage files related to your AWS resources.

Cross-Region Management

You can access and manage resources across different AWS regions directly from CloudShell, making it useful for multi-region setups.


Basic Commands

    aws s3 ls
    aws ec2 describe-instances

    sudo apt install jq

list_buckets.sh

#!/bin/bash
echo "Listing all S3 buckets:"
aws s3 ls
    bash list_buckets.sh
# get account details

aws sts get-caller-identity

# list available regions

aws ec2 describe-regions --query "Regions[].RegionName" --output table

# create a bucket

aws s3 mb s3://chandr34-newbucket

# upload a file to a bucket 

echo "Hello, CloudShell!" > hello.txt
aws s3 cp hello.txt s3://chandr34-newbucket

# List files in bucket 

aws s3 ls s3://chandr34-newbucket/

# Delete bucket  with files 

aws s3 rb s3://chandr34-newbucket --force

# List AMIs

aws ec2 describe-images --owners amazon --query 'Images[*].{ID:ImageId,Name:Name}' --output table

# quickly launch a ec2

aws ec2 create-key-pair --key-name gcnewkeypair --query 'KeyMaterial' --output text > myNewKeyPair.pem

# Change Permission

chmod 0400 myNewKeyPair.pem

# Launch new EC2

aws ec2 run-instances --image-id ami-0866a3c8686eaeeba --count 1 --instance-type t2.micro --key-name gcnewkeypair --security-groups default

# Get Public IP

aws ec2 describe-instances --query "Reservations[].Instances[].PublicIpAddress" --output text

# Login to server

ssh -i myKeyNewPair.pem ubuntu@<getthehostip>

# terminate the instance

aws ec2 terminate-instances --instance-ids <>

Cloud Formation

my-webserver.yml

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudFormation template to launch an Amazon Linux EC2 instance with Nginx installed.

Resources:
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t2.micro
      ImageId: ami-0866a3c8686eaeeba
      KeyName: gcnewkeypair
      SecurityGroupIds:
        - !Ref InstanceSecurityGroup
      UserData:
        Fn::Base64: 
          !Sub |
            #!/bin/bash
            apt update -y
            apt install -y nginx
            systemctl start nginx
            systemctl enable nginx
      Tags:
        - Key: Name
          Value: MyNginxServer

  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable SSH and HTTP access
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0  # SSH access, restrict this to your IP range for security
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTP access for Nginx

Outputs:
  InstanceId:
    Description: The Instance ID of the EC2 instance
    Value: !Ref MyEC2Instance
  PublicIP:
    Description: The Public IP address of the EC2 instance
    Value: !GetAtt MyEC2Instance.PublicIp
  WebURL:
    Description: URL to access the Nginx web server
    Value: !Sub "http://${MyEC2Instance.PublicIp}"

Launch the Stack via CloudShell

# Create the stack
aws cloudformation create-stack --stack-name gc-stack --template-body file://my-webserver.yml --capabilities CAPABILITY_NAMED_IAM


# Check the status

aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].Outputs"

# delete the stack

aws cloudformation delete-stack --stack-name gc-stack


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"

# confirm the deletion status

aws cloudformation list-stacks --query "StackSummaries[?StackName=='gc-stack'].StackStatus"

Terraform

Features of Terraform

Infrastructure as Code: Terraform allows you to write, plan, and create infrastructure using configuration files. This makes infrastructure management automated, consistent, and easy to collaborate on.

Multi-Cloud Support: Terraform supports many cloud providers and on-premises environments, allowing you to manage resources across different platforms seamlessly.

State Management: Terraform keeps track of the current state of your infrastructure in a state file. This enables you to manage changes, plan updates, and maintain consistency in your infrastructure.

Resource Graph: Terraform builds a resource dependency graph that helps in efficiently creating or modifying resources in parallel, speeding up the provisioning process and ensuring dependencies are handled correctly.

Immutable Infrastructure: Terraform promotes the practice of immutable infrastructure, meaning that resources are replaced rather than updated directly. This ensures consistency and reduces configuration drift.

Execution Plan: Terraform provides an execution plan (terraform plan) that previews changes before they are applied, allowing you to understand and validate the impact of changes before implementing them.

Modules: Terraform supports reusability through modules, which are self-contained, reusable pieces of configuration that help you maintain best practices and reduce redundancy in your infrastructure code.

Community and Ecosystem: Terraform has a large open-source community and many providers and modules available through the Terraform Registry, which makes it easier to get started and integrate with various services.

Use Cases

  • Multi-Cloud Provisioning
  • Infrastructure Scaling
  • Disaster Recovery
  • Environment Management
  • Compliance & Standardization
  • CI/CD Pipelines
  • Speed and Simplicity
  • Team Collaboration
  • Error Reduction
  • Enhanced Security

Install Terraform CLI

Terraform Download

Terraform Structure

Provider Block: Specifies the cloud provider or service (e.g., AWS, Azure, Google Cloud) that Terraform will interact with.

provider "aws" {
  region = "us-east-1"
}

Resource Block: Defines the resources to be created or managed. A resource can be a server, network, or other infrastructure component.

resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

Data Block: Fetches information about existing resources, often for referencing in resource blocks.

data "aws_ami" "latest" {
  most_recent = true
  owners      = ["amazon"]
}

Variable Block: Declares input variables to make the script flexible and reusable.

variable "instance_type" {
  description = "Type of instance to use"
  type        = string
  default     = "t2.micro"
}

Output Block: Specifies values to be output after the infrastructure is applied, like resource IDs or connection strings.

output "instance_ip" {
  value = aws_instance.example.public_ip
}

Module Block: Used to encapsulate and reuse sets of Terraform resources.

module "vpc" {
  source = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
}

Locals Block: Defines local values that can be reused in the configuration.

locals {
  environment = "production"
  instance_count = 3
}

SET these environment variables.

export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"

Simple S3 Bucket

simple_s3_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region = "us-east-1"
  profile = "chandr34"
}

resource "aws_s3_bucket" "demo" {
  bucket = "chandr34-my-new-tf-bucket"

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}
Create a new folder
Copy the .tf into it
terraform init 
terraform validate
terraform plan
terraform apply
terraform destroy

Variable S3 Bucket

variable_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

variable "bucket_name" {
  description = "The name of the S3 bucket to create"
  type        = string
}

resource "aws_s3_bucket" "demo" {
  bucket = var.bucket_name

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}
Create a new folder
Copy the .tf into it
terraform init
terraform validate
terraform plan
terraform apply -var="bucket_name=chandr34-variable-bucket"
terraform destroy -var="bucket_name=chandr34-variable-bucket"

Variable file

Any filename with extension .tfvars

terraform.tfvars

bucket_name = "chandr34-variable-bucket1"
terraform apply -auto-approve

AWS Resource Types


Please make sure AWS Profile is created.

Create Public and Private Keys

Linux / Mac Users

// create private/public key

ssh-keygen -b 2048 -t rsa -f ec2_tf_demo

Windows Users

Open PuttyGen and create a Key

Terraform

  • mkdir simple_ec2
  • cd tf-aws-ec2-sample
  • Create main.tf
// main.tf
#https://registry.terraform.io/providers/hashicorp/aws/latest

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

resource "aws_key_pair" "generated_key" {
  key_name   = "generated-key-pair"
  public_key = tls_private_key.generated_key.public_key_openssh
}

resource "tls_private_key" "generated_key" {
  algorithm = "RSA"
  rsa_bits  = 2048
}

resource "local_file" "private_key_file" {
  content  = tls_private_key.generated_key.private_key_pem
  filename = "${path.module}/generated-key.pem"
}

resource "aws_instance" "ubuntu_ec2" {
  ami           = "ami-00874d747dde814fa"
  instance_type = "t2.micro"
  key_name      = aws_key_pair.generated_key.key_name
  vpc_security_group_ids = [aws_security_group.ec2_security_group.id]

  tags = {
    Name        = "UbuntuInstance"
    Environment = "classdemo"
  }
}

resource "aws_security_group" "ec2_security_group" {
  name        = "ec2_security_group"
  description = "Allow SSH and HTTP access"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow SSH from anywhere (use cautiously)
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow HTTP from anywhere
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]  # Allow all outbound traffic
  }

  tags = {
    Name = "EC2SecurityGroup"
  }
}

output "ec2_instance_public_ip" {
  value = aws_instance.ubuntu_ec2.public_ip
}

output "private_key_pem" {
  value     = tls_private_key.generated_key.private_key_pem
  sensitive = true
}

goto terminal

  • terraform init
  • terraform fmt
  • terraform validate
  • terraform apply
  • terraform show

Finally

  • terraform destroy

Data Architecture

Medallion Architecture

Medallion architecture is a data design pattern commonly used in modern data lakes and data warehouses, particularly in cloud-based environments.

A medallion architecture is a data design pattern used to logically organize data in a lakehouse, aiming to incrementally and progressively improve the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).


The Three Tiers

Bronze (Raw)

  • Contains raw, unprocessed data.
  • Typically a 1:1 copy of source system data.
  • Preserves the original data for auditability and reprocessing if needed.
  • Often stored in formats like JSON, CSV, or Avro.

Silver (Cleaned and Conformed)

  • Cleansed and conformed version of bronze data.
  • Applies data quality rules, handles missing values, deduplication.
  • Often includes parsed and enriched data.
  • Typically stored in a more optimized format like Parquet or Delta.

Gold (Business-Level)

  • Contains highly refined, query-ready data sets.
  • Often aggregated and joined from multiple silver tables.
  • Optimized for specific business domains or use cases.
  • Can include star schemas, data marts, or wide denormalized tables.

Key Principles

  • Data flows from Bronze → Silver → Gold
  • Each tier adds value and improves data quality
  • Promotes data governance and lineage tracking
  • Enables self-service analytics at different levels of refinement

Benefits

Flexibility: Supports various data processing needs

Scalability: Easily accommodates growing data volumes

Governance: Improves data lineage and auditability

Performance: Optimizes query performance on refined data sets

Reusability: Allows multiple downstream applications to use appropriately refined data

Bronze to Silver

These are the basic best practices followed when moving data from Bronze to Silver.

DO's

Case standardization

  • Convert column names to lowercase or uppercase.
  • Standardize text fields (e.g., convert all First Name, Last Name, Product names to title case).

Column renaming

  • Adopt consistent naming conventions across tables

Whitespace

  • Trim leading and trailing spaces from string fields.
  • Remove extra spaces between words.

Data type conversions

  • Convert string dates to proper date format
  • Change numeric strings to appropriate integer or decimal types

Null handling

  • Replace empty strings with NULL values
  • Set default values for NULL fields where appropriate

Deduplication

  • Remove exact duplicate records
  • Handle near-duplicates based on business rules

Format standardization

  • Normalize phone numbers to a consistent format
  • Standardize address formats

Value normalization

  • Convert units of measurement to a standard unit (e.g., all weights to pounds or kgs)
  • Standardize currency / lat-long decimal positions.

Character encoding

  • Convert all text to UTF-8 or another standard encoding.

Special character handling

  • Remove or replace non-printable characters.
  • Handle escape characters in text fields.

Data validation

  • Check for values within expected ranges
  • Validate against reference data (e.g., valid product codes)

Date and time standardization

  • Convert all timestamps to UTC or a standard time zone
  • Ensure consistent date formats across all fields

Calculated fields

  • Add derived columns based on raw data (e.g., age from birth date)

Data enrichment

  • Add geographic information based on zip codes
  • Categorize products based on attributes

Error correction

  • Fix common misspellings
  • Correct known data entry errors

Structural changes:

  • Split combined fields (e.g., full name into first and last name).
  • Merge related fields for easier analysis.

Metadata addition:

  • Add source system identifiers.
  • Include data lineage information.

Sensitive data handling

  • Mask or encrypt personally identifiable information (PII)
  • Apply data governance rules

Outlier detection and handling

  • Identify statistical outliers
  • Apply business rules to handle extreme values

Column dropping

  • Remove unnecessary or redundant columns

DON'Ts

Don't Lose Raw Data

  • Avoid modifying or transforming data in a way that the original information is lost. Always keep the Bronze layer intact and unmodified as it serves as the source of truth.
  • Maintain backups of critical data and metadata in case you need to revert transformations.

Don't Over-optimize Early

  • Avoid excessive transformations or optimizations before ensuring data accuracy.
  • Focus on clarity and correctness before performance tuning.

Don't Hard-code Business Logic

  • Avoid hard-coding business rules that could change frequently. Instead, make them configurable to ensure flexibility for future updates.

Don't Skip Data Validation

  • Don’t blindly trust incoming data from the Bronze layer. Always validate the quality, type, and range of data before moving it to Silver.
  • Implement rules for handling invalid data, such as setting defaults for missing values or discarding outliers.

Don't Ignore Metadata

  • Always retain important metadata, such as source system identifiers, data lineage, or event timestamps, to maintain traceability.

Don't Introduce Ambiguity

  • Avoid renaming columns or values without proper documentation. Ensure that any renaming or transformation is clear, consistent, and well-documented to prevent confusion.

Don't Remove Critical Columns Without Backup

  • Avoid dropping columns that might be required later for audits or debugging. Always ensure you have a backup or metadata for columns you drop.

Don't Assume Data Integrity

  • Never assume that data from the Bronze layer is clean or consistent. Always apply deduplication, null handling, and data cleaning rules.

Don't Overload the Silver Layer with Business Logic

  • Avoid implementing complex business rules in the Silver layer. Keep it focused on data cleansing and preparation. Reserve detailed business logic for the Gold layer.

Don't Over-transform Data

  • Avoid excessive or unnecessary transformations in the Silver layer, which could make the data less flexible for future use cases.

Don't Change Granularity

  • Don’t aggregate or change the granularity of data too early. Keep data at the same granularity as the Bronze layer to maintain flexibility for future aggregations in the Gold layer.

Don't Overlook Data Encoding and Formats

  • Don’t assume uniform character encoding across all systems. Always validate encoding, such as converting text to UTF-8 where needed.
  • Standardize date formats and ensure consistent time zones (preferably UTC) across all fields.

Don't Disregard Null and Missing Data

  • Don't ignore patterns of missing data. Treat NULL values carefully, ensuring that they’re handled based on the specific business logic.

Don't Apply Unnecessary Joins

  • Avoid applying joins that aren’t required in the Silver layer. Unnecessary joins can increase complexity and lead to performance degradation.

Don't Remove Contextual Information

  • Don’t drop fields that provide necessary context, such as event timestamps, source system details, or audit trails.

Don't Assume All Data Sources Are Consistent

  • Don’t assume that data from different sources (even within the Bronze layer) will have consistent formats, structures, or units of measurement. Always validate and normalize data as necessary.

Don't Ignore Data Governance

  • Avoid handling sensitive data like PII carelessly. Ensure that you apply data masking, encryption, or other governance rules to stay compliant with regulations.

Don't Forget Documentation

  • Avoid performing transformations or renaming columns without thoroughly documenting them. Clear documentation ensures data transformations are transparent and easily understood across teams.

Silver to Gold

These are the basic best practices followed when moving data from Silver to Gold.

Business alignment

  • Tailor Gold datasets to specific business needs and use cases
  • Collaborate closely with business stakeholders to understand requirements

Aggregation and summarization

  • Create pre-aggregated tables for common metrics (e.g., daily sales totals)
  • Implement various levels of granularity to support different analysis needs

Dimensional modeling

  • Develop star or snowflake schemas for analytical queries
  • Create conformed dimensions for consistent reporting across the organization

Denormalization

  • Create wide, denormalized tables for specific reporting needs
  • Balance performance gains against data redundancy

Metric standardization

  • Implement agreed-upon business logic for key performance indicators (KPIs)
  • Ensure consistent calculation of metrics across different Gold datasets

Data mart creation

  • Develop subject-area specific data marts (e.g., Sales, HR, Finance)
  • Optimize each mart for its intended use case

Advanced transformations

  • Apply complex business rules and calculations
  • Implement time-based analyses (e.g., year-over-year comparisons)

Data quality assurance

  • Implement rigorous testing of Gold datasets
  • Set up automated data quality checks and alerts

Performance optimization

  • Use appropriate indexing and partitioning strategies
  • Implement materialized views for frequently accessed data

Metadata management

  • Maintain detailed business glossaries and data dictionaries
  • Document data lineage and transformation logic

Access control

  • Implement fine-grained access controls for sensitive data
  • Ensure compliance with data governance policies

Versioning and historization

  • Implement slowly changing dimensions (SCDs) where appropriate
  • Maintain historical versions of key business entities

Data freshness

  • Define and implement appropriate refresh schedules
  • Balance data currency against processing costs

Self-service enablement

  • Create views or semantic layers for business users
  • Provide clear documentation and training for end-users

Caching strategies

  • Implement intelligent caching for frequently accessed data
  • Balance cache freshness against query performance

Query optimization

  • Tune common queries for optimal performance
  • Create aggregated tables or materialized views for complex calculations

Data exploration support:

  • Provide sample queries or analysis templates
  • Create dashboards or reports showcasing the value of Gold datasets

Scalability considerations

  • Design Gold datasets to handle growing data volumes
  • Implement appropriate archiving strategies for historical data

Documentation

  • Maintain comprehensive documentation of Gold dataset structures and uses
  • Provide clear guidelines on how to use and interpret the data

Sample Data from Raw to Silver

Using DUCKDB

Raw CSV : Sales_100

Creating Bronze

create table bronze_sales as select * from read_csv("https://raw.githubusercontent.com/gchandra10/filestorage/refs/heads/main/sales_100.csv");

Creating Silver

CREATE SEQUENCE sales_id_seq;

CREATE TABLE silver_sales (
    id INTEGER PRIMARY KEY DEFAULT nextval('sales_id_seq'),
    region VARCHAR,
    country VARCHAR,
    item_type VARCHAR,
    sales_channel VARCHAR,
    order_priority VARCHAR,
    order_date DATE,
    order_id BIGINT,
    ship_date DATE,
    units_sold BIGINT,
    unit_price DOUBLE,
    unit_cost DOUBLE,
    total_revenue DOUBLE,
    total_cost DOUBLE,
    total_profit DOUBLE
);

Bronze Data


select distinct("Order Priority") from bronze_sales;
select distinct("Sales Channel") from bronze_sales;

insert into silver_sales(region,
country,
item_type,
sales_channel,
order_priority,
order_date,
order_id,
ship_date,
units_sold,
unit_price,
unit_cost,
total_revenue,
total_cost,
total_profit
)
select distinct * from bronze_sales where "order id" is not null;

Apache Spark

Apache Spark is a in memory - distributed data analytics engine. It is a unified engine for big data processing. Unified means it offers both batch and stream processing.

Apache Spark overcame the challenges of Hadoop with in-memory parallelization and delivering high performance for distributed processing.

Increase developer productivity and can be seamlessly combined to create complex workflows.

Features of Apache Spark

  • Analytics and Big Data processing.
  • Machine learning capabilities.
  • Huge open source community and contributors.
  • A distributed framework for general-purpose data processing.
  • Support for Java, Scala, Python, R, and SQL.
  • Integration with libraries for streaming and graph operations.

Benefits of Apache Spark

  • Speed (In memory cluster computing)
  • Scalable (Cluster can be added/removed)
  • Powerful Caching
  • Real-time
  • Supports Delta
  • Polyglot (Knowing/Using several languages. Spark provides high-level APIs in SQL/Scala/Python/R. Spark code can be written in any of these 4 languages.)
  • Spark SQL : SQL queries in Spark.
  • Streaming : Process realtime data.
  • MLib : MLlib allows for preprocessing, munging, training of models, and making predictions at scale on data.

Also it supports Scala, Python, JAVA and R.

Spark Architecture

src: www.databricks.com

Cluster Manager: Allocates Students/Faculty for a course. Who is in and who is out.

Driver: Faculty point of entry. Gives instructions and collects the result.

Executor: Table of Students.

Core (Thread): Student

Cores share the same JVM, Memory, Diskspace (like students sharing table resources, power outlets)

Partition

Technique of distributing data across multiple files or nodes in order to improve the query processing performance.

By partitioning the data, you can write each partition to a different executor simultaneously, which can improve the performance of the write operation.

Spark can read each partition in parallel on a different executor, which can improve the performance of the read operation.

Bowl of candies how do sort and count by color?

If its with one person, then others are wasting their time. Instead if its packaged into small packs (partitions) then everyone can work in parallel.

Default size is 128MB and its configurable.

Data ready to be processed

Each Core is assigned a Task

Few cores completed the Tasks

Free Cores picks up the remaining Tasks

Terms to learn

Spark Context

Used widely in Spark Version 1.x

Spark Context : Used by Driver node to establish communication with Cluster. SparkContext is an object that represents the entry point to the underlying Spark cluster, which is responsible for coordinating the processing of distributed data in a cluster. It serves as a communication channel between the driver program and the Spark cluster.

When a Databricks cluster is started, a SparkContext is automatically created and made available as the variable sc in the notebook or code. The SparkContext provides various methods to interact with the Spark cluster, including creating RDDs, accessing Spark configuration parameters, and managing Spark jobs.

text_rdd = sc.textFile("/FileStore/tables/my_text_file.txt")

Other entry points

SQL Context : Entry point to perform SQL Like operations. Hive Context : If Spark application needs to communicate with Hive.

In newer versions of Spark, this is not used anymore.

Spark Session

In Spark 2.0, we introduced SparkSession, a new entry point that subsumes SparkContext, SQLContext, StreamingContext, and HiveContext. For backward compatibiilty, they are preserved.

In short its called as "spark".

spark.read.format("csv").

RDD

RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark. Here are the key aspects:

Core Characteristics:

  • Resilient: Fault-tolerant with the ability to rebuild data in case of failures
  • Distributed: Data is distributed across multiple nodes in a cluster
  • Dataset: Collection of partitioned data elements
  • Immutable: Once created, cannot be changed

Key Features:

  • In-memory computing
  • Lazy evaluation (transformations aren't executed until an action is called)
  • Type safety at compile time
  • Ability to handle structured and unstructured data

Basic Operations:

Transformations (create new RDD):

  • map()
  • filter()
  • flatMap()
  • union()
  • intersection()
  • distinct()

Actions (return values):

  • reduce()
  • collect()
  • count()
  • first()
  • take(n)
  • saveAsTextFile()

Benefits

  • Fault tolerance through lineage graphs
  • Parallel processing
  • Caching capability for frequently accessed data
  • Efficient handling of iterative algorithms
  • Supports multiple languages (Python, Scala, Java)

Limitations

  • No built-in optimization engine
  • Manual optimization required
  • Limited structured data handling compared to DataFrames
  • Higher memory usage due to Java serialization

Example

Read a CSV using RDD and group by Region, Country except Region=Australia

from pyspark.sql import SparkSession
from operator import add

# Initialize Spark
spark = SparkSession.builder \
    .appName("Sales Analysis RDD") \
    .getOrCreate()

sc = spark.sparkContext

# Read CSV file
rdd = sc.textFile("sales.csv")

# Extract header and data
header = rdd.first()
data_rdd = rdd.filter(lambda line: line != header)

# Transform and filter data
# Assuming CSV format: Region,Country,Sales,...
result_rdd = data_rdd \
    .map(lambda line: line.split(',')) \
    .filter(lambda x: x[0] != 'Australia') \
    .map(lambda x: ((x[0], x[1]), float(x[2]))) \
    .groupByKey() \
    .mapValues(lambda sales: sum(sales) / len(sales)) \
    .sortByKey()

# Display results
print("Region, Country, Average Sales")
for (region, country), avg_sales in result_rdd.collect():
    print(f"{region}, {country}, {avg_sales:.2f}")

Good News you don't have to write direct RDDs anymore.

Now the same using Dataframes

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

# Initialize Spark
spark = SparkSession.builder \
    .appName("Sales Analysis DataFrame") \
    .getOrCreate()

# Read CSV file
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("sales.csv")

# Group by and calculate average sales
result_df = df.filter(df.Region != 'Australia') \
    .groupBy('Region', 'Country') \
    .agg(avg('Sales').alias('Average_Sales')) \
    .orderBy('Region', 'Country')

# Show results
result_df.show(truncate=False)

Spark Query Execution Sequence

Unresolved Logical Plan

This is the set of instructions the developer logically wants to do.

First, Spark parses the query and creates the Unresolved Logical Plan Validates the syntax of the query.

Doesn’t validate the semantics meaning column name existence, data types.

Metadata Catalog (Analysis)

This is where the column names, table names are validated against Catalog and returns a Logical Plan.

Catalyst Catalog

This is where first set of optimizations takes place. Rewrite/Reorder the logical sequence of calls. From this we get Optimized Logical Plan.

Catalyst Optimizer

Determines there are multiple ways to execute the query. Do we pull 100% of data from the network or filter the dataset with a predicate pushdown? From this, we determine one or more physical plans.

Physical Plans

Multiple ways to execute the query.

Physical Plans represent what the query engine will actually do. This is different from the optimized logical plan. Each optimization determines a cost model.

Selected Physical Plan

And best performing model is selected by Cost Model. Finally the selected physical plan is compiled to RDDs.

  • Estimated amount of data needed for processing
  • amount of shuffling
  • amount of time to execute the query

RDD

(Whole Stage Code Generation)

This the same RDD a developer would write themselves.

AQE (Adaptive Query Execution)

Checks for join strategies, skews at runtime. This happens repeatedly to find out best plan.

- explain(mode="simple") which will display the physical plan
- explain(mode="extended") which will display physical and logical plans (like “extended” option)
- explain(mode="codegen") which will display the java code planned to be executed
- explain(mode="cost") which will display the optimized logical plan and related statistics (if they exist)
- explain(mode="formatted") which will display a splitted output composed by a nice physical plan outline, and a section with each node details

DAG (Direct Acyclic Graph)

Example of DAG

Stage 1 (Read + Filter):
   [Read CSV] → [Filter]
        |
        ↓
Stage 2 (GroupBy + Shuffle):
   [Shuffle] → [GroupBy]
        |
        ↓
Stage 3 (Order):
   [Sort] → [Display]

Databricks

Databricks is a Unified Analytics Platform on top of Apache Spark that accelerates innovation by unifying data science, engineering and business. With our fully managed Spark clusters in the cloud, you can easily provision clusters with just a few clicks.

This is not Databricks Sales so not getting into roots of the product.

Open

Open standards provide easy integration with other tools plus secure, platform-independent data sharing.

Unified

One platform for your data, consistently governed and available for all your analytics and AI,

Scalable

Scale efficiently with every workload from simple data pipelines to massive LLMs.

Lakehouse

Data Warehouse + Data Lake = Lakehouse.

ELT Data Design Pattern

In ELT (Extract, Load, Transform) data design patterns, the focus is on loading raw data into a data warehouse first, and then transforming it. This is in contrast to ETL, where data is transformed before loading. ELT is often favored in cloud-native architectures.

Batch Load

In a batch load, data is collected over a specific period and then loaded into the data warehouse in one go.

Real-time Example

A retail company collects sales data throughout the day and then runs a batch load every night to update the data warehouse. Analysts use this data the next day for reporting and decision-making.

Stream Load

In stream loading, data is continuously loaded into the data warehouse as it's generated. This is useful in scenarios requiring real-time analytics and decision-making.

Real-time Example

A ride-sharing app collects GPS coordinates of all active rides. This data is streamed in real-time into the data warehouse, where it's immediately available for analytics to optimize ride allocation and pricing dynamically.

Delta

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines, including Spark, PrestoDB, Flink, Trino, and Hive, and APIs for Scala, Java, Rust, Ruby, and Python.

ACID Transactions - Protect your data with the strongest level of isolation

Time Travel - Access to earlier versions of data for audits, rollbacks, or reproduce.

Open Source - Community Driven.

DML Operations - SQL,Scala,Python APIs to merge, update and delete data.

Audit History - Delta Lake logs all changes details providing a complete audit trail.

Schema Evolution/Enforcement - Prevent bad data from causing data corruption.

Unified Batch/Streaming - Exactly once ingestion to backfill to interactive queries.

Delta table is the default data table format in Databricks.

Data Warehousing Concepts

Dimensional Modelling

Dimensional Modeling (DM) is a data structure technique optimized for data storage in a Data warehouse. Dimensional modeling aims to optimize the database for faster data retrieval. The concept of Dimensional Modeling was developed by Ralph Kimball and consisted of "fact" and "dimension" tables.

Dimension Table

Dimension provides the context surrounding a business process event; they give who, what, and where of a fact. In the Sales business process, for the fact quarterly sales number, dimensions would be

Who – Customer Names
Where – Location
What – Product Name

They are joined to Fact tables via a foreign key.

Dimensions offer descriptive characteristics of the facts with the help of their attributes.

Src: www.guru99.com

Src: www.guru99.com

Fact Table

It contains Measurements, metrics, and facts about a business process, the primary table in dimension modeling.

A Fact Table contains

Measurements/facts
Foreign key to the dimension table

Src: www.guru99.com

Star Schema

Star schema is the simplest model used in DWH. It's commonly used in Data Marts.

Star schema is built to streamline the process.

ETL process will produce data from the operational database, transform it into the proper format, and load it into the warehouse.

src: www.vertabelo.com

dim_employee: info about employee 

dim_product: info about the product 

dim_store: info about store 

dim_sales_type: info about sales 

dim_time: the time dimension 

fact_sales: references to dimension tables plus two facts (price and quantity sold)

This Star schema is intended to store the history of placed orders.

fact_supply_order: contains aggregated data about the order & supplies.
dim_supplier: supplier information.

Advantages

  • The fact table is related to each dimension table by exactly one relation.
  • Faster aggregations.
  • Simpler queries. Relatively fewer joins.

Disadvantages

  • Data Integrity is not enforced.
  • Data redundancy. City > State > Country can be normalized and not repeated.

Galaxy Schema

Essentially, Galaxy schema can be derived as a collection of star schemas interlinked and completely normalized to avoid redundancy and inaccuracy of data. It is also called Fact Constellation Schema.

Let's combine the above two Star Schemas.

Advantages

  • Minimum or no redundancy as a result of Normalization.
  • This is a flexible Schema, considering the complexity of the system.

Disadvantages

  • Working on this schema is tedious, as Schema and database systems' complexity makes it more intricate.
  • Data retrieval is done with multi-level joins combined with conditional expressions.
  • The number of levels of normalization is expected, depending on the depth of the given database.

Snowflake Schema

Snowflake is an extension of the Star Schema.

Dimension tables are further normalized and can have their categories.

One or more lookup tables can describe each dimension table. Normalization is recursive until the model is fully normalized.

Check dim_store, dim_product, and dim_time and compare with Star Schema.

Check the Sales Order Schema

Advantages

Better data quality. Less disk space is used compared to Star Schema.

Disadvantages

The major disadvantage is too many tables to join. Poorly written queries result in a decrease in performance significantly.

Starflake Schema

Starflake = Snow + Snowflake

So how does this work? Some tables are normalized in detail, and some are denormalized.

If used correctly, Starflake schema can give the best of both worlds.

dim_store is normalized, whereas time is denormalized.

Time most of it is int so not much space is wasted and it helps in reducing query complexity.

Star vs Snowflake

FeatureStar SchemaStarflake Schema
StructureSingle fact table connected to multiple denormalized dimension tables.Hybrid structure with a mix of denormalized and normalized dimension tables.
ComplexitySimple and straightforward, easy to understand and navigate.More complex due to the normalization of certain dimensions.
Data RedundancyHigher redundancy due to denormalization; data may be duplicated in dimension tables.Reduced redundancy in normalized parts of the schema, leading to potentially less storage use.
Query PerformanceGenerally faster for query performance because of fewer joins (denormalized data).Slightly slower query performance due to additional joins needed for normalized tables.
MaintenanceEasier to maintain, as there are fewer tables and less complex relationships.More challenging to maintain, as normalized tables introduce more relationships and dependencies.
FlexibilityLess flexible in terms of handling updates and changes in dimension attributes (due to denormalization).More flexible for handling changes and updates to dimension attributes, as normalization allows for easier updates without affecting the entire table.
Use CaseBest for environments where simplicity and query performance are prioritized, such as dashboards and reporting.Best for environments where data integrity and storage efficiency are critical, and some dimensions require normalization.

GRAIN

Declaring the grain is the pivotal step in a dimensional design.

The grain establishes precisely what a single fact table row represents.

  • Minute-by-minute weather.
  • Daily sales.
  • A scanner device measures a line item on a customer’s retail sales ticket.
  • An individual transaction against an insurance policy.

What is the lowest level data? Grain is defined in business terms, not as rows / columns.

Src: www.aroundbi.com

Based on the image, what do you think as data frequency?

One row per product? 
One row per day per product?
One row per day per store per product? 

It can be anything, but that has to be decided first.

Ralph Kimball says, "The grain must be declared before choosing dimensions or facts".

The grain must be declared before choosing dimensions or facts because every candidate dimension or fact must be consistent with the grain.

This consistency enforces uniformity on all dimensional designs, that is critical to BI application performance and ease of use.

When changes are made to Grain (adding Per customer to lowest level)

src: www.aroundbi.com

Rolled-up summary grains are essential for performance tuning, but they pre-suppose the business’s common questions.

Each proposed fact table grain results in a separate physical table; different grains must not be mixed in the same fact table.

Making one more change

Helps in Better Reporting. Increases query performance. It helps in getting aggregated / summary view.

Multi-Fact Star Schema

The schema with more than one fact table linked by normalized dimension tables is often called a "Data Mart Schema" or a "Multi-Fact Star Schema."

This schema can be helpful when you have multiple fact tables that share some but not all dimensions and where the dimensions are normalized.

  • Multiple Fact Tables: For different aspects of the business.
  • Normalized Dimension Tables: To avoid redundancy and improve data integrity.

It's a complex, real-world adaptation designed to meet specific business needs.

Vertabelo Tool

  • Easy to model for any database.
  • Model sharing.
  • Model exporting.
  • Import existing database.
  • Easy layout.
  • Generate necessary SQL queries.
  • Live validation
  • SQL Preview.
  • Zoom, Navigation, and Search capability.

Dimension - Fact

Identify the Fact and Dimension Tables.

Sample Excercise

From the data given below, Identify the Dimensions and Facts

Keys

Again what is Primary Key? Unique - Not Null - One PK per table.

Super Key - A super key is a set or one or more columns (attributes) to identify rows in a table uniquely.

Candidate Key - Minimal super keys with no redundant attributes to uniquely identify a row.

Primary Key - Primary key is selected from the sets of candidate keys.

Alternate Key – A candidate key that was not selected as Primary Key.

Natural Key – A key that happens naturally in the table. Example: SSN, TaxID.

Surrogate Key – A system-generated identifier (Meaningless Simple Number).

Unique Key – Unique values one NULL.

Foreign Key – used to relate other tables.

Super Key

Super Key is a set of one or more columns (attributes) to identify rows in a table uniquely.

For the above table, a bag of keys will help us uniquely identify that table's rows.

{ID}
{EmpNum}
{SSN}
{Email}
{ID,EmpNum}
{ID,SSN}
{ID,Email}
{ID,DOB}
{EmpNum,SSN}
{EmpNum,SSN,Email}
{ID, EmpNum, SSN}
{ID, EmpNum, SSN, DOB}
{ID, EmpNum, SSN, DOB, Name}
{ID, EmpNum, SSN, Name, Email,DOB}
...

Now you get the idea.. we can come up with more & more combinations.

Candidate Key

"Minimal super keys" with no redundant attributes to uniquely identify a row.

Here the essential condition is Minimal. Let's see what minimal keys help us identify a row uniquely.

With just ID, we can uniquely identify a row. Similarly, EmpNum, SSN, and Email can uniquely pull a row.

{ID}

{EmpNum}

{SSN}

{Email}

Primary Key

The primary key is selected from the sets of candidate keys. Let's choose the best column as Primary Key through the process of elimination.

SSN: Sensitive
Email: Varchar
EmpNum: Good to use when needed.
ID: Auto Increment. Widely used in Data Warehousing.

Alternate Key

A candidate key should have been selected as Primary Key.

Email is best suited for Alternate Key. Email can be used to search particular employees if the system doesn’t have access to EmpNum. It is used to link to External accounts.

Surrogate Key

Surrogate Key: A system-generated identifier (Meaningless Simple Number)

ID column matches the definition.

Surrogate Keys are very handy and useful in Data Warehousing.

Unique Key

Unique values may have one NULL value.

Under certain circumstances, Email can be a unique key. Assuming email is generated by a different system and it can be null for a brief period.

Natural Key

A natural key is a column or set of columns that already exist in the table (e.g., they are attributes of the entity within the data model) and uniquely identifies a record in the table. Since these columns are entity attributes, they have business meaning.

Example: SSN

Foreign Key

It is used to refer to other tables.

Why Surrogate Keys are Important

Example of Primary / Foreign Keys

Lets have a sample data

StoreID *StreetCityStateCountry
S100124th BlvdPhoenixAZUSA
S100221 Bell RoadMiamiFLUSA
S1003Main StreetNew PortCAUSA

StoreID is Natural Key (PK) It has a meaning S stands for Store 1001 means its the first store.

What is the issue now ?

When data changes (Slowly Changing Dimension we will cover in next chapter) how to handle the change.

Situation 1: Store 1 moves to new location or store is closed for sometime and opened under new franchise.

StoreID *StreetCityStateCountry
S100124th BlvdPhoenixAZUSA
S100221 Bell RoadMiamiFLUSA
S1003Main StreetNew PortCAUSA
S10011st StreetPhoenixAZUSA

Situation 2: When acquiring competitive business (say Target buys KMart), the Natural Keys dont make sense now.

StoreID *StreetCityStateCountry
S100124th BlvdPhoenixAZUSA
S100221 Bell RoadMiamiFLUSA
S1003Main StreetNew PortCAUSA
233South StreetNew BrunswickNJUSA
1233JFK BlvdCharlotteNCUSA

These business decisions / changes have nothing to do with the Technology.

How to Overcome this issue?

Add Surrogate Keys (running sequence number)

Surr_Store *StoreIDStreetCityStateCountry
1S100124th BlvdPhoenixAZUSA
2S100221 Bell RoadMiamiFLUSA
3S1003Main StreetNew PortCAUSA
4233South StreetNew BrunswickNJUSA
51233JFK BlvdCharlotteNCUSA

Properties of Surrogate Keys

- Numerical
- Sequential
- Meaningless Simple Number

Adv of Surrogate Keys

- Constant Behavior (will not change based on Business need)
- Integration is easier.
- Faster Query Performance. (because of Integer values)
- Future records (every other column can be NULL still ID is available)

Its a good practice to have Surrogate Key in DataWarehouse Dimension & Fact tables.

More Examples

Dimension columns typically contain descriptive attributes that provide context or categorization to the data, while fact columns contain measurable, numerical data that can be analyzed or aggregated.

Identify the Dimension and Fact in the following designs.

Student Example

FirstNameLastNameDOBHtWtGenderCourseGrade
RossGeller1967-10-1872170MalePaleontologyA
RachelGreen1969-05-0565125FemaleFashion DesignB+
MonicaGeller1964-04-2266130FemaleCulinary ArtsA
ChandlerBing1968-04-0873180MaleAdvertisingB
JoeyTribbiani1968-01-0971185MaleActingC+

CAR Sales Example

ManufacturerModelSales_in_thousandsVehicle_typePrice_in_thousandsEngine_sizeHorsepowerWheelbaseWidthLengthCurb_weightFuel_capacityFuel_efficiencyLatest_LaunchPower_perf_factor
AcuraIntegra16.919Passenger21.51.8140101.267.3172.42.63913.2282/2/201258.28
AcuraTL39.384Passenger28.43.2225108.170.3192.93.51717.2256/3/201191.371
AcuraCL14.114Passenger[NULL]3.2225106.970.61923.4717.2261/4/2012[NULL]
AcuraRL8.588Passenger423.5210114.671.4196.63.8518223/10/201191.39
AudiA420.397Passenger23.991.8150102.668.21782.99816.42710/8/201162.778
AudiA618.78Passenger33.952.8200108.776.11923.56118.5228/9/201184.565
AudiA81.38Passenger624.231011374198.23.90223.7212/27/2012134.657
BMW323i19.747Passenger26.992.5170107.368.41763.17916.6266/28/201171.191

Master Data Management

Master Data Management (MDM) refers to creating and managing data that an organization must have as a single master copy, called the master data.

  • State
  • Customers
  • Vendors
  • Products

It is the single source of the truth.

MDM is not Data Warehouse but is closely related to Data Warehouse.

Different Goals: MDM is to create and maintain a single source of truth. Whereas in DW, Sales Customer vs. Marketing Customer may differ and not follow a single source of truth.

Types of Data: MDM contains data that doesn't change, mainly Dimensions. At the same time, DW has both Dimensions and Facts.

Reporting Needs: Data Warehousing's priority is to address end-user requirements. MDM's priority is ensuring its follows data governance, quality, and complaint.

Steps of Dimensional Modeling

  • Identify the Business Processes

  • Identify the Facts and Dimensions in your Dimensional Data Model

  • Define the Grain

  • Identify the Attributes / Keys for Dimensions

  • Build the Schema

  • Store Historical Info

Types of Dimensions

Date Dimension Table

In this example, the Date Dimension table contains information about dates, including the date itself, the year, quarter, month, day, and weekday. The "Date Key" column is the table's primary key and is used to join with other tables in the Star Schema.

This Date Dimension table can be joined with other fact tables in the Star Schema, such as a Sales Fact table, which contains measures such as total sales revenue, units sold, and discounts. By joining the Sales Fact table with the Date Dimension table on the Date Key column, it becomes possible to analyze sales data by date, year, quarter, month, and weekday. For example, it becomes possible to answer questions like:

  • What was the total sales revenue for January 2022?
  • What were the total units sold on Mondays in Q1 2022?
  • What was the average discount percentage on weekends in 2022?

Degenerate Dimension

Generally, what is a Dimension table? Something like a master list. Employee, Product, Date, Store.

In a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its dimension table.

It is key in the fact table but does not have its dimension table.

Degenerate dimensions are most familiar with the transaction and accumulating snapshot fact tables.

Means no separate Dimension table.

OrderNo is degenerate dimension. As it has no Dimension table.

Uses of using Degenerate Dimension Table

- Grouping line items
- Getting the average sale
- Tracking

Other Examples of Degenerate Dimensions are

  1. Order number or invoice number: This is a typical example of a degenerate dimension. It is a unique identifier for an order or an invoice but has no meaningful attributes other than its value. This degenerate dimension can be used in a fact table to analyze sales by order or invoice number.

  2. Tracking number: Another example of a degenerate dimension is a tracking number for a shipment. Like an order number, it is a unique identifier but has no additional attributes associated with it. It can be used in a fact table to analyze shipping performance or delivery times.

  3. ATM transaction number: In banking, an ATM transaction number is a unique identifier for a transaction at an ATM. It can be used in a fact table to analyze ATM usage patterns and trends.

  4. Serial number: A serial number is a unique identifier assigned to a product or equipment. It can be used in a fact table to analyze the performance of a particular product or equipment.

  5. Coupon code: A coupon code is a unique identifier to redeem a discount or promotion. It can be used in a fact table to analyze the usage and effectiveness of different marketing campaigns or promotions.

Degenerate dimensions are helpful when we have a unique identifier for an event or transaction. Still, it has no additional attributes that make it sound like a separate dimension table. In these cases, we can include the unique identifier in the fact table as a degenerate dimension.

Junk Dimension

Cardinality: Number of Unique Values.

A junk dimension is a dimension table created by grouping low cardinality and unrelated attributes.

The idea behind a junk dimension is to reduce the number of dimension tables in a data warehouse and simplify queries.

An example of a junk dimension could be a table that includes binary flags such as "is_promotion", "is_return", and "is_discount".

Possible Values for is_promotion as Y or N same for is_return and is_discount.

Example:

fact_sale table

Date
Product 
Store
OrderNumber

PaymentMode
StoreType
CustomerSupport

Quantity
UnitPrice

Possible values for these columns

PaymentMode - Cash/Credit/Check

StoreType - Warehouse/Marketplace

CustomerSupport - Yes/No

So how to handle the situation

Option 1: Add it to Fact Table 

The problem is, Fact table data is not that important and sometimes won't make sense

src: aroundbi.com

Option 2: Add it as Dimension Table

The problem is with more Dimension table adds more joins to your queries.

src: aroundbi.com

Do you know how to take care of this situation?

Let's create a new table with values from all possible combinations.

Junk Dimension with ID

So, the revised Fact Table looks like this

  • Basically to group low cardinality columns.
  • If the values are too uncorrelated.
  • Helps to keep the DW simple by minimizing the dimensions.
  • Helps to improve the performance of SQL queries.

Note: If the resultant dimension has way too many rows, then don’t create a Junk dimension.

Static Dimension

A static dimension refers to a dimension table in a data warehouse that does not change or infrequently changes over time.

Values Never Change.

Time Dimension: Days, weeks, months

Payment Method Dimension: Credit Card, Debit Card, PayPal, Zelle, etc.

Education Level Dimension: High School, Bachelor, Master, PhD, etc.

Geographic Dimension: Store Locations, Categories, US States, Countries, etc.

Conformed Dimensions

This dimension is consistent and the same across multiple fact tables in a data warehouse. The dimension tables are designed once and used across various areas, maintaining consistency.

List of countries/states where the company is doing business.

An excellent example of a conformed dimension is a "Date" or "Time" dimension. Imagine a data warehouse for a retail business that has two fact tables: "Sales" and "Inventory". These fact tables might have a "Date" dimension that tracks when a sale or inventory count was made.

In this example, Date, Product, and Store dimensions remain constant across multiple fact tables.

Slowly Changing Dimensions

Slowly Changing Dimensions (SCDs) is a concept in data warehousing and business intelligence that deals with managing changes in dimension data over time.

Data Warehousing has

- Time Variant (with history data)
- Non Volatile (no changes)

But in reality

- The customer may change his address.
- Store move to a new location.
- An employee joins a new company.

Dimension data refers to the descriptive attributes of an entity, such as customer, product, or location. These attributes provide context for analysis and reporting in a data warehouse.

However, dimension data can change over time, and it's essential to maintain a historical record of these changes for accurate reporting and analysis.

There are several types of SCDs, each with different strategies for managing changes in dimension data:

Type 1 - Overwrite: In this approach, when changes occur, the existing values are overwritten with the new values. No historical data is preserved.

Type 2 - Add a New Row: This method maintains a complete history of changes by adding a new row with the updated attribute values in the dimension table. Each row has a start and end date (or a start date and a flag indicating the current record) to show the period during which the attribute values were valid.

Type 3 - Add a New Column: This approach maintains a limited history of changes by adding new columns to the dimension table to store the previous values of changed attributes. It is useful when tracking a small number of changes but can become unwieldy with many changes.

Type 4 - Add a History Table: In this method, a separate history table stores the changes in the dimension attributes. The primary dimension table contains the current attribute values, while the history table stores historical data.

Type 6 - Hybrid: This combines Type 1, Type 2, and Type 3 approaches. It allows for the selective preservation of history for specific attributes and can be used to overwrite certain attribute values.

Choosing the appropriate SCD type depends on the specific requirements of the data warehouse, the importance of historical data for analysis, and the performance implications of each approach.

SCD - Type 0

Type 0 - Retain Original: This approach ignores changes and retains the original values for the dimension attributes. No history is kept for changes.

The Product dimension table might include the following columns:

  1. Product_ID
  2. Product_Name
  3. Category
  4. Price

Now, imagine that the store decides to re-categorize one of its products. For instance, they may change the "Smartphone" product from the "Electronics" category to the "Mobile Devices" category.

Using a Type 0 Slowly Changing Dimension approach, the store will retain the original value for the product category, ignoring the change. In this case, the Product dimension table would still show the "Smartphone" product in the "Electronics" category, even though it has been re-categorized to "Mobile Devices." This approach means no history is kept for changes, and the original values are always preserved.

The Product dimension table would look like this:

Product_IDProduct_NameCategoryPrice
1SmartphoneElectronics1000

Use Cases:

  1. When historical data is not relevant or necessary for analysis, and only the original values are needed.
  2. For dimensions with attributes that are fixed and don't change over time, such as unique identifiers or codes.
  3. In cases where the data warehouse is only required to support reporting and analysis on the current state of the business and not the historical trends.

Advantages:

  1. Simplicity: SCD Type 0 is the simplest approach, as it doesn't require any additional mechanisms to handle changes in dimension attributes.
  2. Space Efficiency: Since there is no need to store historical data or multiple versions of records, the dimension tables will be smaller and require less storage space.
  3. Performance: As there are no additional rows or columns for historical data, the querying and processing of the data warehouse will generally be faster.

Disadvantages:

  1. Lack of Historical Data: SCD Type 0 does not store any historical data, which means it cannot support reporting and analysis that requires tracking changes over time. This can be a significant limitation for businesses that need to analyze trends, understand the impact of changes, or perform other historical analyses.
  2. Inaccurate Analysis: Since the dimension table only contains the original values, any changes that have occurred over time are not reflected. This may lead to incorrect analysis results or conclusions based on outdated information.
  3. Inability to Track Changes: With SCD Type 0, it is impossible to determine when or why changes occurred, as there is no record of any changes in the dimension data.

SCD - Type 1

Type 1 - Overwrite: In this approach, when changes occur, the existing values are overwritten with the new values. No historical data is preserved.

Before the change, the Product dimension table looks like this:

Product_IDProduct_NameCategoryPrice
1SmartphoneElectronics800

After the change, the Product dimension table will look like this:

Product_IDProduct_NameCategoryPrice
1SmartphoneMobile Devices800

With a Type 1 SCD approach, the table now reflects the updated category for the "Smartphone" product. However, there is no record of the product's previous category, "Electronics," in the table.

Use Cases:

  1. When historical data is not essential for analysis, and only the current values are needed.
  2. For dimensions where tracking historical changes is not required for the business, such as minor corrections or updates to attributes.
  3. In cases where the data warehouse is required to support reporting and analysis on the current state of the business, and not historical trends.

Advantages:

  1. Simplicity: SCD Type 1 is relatively simple to implement, as it only requires overwriting existing records when changes occur.
  2. Space Efficiency: Since there is no need to store historical data or multiple versions of records, the dimension tables will be smaller and require less storage space.
  3. Performance: As there are no additional rows or columns for historical data, the querying and processing of the data warehouse will generally be faster.

Disadvantages:

  1. Lack of Historical Data: SCD Type 1 does not store any historical data, which means it cannot support reporting and analysis that requires tracking changes over time. This can be a significant limitation for businesses that need to analyze trends, understand the impact of changes, or perform other historical analyses.
  2. Loss of Previous Data: Since the dimension table only contains the most recent values, any changes that have occurred over time overwrite the previous values. This may lead to a loss of potentially valuable historical information.
  3. Inability to Track Changes: With SCD Type 1, it is impossible to determine when or why changes occurred, as there is no record of any changes in the dimension data. This can make it challenging to understand the reasons for changes or identify any potential issues or patterns.

SCD - Type 2

Using the retail store example with the "Product" dimension, let's see how a Type 2 Slowly Changing Dimension approach would handle the change in the product category.

Recall that the Product dimension table contains the following columns:

  1. Product_ID
  2. Product_Name
  3. Category
  4. Price

For a Type 2 SCD, we will need to add two additional columns to the table:

  1. Start_Date
  2. End_Date (or a flag indicating the current record, such as "Is_Current")

Now, imagine that the store decides to re-categorize the "Smartphone" product from the "Electronics" category to the "Mobile Devices" category, as in the previous examples.

With a Type 2 SCD approach, the store will add a new row with the updated category information while retaining the old row in the table. Each row will have a start and end date, indicating the period during which the attribute values were valid.

Before the change, the Product dimension table looks like this:

Product_IDProduct_NameCategoryPriceStart_DateEnd_Date
1SmartphoneElectronics8002021-01-01NULL

After the change, the Product dimension table will look like this:

Sur_IDProduct_IDProduct_NameCategoryPriceStart_DateEnd_Date
11SmartphoneElectronics8002021-01-012023-04-03
21SmartphoneMobile Devices8002023-04-04NULL

With a Type 2 SCD approach, the table now has a new row reflecting the updated category for the "Smartphone" product.

Additionally, the previous row for the "Smartphone" product with the "Electronics" category is still in the table, with an updated End_Date.

This approach allows for the preservation of historical data and enables accurate reporting and analysis based on the product's category at different points in time.

Use Cases:

  1. When historical data is critical for analysis, and it's essential to track changes over time for reporting and decision-making.
  2. For dimensions where understanding the impact of changes on business performance, such as customer demographics, product attributes, or pricing, is vital.
  3. In cases where the data warehouse is required to support reporting and analysis that includes historical trends, comparisons, and the effect of changes over time.

Advantages:

  1. Historical Data Preservation: SCD Type 2 maintains a complete history of changes in the dimension attributes, enabling more accurate and detailed reporting and analysis.
  2. Accurate Analysis: By preserving historical data, SCD Type 2 allows for accurate analysis and reporting that accounts for changes over time, leading to better insights and decision-making.
  3. Change Tracking: SCD Type 2 enables tracking of when and why changes occurred, making it easier to identify patterns, trends, or potential issues in the dimension data.

Disadvantages:

  1. Complexity: SCD Type 2 is more complex to implement and manage compared to SCD Type 0 and Type 1, as it requires additional mechanisms to handle changes in dimension attributes and maintain historical data.
  2. Space Requirements: Since multiple versions of records are stored to maintain historical data, the dimension tables will be larger and require more storage space.
  3. Performance: As there are additional rows for historical data, the querying and processing of the data warehouse may be slower compared to SCD Type 0 or Type 1, especially when dealing with large amounts of historical data. This may require more robust indexing, partitioning, or query optimization strategies to maintain acceptable performance.

SCD - Type 3

SCD Type 3 involves adding a new column to the dimension table to store the previous value of the changed attribute along with the current value. It allows tracking the current and previous values but needs to maintain a complete history of changes. Using the same "Product" dimension example with the mobile product category change:

Recall that the Product dimension table contains the following columns:

  1. Product_ID
  2. Product_Name
  3. Category
  4. Price

For a Type 3 SCD, we will need to add a column to the table:

  1. Previous_Category

Now, imagine that the store decides to re-categorize the "Smartphone" product from the "Electronics" category to the "Mobile Devices" category, as in the previous examples.

With a Type 3 SCD approach, the store will update the existing row by setting the "Previous_Category" column to the old category value and overwriting the "Category" column with the new value.

Before the change, the Product dimension table looks like this:

Product_IDProduct_NameCategoryPricePrevious_Category
1SmartphoneElectronics800NULL

After the change, the Product dimension table will look like this:

Product_IDProduct_NameCategoryPricePrevious_Category
1SmartphoneMobile Devices800Electronics

With a Type 3 SCD approach, the table now reflects the updated category for the "Smartphone" product and also retains the previous category in the "Previous_Category" column. However, it can only track one previous value and does not maintain a complete history of all changes that occurred over time.

Use Cases:

  1. When it is essential to track the immediate previous value of an attribute, but a complete history of changes is not required.

  2. For dimensions where the primary focus is on comparing the current value with the previous value, rather than analyzing historical trends.

  3. In cases where the data warehouse is required to support reporting and analysis that involves comparisons between current and previous states, but not a full history of changes.

Advantages:

  1. Limited Historical Data: SCD Type 3 preserves the immediate previous value of an attribute, allowing for some level of historical analysis and comparison with the current value.

  2. Space Efficiency: Since only one previous value is stored, the dimension tables will require less storage space compared to SCD Type 2.

  3. Simplicity: SCD Type 3 is relatively simple to implement, as it only requires adding a new column to the dimension table and updating it when changes occur.

Disadvantages:

  1. Incomplete Historical Data: SCD Type 3 does not maintain a full history of changes, which may limit the depth of historical analysis and reporting that can be performed.

  2. Limited Change Tracking: With SCD Type 3, it is only possible to track the immediate previous value of an attribute, making it difficult to understand trends or patterns in the data over time.

  3. Additional Columns: SCD Type 3 requires adding a new column for each attribute that needs to track the previous value, which can increase the complexity of the dimension table schema.

  4. Scalability: If there are multiple attributes that require tracking of previous values or if the number of changes becomes more frequent, SCD Type 3 may become less practical and harder to manage. In such cases, SCD Type 2 may be a more suitable approach to maintain a complete history of changes.

Suppose the university decides to change the credit hours of a specific course, "Introduction to Data Science," from 3 to 4 credit hours. The university wants to track both the current and the immediately previous credit hours for reporting purposes.

Using SCD Type 3, the university would update the course's record in the Course dimension table by overwriting the "Credit_Hours" column and updating the "Previous_Credit_Hours" column with the old value.

Before the change, the Course dimension table looks like this:

Course_IDCourse_NameCredit_HoursPrevious_Credit_Hours
1Introduction to Data Science3NULL

After the change, the Course dimension table will look like this:

Course_IDCourse_NameCredit_HoursPrevious_Credit_Hours
1Introduction to Data Science43

With a Type 3 SCD approach, the table now reflects the updated credit hours for the "Introduction to Data Science" course and retains the previous credit hours in the "Previous_Credit_Hours" column. This allows the university to compare the current and previous credit hours, which might be useful for understanding recent changes in course workload. However, it does not maintain a complete history of all credit hour changes over time.

SCD - Type 4

SCD Type 4, the "history table" approach, involves creating a separate table to store historical data, while the main dimension table only retains current information. Using the "Product" dimension example with the mobile product category change:

Recall that the Product dimension table contains the following columns:

  1. Product_ID
  2. Product_Name
  3. Category
  4. Price

For a Type 4 SCD, we will create an additional table called "Product_History":

Product_History Table:

  1. History_ID
  2. Product_ID
  3. Category
  4. Price
  5. Valid_From
  6. Valid_To

Now, imagine that the store decides to re-categorize the "Smartphone" product from the "Electronics" category to the "Mobile Devices" category.

With a Type 4 SCD approach, the store will insert a new row in the "Product_History" table, representing the previous state of the "Smartphone" product, and update the existing row in the main "Product" dimension table with the new category.

Before the change, the tables look like this:

Product Dimension Table:

Product_IDProduct_NameCategoryPrice
1SmartphoneElectronics800

Product_History Table:

(empty)

After the change, the tables will look like this:

Product Dimension Table:

Product_IDProduct_NameCategoryPrice
1SmartphoneMobile Devices800

Product_History Table:

History_IDProduct_IDCategoryPriceValid_FromValid_To
11Electronics8002020-01-012023-01-01

With a Type 4 SCD approach, the main "Product" dimension table contains only the current information, and the "Product_History" table maintains the history of changes. This approach helps maintain a clean dimension table and allows for efficient querying of current data while still preserving historical data for analysis when required.

Use Cases:

  1. When it is important to maintain a complete history of changes, but the main dimension table should only contain the current state of the data.

  2. For situations where querying the current data efficiently is a priority, but historical data is still required for more in-depth analysis.

  3. When the dimension table is frequently accessed or queried for current data, and performance is a concern.

Advantages:

  1. Performance: By keeping the current data in the main dimension table and historical data in a separate table, SCD Type 4 allows for efficient querying of the current data without the overhead of historical data.

  2. Historical Data: SCD Type 4 maintains a complete history of changes, enabling in-depth analysis and reporting that requires historical data.

  3. Separation of Concerns: By separating the current and historical data, SCD Type 4 provides a cleaner and more organized data structure, making it easier to manage and maintain.

  4. Scalability: Since historical data is stored separately, SCD Type 4 can scale well with large dimensions and frequent changes.

Disadvantages:

  1. Complexity: SCD Type 4 adds complexity to the data warehouse design and maintenance, as it requires managing two separate tables for the same dimension.

  2. Increased Storage: Storing historical data in a separate table requires additional storage space, which can be a concern for large dimensions with extensive change history.

  3. Maintenance: Implementing and maintaining SCD Type 4 can be more challenging than other SCD types, as it requires managing the relationships between the dimension and history tables and ensuring data integrity between them.

  4. Query Complexity: Analyzing historical data and comparing it with current data can involve more complex queries, as it may require joining the dimension and history tables.

Overall, SCD Type 4 is suitable for scenarios where maintaining a complete history of changes is necessary, but the main dimension table should only contain the current state of the data to optimize query performance.

SCD - Type 6

SCD Type 6 is a hybrid approach that combines the features of SCD Types 1, 2, and 3. It maintains the current attribute value, the previous attribute value, and a full history of changes with effective dates. Using the "Product" dimension example with the mobile product category change:

Recall that the Product dimension table contains the following columns:

  1. Product_ID
  2. Product_Name
  3. Category
  4. Price

For a Type 6 SCD, we will need to add the following columns to the table:

  1. Previous_Category
  2. Valid_From
  3. Valid_To
  4. Is_Current

Now, imagine that the store decides to re-categorize the "Smartphone" product from the "Electronics" category to the "Mobile Devices" category.

With a Type 6 SCD approach, the store will insert a new row in the "Product" dimension table with the updated category and set "Is_Current" to 'Y' (Yes) for the new row and 'N' (No) for the old row. The new row will have the "Previous_Category" column set to the old category value.

Before the change, the Product dimension table looks like this:

Product_IDProduct_NameCategoryPricePrevious_CategoryValid_FromValid_ToIs_Current
1SmartphoneElectronics800NULL2020-01-01NULLY

After the change, the Product dimension table will look like this:

Product_IDProduct_NameCategoryPricePrevious_CategoryValid_FromValid_ToIs_Current
1SmartphoneElectronics800NULL2020-01-012023-01-01N
2SmartphoneMobile Devices800Electronics2023-01-01NULLY

With a Type 6 SCD approach, the table now reflects the updated category for the "Smartphone" product, retains the previous category, and maintains a full history of changes with effective dates. This approach provides the benefits of SCD Types 1, 2, and 3 while optimizing the storage and query performance to some extent. However, it still requires managing additional columns and more complex logic for managing the Is_Current flag and Valid_From/Valid_To dates.

SCD - Type 5 - Fun Fact

A data warehousing expert, Ralph Kimball, introduced the concept of Slowly Changing Dimensions and defined Types 0, 1, 2, and 3. These widely adopted types became the standard classification for handling dimension changes.

Later, other types, such as Type 4 and Type 6, were proposed by different practitioners to address additional scenarios and use cases that needed to be covered by Kimball's original classification. However, there was no commonly agreed upon and widely adopted SCD Type 5.

In short, SCD Type 5 does not exist because it was never proposed or documented as a distinct method for handling dimension changes. The existing SCD types (0, 1, 2, 3, 4, and 6) cover most use cases and have been widely adopted in the data warehousing community.

Role Playing Dimension

Role-playing dimension is a term used in data warehousing that refers to a dimension used for multiple purposes within the same database. Essentially, the same dimension table is linked to the fact table multiple times, each playing a different role. This concept is often used when a single physical dimension can have different meanings in other contexts.

Date Dimension: This is the most common example. A single date dimension table can be used to represent different types of dates in a fact table, such as:

  • Order Date: The date when an order was placed.
  • Shipping Date: The date when an order was shipped.
  • Delivery Date: The date when an order was delivered.

Employee Dimension in a Hospital Setting:

  • Attending Physician: The primary doctor responsible for a patient.
  • Referring Physician: The doctor who referred the patient to the hospital.
  • Admitting Physician: The doctor who admitted the patient to the hospital.

Product Dimension in Retail:

  • Ordered Product: The product that a customer ordered.
  • Returned Product: The product that a customer returned.
  • Replacement Product: The product was sent as a replacement for a returned item.

Why Use Role-Playing Dimensions?

  • Efficiency: It's efficient in terms of storage as you don't need to create multiple dimension tables for each role.
  • Consistency: Ensures consistency across different business processes since the same dimension table is used.
  • Flexibility: Offers flexibility in querying and reporting. You can easily compare and contrast different aspects (like order date vs. delivery date).

Conformed vs Role Playing

Conformed and role-playing dimensions are both concepts in data warehousing, but they serve different purposes.

A conformed dimension is a dimension that is shared across multiple fact tables in a data warehouse. A conformed dimension has the same meaning and structure in all fact tables and is used to maintain consistency and integrity in the data warehouse.

For example, a Date dimension could be used in multiple fact tables that record sales, inventory, and customer data. The Date dimension maintains its structure and meaning across all fact tables, ensuring consistent and accurate analysis.

Shrunken Dimension

A shrunken dimension is a subset of a dimension’s attributes that apply to a higher summary level. For example, a Month dimension would be a shrunken dimension of the Date dimension. The Month dimension could be connected to a forecast fact table whose grain is at the monthly level.

(Sales Dimension Example)

Suppose you have a dataset with 100 features (or variables) that describe some phenomenon. However, you suspect that not all of these features are relevant to predicting the outcome you're interested in. You suspect many of these features are noisy, redundant, or irrelevant.


CREATE TABLE dim_patient (
    PatientID INT PRIMARY KEY,
    Name STRING,
    Age INT,
    Gender STRING,
    Race STRING,
    Height DECIMAL,
    Weight DECIMAL,
    EyeColor STRING,
    MaritalStatus STRING,
    Address STRING,
    City STRING,
    State STRING,
    ZipCode STRING,
    PhoneNumber STRING,
    EmailAddress STRING,
    InsuranceProvider STRING,
    InsurancePolicyNumber STRING,
    PrimaryCarePhysician STRING
);

In this shrunken dimension, you're reducing the attribute set to only those necessary for BMI analysis, simplifying the data model for this specific use case.


CREATE TABLE dim_patient_bmi(
    PatientID INT Primary KEY,
    Name String,
    Height DECIMAL,
    Weight DECIMAL,
    BMI DECIMAL
)


CREATE TABLE dim_patient_demographics (
    PatientID INT PRIMARY KEY,
    Age INT,
    Gender STRING,
    Race STRING
);

Swappable Dimension

Swappable dimensions are used in data warehousing when you have multiple versions of a dimension and you want to switch between them for different types of analysis. This concept is particularly useful in scenarios where different perspectives or classifications are needed for the same underlying data.

Key Components

Fact Table: SalesFact

Contains measures like QuantitySold and TotalSalesAmount Has a foreign key (ProductID) that can link to either product dimension

Multiple Dimension Tables:

  • StandardProductDimension
  • SeasonalProductDimension

Common Identifier:

Both dimension tables share ProductID as their primary key

Different Attributes:

  • StandardProductDimension: Category, SubCategory, Brand
  • SeasonalProductDimension: Season, HolidayTheme, WeatherAppropriate

Functionality:

Allows switching between different product perspectives in analyses Enables querying sales data using either standard or seasonal product attributes

Benefits:

  • Flexibility in reporting and analysis
  • Accommodates different business needs or seasonality
  • Optimizes storage by separating rarely-used attributes

Use Case Example:

  • Use StandardProductDimension for regular inventory analysis
  • Switch to SeasonalProductDimension for holiday sales planning

Implementation Note: Requires careful ETL processes to ensure data consistency across dimensions.

Step Dimension

A Step Dimension in data warehousing represents a process that involves several steps or stages, each of which might need to be analyzed separately. This type of dimension is beneficial in scenarios where a process progresses through distinct phases, and you want to track or analyze each phase individually.

Step Dimension: OrderStatusDimension

This dimension table represents the different steps in the order processing lifecycle.


CREATE TABLE OrderStatusDimension (
    StatusID INT PRIMARY KEY,
    StatusName VARCHAR(100),
    Description VARCHAR(255)
);

StatusIDStatusNameDescription
1Order PlacedOrder has been placed
2Payment ProcessedPayment has been received
3ShippedOrder has been shipped
4DeliveredOrder has been delivered

Fact Table: OrderFact

The fact table tracks each order along with its current status.


CREATE TABLE OrderFact (
    OrderID INT PRIMARY KEY,
    DateKey INT,
    CustomerID INT,
    ProductID INT,
    StatusID INT,  -- Foreign Key to OrderStatusDimension
    Quantity INT,
    TotalAmount DECIMAL
);

SQL Query Example

To analyze the number of orders at each status:

SELECT 
    osd.StatusName, 
    COUNT(*) AS NumberOfOrders
FROM 
    OrderFact of
JOIN 
    OrderStatusDimension osd ON of.StatusID = osd.StatusID
GROUP BY 
    osd.StatusName;

Remember the Accumulating Snapshot Fact table?

Step dimensions are closely connected to the Accumulating Snapshot Fact table.


CREATE TABLE OrderProcessFact (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    ProductID INT,
    OrderDate DATE,
    PaymentDate DATE NULL,
    ShipDate DATE NULL,
    DeliveryDate DATE NULL,
    QuantityOrdered INT,
    TotalAmount DECIMAL,
    CurrentStatus VARCHAR(100)
);
  • When an order is placed, a new record is inserted with the OrderDate and initial CurrentStatus.
  • As the order progresses through payment, shipping, and delivery, the respective date fields and the CurrentStatus are updated.
  • This table allows for analysis of the duration between different stages of the order process, identification of bottlenecks, and overall process efficiency.

Temporal

Temporal refers to anything related to time. In the context of databases, data warehousing, and analytics, "temporal" generally describes aspects that change or evolve over time, or that are connected to the tracking of time.

Examples

Temporal Data: Data that represents a point or duration in time. This can include:

  • Timestamps: A specific point in time (e.g., "2020-10-15 10:30:00").
  • Dates: A particular day, month, or year (e.g., "2020-10-15").
  • Time Ranges: A start and end date (e.g., "2020-10-01 to 2020-10-15").

Temporal Dimension: A dimension (like dim_date or dim_time) that stores and manages information about time. It's used in queries to analyze data based on time intervals (e.g., sales per month, inventory levels over time).

Temporal Relationships: These are relationships between entities that change over time. For example, in customer data, an individual might change their address, and you may want to track how that address evolves over time.

Temporal Analysis: Analyzing how things change over time, such as:

  • Trend Analysis: How sales grow or decline over time.
  • Duration Analysis: How long an order stays in each processing step.
  • Time-Series Analysis: Data points collected or recorded at specific time intervals to understand patterns (e.g., daily stock prices).

Temporal Context is often important when we talk about step dimensions, slowly changing dimensions (SCDs), or any type of analysis that looks at the evolution of data over time.

Types of Facts

Factless Fact Table

Ideally, Fact tables should contain some Measurements. What if there is nothing to measure?

A factless fact table is a type of fact table in a data warehouse that contains only foreign keys and no measures. It represents a many-to-many relationship between dimensions without any associated numerical measures.

Here is an example of a factless fact table for a university enrollment system:

student_idcourse_idsemester_id
1101202201
2102202201
3101202201
3103202201
4104202201

In this example, the fact table captures the enrollment of students in courses for a particular semester. It contains only foreign keys to the student, course, and semester dimensions and does not contain any measures such as enrollment count or grade.

This type of fact table is useful in scenarios where we need to analyze the relationships between dimensions without any numerical measures. For example, we might use this factless fact table to answer questions such as:

  • Which students are enrolled in more than one course in a semester?
  • Which courses have no students enrolled in a semester?
  • Which students have not enrolled in any courses in a semester?
  • Which courses are only offered in one semester?

By analyzing the relationships between dimensions in this way, we can gain insights into the behavior and patterns of our data without relying on numerical measures.

Good use case for building Aggregate Fact Tables.

Transaction Fact

Simplest Fact table, capturing data at lowest level.

What is the GRAIN here?

By date, time, store, product

This is a Transaction Fact table. It captures the lowest amount of data.

It helps you to find

What are the sales of Store 001?

What is the sale of a particular product per day?

  • Simple structure.
  • Keeps most detailed level data.
  • Easy to aggregate.

Periodic Fact

Stores current state at regular intervals.

Explain the state of entities at a particular instance in time.

The time interval can be hourly/daily/weekly.

Use Case:

  • Analyze business performance at a fixed interval.
  • Status of all flights.
  • Inventory stock. How much stock is left at the end of the day?

Transaction Fact Table is the source, and this stores a snapshot of info.

GRAIN for this table will be hourly / 2 hrs / daily based on need.

Accumulating Snapshot Fact Table

An Accumulating Snapshot Fact Table is used to track processes or workflows that have a definite beginning and end but take place over time. It allows you to capture the state of a process at different stages and update facts as the process progresses.

A classic example of an accumulating snapshot is in the order processing workflow, where we track the progress of an order from placement through to fulfillment and shipping. The fact table accumulates data as each order moves through various stages.

Amazon Order.. what happened at several time intervals.

As the process moves from one step to another, these get updated.

Accumulating snapshot always has different date fields denoting multiple stages of the process.

Use Cases

Track Returns: You can easily track which orders have been returned by checking the Return Date and filtering by Order Status = "Returned."

Monitor the Order Workflow: You can monitor the progress of orders through the entire workflow by checking how long it takes from Order Date to Payment Date, Shipment Date, Delivery Date, and Return Date (if applicable).

Calculate Performance Metrics: Calculate metrics like average delivery time, return rates, or the percentage of orders that are still pending or awaiting shipment.

Example Queries

Find all returned orders: Query for orders where Return Date is not NULL.

Calculate time from order to delivery: Subtract Order Date from Delivery Date to calculate how long deliveries take.

Track pending orders: Filter where Order Status is "Pending" or Shipment Date is NULL.

Another example: Hospital entry

How the Accumulating Snapshot Works:

Row Creation: A new row is created when the patient is admitted, with basic information like Admission Date, Doctor ID, and Diagnosis.

Row Updates: As the patient's treatment progresses, additional fields like Treatment Start Date, Treatment End Date, Discharge Date, and Follow-up Date are updated.

Transaction vs Periodic vs Accumulating

Difference between 3 Fact Tables

TransactionPeriodicAccumulating
Stores lowest grain of data.Stores current state of data at a regular interval of time.Stores intermediate steps that has happened over a period of time.
one row per Transactionone row per time periodone row per entire lifetime of the event
Date dimension is lowest tableDate/time is regular interval of snapshot frequencymultiple data dimensions for intermediate steps
Easy to aggregateminimal aggregationnot easy to aggregate
largest database sizeMedium Database sizeSmallest Database Size
Only InsertOnly InsertInsert and Update
Most business requirementbusiness performance is reviewed at regular intervalthe Business process has multiple stages

Additive, Semi-Additive, Non-Additive

In data warehousing, a fact table contains measurements or facts, which can be categorized into three types: additive, semi-additive, and non-additive.

Additive Facts

These can be summed across all dimensions of the fact table. Additive facts are the most straightforward to aggregate and are commonly used in reporting.

Examples:

  • Sales revenue: Can be summed by time, product, region, etc.
  • Quantity sold: Summable across product, time, store dimensions.
  • Profit: Aggregates well across dimensions.
  • Number of website visits: Can be summed up by day, user, or region.
  • Number of clicks on a banner ad: Can be aggregated across time, campaigns, etc.

Semi-Additive Facts

These can be summed across some dimensions but not others. For example, summing across time might not make sense for certain facts like balances or inventory, which are snapshot-based.

Examples:

  • Bank balance: Can be summed across accounts but not over time (as balances are snapshots at specific points in time).
  • Stock price: Cannot be summed across time but might be relevant for a single stock across locations or markets.
  • Inventory levels: Summable across product or location but not over time. Number of employees: Summable by department, but not over time as it's a snapshot metric.
  • Number of students enrolled in a course: Summable by course but not over time.

Non-Additive Facts

These cannot be summed across any dimension. These are usually metrics like ratios, percentages, or averages that do not make sense to sum.

Examples:

  • Profit margin: It's a ratio and cannot be added up across regions or time.
  • Gross margin percentage: Like profit margin, it’s a percentage and non-additive.
  • Average temperature: Averages do not sum well; they need to be recalculated for larger groups.
  • Average customer satisfaction rating: Cannot sum up ratings; they need to be averaged again for larger sets.
  • Percentage of market share: Similar to margins, percentages cannot be summed.

It is essential to identify the type of fact as it determines how the fact table will be aggregated and also impacts the design of the data warehouse.

Example

Additive Facts:

Sales Revenue: Can be summed across all dimensions (e.g., total sales revenue by day, store, product, etc.). Quantity Sold: Additive across all dimensions (e.g., sum of quantities sold by product, store, or day).

Total Sales Revenue for P001 across all stores on 10/01/2023 = 500 (S001) + 600 (S002) = 1100.

Total Quantity Sold for P001 on 10/01/2023 = 10 (S001) + 12 (S002) = 22.

Semi-Additive Facts:

Inventory Level: This represents a snapshot of inventory at a given time. You can sum inventory across stores but not across time (e.g., you can sum inventory for a product across different stores at a single point in time, but not over multiple days).

Inventory Level for P001 on 10/01/2023 across all stores = 200 (S001) + 180 (S002) = 380.

Inventory Level across time would not be summed (e.g., you cannot add inventory levels on 10/01 and 10/02).

Non-Additive Facts:

Profit Margin (%): Cannot be summed across any dimension, as it is a percentage. For example, adding profit margins across stores or products would not yield a meaningful result. Instead, you'd need to compute a weighted average.

Average Customer Rating: Similar to profit margin, it cannot be summed across stores or time. It would need to be recalculated as an average if you wanted to aggregate it.

Average Profit Margin for P001 on 10/01/2023 would need to be recalculated based on weighted averages of sales revenues or units sold.

Average Customer Rating would need to be recalculated based on customer reviews or feedback.

Periodic Snapshot vs Additive

Periodic snapshot fact and additive fact are different concepts, although they may appear similar at first glance.

A periodic snapshot fact table is designed to capture a point-in-time snapshot of some measure at regular intervals, typically periodically, such as daily, weekly, monthly, or quarterly.

An additive fact represents a measure that can be aggregated across all dimensions in a fact table. For example, sales revenue is an additive fact because it can be summed up across all dimensions such as time, product, region, etc.

Additive FactPeriodic Snapshot Fact
DefinitionRepresents a measure that can be aggregated across all dimensions in a fact tableCaptures a point-in-time snapshot of some measure at regular intervals
GranularityCan be at any level of granularityUsually at a higher level of granularity such as daily, weekly, monthly, or quarterly
AggregationCan be summed up across all dimensions in the fact tableCaptures the total value of the measure at a specific point in time
ExampleSales revenue, profit, number of clicks on a banner adMonthly sales revenue, quarterly website visits, daily customer support tickets

Conformed Fact

The fact is used in more than one fact table. (Attribute)

The conformed Fact table is a fact table that can be used across multiple Data Marts.

Suppose a company has separate fact tables for Online Sales and In-store Sales. Both these fact tables might have a "Total Sales" metric.

To ensure that this metric is a conformed fact, the company would need to make sure that "Total Sales" has the same meaning (i.e., the total revenue from sales), the same scale (e.g., in US dollars), and is calculated in the same way (e.g., quantity sold times unit price) in both fact tables.

Miscellaneous

CSV to Dimension Models Example

Sample Data

https://github.com/gchandra10/filestorage/blob/main/sales_100.csv

Design using https://mermaid.live/

---
title: Sales 100 example
---

erDiagram
  dim_region {
    region_id int
    region_name string
  }
  
  dim_country {
    country_id int
    region_id int
    country_name string
  }
  
  dim_item_type {
    item_type_id int
    item_type_name string
  }
  
  dim_sales_channel {
    sales_channel_id int
    sales_channel_type string
  }

  dim_order_priority{
    order_priority_id int
    order_priority string
  }

  dim_date{
    date_id int
    shortdate date
    day_of_week string
    is_weekend boolean
    day int
    month int
    week_of_month int
    week_of_year int
    quarter string
    year int
    leap_year boolean
  }
    
  fact_order_transaction {
    id int
    order_id int
    country_id int
    item_type_id int
    sales_channel_id int
    order_priority_id int

    order_date_id int
    ship_date_id int
    
    units_sold int
    unit_price float
    unit_cost float
    total_revenue float
    total_cost float
    total_profit float
  }

  dim_month {
    month_id int pk
    month string
    year int
    quarter string
    is_end_of_year boolean
  }

  fact_quarterly_sales {
    id int
    month_id int
    total_units_sold int
    total_revenue float
    total_cost float
    total_profit float
  }

  fact_region_sales{
    id int
    region_id int
    sales_channel_id int
    total_units_sold int
    total_revenue float
    total_cost float
    total_profit float
  }
  
dim_region ||--o{ dim_country : "region_id"
dim_country ||--o{ fact_order_transaction : "country_id"
dim_item_type ||--o{ fact_order_transaction : "item_type_id"
dim_sales_channel ||--o{ fact_order_transaction : "sales_channel_id"
dim_order_priority  ||--o{ fact_order_transaction : "order_priority_id"
dim_date ||--o{ fact_order_transaction : "order_date"
dim_date ||--o{ fact_order_transaction : "ship_date"

dim_month ||--o{ fact_quarterly_sales : "month_id"

dim_region ||--o{ fact_region_sales : "region_id"
dim_sales_channel ||--o{ fact_region_sales : "sales_channel_id"

Design

---
title: Sales 100 example
---

erDiagram
  dim_region {
    region_id int
    region_name string
  }
  
  dim_country {
    country_id int
    region_id int
    country_name string
  }
  
  dim_item_type {
    item_type_id int
    item_type_name string
  }
  
  dim_sales_channel {
    sales_channel_id int
    sales_channel_type string
  }

  dim_order_priority{
    order_priority_id int
    order_priority string
  }

  dim_date{
    date_id int
    shortdate date
    day_of_week string
    is_weekend boolean
    day int
    month int
    week_of_month int
    week_of_year int
    quarter string
    year int
    leap_year boolean
  }
    
  fact_order_transaction {
    id int
    order_id int
    country_id int
    item_type_id int
    sales_channel_id int
    order_priority_id int

    order_date_id int
    ship_date_id int
    
    units_sold int
    unit_price float
    unit_cost float
    total_revenue float
    total_cost float
    total_profit float
  }

  dim_month {
    month_id int pk
    month string
    year int
    quarter string
    is_end_of_year boolean
  }

  fact_quarterly_sales {
    id int
    month_id int
    total_units_sold int
    total_revenue float
    total_cost float
    total_profit float
  }

  fact_region_sales{
    id int
    region_id int
    sales_channel_id int
    total_units_sold int
    total_revenue float
    total_cost float
    total_profit float
  }
  
dim_region ||--o{ dim_country : "region_id"
dim_country ||--o{ fact_order_transaction : "country_id"
dim_item_type ||--o{ fact_order_transaction : "item_type_id"
dim_sales_channel ||--o{ fact_order_transaction : "sales_channel_id"
dim_order_priority  ||--o{ fact_order_transaction : "order_priority_id"
dim_date ||--o{ fact_order_transaction : "order_date"
dim_date ||--o{ fact_order_transaction : "ship_date"

dim_month ||--o{ fact_quarterly_sales : "month_id"

dim_region ||--o{ fact_region_sales : "region_id"
dim_sales_channel ||--o{ fact_region_sales : "sales_channel_id"

Sample Data Architecture Diagram

https://learn.microsoft.com/en-us/azure/architecture/example-scenario/analytics/sports-analytics-architecture-azure

How do you draw architectural diagrams?

Data Pipeline Models

Sequence Model

The Sequence Model represents a linear flow where data moves from a Source to a Process stage and finally to a Sink. This model is ideal for straightforward data transformations and processing tasks, where data is ingested, cleaned, and transformed sequentially before being stored in the final destination. This approach aligns with the traditional Bronze → Silver → Gold layer strategy in data engineering.

Funnel Model

After processing, the Funnel Model aggregates data from multiple sources into a single Sink. This model is useful when combining data from various origins, such as multiple databases or external APIs, into a unified repository. The central processing stage consolidates and harmonizes data from these different sources before loading it into the destination, ensuring that data from diverse inputs is integrated into one output.

Fan-out/Star Model

The Fan-out/Star Model starts with a single Source, processes the data, and then distributes the results to multiple Sinks. This model is effective when the processed data needs to be utilized in various downstream applications or systems. It allows the same source data to be transformed and delivered to different destinations, each serving a unique purpose or system requirement.

Each model serves distinct data engineering needs, from simple data pipelines to complex ETL processes involving multiple sources or destinations.

New DW Concepts

  1. Cloud Data Warehousing: With the increasing popularity of cloud computing, cloud data warehousing has become a popular concept. It involves storing data in a cloud-based rather than an on-premise data warehouse. This allows for greater scalability, flexibility, and cost savings.

Examples: Databricks, Snowflake, Azure Synapse, and so on.

  1. Data Virtualization: Data virtualization is a technique that allows data to be accessed and integrated from multiple sources without the need for physical data movement or replication. This can help reduce data redundancy and improve data consistency.

  2. Self-Service BI: Self-service BI allows business users to access and analyze data without relying on IT or data analysts. This concept has become popular with user-friendly data visualization tools enabling users to create reports and dashboards.

  3. Big Data Analytics: Big data analytics involves using advanced analytics techniques to analyze large and complex datasets. This requires specialized tools and technologies, such as Hadoop and Spark, to process and analyze large volumes of data.

  4. Data Governance: Data governance involves establishing policies, standards, and procedures for managing data assets. This helps ensure data accuracy, consistency, and security and that data is used to align with organizational goals and objectives.

  5. Delta Sharing: With Delta Sharing, organizations can share their data with partners, customers, and other stakeholders without having to move or copy the data. This can help reduce data duplication and improve data governance while allowing for more collaborative and agile data sharing.

Overall, these new data warehousing concepts are focused on improving the speed, flexibility, and accessibility of data and ensuring that data is used in a way that supports organizational objectives.

  1. DataOps: DataOps is a methodology that emphasizes collaboration, automation, and monitoring to improve the speed and quality of data analytics. It combines DevOps and agile methods to create a more efficient and streamlined data pipeline.

  2. Data Mesh: Data Mesh is an architectural approach emphasizing decentralization and domain-driven data architecture design. It involves breaking down data silos and creating a more flexible and scalable data architecture that aligns with business needs.

  3. Augmented Analytics: Augmented analytics is a technique that uses machine learning and artificial intelligence to automate data preparation, insight generation, and insight sharing. It aims to improve the speed and accuracy of data analytics while reducing the reliance on data scientists and analysts.

  4. Real-time Data Warehousing: Real-time data warehousing involves using streaming data technologies like Apache Kafka to capture and process data in real-time. This enables organizations to analyze and act on data in real-time rather than waiting for batch processing cycles.

  5. Data Privacy and Ethics: Data privacy and ethics are becoming increasingly important in data warehousing and analytics. Organizations focus on ensuring that data is collected, stored, and used ethically and responsibly and that data privacy regulations, such as GDPR and CCPA, are followed.

These are just a few new data warehousing concepts emerging in response to the changing data landscape. As data volumes continue to grow and technologies continue to evolve, we can expect to see continued innovation in data warehousing and analytics.

Dataset Examples

Retail Sales Transactions: This dataset includes individual retail store or chain sales transactions. It typically contains transaction ID, customer ID, product ID, quantity, price, and timestamp.

Transaction IDCustomer IDProductQuantityPriceTimestamp
T001C001Apple iPhone X5$10.992023-05-10 10:30:15
T002C002Samsung Galaxy S93$24.992023-05-10 14:45:21
T003C003HP Printer2$5.992023-05-11 09:12:33

Financial Transactions: This dataset consists of financial transactions from banks or credit card companies. It includes transaction ID, account number, transaction type, amount, merchant information, and timestamp.

Transaction IDAccount NumberTransaction TypeAmountMerchantTimestamp
T001A123456Deposit$100.00XYZ Bank2023-05-10 08:15:30
T002B987654Withdrawal-$50.00ABC Store2023-05-10 15:20:45
T003C246810Transfer$250.00Online Shopping2023-05-11 11:05:12

Online Marketplace Transactions: This dataset contains transactions from an online marketplace like Amazon, eBay, or Alibaba. It includes transaction ID, buyer/seller ID, product ID, quantity, price, shipping details, and timestamps.

Transaction IDBuyerSellerProductQuantityPriceShipping DetailsTimestamp
T001John DoeSellerABook1$19.99Address1, City1, State12023-05-10 13:45:27
T002Jane SmithSellerBSmartphone2$49.99Address2, City2, State22023-05-10 16:20:10
T003Mark JohnsonSellerCHeadphones3$9.99Address3, City3, State32023-05-11 10:05:55

E-commerce Order Transactions: This dataset focuses on transactions from an e-commerce website. It includes data such as order ID, customer ID, product ID, quantity, price, shipping information, payment details, and timestamps.

Order IDCustomerProductQuantityPriceShipping InformationPayment DetailsTimestamp
O001John DoeApple iPhone X2$29.99Address1, City1, State1Credit Card2023-05-10 11:30:45
O002Jane SmithSamsung Galaxy S91$14.99Address2, City2, State2PayPal2023-05-10 17:15:20
O003Mark JohnsonHP Printer4$39.99Address3, City3, State3Credit Card2023-05-11 08:45:10

Travel Booking Transactions: This dataset comprises transactions from a travel booking website or agency. It includes booking ID, traveler details, flight/hotel ID, dates, prices, payment information, and timestamps.

Booking IDTravelerFlightHotelDatesPricePayment InformationTimestamp
B001John DoeFlight 123Hotel ABC2023-06-15 - 2023-06-20$500.00Credit Card2023-05-10 09:30:15
B002Jane SmithFlight 456Hotel XYZ2023-07-01 - 2023-07-10$750.00PayPal2023-05-11 14:20:30
B003Mark JohnsonFlight 789Hotel PQR2023-08-10 - 2023-08-15$600.00Credit Card2023-05-12 11:45:55

Stock Market Transactions: This dataset involves transactions from stock exchanges. It includes details like trade ID, stock symbol, buy/sell order, quantity, price, trader information, and timestamps.

Transaction IDStock SymbolBuy/SellQuantityPriceTraderTimestamp
T001AAPLBuy100$150.50John Doe2023-05-10 09:30:15
T002GOOGLSell50$2500.00Jane Smith2023-05-10 11:15:45
T003MSFTBuy75$180.75Mark Johnson2023-05-10 14:40:20

Datasets

https://cseweb.ucsd.edu/~jmcauley/datasets.html

Thoughts on data

Don't remove NULL columns or Bad data from the Source. Learn to handle that in processing.

Sample 1

Serial_NumberList_YearCode
20093202010 - A Will
200192202014 - Foreclosure
190871201918 - In Lieu Of Foreclosure

Create a separate Dimension table for Code and split the Code No and Code Name

dim_code
id
code_no
code_name
fact
Serial_Number
List_Year
dim_id

Sample 2

TripIDSource LatSource LongDes LatDes Long
1-73.990371740.73469543-73.9818420440.73240662
2-73.9807815640.7299118-73.9444732740.71667862
3-73.9845504840.67956543-73.9502716140.78892517

Separate Dimension for Lat and Long

dim_location
location_id
lat
long
location_name (If exists)

Updated Fact table

fact
trip_id
source_location_id
destination_location_id

Another Variation

Location
POINT (-72.98492 41.64753)
POINT (-72.96445 41.25722)

Notes:

DRY Principle: You're not repeating the lat-long info, adhering to the "Don't Repeat Yourself" principle.

Ease of Update: If you need to update a location's details, you do it in one place.

Flexibility: Easier to add more attributes to locations in the future.

5 decimal places: Accurate to ~1.1 meters, usually good enough for most applications including vehicle navigation.

4 decimal places: Accurate to ~11 meters, may be suitable for some applications but not ideal for vehicle-level precision.

3 decimal places: Accurate to ~111 meters, generally too coarse for vehicle navigation but might be okay for city-level analytics.


Sample 3

Fiscal YearDisbursement DateVendor Invoice DateVendor Invoice WeekCheck Clearance Date
202306-Oct-2308-Aug-2308-06-2023
202306-Oct-2316-Aug-2308/13/2023
202306-Oct-2322-Sep-2309/17/202310-08-2023

Date Dimension

DateKeyFullDateYearMonthDayWeekdayWeekOfYearQuarterIsWeekendIsHoliday
202308082023-08-082023882323FALSEFALSE
202308162023-08-1620238163333FALSEFALSE
202309222023-09-2220239225383FALSEFALSE
202310062023-10-0620231065404FALSEFALSE
202310082023-10-0820231087404TRUEFALSE
..............................

Fact Table

Fiscal YearDisbursement Date KeyVendor Invoice Date KeyVendor Invoice Week Date KeyCheck Clearance Date Key
2023202310062023080820230806
2023202310062023081620230813
202320231006202309222023091720231008

Date Dimension Sample


Sample 4

Another DateTime

date_attendingip_location
2017-12-23 12:00:00Reseda, CA, United States
2017-12-23 12:00:00Los Angeles, CA, United States
2018-01-05 14:00:00Mission Viejo, CA, United States

Date Time Dimension

DateKeyFullDateYearMonthDayWeekdayWeekOfYearQuarterIsWeekendIsHoliday
201712232017-12-23201712236514TRUEFALSE
201801052018-01-05201815511FALSEFALSE
..............................

Fact Table

FactIDDateKeyip_location
120171223Reseda, CA, United States
220171223Los Angeles, CA, United States
320180105Mission Viejo, CA, United States

Sample 5

Job TitleExperienceQualificationsSalary RangeAge_Group
Digital Marketing Specialist5 to 15 YearsM.Tech$59K-$99KYouth (<25)
Web Developer2 to 12 YearsBCA$56K-$116KAdults (35-64)
Operations Manager0 to 12 YearsPhD$61K-$104KYoung Adults (25-34)
Network Engineer4 to 11 YearsPhD$65K-$91KYoung Adults (25-34)
Event Manager1 to 12 YearsMBA$64K-$87KAdults (35-64)

Experience Dimension

ExperienceIDExperienceRangeMinExperienceMaxExperience
15 to 15 Years515
22 to 12 Years212
30 to 12 Years012
44 to 11 Years411
51 to 12 Years112

Job Dimension Table

JobIDJob TitleQualifications
1Digital Marketing SpecialistM.Tech
2Web DeveloperBCA
3Operations ManagerPhD
4Network EngineerPhD
5Event ManagerMBA

Age Group Dimension Table

AgeGroupIDAge_Group
1Youth (<25)
2Adults (35-64)
3Young Adults (25-34)

Fact Table

JobIDExperienceIDSalary RangeAgeGroupID
11$59K-$99K1
22$56K-$116K2
33$61K-$104K3
44$65K-$91K3
55$64K-$87K2

Sample 6

ITEM CODEITEM DESCRIPTION
100293SANTORINI GAVALA WHITE - 750ML
100641CORTENOVA VENETO P/GRIG - 750ML
100749SANTA MARGHERITA P/GRIG ALTO - 375ML

The Fact will turn out to be like this

ItemIDItemCodeItem DescriptionQuantity
1100293SANTORINI GAVALA WHITE750ML
2100641CORTENOVA VENETO P/GRIG750ML
3100749SANTA MARGHERITA P/GRIG ALTO375ML

Note: If the same Item Descrition repeats for various quantities, then create a separate table for Quantity.

Quantity Dimension

QuantityIDQuantity
1750ML
2375ML
3500ML
......

Item Dimension

ItemIDItem NameQuantityID
1SANTORINI GAVALA WHITE1
2CORTENOVA VENETO P/GRIG1
3SANTA MARGHERITA P/GRIG ALTO2
.........

Fact Table

FactIDItemIDQuantityID
111

Sample 7

production_countriesspoken_languages
United Kingdom, United States of AmericaEnglish, French, Japanese, Swahili
United Kingdom, United States of AmericaEnglish
United Kingdom, United States of AmericaEnglish, Mandarin

Country Dimension

CountryIDCountryName
1United Kingdom
2United States of America

Language Dimension

LanguageIDLanguageName
1English
2French
3Japanese
4Swahili
5Mandarin

Approach 1 : Creating Many to Many

FactIDCountryIDLanguageID
111
121
112
122

Pros

Normalization: Easier to update and maintain data.

Cons

Storage: May require more storage for the additional tables and keys.

Approach 2 : Fact Table with Array Datatypes

FactIDProductionCountryIDsSpokenLanguageIDs
1[1, 2][1, 2, 3, 4]
2[1, 2][1]
3[1, 2][1, 5]

With newer Systems like Spark SQL which supports Arrays

  • Simplicity: Easier to understand and less complex to set up.
  • Performance: Could be faster for certain types of queries, especially those that don't require unpacking the array.