Disclaimer

This is not an official Textbook for Data Warehouse.

This is only a reference material with the lecture presented in the class.

If you find any errors, please do email me at chandr34 @ rowan.edu

Terms to Know

These are some basic terms to know. We will learn lot more going forward.

DBMS - Database Management System

RDBMS - Relational Database Management System

ETL - Extract Transform Load - Back office process for loading data to Data Warehouse.

Bill Inmon - Considered the Father of Data Warehousing.

Ralph Kimball - He’s the father of Dimensional modeling and the Star Schema.

OLTP - Online Transaction Processing. The classic operational database for taking orders, reservations, etc.

OLAP - Online Analytical Processing. Big database providers (IBM, Teradata, Microsoft.. ) started integrating OLAP into their systems.

MetaData - Data about Data.

Data Pipeline - A set of processes that move data from one system to another.

ETL - Extract Transform Load

ELT - Extract Load Transform

Jobs

Data Analyst: Plays a crucial role in business decision-making by analyzing data and providing valuable insights, often sourced from a data warehouse.
Data Scientist: Employs data warehousing as a powerful tool for modeling, statistical analysis, and predictive analytics, enabling the resolution of complex problems.
Data Engineer: Demonstrates precision and dedication in their work by focusing on the design, construction, and maintenance of data pipelines that facilitate the movement of data into and out of a data warehouse.
Analyst Engineer: A hybrid role combining the skills of a data analyst and data engineer, often involved in analyzing data and developing the infrastructure to support that analysis.
Data Architect: Designs and oversees the implementation of the overall data infrastructure, including data warehouses, to ensure scalability and efficiency.
Database Administrator (DBA): This person manages the performance, integrity, and security of databases, including those used in data warehouses.
Database Security Analyst: This position focuses on ensuring the security of databases and data warehouses and protecting against threats and vulnerabilities.
Database Manager: Oversees the overall management and administration of databases, including those used for data warehousing.
Business Intelligence (BI) Analyst: Utilizes data from a data warehouse to generate reports, dashboards, and visualizations that aid business decision-making.
AI/ML Engineer: Uses warehouse data to build and deploy machine learning models, particularly in enterprise environments where historical data is crucial.
Compliance Analyst: Ensures that data warehousing solutions meet regulatory requirements, especially in industries like finance, healthcare, or insurance.
Chief Data Officer (CDO): An executive role responsible for an organization’s data strategy, often overseeing data warehousing as a critical component.
Data Doctor: Typically diagnoses and "fixes" data-related issues within an organization. This role might involve data cleansing, ensuring data quality, and resolving inconsistencies or errors in datasets.
Data Advocate: Champions the use of data within an organization. They promote data-driven decision-making and ensure that the value of data is recognized and utilized effectively across different departments.

Skills needed

A data warehouse developer is responsible for designing, developing, and maintaining data warehouse systems. To be qualified as a data warehouse developer, a person should possess a combination of technical skills and knowledge in the following areas:

Must-have skills

Database Management Systems (DBMS): A strong understanding of relational and analytical database management systems such as Oracle, SQL Server, PostgreSQL, or Teradata.
SQL: Proficiency in SQL (Structured Query Language) for creating, querying, and manipulating database objects.
Data Modeling: Knowledge of data modeling techniques, including dimensional modeling (star schema, snowflake schema), normalization, and denormalization. Familiarity with tools such as Vertabelo or ERwin, or PowerDesigner is a plus.
ETL (Extract, Transform, Load): Experience with ETL processes and tools like Microsoft SQL Server Integration Services (SSIS), Talend, or Informatica PowerCenter for extracting, transforming, and loading data from various sources into the data warehouse.
Data Integration: Understanding of data integration concepts and techniques, such as data mapping, data cleansing, and data transformation.
Data Quality: Knowledge of data quality management and techniques to ensure data accuracy, consistency, and integrity in the data warehouse.
Performance Tuning: Familiarity with performance optimization techniques for data warehouses, such as indexing, partitioning, and materialized views.
Reporting and Data Visualization: Experience with reporting and data visualization tools like Tableau, Power BI, or QlikView for creating dashboards, reports, and visualizations to analyze and present data.
Big Data Technologies: Familiarity with big data platforms such as Spark and NoSQL databases like MongoDB or Cassandra can be beneficial, as some organizations incorporate these technologies into their data warehousing solutions.
Programming Languages: Knowledge of programming languages like Python, Java, or C# can help implement custom data processing logic or integrate with external systems.
Cloud Platforms: Experience with cloud-based data warehousing solutions such as Databricks can be a plus as more organizations move their data warehouses to the cloud.
Version Control: Familiarity with version control systems like Git or SVN for managing code and collaborating with other developers.

Nice to have skills

In summary, while Linux skills are not a core requirement for a data warehouse developer, they can be valuable for managing, optimizing, and troubleshooting your data warehousing environment.

Server Management
Scripting and Automation (AWK, Bash)
File System and Storage Management
Networking and Security
Performance Tuning
Working with Cloud Platforms
Deploying and Managing Containers (Docker, Podman, Kubernetes)

Application Tiers

Where does Database fit in?

Database Tier: Actual data
Application Tier: Business logic
Presentation Tier: Front end (Web, Client, Mobile App)

https://bluzelle.com/blog/things-you-should-know-about-database-caching

Robotics
AI
IoT
Blockchain
3D Printing
Internet / Mobile Apps
Autonomous Cars - VANET Routing
VR / AR - Virtual Reality / Augmented Reality
Wireless Services
Quantum Computing
5G
Voice Assistant (Siri, Alexa, Google Home)
Cyber Security
Big Data Analytics
Machine Learning
DevOps
NoSQL Databases
Microservices Architecture
Fintech
Smart Cities
E-commerce Platforms
HealthTech

Do you see any pattern or anything common in these?

Visual Interface & Data

Operational Database

An operational database management system is software that allows users to quickly define, modify, retrieve, and manage data in real time.

While conventional databases rely on batch processing, operational database systems are oriented toward real-time, transactional operations.

Let's take a Retail company that uses several systems for its day-to-day operations.

They buy software from various vendors and manage their business.

Sales Transactions
Inventory Management
Customer Relationship Management
HR Systems

and so on.

Other Examples:

Banner Database (Registration)
eCommerce Database
Blog Database
Banking Transactions

What is a Data Warehouse

Is it a Database?
Is it Big Data?
Is it the backend for Visualization?

Yes! Yes! & Yes!

Typical Data Architecture

Data Source Layer

Operational System: This includes data from various operational systems like transactional databases. CRM System: Data from customer relationship management systems. ERP System: Data from enterprise resource planning systems. External Data: Data that might come from external sources outside the organization. These sources feed data into the Data Warehouse system. Depending on the source, the data might be structured or unstructured.

DW Staging

Staging Area: This is an intermediate storage area used for data processing during the ETL (Extract, Transform, Load) process. ETL Process: Extract: Data is extracted from various data sources. Transform: The extracted data is transformed into a format suitable for analysis and reporting. This might include cleaning, normalizing, and aggregating the data. Load: The transformed data is then loaded into the Data Warehouse.

Data Warehouse

Raw Data: The unprocessed data loaded directly from the staging area. Metadata: Data about the data, which includes information on data definitions, structures, and rules. Aggregated Data: Summarizing or aggregating data for efficient querying and analysis.

Presentation Layer

**OLAP (Online Analytical Processing): **This tool is used for multidimensional data analysis in the warehouse, enabling users to analyze data from various perspectives. Reporting: Involves generating reports from the data warehouse, often for decision-making and business intelligence purposes. Data Mining: This involves analyzing large datasets to identify patterns, trends, and insights, often used for predictive analytics.

Flow of Data

Data Source Layer → Staging Area: Data is extracted from multiple sources and brought into the staging area. Staging Area → Data Warehouse: The data is transformed and loaded into the data warehouse. Data Warehouse → Presentation Layer: The data is then used for various purposes, such as OLAP, Reporting, and Data Mining.

This architecture ensures that data is collected, processed, and made available for analysis in a structured and efficient manner, facilitating business intelligence and decision-making processes.

Problem Statement

RetailWorld uses different systems for sales transactions, inventory management, customer relationship management (CRM), and human resources (HR). Each system generates a vast amount of data daily.

The company's management wants to make data-driven decisions to improve its operations, optimize its supply chain, and enhance customer satisfaction. However, they face the following challenges:

Data Silos: Data is stored in separate systems, making gathering and analyzing information from multiple sources challenging.
Inconsistent Data: Different systems use varying data formats, making it hard to consolidate and standardize the data for analysis.
Slow Query Performance: As the volume of data grows, querying the operational databases directly becomes slower and impacts the performance of the transactional systems.
Limited Historical Data: Operational databases are optimized for current transactions, making storing and analyzing historical data challenging.

Solution

Centralized Data Repository: The Data Warehouse consolidates data from multiple sources, breaking down data silos and enabling a unified view of the company's information.
Consistent Data Format: Data is cleaned, transformed, and standardized to ensure consistency and accuracy across the organization.
Improved Query Performance: The Data Warehouse is optimized for analytical processing, allowing faster query performance without impacting the operational systems.
Historical Data Storage: The Data Warehouse can store and manage large volumes of historical data, enabling trend analysis and long-term decision-making.
Enhanced Reporting and Analysis: The Data Warehouse simplifies the process of generating reports and conducting in-depth analyses, providing insights into sales trends, customer preferences, inventory levels, and employee performance.

Key Features

They are used for storing historical data.
Low Latency / Response time is fast.
Data consistency and quality of data.
Used with Business Intelligence, Data Analysis & Data Science.

By doing all the above, it answers some of the questions the business/needs.

Sales of particular items this month compared to last month?
Top 3 best-selling products of the quarter?
How is the internet traffic before the pandemic / during the pandemic?

It's read-only for end-users and upper management.

Who are the end users?

Data Analysts
Data Scientists.

Data Size

A query can range from a few thousand rows to billion rows.

Need for Data Warehouse

Amazon CEO wants to know the sales of the new Kindle Scribe reader they launched this year compared to other eReaders in the next 30 mins.

Where do we look for this info?

Transactional databases use every resource to serve customers. Querying on them may slow down a customer's request.
Also, they are Live databases. So they may or may not have historical data.
Chances are data format may differ for each region/country in which Amazon is doing business in.

It's not just Amazon; it's everywhere.

Airlines
Bank Transactions / ATM transactions
Restaurant chains, and so on.

Companies will have multiple databases that are not linked with each other for specific reasons.

Sales Database
Customer Service Database
Marketing Database
Inventory Database
Human Resources Database

There is no reason for linking HR data to Sales data, but the CFO might need this info for budgeting.

Fun Qn: How many times have you received Marketing mail/email from the same company you have an account with?

Current State of the Art

The business world decided as follows.

A Database should exist just for doing BI & Strategic reports.
It should be separated from the operational / transaction database for the day-to-day running of the business.
It should encompass all aspects of the business (sales, inventory, hr, customer service…)
An enterprise-wide standard definition for every field name in every table.
- Example: employee number should be identical across DB. empNo, eNo,EmployeeNum.. empID not acceptable.
Metadata database (data about data) defining assumptions about each field, describing transformations performed and cleansing operations, etc.
- Example: If US telephone, it should be nnn-nnn-nnnn or (nnn) nnn-nnnn
Data Warehouse is read-only to its end users so that everyone will use the same data, and there will be no mismatch between teams.
Fast access, even if it's big data.

How its done?

Operational databases for tracking sales, inventory, support calls, chat, and email. (Relational / NoSQL)
The Back Office team (ETL team) gathers data from multiple sources, cleans it, transforms it, massages the missing, and stores it in the Staging database.
- If the phone number is not in the format, then format it.
- If the email address is not linked to the chat/phone record, read it from the Customer and update it.
Staging database: Working database where all the work is done to the data. It then dumps to the data warehouse, which is visible as “read-only” to end users.
Data Analysts then build reports using Data Warehouse.

Back to the original question.

If all of these things are done right, Amazon's CEO can get the report in less than 30 minutes without interfering with business operations. 👍

Types of Data

Structured Data (rows/columns CSV, Excel)
Semi-Structured Data (JSON / XML)
Unstructured Data (Video, Audio, Document, Email)

Structured Data

ID	Name	Join Date
101	Rachel Green	2020-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Semi-Structured Data

JSON

[
   {
      "id":1,
      "name":"Rachel Green",
      "gender":"F",
      "series":"Friends"
   },
   {
      "id":"2",
      "name":"Sheldon Cooper",
      "gender":"M",
      "series":"BBT"
   }
]

XML

<?xml version="1.0" encoding="UTF-8"?>
<actors>
   <actor>
      <id>1</id>
      <name>Rachel Green</name>
      <gender>F</gender>
      <series>Friends</series>
   </actor>

   <actor>
      <id>2</id>
      <name>Sheldon Cooper</name>
      <gender>M</gender>
      <series>BBT</series>
   </actor>
</actors>

Unstructured Data

Text Logs: Server logs, application logs.
Social Media Posts: Tweets, Facebook comments.
Emails: Customer support interactions.
Audio/Video: Customer call recordings and marketing videos.
Customer Reviews: Free-form text reviews.
Images: Product images user profile pictures.
Documents: PDFs, Word files.
Sensor Data: IoT data streams.

These can be ingested into modern data warehouses for analytics, often after some preprocessing. For instance, text can be analyzed with NLP before storing, or images can be processed into feature vectors.

Data Storage Systems

Data Lake

A place where you dump all forms of data of your business.

Structured / Un Structured / Semi-Structured.

Example

Customer service chat logs, voice recordings, email, website comments, social media.
Need a cheap way to store different types of data in large quantities.
Data is not needed now, but planning to use it for later use.
Larger organizations need all kinds of data to analyze and improve business.

Data Warehouse

Data Warehouse - Stores already modeled/structured data and ready for use.
Data from the Warehouse can be used for analyzing its operational data.
There will be developers to support the data.
It’s multi-purpose storage for different use cases.

Data Mart

A subset of Data Warehouse for a specific use case.

A specific group of users uses it, so it is more secure and performs better.

Example: Pandemic Analysis

Dependent Data Marts - constructed from an existing data warehouse.

Example: Grocery / School Supplies

Independent Data Marts - built from scratch and operated in silos.

Example: Mask / Glove Sales

Hybrid Data Marts - Mix and match both.

Data Warehouse 1980 - Current

Data Warehouses (1980 - 2000):

Pros

High Quality Data.
Standard modeling technique (star schema/Kimball).
Reliability through ACID transactions.
Very good fit for business intelligence.

Cons

Closed Formats.
Support only SQL.
No support for Machine Learning.
No streaming support.
Limited scaling support.

Data Lakes (2010 - 2020)

Pros

Support for open formats.
Can support all data types & their use cases.
Scalability through underlying cloud storage.
Support for Machine Learning & AI.

Cons

Weak schema support.
No ACID transaction support.
Low data quality.
Leads to "Data Swamps".

Lakehouses (2020 and beyond):

Pros

Support for both BI and ML/AI workloads.
Standard Storage Format.
Reliability through ACID transactions.
Scalability through underlying cloud storage.

Cons

Cost Considerations.
Data Governance and Security.
Performance Overhead. (Due to ACID transactions)

Data Warehouse vs Data Mart

Data Warehouse	Data Mart
Independent application / system	Specific to support one system.
Contains detailed data.	Mostly aggregated data.
Involves top-down/bottom-up approach.	Involves bottom-up approach.
Adjustable and exists for an extended period of time.	Restricted for a project / shorter duration of time.

Data Warehouse Architecture

Top-Down Approach

This method begins with developing a comprehensive enterprise data warehouse (EDW) consolidating all organizational data. This central warehouse creates data marts to serve specific business units or functions. The top-down approach ensures a unified, consistent data model across the enterprise but typically requires more upfront investment in time and resources.

External Sources (ETL) >> Data Warehouse >> Data Mining
                                         >> Data Mart 1
                                         >> Data Mart 2

In the words of Inmon

"Data Warehouse as a central repository for the complete organization and data marts are created from it after the complete data warehouse has been created."

src: https://www.geeksforgeeks.org/data-warehouse-architecture/

Bottom-Up Approach

This approach starts by creating small, specific data marts for individual business units. These data marts are designed to meet the immediate analytical needs of departments. Over time, these marts are integrated into a comprehensive enterprise data warehouse (EDW). The bottom-up approach is agile and allows quick wins but can lead to challenges integrating data marts into a cohesive system.

External Sources (ETL)     >> Data Mart 1  >> Data Warehouse >> Data Mining
                           >> Data Mart 2

Kimball gives this approach: data marts are created first and provide a thin view for analysis, and a data warehouse is created after complete data marts have been created.

Summary

Each approach has its advantages and trade-offs, with the bottom-up being more iterative and flexible, while the top-down offers a more structured and holistic view of the organization’s data.

Examples:

Top Down - Popular big retail stores likely to follow this architecture. As they have build a centralized data warehouse that feeds their stores. Similarly Financial organlizations like Banks may take top-down approach.

Bottom Up - Popular OTT have such models. Initially bring movies, then add their own production and add third party providers.

Data Warehouse Characteristic

Characteristics

Subject Oriented
Integrated
Time-Variant
Non-Volatile
- Data Loading (ETL)
- Data Access (Reporting / BI)

Functions

Data Consolidation
Data Cleaning
Data Integration

Subject Oriented

src: https://unstop.com/blog/characteristics-of-data-warehouse

Data analysis for a business's decision-makers can be done quickly by constricting to a particular subject area of the Data warehouse.

Do not add unwanted info on subjects for decision-making.

When analyzing customer information, it's crucial to focus on the relevant data and avoid unnecessary details, such as food habbits, which can distract from the main task.

Integrated

Multiple Source Systems:

In most organizations, data is stored across various systems. For example, a bank might have separate systems for savings accounts, checking accounts, and loans. Each system is designed to serve a specific purpose and might have its own database, schema, and data formats.

Unified Subject Areas:

The data warehouse is a centralized repository where data from these different source systems is brought together. This integration is not just about storing the data in one place; it involves transforming and aligning the data to be analyzed.

Consistency and Standardization:

During integration, the data is often standardized to ensure consistency. For example, account numbers might be formatted differently in the source systems, but they are unified in a standard format in the data warehouse. This standardization is crucial for accurate reporting and analysis.

Benefits:

Holistic View: The data warehouse provides a comprehensive view of a subject by integrating data from different sources. For example, a bank can now analyze a customer's relationship across all accounts rather than looking at each in isolation.

Improved Decision-Making: With integrated data, organizations can perform more sophisticated analyses, leading to better decision-making. For example, they can understand a customer's total exposure by analyzing their savings, checking, and loan accounts together.

Efficiency: Analysts and business users can access all the relevant data in one place without needing to query multiple systems.

Time Variant

Data warehouses, unlike operational databases, are designed with a unique ability to maintain a comprehensive historical record of data. This feature not only allows for trend analysis and historical reporting but also ensures the reliability of the system for comparison over time.

For example, if a customer changes their address, a data warehouse will keep old and new addresses, along with timestamps indicating when the change occurred.

Data warehouses often include a time dimension, allowing users to analyze data across different periods. This could include daily, monthly, quarterly, or yearly trends, providing insights into how data changes over time.

Non Volatile

Non-Volatile Nature: The data warehouse does not allow modifications to the data once it is loaded. This characteristic ensures that historical data is preserved for long-term analysis.

The cylinder on the left side of the diagram represents OLTP databases, which are typically used in operational systems. These databases handle day-to-day transactions, such as reading, adding, changing, or deleting data. OLTP systems are optimized for fast transaction processing.

The cube represents the data warehouse, a dedicated repository designed for analysis and reporting. Unlike OLTP systems, the data warehouse is non-volatile, meaning that once data is loaded into the warehouse, it remains stable and is not updated or deleted. This stability is a key feature, ensuring that historical data is preserved intact for analysis over time.

Tools

Traditional Solutions

SQL Server
Oracle
PostgreSQL
IBM DB2
Teradata
Informatica
SAP HANA

Cloud Solutions

Databricks
Snowflake
Microsoft Fabric
Google BigQuery
Amazon Redshift

ETL/ELT Tools

Talend
Apache NiFi
Fivetran
Apache Airflow (orchestrator)

Data Integration and Data Prep

Alteryx
Trifacta
dbt (data build tool)

BI and Analytics Tools

Tableau
QlikView
Power BI
Looker

Data Lakes

AWS Lake Formation
Azure Data Lake
Google Cloud Storage
Apache Hadoop (HDFS)

Data Cataloging and Governance

Unity Catalog
Apache Atlas
Collibra
Alation
Informatica Data Catalog

Big Data Technologies

Apache Spark
Apache Hive
Apache Impala
Apache HBase
Presto

Cloud vs On-Premise

Feature	Cloud Datawarehouse	On-Premise
Scalability	Instant Up / Down, Scale In / Out	Reconfiguring / purchasing hardware, software, etc.
Availability	Up to 99.99%	Depends on infrastructure.
Security	Provided by cloud provider	Depends on the competence of the in-house IT team.
Performance	Serve multiple geo locations, helps query performance	Scalability challenge
Cost-effectiveness	No hardware / initial cost. Pay only for usage. *If not managed carefully, it could cost a fortune.	Requires significant initial investment, salary.

src: https://www.scnsoft.com/analytics/data-warehouse/cloud

Steps to design a Data Warehouse

Gather Requirements
Environment (Physical / Cloud)
- Dev
- Test
- Prod
Data Modeling
- Star Schema
- Snowflake Schema
- Galaxy Schema
Choose ETL - ELT Solution
OLAP Cubes or Not
Visualization Tool
Query Performance

Gather Requirements

DW is subject-oriented.
Needs data from all related sources. DW is valuable as the data contained within it.
Talk to groups and align goals with the overall project.
Could you determine the scope of the project and how it helps the business?
Discover future needs with the data and technology solution.
Disaster Recovery model.
Security (threat detection, mitigation, monitoring)
Anticipate compliance needs and mitigate regulatory risks.

Environment

Need separate environments for Development, Testing, and Production.
Development & Testing will have some % of sample data from Production.
Testing and Production will have a similar HW environment. (Cloud / In house)
Nice to have a similar environment for development.
Track changes, indexes, and query changes are done in the lower environment.
DR environment is part of the Production release.
Suppose it's in-house; deploy it in different data centers. If it's cloud, deploy it in different regions.

Data Modeling

Data Modeling is a process to visualize the data warehouse.
It helps to set standards in naming conventions, creating relationships between datasets, and establishing compliance and security.
Most Complex phase in data warehouse design.
Recollect Top - Down vs. Bottom - Up that decision plays a vital role.
Data Modelling typically starts at the data mart level and then branches out to the data warehouse.
Three popular data models for data warehouses
- Star Schema
- Galaxy Schema
- Snowflake Schema

ETL / ELT Solution

ETL

Extract
Transform
Load

ETL plays a vital part in moving data across. Many ways ETL can be implemented.

Popular ones

GUI Tools such as

SSIS
Pentaho
Talend
Scripting Tools such as Bash, Python.

ELT

Extract
Load
Transform

In big data platforms such as Hadoop, and Spark, you can load a JSON, or CSV file and start using them as is. This technology can even parse compressed .gz / .bz2 files

Extensively used when dealing with Semi-Structured databases.

Online Analytic Processing

In OLAP cube data can be pre-calculated and pre-aggregated, making analysis faster.

Usually, data is organized in row and column format.

OLAP contains multi-dimensional data, with data from different data sources.

There are 4 types of analytical operations in OLAP

Roll-up: Consolidation, aggregation. Data from different cities can be rolled up to the state/country level.

Drill-down: Opposite of roll-up. If you have data by year, you can analyze monthly, weekly, and daily trends.

Slice-dice: Take one dimension of the data from the cube and create a sub-cube.

If data from various products / various quarters are available take one quarter alone and work with it.

Pivot: Rotating the data axes. Basically swapping the x and y-axis of the data.

Front End

Visualization - the primary reason for creating data warehouses.

Popular Premium BI tools.

Tableau
PowerBI
Looker

Check Data Warehouse Tools for more details.

Query Optimization

These are some basic best practices. There is lot more to discuss in future sessions.

Retrieve only necessary rows.
Do not use *; instead, specify the columns.
Create views so users will control the data pull as well as security.
Filter first and Join later.
Filter first and Group later.
Monitor queries regularly for performance.
Index only when needed. Please don't index unwanted columns.

RDBMS

Data Model

Data Models: Is used to define how the logical structure of a database is modeled.

Entity: A Database entity is a thing, person, place, unit, object, or any item about which the data should be captured and stored in properties, workflow, and tables.

Attributes: Properties of Entity.

Example:

Entity: Student

Attributes: id, name, join date

Student

student id	name	join date
101	Rachel Green	2000-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Entity Relationship Model

Student

student id	name	join date
101	Rachel Green	2000-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Courses

student id	semester	course
101	Semester 1	DBMS
101	Semester 1	Calculus
201	Semester 1	Algebra
201	Semester 1	Web

One to One Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Studentdetails {
        int student_id PK
        string SSN
        date DOB
    }
    
    Student ||--|| Studentdetails : "1 to 1"

Student

student id	name	join date
101	Rachel Green	2000-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Studentdetails

student id	SSN	DOB
101	123-56-7890	1980-05-01
201	236-56-4586	1979-07-05
301	365-45-9875	1980-12-14
401	148-89-4758	1978-06-05

For every row on the left-hand side, there will be only one matching entry on the right-hand side.

For student id 101, you will find one SSN and one DOB.

One to Many Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Address {
        int address_id PK
        int student_id FK
        string address
        string address_type
    }
    
    Student ||--o{ Address : "has"

Student

student id	name	join date
101	Rachel Green	2000-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Address

student id	address id	address	address type
101	1	1 main st, NY	Home
101	2	4 john blvd,NJ	Dorm
301	3	3 main st, NY	Home
301	4	5 john blvd,NJ	Dorm
201	5	12 center st, NY	Home
401	6	11 pint st, NY	Home

What do you notice here?

Every row on the left-hand side has one or more rows on the right-hand side.

For student id 101, you will notice the home address and Dorm address.

Many to Many Mapping

erDiagram
    Student {
        int student_id PK
        string name
        date join_date
    }
    
    Course {
        string course_id PK
        string course_name
    }
    
    StudentCourses {
        int student_id FK
        string course_id FK
    }
    
    Student ||--o{ StudentCourses : enrolls
    Course ||--o{ StudentCourses : offered_in

Student

student id	name	join date
101	Rachel Green	2000-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Student Courses

student id	course id
101	c1
101	c2
301	c1
301	c3
201	c3
401	c4

Courses

course id	course name
c1	DataBase
c2	Web Programming
c3	Big Data
c4	Data Warehouse

What do you notice here?

Students can take more than one course, and courses can have more than one student.

Attributes

Types of Attributes

Simple Attribute - Atomic Values.

An attribute that cannot be divided further.

Example: ssn

Composite Attribute

Made up of more than one simple attribute.

Example: firstname + mi + lastname

Derived Attribute

Calculated from existing attributes.

Age: Derived from DOB

Multivalued Attribute

An attribute that can have multiple values for a single entity (e.g., PhoneNumbers for a person).

src: https://www.tutorialspoint.com/dbms/er_diagram_representation.htm

Student: Entity
Name: Composite Attribute
Student ID: Simple Attribute
Age: Derived Attribute
PhoneNo: MultiValued Attribute
(The Student can have more than one phone number)

Keys

Primary Key

The Attribute helps to identify a row in an entity uniquely.

It cannot be empty and cannot have duplicates.

student id	name	join date
101	Rachel Green	2020-05-01
201	Joey Tribianni	1998-07-05
301	Monica Geller	1999-12-14
401	Cosmo Kramer	2001-06-05

Based on the above data, which attribute can be PK?

any column (as its just four rows)

If we extend the dataset to 10000 rows, which can be PK?

yes student id

Composite Key

A Primary key that consists of two or more attributes is known as a **** composite key.

student id	course id	name
101	C1	Rachel Green
101	C2	Rachel Green
201	C2	Monica Geller
201	C3	Cosmo Kramer

Do you know how to find the unique row?

The combination of StudentID + CourseID makes it unique.

Unique Key

Unique keys are similar to Primary Key, except it can allow one NULL.

Transaction

What is a Transaction?

A transaction can be defined as a group of tasks.

A simple task is the minimum processing unit that cannot be divided further.

Example: Transfer $500 from X account to Y account.

Open_Account(X) 
Old_Balance = X.balance 
New_Balance = Old_Balance - 500 
X.balance = New_Balance 
Close_Account(X)

Open_Account(Y) 
Old_Balance = B.balance 
New_Balance = Old_Balance + 500 
Y.balance = New_Balance 
Close_Account(Y)

States of Transaction

src: https://www.tutorialspoint.com/dbms/dbms_transaction.htm

Active: When the transaction's instructions are running, the transaction is active.
Partially Committed: After completing all the read and write operations, the changes are made in the main memory or local buffer.
Committed: It is the state in which the changes are made permanent on the DataBase.
Failed: When any instruction of the transaction fails.
Aborted: The changes are only made to the local buffer or main memory; hence, these changes are deleted.

ACID

ACID Complaint

Atomicity: All or none. Changes all or none.

Consistency: Never leaves the database in a half-finished state. Deleting a customer is not possible before deleting related invoices.

Isolation: Separates a transaction from another. Transactions are isolated; they occur independently without interference. One change will not be visible to another transaction.

Durability: Recover from an abnormal termination. Once the transaction is completed, data is written to disk and persists when the system fails. DB returns to a consistent state when restarted due to abnormal termination.

Example

Atomicity

Example: Transferring money between two bank accounts. If you transfer $100 from Account A to Account B, the transaction must ensure that either the debit from Account A and credit to Account B occur or neither occurs. If any error happens during the transaction (e.g., system crash), the transaction will be rolled back, ensuring that no partial transaction is completed.

Consistency

Example: Enforcing database constraints, such as ensuring that a field that stores a percentage can only hold values between 0 and 100. If an operation tries to insert a value of 110, it will fail because it violates the consistency rule of the database schema.

Isolation:

Example: Two transactions that concurrently update the same set of rows in a database. Transaction A updates a row; before it commits, Transaction B also tries to update the same row. Depending on the isolation level (e.g., Serializable), Transaction B may be required to wait until Transaction A commits or rolls back to avoid data inconsistencies, ensuring that transactions do not interfere with each other.

Durability:

Example: After a transaction is committed, such as a user adding an item to a shopping cart in an e-commerce application, the data must be permanently saved to the database. Even if the server crashes immediately after the commit, the added item must remain in the shopping cart once the system is back online.

Online/Realtime vs Batch

Online Processing

An online system handles transactions when they occur and provides output directly to users. It is interactive.

Use Cases

E-commerce Websites:

Use Case: Processing customer orders. Example: When a customer places an order, the system immediately updates inventory, processes the payment, and provides an order confirmation in real time. Any delay could lead to issues like overselling stock or customer dissatisfaction.

Online Banking:

Use Case: Fund transfers and account balance updates. Example: When a user transfers money between accounts, the transaction is processed immediately, and the balance is updated in real-time. Real-time processing is crucial to ensure the funds are available immediately and that the account balance reflects the latest transactions.

Social Media Platforms:

Use Case: Posting updates and notifications. Example: When a user posts a new status or comment, it should be instantly visible to their followers. Notifications about likes, comments, or messages are also delivered in real time to maintain user engagement.

Ride-Sharing Services (e.g., Uber, Lyft):

Use Case: Matching drivers with passengers. Example: When a user requests a ride, the system matches them with a nearby driver in real time, providing immediate feedback on the driver's location and estimated arrival time.

Fraud Detection Systems:

Use Case: Monitoring transactions for fraudulent activities. Example: Credit card transactions are monitored in real time to detect unusual patterns and prevent fraud before they are completed.

Batch Processing

Data is processed in groups or batches. Batch processing is typically used for large amounts of data that must be processed on a routine schedule, such as paychecks or credit card transactions.

A batch processing system has several main characteristics: collect, group, and process transactions periodically.

Batch programs require no user involvement and require significantly fewer network resources than online systems.

Use Cases

End-of-Day Financial Processing:

Use Case: Reconciling daily transactions. Example: Banks often batch process all the day's transactions at the end of the business day to reconcile accounts, generate statements, and update records. This processing doesn't need to be real-time but must be accurate and comprehensive.

Data Warehousing and ETL Processes:

Use Case: Extracting, transforming, and loading data into a data warehouse. Example: A retail company may extract sales data from various stores, transform it to match the warehouse schema, and load it into a centralized data warehouse. This process is typically done in batches overnight to prepare the data for reporting and analysis the next day.

Payroll Processing:

Use Case: Calculating employee salaries. Example: Payroll systems typically calculate and process salaries in batches, often once per pay period. Employee data (hours worked, overtime, deductions) is collected over the period and processed in a single batch job.

Inventory Updates:

Use Case: Updating inventory levels across multiple locations. Example: A chain of retail stores might batch-process inventory updates at the end of each day. Each store's sales data is sent to a central system, where inventory levels are adjusted in batches to reflect the day's sales.

Billing Systems:

Use Case: Generating customer bills. Example: Utility companies often generate customer bills at the end of each month. Usage data is collected throughout the month, and bills are generated in batch processing jobs, which are then sent to customers.

DSL vs GPL

GPL - General Programming Language.  Python / JAVA / C++ 

One tool can be used to do many things. 

DSL - Domain-Specific Language. HTML / SQL / JQ / AWK /...

Specific tools to do a specific job.

src: https://tomassetti.me/domain-specific-languages/"

Storage Formats

Account number	Last name	First name	Purchase (in dollars)
1001	Green	Rachel	20.12
1002	Geller	Ross	12.25
1003	Bing	Chandler	45.25

Row Oriented Storage

In a row-oriented DBMS, the data would be stored as

1001,Green,Rachel,20.12;1002,Geller,Ross,12.25;1003,Bing,Chandler,45.25

Best suited for OLTP - Transaction data.

Columnar Oriented Storage

1001,1002,1003;Green,Geller,Bing;Rachel,Ross,Chandler;20.12,12.25,45.25

Best suited for OLAP - Analytical data.

Compression: Since the data in a column tends to be of the same type (e.g., all integers, all strings), and often similar values, it can be compressed much more effectively than row-based data.
Query Performance: Queries that only access a subset of columns can read just the data they need, reducing disk I/O and significantly speeding up query execution.
Analytic Processing: Columnar storage is well-suited for analytical queries and data warehousing, which often involve complex calculations over large amounts of data. Since these queries often only affect a subset of the columns in a table, columnar storage can lead to significant performance improvements.

Aspect	CSV (Comma-Separated Values)	Parquet
Data Format	Text-based, plain text	Columnar, binary format
Compression	Usually uncompressed (or lightly compressed)	Highly compressed
Schema	None, schema-less	Strong schema enforcement
Read/Write Efficiency	Row-based, less efficient for column operations	Column-based, efficient for analytics
File Size	Generally larger	Typically smaller due to compression
Storage	More storage space required	Less storage space required
Data Access	Good for sequential access	Efficient for accessing specific columns
Example Size (1 GB)	Could be around 1 GB or more depending on compression	Could be 200-300 MB (due to compression)
Use Cases	Simple data exchange, compatibility	Big data analytics, data warehousing
Support for Data Types	Limited to text, numbers	Rich data types (int, float, string, etc.)
Processing Speed	Slower for large datasets, particularly for queries on specific columns	Faster, especially for column-based queries
Tool Compatibility	Supported by most tools, databases, and programming languages	Supported by big data tools like Apache Spark, Hadoop, etc.

Feature	Star Schema	Starflake Schema
Structure	Single fact table connected to multiple denormalized dimension tables.	Hybrid structure with a mix of denormalized and normalized dimension tables.
Complexity	Simple and straightforward, easy to understand and navigate.	More complex due to the normalization of certain dimensions.
Data Redundancy	Higher redundancy due to denormalization; data may be duplicated in dimension tables.	Reduced redundancy in normalized parts of the schema, leading to potentially less storage use.
Query Performance	Generally faster for query performance because of fewer joins (denormalized data).	Slightly slower query performance due to additional joins needed for normalized tables.
Maintenance	Easier to maintain, as there are fewer tables and less complex relationships.	More challenging to maintain, as normalized tables introduce more relationships and dependencies.
Flexibility	Less flexible in terms of handling updates and changes in dimension attributes (due to denormalization).	More flexible for handling changes and updates to dimension attributes, as normalization allows for easier updates without affecting the entire table.
Use Case	Best for environments where simplicity and query performance are prioritized, such as dashboards and reporting.	Best for environments where data integrity and storage efficiency are critical, and some dimensions require normalization.

StoreID *	Street	City	State	Country
S1001	24th Blvd	Phoenix	AZ	USA
S1002	21 Bell Road	Miami	FL	USA
S1003	Main Street	New Port	CA	USA

StoreID *	Street	City	State	Country
S1001	24th Blvd	Phoenix	AZ	USA
S1002	21 Bell Road	Miami	FL	USA
S1003	Main Street	New Port	CA	USA
S1001	1st Street	Phoenix	AZ	USA

StoreID *	Street	City	State	Country
S1001	24th Blvd	Phoenix	AZ	USA
S1002	21 Bell Road	Miami	FL	USA
S1003	Main Street	New Port	CA	USA
233	South Street	New Brunswick	NJ	USA
1233	JFK Blvd	Charlotte	NC	USA

FirstName	LastName	DOB	Ht	Wt	Gender	Course	Grade
Ross	Geller	1967-10-18	72	170	Male	Paleontology	A
Rachel	Green	1969-05-05	65	125	Female	Fashion Design	B+
Monica	Geller	1964-04-22	66	130	Female	Culinary Arts	A
Chandler	Bing	1968-04-08	73	180	Male	Advertising	B
Joey	Tribbiani	1968-01-09	71	185	Male	Acting	C+

Manufacturer	Model	Sales_in_thousands	Vehicle_type	Price_in_thousands	Engine_size	Horsepower	Wheelbase	Width	Length	Curb_weight	Fuel_capacity	Fuel_efficiency	Latest_Launch	Power_perf_factor
Acura	Integra	16.919	Passenger	21.5	1.8	140	101.2	67.3	172.4	2.639	13.2	28	2/2/2012	58.28
Acura	TL	39.384	Passenger	28.4	3.2	225	108.1	70.3	192.9	3.517	17.2	25	6/3/2011	91.371
Acura	CL	14.114	Passenger	[NULL]	3.2	225	106.9	70.6	192	3.47	17.2	26	1/4/2012	[NULL]
Acura	RL	8.588	Passenger	42	3.5	210	114.6	71.4	196.6	3.85	18	22	3/10/2011	91.39
Audi	A4	20.397	Passenger	23.99	1.8	150	102.6	68.2	178	2.998	16.4	27	10/8/2011	62.778
Audi	A6	18.78	Passenger	33.95	2.8	200	108.7	76.1	192	3.561	18.5	22	8/9/2011	84.565
Audi	A8	1.38	Passenger	62	4.2	310	113	74	198.2	3.902	23.7	21	2/27/2012	134.657
BMW	323i	19.747	Passenger	26.99	2.5	170	107.3	68.4	176	3.179	16.6	26	6/28/2011	71.191

Sur_ID	Product_ID	Product_Name	Category	Price	Start_Date	End_Date
1	1	Smartphone	Electronics	800	2021-01-01	2023-04-03
2	1	Smartphone	Mobile Devices	800	2023-04-04	NULL

Product_ID	Product_Name	Category	Price	Previous_Category	Valid_From	Valid_To	Is_Current
1	Smartphone	Electronics	800	NULL	2020-01-01	2023-01-01	N
2	Smartphone	Mobile Devices	800	Electronics	2023-01-01	NULL	Y

Advance Data Warehousing