Apache Spark

Apache Spark is a in memory - distributed data analytics engine. It is a unified engine for big data processing. Unified means it offers both batch and stream processing.

Apache Spark overcame the challenges of Hadoop with in-memory parallelization and delivering high performance for distributed processing.

Increase developer productivity and can be seamlessly combined to create complex workflows.

Features of Apache Spark

Analytics and Big Data processing.
Machine learning capabilities.
Huge open source community and contributors.
A distributed framework for general-purpose data processing.
Support for Java, Scala, Python, R, and SQL.
Integration with libraries for streaming and graph operations.

Benefits of Apache Spark

Speed (In memory cluster computing)
Scalable (Cluster can be added/removed)
Powerful Caching
Real-time
Supports Delta
Polyglot (Knowing/Using several languages. Spark provides high-level APIs in SQL/Scala/Python/R. Spark code can be written in any of these 4 languages.)

Spark Popular Eco System (Core APIs)

Spark SQL : SQL queries in Spark.
Streaming : Process realtime data.
MLib : MLlib allows for preprocessing, munging, training of models, and making predictions at scale on data.

Also it supports Scala, Python, JAVA and R.

Spark Architecture

src: www.databricks.com

Cluster Manager: Allocates Students/Faculty for a course. Who is in and who is out.

Driver: Faculty point of entry. Gives instructions and collects the result.

Executor: Table of Students.

Core (Thread): Student

Cores share the same JVM, Memory, Diskspace (like students sharing table resources, power outlets)

Partition

Technique of distributing data across multiple files or nodes in order to improve the query processing performance.

By partitioning the data, you can write each partition to a different executor simultaneously, which can improve the performance of the write operation.

Spark can read each partition in parallel on a different executor, which can improve the performance of the read operation.

Bowl of candies how do sort and count by color?

If its with one person, then others are wasting their time. Instead if its packaged into small packs (partitions) then everyone can work in parallel.

Default size is 128MB and its configurable.

Free Cores picks up the remaining Tasks

Spark Context : Used by Driver node to establish communication with Cluster. SparkContext is an object that represents the entry point to the underlying Spark cluster, which is responsible for coordinating the processing of distributed data in a cluster. It serves as a communication channel between the driver program and the Spark cluster.

When a Databricks cluster is started, a SparkContext is automatically created and made available as the variable sc in the notebook or code. The SparkContext provides various methods to interact with the Spark cluster, including creating RDDs, accessing Spark configuration parameters, and managing Spark jobs.

text_rdd = sc.textFile("/FileStore/tables/my_text_file.txt")

Other entry points

SQL Context : Entry point to perform SQL Like operations. Hive Context : If Spark application needs to communicate with Hive.

In newer versions of Spark, this is not used anymore.

Spark Session

In Spark 2.0, we introduced SparkSession, a new entry point that subsumes SparkContext, SQLContext, StreamingContext, and HiveContext. For backward compatibiilty, they are preserved.

In short its called as "spark".

spark.read.format("csv").

RDD

RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark. Here are the key aspects:

Core Characteristics:

Resilient: Fault-tolerant with the ability to rebuild data in case of failures
Distributed: Data is distributed across multiple nodes in a cluster
Dataset: Collection of partitioned data elements
Immutable: Once created, cannot be changed

Key Features:

In-memory computing
Lazy evaluation (transformations aren't executed until an action is called)
Type safety at compile time
Ability to handle structured and unstructured data

Basic Operations:

Transformations (create new RDD):

map()
filter()
flatMap()
union()
intersection()
distinct()

Actions (return values):

reduce()
collect()
count()
first()
take(n)
saveAsTextFile()

Benefits

Fault tolerance through lineage graphs
Parallel processing
Caching capability for frequently accessed data
Efficient handling of iterative algorithms
Supports multiple languages (Python, Scala, Java)

Limitations

No built-in optimization engine
Manual optimization required
Limited structured data handling compared to DataFrames
Higher memory usage due to Java serialization

Example

Read a CSV using RDD and group by Region, Country except Region=Australia

from pyspark.sql import SparkSession
from operator import add

# Initialize Spark
spark = SparkSession.builder \
    .appName("Sales Analysis RDD") \
    .getOrCreate()

sc = spark.sparkContext

# Read CSV file
rdd = sc.textFile("sales.csv")

# Extract header and data
header = rdd.first()
data_rdd = rdd.filter(lambda line: line != header)

# Transform and filter data
# Assuming CSV format: Region,Country,Sales,...
result_rdd = data_rdd \
    .map(lambda line: line.split(',')) \
    .filter(lambda x: x[0] != 'Australia') \
    .map(lambda x: ((x[0], x[1]), float(x[2]))) \
    .groupByKey() \
    .mapValues(lambda sales: sum(sales) / len(sales)) \
    .sortByKey()

# Display results
print("Region, Country, Average Sales")
for (region, country), avg_sales in result_rdd.collect():
    print(f"{region}, {country}, {avg_sales:.2f}")

Good News you don't have to write direct RDDs anymore.

Now the same using Dataframes

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

# Initialize Spark
spark = SparkSession.builder \
    .appName("Sales Analysis DataFrame") \
    .getOrCreate()

# Read CSV file
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("sales.csv")

# Group by and calculate average sales
result_df = df.filter(df.Region != 'Australia') \
    .groupBy('Region', 'Country') \
    .agg(avg('Sales').alias('Average_Sales')) \
    .orderBy('Region', 'Country')

# Show results
result_df.show(truncate=False)

Spark Query Execution Sequence

Unresolved Logical Plan

This is the set of instructions the developer logically wants to do.

First, Spark parses the query and creates the Unresolved Logical Plan Validates the syntax of the query.

Doesn’t validate the semantics meaning column name existence, data types.

Metadata Catalog (Analysis)

This is where the column names, table names are validated against Catalog and returns a Logical Plan.

Catalyst Catalog

This is where first set of optimizations takes place. Rewrite/Reorder the logical sequence of calls. From this we get Optimized Logical Plan.

Catalyst Optimizer

Determines there are multiple ways to execute the query. Do we pull 100% of data from the network or filter the dataset with a predicate pushdown? From this, we determine one or more physical plans.

Physical Plans

Multiple ways to execute the query.

Physical Plans represent what the query engine will actually do. This is different from the optimized logical plan. Each optimization determines a cost model.

Selected Physical Plan

And best performing model is selected by Cost Model. Finally the selected physical plan is compiled to RDDs.

Estimated amount of data needed for processing
amount of shuffling
amount of time to execute the query

RDD

(Whole Stage Code Generation)

This the same RDD a developer would write themselves.

AQE (Adaptive Query Execution)

Checks for join strategies, skews at runtime. This happens repeatedly to find out best plan.

- explain(mode="simple") which will display the physical plan
- explain(mode="extended") which will display physical and logical plans (like “extended” option)
- explain(mode="codegen") which will display the java code planned to be executed
- explain(mode="cost") which will display the optimized logical plan and related statistics (if they exist)
- explain(mode="formatted") which will display a splitted output composed by a nice physical plan outline, and a section with each node details

DAG (Direct Acyclic Graph)

Example of DAG

Stage 1 (Read + Filter):
   [Read CSV] → [Filter]
        |
        ↓
Stage 2 (GroupBy + Shuffle):
   [Shuffle] → [GroupBy]
        |
        ↓
Stage 3 (Order):
   [Sort] → [Display]

Databricks

Databricks is a Unified Analytics Platform on top of Apache Spark that accelerates innovation by unifying data science, engineering and business. With our fully managed Spark clusters in the cloud, you can easily provision clusters with just a few clicks.

This is not Databricks Sales so not getting into roots of the product.

Open

Open standards provide easy integration with other tools plus secure, platform-independent data sharing.

Unified

One platform for your data, consistently governed and available for all your analytics and AI,

Scalable

Scale efficiently with every workload from simple data pipelines to massive LLMs.

Lakehouse

Data Warehouse + Data Lake = Lakehouse.

ELT Data Design Pattern

In ELT (Extract, Load, Transform) data design patterns, the focus is on loading raw data into a data warehouse first, and then transforming it. This is in contrast to ETL, where data is transformed before loading. ELT is often favored in cloud-native architectures.

Batch Load

In a batch load, data is collected over a specific period and then loaded into the data warehouse in one go.

Real-time Example

A retail company collects sales data throughout the day and then runs a batch load every night to update the data warehouse. Analysts use this data the next day for reporting and decision-making.

Stream Load

In stream loading, data is continuously loaded into the data warehouse as it's generated. This is useful in scenarios requiring real-time analytics and decision-making.