File Formats

CSV/TSV

Pros

Tabular Row storage.
Human-readable is easy to edit manually.
Simple schema.
Easy to implement and parse the file(s).

Cons

No standard way to present binary data.
No complex data types.
Large in size.

JSON

Pros

Supports hierarchical structure.
Most languages support them.
Widely used in Web

Cons

More memory usage due to repeatable column names.
Not very splittable.
Lacks indexing.

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Twitter and Cloudera developed it to provide a compact and efficient way of storing large, flat datasets.

Best for WORM (Write Once Read Many).

The key features of Parquet are:

Columnar Storage: Parquet is optimized for columnar storage, unlike row-based files like CSV or TSV. This allows it to efficiently compress and encode data, which makes it a good fit for storing data frames.
Schema Evolution: Parquet supports complex nested data structures, and the schema can be modified over time. This provides much flexibility when dealing with data that may evolve.
Compression and Encoding: Parquet allows for highly efficient compression and encoding schemes. This is because columnar storage makes better compression and encoding schemes possible, which can lead to significant storage savings.
Language Agnostic: Parquet is built from the ground up for use in many languages. Official libraries are available for reading and writing Parquet files in many languages, including Java, C++, Python, and more.
Integration: Parquet is designed to integrate well with various big data frameworks. It has deep support in Apache Hadoop, Apache Spark, and Apache Hive and works well with other data processing frameworks.

In short, Parquet is a powerful tool in the big data ecosystem due to its efficiency, flexibility, and compatibility with a wide range of tools and languages.

Difference between CSV and Parquet

Aspect	CSV (Comma-Separated Values)	Parquet
Data Format	Text-based, plain text	Columnar, binary format
Compression	Usually uncompressed (or lightly compressed)	Highly compressed
Schema	None, schema-less	Strong schema enforcement
Read/Write Efficiency	Row-based, less efficient for column operations	Column-based, efficient for analytics
File Size	Generally larger	Typically smaller due to compression
Storage	More storage space required	Less storage space required
Data Access	Good for sequential access	Efficient for accessing specific columns
Example Size (1 GB)	Could be around 1 GB or more depending on compression	Could be 200-300 MB (due to compression)
Use Cases	Simple data exchange, compatibility	Big data analytics, data warehousing
Support for Data Types	Limited to text, numbers	Rich data types (int, float, string, etc.)
Processing Speed	Slower for large datasets, particularly for queries on specific columns	Faster, especially for column-based queries
Tool Compatibility	Supported by most tools, databases, and programming languages	Supported by big data tools like Apache Spark, Hadoop, etc.

Parquet Compression

Snappy (default)
Gzip

Snappy

Low CPU Util
Low Compression Rate
Splittable
Use Case: Hot Layer
Compute Intensive

GZip

High CPU Util
High Compression Rate
Splittable
Use Case: Cold Layer
Storage Intensive

Advance Data Warehousing

File Formats

CSV/TSV

JSON

Parquet

Difference between CSV and Parquet

Parquet Compression