Table format : Wasn’t a file format enough?

Ani
4 min readNov 4, 2021

--

When you can’t see it well, wear a pair of glasses.

The era of big data reached the pinnacle of success with the rise of multiple successful file formats with the likes of parquet, ORC, AVRO etc which gave much power to the compute engines such as Map Reduce, Spark to optimise search and retrieval capabilities.

Apache Parquet : An Overview

For example let’s have some understanding how parquet files are designed to have faster reads without reading the entire file.

FileLayout.gif

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. This approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

It consists of a header followed by one or more blocks, terminated by a footer. The header contains only a 4-byte magic number, (00000–430–607e72a48a3–8baa-cd90cc7b84b9–00001), that identifies the file as being in Parquet format, and all the file metadata is stored in the footer. The footer’s metadata in-cludes the format version, the schema, any extra key-value pairs, and metadata for every block in the file. The final two fields in the footer are a 4-byte field encoding the length of the footer metadata, and the magic number again .

How Spark reads parquet

When say spark reads a directory of parquet files, spark engine doesn’t read the entire parquet files at first place. Let’s assume we have a table DEPARTMENT_TABLE which is partitioned DEPARTMENT and we fired the below sql in spark. First it will try to prune the partition with DEPARTMENT = software-services and SUBDEPARTMENT = datalake so that spark can avoid looking into other partitions such as CHEMISTRY or MATHEMATICS.

SELECT DEPARTMENT, SUBDEPARTMENT
,COUNT(EMP_ID)
FROM DEPARTMENT_TABLE
GROUP BY DEPARTMENT, SUBDEPARTMENT
WHERE DEPARTMENT = 'software-services' AND SUBDEPARTMENT='datalake'
AND EMP_ID BETWEEN 10001 AND 200001;

What after that?

After spark gets the directory it it needs to scan the files in SUBDEPARTMENT = datalake. To do that efficiently spark uses the parquet metadata and read the footer for further filtering if provided in the SQL. Just like the SQL above there is one more filter with EMP_ID BETWEEN 10001 AND 200001. As said in the parquet details in parquet, spark opens all the files and read the metadata to get the range validated for EMP_ID. This operation in spark is called file listing operation and it is on the order of O(N) where N is number of files inside the directory “datalake”.

When we have this directory structure become more complicated and spark generated numerous small files the listing and metadata reading operation becomes a bottleneck and the retrieval of records become slower with time.

This problem is not just because of file format but the overall listing process of spark. This was predominant in hive, spark and map reduce or any distributed computing around hdfs.

Table format : The Saviour

When we say file format we talk about unit of files, a parquet file, an avro file or a ORC file. When we say a group of files as a collection it is called a table format. Like the below image when spark sees the parquet files under the subdepartment datalake through the pair of glasses of a table format it sees the directory as a table. Thus it is called a table format. At this moment there are 4 major table formats available in the big data universe.

Apache Iceberg, Apache HUDI, Databricks Delta and HIVE ACID tables.

How it solves the listing problem?

The pair of glasses in the above said analogy are nothing but those table formats which helps spark not just to track a directory but also a file. To achieve that it takes help from catalog systems such as Hive Meta Store(HMS) or AWS Glue catalog services to implement ACID properties and using the strong relationship of a table to it’s data files. This essentially improves the search operation or list operation to the order of O(1).

Underneath it writes files in versions and keep track of the latest files corresponding to a table with the help of metastore. After writing when we read the table spark talks to the metastore to know the current state of the table and what are the related files to be scanned. Using Optimistic Concurrency Control it guarantees all ACID compliance and benefits. This essentially speeds up the engine and help to prune files to best extent.

For more details reach out to me : anigos

To read more about apache iceberg : Apache Iceberg : A Primer

--

--

Ani

Senior Software Engineer, Big Data — Passionate about designing robust distributed systems