What could be the best compression codec for your datalake? Most popular and optimised file format that is parquet which is also the recommended format for spark and highly optimised with the two codecs.
Research says that ZSTD has more compression and decompression speed. Write is a process that is not that important than a read for datalake as read operation needed to be more faster for retrieval. On the other hand saving disc space saves money and who doesn’t want to save it?
GZIP provides lossless compression that is not splittable. Gzip compression ratio is around 2.7x-3x. Compression speed is between 100MB/s and decompression speed is around 440MB/s.
ZSTD provides lossless compression that is splittable. It is not data type-specific and is designed for real-time compression. Zstd compression ratio is around 2.8x. Compression speed is around 530MB/s and decompression is around 1360MB/s.
By default, GZIP Level 6 as the compression algorithm inside Parquet. Recent community development on Parquet’s support for ZSTD from Facebookcaught data engineers attention. Experiment proved, ZSTD Level 9 and Level 19 are able to reduce Parquet file size by 8% and 12% compared to GZIP-based Parquet files, respectively. Moreover, both ZSTD Level 9 and Level 19 have decompression speeds faster than GZIP level 6.
#spark #dataengineering #compression #zstd #gzip #datalake #lakehouse#databricks #Databricks