Table Evolution in Apache Iceberg

Ani
2 min readNov 1, 2021

--

In you datalake table you want to make changes to the physical layout of the table but don’t want to rebuild the table with huge compute?

⚫ Have you done a mistake creating a table with wrong partition?
⚫ Did you forget to put correct compression codec for data files?
⚫ Do you want to add column to a datalake table at any position?
⚫ Do you want to change the sort order of a datalake table?

Well you can still achieve those without rebuilding the table entirely but just with the new data supporting your current and future analytics with apache iceberg Table Evolution principles.

Iceberg supports the following schema evolution changes:

⚫ Add — add a new column to the table or to a nested struct
⚫ Drop — remove an existing column from the table or a nested struct
⚫ Rename — rename an existing column or field in a nested struct
⚫ Update — widen the type of a column, struct field, map key, map value, or list element
⚫ Reorder — change the order of columns or fields in a nested struct

Iceberg guarantees that schema evolution changes are independent and free of side-effects, without rewriting files. Iceberg uses unique IDs to track each column in a table. When you add a column, it is assigned a new ID so existing data is never used by mistake.

My most favourite feature is Partition Evolution where you can change the table partition and it has no effect on the existing table metadata. Instead of failing it will fire two separate query plans and give you the results. The data for 2008 is partitioned by month. Starting from 2009 the table is updated so that the data is instead partitioned by day. Both partitioning layouts are able to coexist in the same table.

https://iceberg.apache.org/#evolution/

Iceberg uses hidden partitioning, so you don’t need to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don’t contain matching data.

To know more check here.

#datalakehouse #iceberg #spark #sql #opendataarchitecture #bigdata#dataengineering #analytics #future #apachespark

--

--

Ani
Ani

Written by Ani

Senior Software Engineer, Big Data — Passionate about designing robust distributed systems

Responses (1)