Data Lake House Architecture

Ani
2 min readNov 1, 2021

--

With a number of buzzes around lakehouses and it’s benefits let’s have a look at the basic architecture of a working lakehouse.

Image Ⓒ Amazon Web Services (AWS)

A lakehouse consists of the best qualities of a Warehouse and a datalake. But how can one implement ACID properties of a warehouse on files? That’s where the shared catalog comes which tracks all transactional details. For example,

⚫ Which data files are the current files or pointing to the latest version?
⚫ Which files are not in use or stale?
⚫ What is the partitioning scheme?
⚫ What are the sort keys?
⚫ Is there any active locking present in the table?

Those all answers we can get from the catalog. This catalog is shared between the lake and the DW. The DW is nothing but looking at the collection of files as a table through the glasses of catalog/metadata.

Rest all are same like a functional datalake just addition to what we can do in lake, a fully functional lakehouse offers SQL based ELT to do Insert/Update/Delete/Merge like DML operations on files with the help of some very strong pointers aka metastore.

You can use the same old Hive Metasrore DB schema and install in a postgres or mysql instance for example to work with Iceberg or in case of Databricksdelta they use External Apache Hive metastore
or AWS Glue Data Catalog as the metastore for Databricks Runtime.

#lakehouse Databricks #iceberg #deltalake #sql #spark #bigdata#dataengineering #opendataarchitecure #warehouse #architecture #mysql#aws #data

--

--

Ani
Ani

Written by Ani

Senior Software Engineer, Big Data — Passionate about designing robust distributed systems