Data Engineering

Lakehouse and the evolution of Data Lake

Simplifying data infrastructure and accelerating innovation

Bruno Cordeiro

Published in

Towards AI

5 min readOct 16, 2020

The history of data storage starts back in the 1950s when punch cards were used for storing data generated by computers. A lot has changed since then and this article will cover one of the latest trends in the industry, Lakehouse.

Before we jump into what’s Lakehouse and how you can benefit from it, let’s have a quick overview of two data management paradigms widely used nowadays.

Data Warehouse

The architecture for Data Warehouses was developed in the 1980s to support companies in their decision-making process. The central concept relies on having historical data processed and stored in both formats, aggregated and granular.

Aggregated data contain high-level information, summarized by groups and displaying measures such as totals, averages, or sums; granular data contain information at the lowest level of detail that is relevant for the business analysis.

This data is then consumed by BI tools, where executives and other staff can visualize and analyze data in the format of reports and charts.

Data Lake

With the advent of big data, traditional architectures like the data warehouse had to be rethought. With data coming from different sources, in different formats, and usually in a bigger volume, a new paradigm needed to emerge to fill this gap. In a data lake, the data is stored in its raw format and it’s only queried when a business question arises, retrieving relevant data that can then be analyzed to help answer the question. The data is stored in cloud storage like Amazon S3, which has become one of the largest and most cost-effective storage systems in the world as it makes it possible to store practically limitless amounts of data in its native format at a low cost.

Lakehouse: The best of two worlds

Data lakes, like any emerging technology, started to reveal some of its weaknesses. Some of the key features from databases were missing, transactions were not supported, hard to enforce data quality and it relies on the eventual consistency model rather than the strong consistency model.

Eventual consistency relies on the concept that, if no new updates are made to a given data record, eventually all accesses to that record will return the last updated value whereas strong consistency offers up-to-date data but at the cost of high latency.

Databricks announced early this year, Lakehouse, a new paradigm that merges the best of the two worlds, including their names.

Lakehouse's main goal is to bring the key features from data warehouses into the data lake model and to achieve that, we have a new framework, Delta Lake.

Delta Lake is an open-source framework with the function of serving as a storage layer to Apache Spark and other big data platforms. It was developed by Databricks and has been adopted by companies such as Alibaba, McAfee, Nielsen, Tableau, Talend, and eBay.

Delta Lake brings support to ACID transactions to lakehouses. It guarantees atomicity and durability to the storage system.

In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. — Wikipedia

Specifically, Delta Lake performs transactional operations via LogStore API rather than accessing the storage system directly in order to enforce three factors to the storage system:

Atomic visibility: Ensures that a file is visible in its entirety or not visible at all.
Mutual exclusion: Ensures that only one writer is able to create or rename a file at the final destination.
Consistent listing: Ensures that once a file is written in a directory, all future listings for that directory must return that file.

Supported Storage Systems

Delta Lake has built-in LogStore implementations for HDFS, Amazon S3, and Azure storage services.

Integrations

You can access Delta tables from Apache Spark and other data processing systems. Reading from external tables is supported by using a manifest file. In this file, there is a list of data files to read for querying a table.

Here is the list of integrations that are currently available:

Presto
Athena
Redshift Spectrum (Experimental)
Snowflake (Experimental)
Apache Hive

More information on how to set up integration using manifest files and query Delta tables can be found at https://docs.delta.io/latest/integrations.html

Lakehouse Architecture

The main difference between data lakes and lakehouses is the addition of a structured transactional layer.

Final Thoughts

In this article, we could see how much technology has evolved and the pace of innovation is just speeding up as technology becomes cheaper and companies need to adapt quickly to this always-changing environment. Let’s see how long it takes until the next paradigm comes up.

References:

Keith D. Foote (April 19, 2018). A Brief History of the Data Warehouse. Retrieved from https://www.dataversity.net/brief-history-data-warehouse/
Ben Lorica et al. (January 30, 2020). What is a Lakehouse? Retrieved from https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Aggregate and granular information. Retrieved from https://www.cese.nsw.gov.au/effective-practices/using-data-with-confidence-main/aggregate-granular
Saurabh.v (July 16, 2017) Eventual vs Strong Consistency in Distributed Databases. Retrieved from https://hackernoon.com/eventual-vs-strong-consistency-in-distributed-databases-282fdad37cf7
ACID (13 October, 2020). Wikipedia. Retrieved from https://en.wikipedia.org/wiki/ACID
Delta Lake (2020). Delta [Source Code]. https://github.com/delta-io/delta
Delta Lake (2020). Storage configuration. Retrieved from https://docs.delta.io/latest/delta-storage.html

Towards AI

Data Engineering

Lakehouse and the evolution of Data Lake

Simplifying data infrastructure and accelerating innovation

Data Warehouse

Data Lake

Lakehouse: The best of two worlds

Supported Storage Systems

Integrations

Lakehouse Architecture

Final Thoughts

References:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Towards AI

Written by Bruno Cordeiro

Responses (1)