What is the difference between a data lake and a data warehouse?
- Introduction
- Data lakes and data warehouses
- Criteria to choose lake and warehouse tools
- Conclusion
- Further reading
- References
Introduction
With the data ecosystem growing fast, new terms are coming up every week. Some of the most popular ones include “data lakes” and “data warehouses”. If you are
Trying to understand the differences between a data lake and a data warehouse
Frustrated by vendor marketing content aimed at selling their lake/warehouse product?
Then this post is for you. By the end of this post, you will understand what data lakes and warehouses are, and how to choose the right tools for your data lakes and warehouses.
Data lakes and data warehouses
Data lake
A data lake is a storage location where raw data is dumped and used as the data source for your data warehouse.
Compared to a data warehouse, data lakes are quick to set up due to not needing data cleaning and modeling.
Recommended reading: What is a staging area?
Data warehouse
Data warehousing is the process of understanding data, analyzing end-user usage patterns, curating, cleaning, modeling, and quality testing the data.
The result of data warehousing is ready-to-use data (aka the data warehouse). The data in a warehouse is used to compute critical business KPIs.
Due to the curation and cleaning work required, it is usually slower to set up compared to a data lake.
Recommended reading: What is a data warehouse?
Criteria to choose lake and warehouse tools
Let’s go over the criteria for choosing your tools.
Factor | Cloud storage service (e.g. S3) | OLAP DB (e.g. Snowflake) |
---|---|---|
Data schema |
Data of any schema can be stored. Systems like Spark can infer the schema when reading the data (schema-on-read ). |
Data schema needs to be defined to load it into a table(schema-on-write ). Some OLAP DBs allow unstructured columns letting users store data without defining schema(e.g. storing unstructured data as a column
). |
Data format |
Supports all the popular and custom formats. | Supports all the popular formats (e.g. Avro, Parquet, ORC, etc). Most OLAP DBs have their custom internal format for storage. |
Data size |
Storage costs less than OLAP DB. Typically storage ranges from bytes to Peta and Exabytes. | Storage costs more than cloud storage. Alternatively, data can be stored in cloud storage and schema can be defined as an external table. |
Data ingest |
Data can be ingested in bulk or in streams. | Data can be ingested in bulk or streams.(e.g. snowflake-kafka-connector ) |
Data read |
Data can be read in bulk and streams. There is support for event triggering in response to incoming data. E.g. Trigger AWS job when data lands in S3 | Data can be read in bulk and in streams(e.g. Snowflake streams ). OLAP DBS does not support triggering events based on incoming data. |
Data process |
Needs a separate service to process data. (E.G. Spark, Presto, Python, Scala, ) | OLAP DBs excel at processing large amounts of data. There are certain types of data like audio and video that OLAP DBs are not good at processing. |
End user |
The end users need to be technically savvy to access the data. | The end users can be anyone who knows SQL or a data visualization tool that allows querying data without SQL. (E.G. Metabase, Looker ) |
Batch v streaming |
Generally used as data dumps in batch and streaming data pipelines. | OLAP DBs are better suited for batch processing due to higher latency and processing data on-demand. Materialized views can provide up to date views, but they have their limitations . |
Data management |
Folder structures, paths, metadata, permissions, individual file sizes, etc will need to be managed. | Most OLAP DBS have in-built metadata tables, structured permission schema, etc making management easier. |
Historically, data lakes have been associated with cloud storage services and data warehouses with OLAP DBs. But with new systems like delta lake , Snowflake , and Clickhouse the boundaries are dissolving.
When choosing a tool for your data pipeline use the table above to make a good choice. Note that every system has its nuances, so make sure to read its documentation regarding the above points.
Conclusion
To recap, we saw
The main idea is to think of both data lake and warehouse as concepts and not tools. When you have to choose between cloud storage and OLAP DBs for your data pipeline, use the points here to make the best choice.
If you have any questions, comments, or ideas please leave them in the comment section below.
Further reading
- Read this article for a quick introduction to data warehouses.
- Read this article to understand the need for staging area.
- Curious about ways to load data into a data warehouse? Checkout this article .
- Read this to understand differences between ETL & ELT.
- dbt tutorial
References
If you found this article helpful, share it with a friend or colleague using one of the socials below!