What do Snowflake, Databricks, Redshift, BigQuery actually do?

Nov 22, 2024 · 12 min read

1. Introduction
2. Analytical databases aggregate large amounts of data
3. Most platforms enable you to do the same thing but have different strengths
4. Conclusion
5. Further Reading

1. Introduction

If you are in the data space (or looking to get into it), you would have come across platforms like Databricks, Snowflake, Redshift, Bigquery, etc. If you

Don’t understand their purpose besides just running Spark, SQL, etc

Are unclear which platform to choose for your use case

Wondering if a data warehouse is just a place where you can store all your data

Wondering if you need to pay big bucks to do analytical data processing

Not entirely clear why one platform may be “better” than another

This post is for you! In it, we will discuss analytical databases and the benefits provided by data platforms.

By the end of this post, you will have a clear overview of what you will get by paying a premium.

2. Analytical databases aggregate large amounts of data

Analytical data processing systems (aka OLAP) aggregate large amounts of data. Aggregation of historical data enables stakeholders to find insights or see how your company/organization is performing.

There are multiple open-source analytical data processing systems, such as Apache Spark, Trino, etc. However, these systems are generally distributed (aka use multiple machines) and require monitoring/resource allocation, which requires engineering resources.

Snowflake , Databricks , Redshift , BigQuery are paid solutions that make it easy for you to run and maintain analytical data processing systems. Some of these platforms (e.g., Databricks, AWS Athena, Starburst) provide management for open-source software (e.g., Apache Spark, Trino, etc.).

Ref: What is the difference between OLTP and OLAP

3. Most platforms enable you to do the same thing but have different strengths

If you are trying to understand the capabilities of big data platforms, it’s helpful to think in terms of key concepts. For each concept, we will see how Databricks, Snowflake, and Redshift compare. In addition to the paid platforms, we will also see how/if open-source Apache Spark provides these capabilities.

Let’s go over the key concepts to be mindful of below:

3.1. Understand how the platforms process data

To compare how data platforms process data, we have to look at the individual components involved in processing data; they are:

3.1.1. A compute engine is a system that transforms data

Compute engine refers to the software/code that transforms your data (based on your code).

Apache Spark: Apache Spark converts your code into a series of data processing steps, which are executed by a system called Tungsten.
Databricks: Tungsten (see previous point). You can optionally use the Photon execution engine. Photon is a databricks-only execution engine written in C++ for better memory management and performance.
Snowflake/Redshift/BigQuery: Custom execution engines. Since these systems are closed-source, we cannot access the execution engine code. However, most data platforms have documentation that provides a high-level architecture overview.

3.1.2. Metadata catalog stores information about datasets

A metadata catalog stores information about datasets, such as schema groups, table schemas, partition schemas, and row counts.

The Hive metastore is a popular metadata catalog for storing metadata information. Most open source systems (Apache Spark, Trino) have integrations that allow them to use Hive metastore to store information about their tables.

Apache Spark: You can use in-memory(no table defn shared across sessions) or Hive metastore. If you use Hive metastore, note that you have to maintain an OLTP db to use as the Hive Metastore . Alternatively, you can manage via the recently open-sourced Unity catalog .
Databricks: Databricks has managed the Unity catalog. Unity catalog enables you to store metadata and has additional features (over the open source version) such as audits, tracking data lineage, etc.
Snowflake/Redshift/BigQuery: Custom metadata management (ref: Snowflake cloud services , Redshift metadata etc).

3.1.3. Data platform support for SQL, Dataframe, and Dataset APIs

The code interface defines how an engineer can instruct the data processing system to process data. Most analytical data systems typically support SQL, with some supporting dataframe and dataset APIs.

Note that most analytical systems have Python libraries that allow you to connect to them, but you will still have to write SQL to process data.

Apache Spark: Apache Spark enables you to code in SQL, Python (dataframe), R (dataframe), Java (dataframe & dataset), & scala (dataframe and dataset).
Databricks: Databricks enables you to code in SQL, Python (dataframe), R (dataframe), Java (dataframe & dataset), & scala (dataframe and dataset).
Snowflake/Redshift/BigQuery: SQL. These data platforms offer limited support for data processing in Python by creating UDFs . Snowflake also has Snowpark , which is Apache Spark run as part of your Snowflake infra, and since this uses Apache Spark, you can use Python, Java, or Scala.

3.1.4. Query planner turns your code into concrete execution steps

Most analytical data processing systems follow a declarative programming format, i.e., you tell them what to do, and it decides how to do it.

Apache Spark: Uses catalyst optimizer to plan how to execute your query/dataframe processing. You can use RDD APIs to control precisely how data is processed.
Databricks: Same as Apache Spark.
Snowflake/Redshift/BigQuery: Custom query planner.

3.1.5. Switch on machines only when needed (aka Serverless)

Most analytical database engines offer separate computing and storage. This separation enables these engines to run the compute machines only as needed.

Apache Spark: You must manage switching clusters on and off.
Databricks: You can spin up clusters as needed. Note that the spin-up times can be high.
Snowflake/Redshift/BigQuery: You can pre-define the types of clusters to spin up based on who is querying. Spin-up times are very fast.

3.1.6. Scheduling and orchestration

Orchestration & job schedule: Scheduling refers to a system that knows when to start a pipeline and starts it. Orchestrator is a system that enables you to run data pipeline tasks in the order you need.

Apache Spark: Apache Spark builds a DAG when you tell it to write out the data. You have to manage to schedule the Spark job.
Databricks: With Databricks jobs, you can schedule and orchestrate your pipeline tasks. Depending on your use case, you can also use delta live tables.
Snowflake/Redshift/BigQuery: Not available. However, you can use materialized/dynamic tables (caveats ) if it suits your use case.

3.2. How data is stored will determine how efficiently it can be processed

With companies using multiple data processing systems (e.g., Python for DS, SQL for DAs, etc.), your data platform of choice must allow seamless access to any data format stored in any location.

3.2.1. Ensure that data can be stored where you need to

If your company or organization uses multiple storage locations(S3, Blob store, SFTP, FTP, etc.), it’s critical to check that a data platform can read/write to the required locations.

Apache Spark: Can read from and write to most storage locations and formats.
Databricks: Can read from/write to most storage locations and formats. Note that if you are using multiple cloud providers, you will need to ensure appropriate permissions.
Snowflake: Snowflake can run on any cloud provider, and with appropriate permission, you can easily read/write to most locations.
Redshift/BigQuery: Cloud vendor-specific analytical data systems only enable you to read/write to their cloud storage systems and open protocols like SFTP, FTP, etc.

3.2.2. Use platforms that support open file and table formats

Open-source column formats (e.g., Parquet, ORC) and table formats (E.g., Apache Iceberg, Hudi, Delta Lake) enable better query performance and give you access to advanced features (What is open table format and its benefits? ).

Apache Spark: Supports all open file and table formats.
Databricks: Native support for Delta lake format. Apache Iceberg and Hudi are supported as well. Databricks has strong support for delta format (and they recently purchased the company behind Apache Iceberg ).
Snowflake: Has a custom in-house table format (with features like time travel, clone, etc.) and recently added support for Apache Iceberg.
Redshift/BigQuery: Support with some gotchas .

Note that closed data systems(Snowflake, BigQuery, etc.) had implemented the table format features (zero-copy cloning, time travel, etc.) long before they were available in open-source table formats.

A common use case for data warehouses is sending data to external stakeholders(think your company clients). Typically, data is sent to stakeholders via a script/data pipeline that reads data from the warehouse and dumps it in the stakeholder location.

Most stakeholders reload the data into their warehouse for querying. Most data platforms enable customers to exchange data to address this use case.

Apache Spark: Data sharing standard for Spark ecosystem with Delta Lake . However, you have to manage the Delta sharing server, which keeps track of who has access to which data.
Databricks: Delta sharing with additional features .
Snowflake: Snowflake has its own version of secure data sharing .
Redshift: Only sharable to other AWS services, data share overview .

Note that most data platforms only enable sharing data with stakeholders using the same data platform.

3.3. Ensure seamless data querying for end-users

Some data platforms were designed to allow end users to query data as needed, while others were developed solely as data processing tools.

Apache Spark: You have to manage spark clusters to be able to answer ad-hoc questions.
Databricks: You can spin up clusters when needed (slow) or have a cluster running always for quick access (costly).
Snowflake/Redshift/BigQuery: You can spin up clusters quickly as needed.

3.4. Data platforms are expensive

Data platforms are expensive, and the pricing model is complex (especially if you use advanced features). The most significant part of your bill will be the amount you pay for the compute engine, followed by storage and operations costs.

Apache Spark: Free and open source. However, you must spend engineering hours setting up and maintaining clusters.
Databricks: You have many computing machine options. Use this pricing calculator tool to create an estimate.
Snowflake: You have many computing machine options. Use this pricing calculator tool to create an estimate. The pricing model is quite complex that you often end up having to learn how to save costs .
Redshift/BigQuery: Similar to the above, you will pay for computing, storage, networking, and special features. Use the documentation & pricing calculators to get an estimate.

At a high level, you will end up paying for:

Compute: Typically the most expensive part of your bill (usually at ~80% of your bill).
Storage: Typically the second most expensive part of your bill (usually at ~15% of your bill).
Special services: Specialized services (e.g., Snowpipe, delta live tables, photon data processor, ML) cost extra on top of standard compute (usually at ~5% of your bill).

3.5. Data governance: Know the what is/where is/upstream of datasets

Data governance is an umbrella term used to define

What a dataset is
Who has access to which dataset
What datasets are available
How to share datasets
How a specific dataset is created(ie. what are its upstream sources)
Defining data quality and seeing if a dataset meets those
Data security and access (e.g., ensuring PII information is only accessible when needed, etc.)

Most data platforms offer features for data governance:

Apache Spark: If you do not have a dedicated data governance system, you will need to manage data governance.
Databricks: Has a set of features to help data governance .
Snowflake/Redshift/BigQuery: Has a set of features to help data governance .

3.6. Managing data platform is a lot of work (aka ops)

Operations (aka ops) refers to tools/techniques for managing data infrastructure, such as what happens when the system goes down, how to deploy new code to a cluster, what happens when a specific region cloud is down (e.g., AWS us-east-1 is down, does your snowflake instance still work?, etc.), what happens when infrastructure breaks (aka disaster recovery), etc.

Most enterprise analytical systems have their recommendations for disaster recovery. (e..g, Snowflake disaster recovery , Snowflake disaster recovery ).

However, setting up data bricks requires some work to integrate the different systems involved, such as code repos, cloud providers, etc. Operations are “easier” in Snowflake because you can run queries directly on Snowflake without deploying jars to any clusters. You do not have to manage clusters manually, and Snowflake can automatically spin up clusters when you try to run a query on it.

4. Conclusion

To recap we saw:

Data platforms that started out as warehouses (e.g., Snowflake, BigQuery, Redshift, etc.) are generally easier to manage and set up. Data systems that started out as data processing systems (e.g., Spark) require a more hands-on approach to reach the same level of feature availability as the data warehouse systems, but they are also more powerful since they provide a deeper level of control over how data is processed and stored.

Most data platforms are converging in terms of features, e.g., Snowpipe from Snowflake and Warehouse pattern support from Databricks.

Finally, marketing also plays a huge role in how a data platform is perceived (e.g. What platform comes to mind when you hear data lakehouse?). When you are deciding between them, think in features and do not base your decision just on marketing material.

The next time you are working to understand what a data platform offers, use the list above to guide you. Please let me know in the comment section below if you have any questions or comments.

5. Further Reading

If you found this article helpful, share it with a friend or colleague using one of the socials below!