What do Snowflake, Databricks, Redshift, BigQuery actually do?
- 1. Introduction
- 2. Analytical databases aggregate large amounts of data
- 3. Most platforms enable you to do the same thing but have different strengths
- 3.1. Understand how the platforms process data
- 3.1.1. A compute engine is a system that transforms data
- 3.1.2. Metadata catalog stores information about datasets
- 3.1.3. Data platform support for SQL, Dataframe, and Dataset APIs
- 3.1.4. Query planner turns your code into concrete execution steps
- 3.1.5. Switch on machines only when needed (aka Serverless)
- 3.1.6. Scheduling and orchestration
- 3.2. How data is stored will determine how efficiently it can be processed
- 3.3. Ensure seamless data querying for end-users
- 3.4. Data platforms are expensive
- 3.5. Data governance: Know the what is/where is/upstream of datasets
- 3.6. Managing data platform is a lot of work (aka ops)
- 3.1. Understand how the platforms process data
- 4. Conclusion
- 5. Further Reading
1. Introduction
If you are in the data space (or looking to get into it), you would have come across platforms like Databricks, Snowflake, Redshift, Bigquery, etc. If you
Don’t understand their purpose besides just running Spark, SQL, etc
Are unclear which platform to choose for your use case
Wondering if a data warehouse is just a place where you can store all your data
Wondering if you need to pay big bucks to do analytical data processing
Not entirely clear why one platform may be “better” than another
This post is for you! In it, we will discuss analytical databases and the benefits provided by data platforms.
By the end of this post, you will have a clear overview of what you will get by paying a premium.
2. Analytical databases aggregate large amounts of data
Analytical data processing systems (aka OLAP) aggregate large amounts of data. Aggregation of historical data enables stakeholders to find insights or see how your company/organization is performing.
There are multiple open-source analytical data processing systems, such as Apache Spark, Trino, etc. However, these systems are generally distributed (aka use multiple machines) and require monitoring/resource allocation, which requires engineering resources.
Snowflake , Databricks , Redshift , BigQuery are paid solutions that make it easy for you to run and maintain analytical data processing systems. Some of these platforms (e.g., Databricks, AWS Athena, Starburst) provide management for open-source software (e.g., Apache Spark, Trino, etc.).
Ref: What is the difference between OLTP and OLAP
3. Most platforms enable you to do the same thing but have different strengths
If you are trying to understand the capabilities of big data platforms, it’s helpful to think in terms of key concepts. For each concept, we will see how Databricks, Snowflake, and Redshift compare. In addition to the paid platforms, we will also see how/if open-source Apache Spark provides these capabilities.
Let’s go over the key concepts to be mindful of below:
3.1. Understand how the platforms process data
To compare how data platforms process data, we have to look at the individual components involved in processing data; they are:
3.1.1. A compute engine is a system that transforms data
Compute engine refers to the software/code that transforms your data (based on your code).
Apache Spark
: Apache Spark converts your code into a series of data processing steps, which are executed by a system calledTungsten
.Databricks
: Tungsten (see previous point). You can optionally use thePhoton
execution engine.Photon
is a databricks-only execution engine written in C++ for better memory management and performance.Snowflake/Redshift/BigQuery
: Custom execution engines. Since these systems are closed-source, we cannot access the execution engine code. However, most data platforms have documentation that provides a high-level architecture overview.
3.1.2. Metadata catalog stores information about datasets
A metadata catalog stores information about datasets, such as schema groups, table schemas, partition schemas, and row counts.
The Hive metastore is a popular metadata catalog for storing metadata information. Most open source systems (Apache Spark, Trino) have integrations that allow them to use Hive metastore to store information about their tables.
Apache Spark
: You can use in-memory(no table defn shared across sessions) or Hive metastore. If you use Hive metastore, note that you have to maintain an OLTP db to use as the Hive Metastore . Alternatively, you can manage via the recently open-sourced Unity catalog .Databricks
: Databricks has managed the Unity catalog. Unity catalog enables you to store metadata and has additional features (over the open source version) such as audits, tracking data lineage, etc.Snowflake/Redshift/BigQuery
: Custom metadata management (ref: Snowflake cloud services , Redshift metadata etc).
3.1.3. Data platform support for SQL, Dataframe, and Dataset APIs
The code interface defines how an engineer can instruct the data processing system to process data. Most analytical data systems typically support SQL, with some supporting dataframe and dataset APIs.
Note that most analytical systems have Python libraries that allow you to connect to them, but you will still have to write SQL to process data.
Apache Spark
: Apache Spark enables you to code in SQL, Python (dataframe), R (dataframe), Java (dataframe & dataset), & scala (dataframe and dataset).Databricks
: Databricks enables you to code in SQL, Python (dataframe), R (dataframe), Java (dataframe & dataset), & scala (dataframe and dataset).Snowflake/Redshift/BigQuery
: SQL. These data platforms offer limited support for data processing in Python by creating UDFs . Snowflake also has Snowpark , which is Apache Spark run as part of your Snowflake infra, and since this uses Apache Spark, you can use Python, Java, or Scala.
3.1.4. Query planner turns your code into concrete execution steps
Most analytical data processing systems follow a declarative programming format, i.e., you tell them what to do, and it decides how to do it.
Apache Spark
: Uses catalyst optimizer to plan how to execute your query/dataframe processing. You can use RDD APIs to control precisely how data is processed.Databricks
: Same as Apache Spark.Snowflake/Redshift/BigQuery
: Custom query planner.
3.1.5. Switch on machines only when needed (aka Serverless)
Most analytical database engines offer separate computing and storage. This separation enables these engines to run the compute machines only as needed.
Apache Spark
: You must manage switching clusters on and off.Databricks
: You can spin up clusters as needed. Note that the spin-up times can be high.Snowflake/Redshift/BigQuery
: You can pre-define the types of clusters to spin up based on who is querying. Spin-up times are very fast.
3.1.6. Scheduling and orchestration
Orchestration & job schedule: Scheduling refers to a system that knows when to start a pipeline and starts it. Orchestrator is a system that enables you to run data pipeline tasks in the order you need.
Apache Spark
: Apache Spark builds a DAG when you tell it to write out the data. You have to manage to schedule the Spark job.Databricks
: WithDatabricks jobs,
you can schedule and orchestrate your pipeline tasks. Depending on your use case, you can also use delta live tables.Snowflake/Redshift/BigQuery
: Not available. However, you can use materialized/dynamic tables (caveats ) if it suits your use case.
3.2. How data is stored will determine how efficiently it can be processed
With companies using multiple data processing systems (e.g., Python for DS, SQL for DAs, etc.), your data platform of choice must allow seamless access to any data format stored in any location.
3.2.1. Ensure that data can be stored where you need to
If your company or organization uses multiple storage locations(S3, Blob store, SFTP, FTP, etc.), it’s critical to check that a data platform can read/write to the required locations.
Apache Spark
: Can read from and write to most storage locations and formats.Databricks
: Can read from/write to most storage locations and formats. Note that if you are using multiple cloud providers, you will need to ensure appropriate permissions.Snowflake
: Snowflake can run on any cloud provider, and with appropriate permission, you can easily read/write to most locations.Redshift/BigQuery
: Cloud vendor-specific analytical data systems only enable you to read/write to their cloud storage systems and open protocols like SFTP, FTP, etc.
3.2.2. Use platforms that support open file and table formats
Open-source column formats (e.g., Parquet, ORC) and table formats (E.g., Apache Iceberg, Hudi, Delta Lake) enable better query performance and give you access to advanced features (What is open table format and its benefits? ).
Apache Spark
: Supports all open file and table formats.Databricks
: Native support for Delta lake format. Apache Iceberg and Hudi are supported as well. Databricks has strong support for delta format (and they recently purchased the company behind Apache Iceberg ).Snowflake
: Has a custom in-house table format (with features like time travel, clone, etc.) and recently added support for Apache Iceberg.Redshift/BigQuery
: Support with some gotchas .
Note that closed data systems(Snowflake, BigQuery, etc.) had implemented the table format features (zero-copy cloning, time travel, etc.) long before they were available in open-source table formats.
3.2.3. Share data seamlessly with data sharing
A common use case for data warehouses is sending data to external stakeholders(think your company clients). Typically, data is sent to stakeholders via a script/data pipeline that reads data from the warehouse and dumps it in the stakeholder location.
Most stakeholders reload the data into their warehouse for querying. Most data platforms enable customers to exchange data to address this use case.
Apache Spark
: Data sharing standard for Spark ecosystem with Delta Lake . However, you have to manage the Delta sharing server, which keeps track of who has access to which data.Databricks
: Delta sharing with additional features .Snowflake
: Snowflake has its own version of secure data sharing .Redshift
: Only sharable to other AWS services, data share overview .
Note that most data platforms only enable sharing data with stakeholders using the same data platform.
3.3. Ensure seamless data querying for end-users
Some data platforms were designed to allow end users to query data as needed, while others were developed solely as data processing tools.
Apache Spark
: You have to manage spark clusters to be able to answer ad-hoc questions.Databricks
: You can spin up clusters when needed (slow) or have a cluster running always for quick access (costly).Snowflake/Redshift/BigQuery
: You can spin up clusters quickly as needed.
3.4. Data platforms are expensive
Data platforms are expensive, and the pricing model is complex (especially if you use advanced features). The most significant part of your bill will be the amount you pay for the compute engine, followed by storage and operations costs.
Apache Spark
: Free and open source. However, you must spend engineering hours setting up and maintaining clusters.Databricks
: You have many computing machine options. Use this pricing calculator tool to create an estimate.Snowflake
: You have many computing machine options. Use this pricing calculator tool to create an estimate. The pricing model is quite complex that you often end up having to learn how to save costs .Redshift/BigQuery
: Similar to the above, you will pay for computing, storage, networking, and special features. Use the documentation & pricing calculators to get an estimate.
At a high level, you will end up paying for:
- Compute: Typically the most expensive part of your bill (usually at ~80% of your bill).
- Storage: Typically the second most expensive part of your bill (usually at ~15% of your bill).
- Special services: Specialized services (e.g., Snowpipe, delta live tables, photon data processor, ML) cost extra on top of standard compute (usually at ~5% of your bill).
3.5. Data governance: Know the what is/where is/upstream of datasets
Data governance is an umbrella term used to define
- What a dataset is
- Who has access to which dataset
- What datasets are available
- How to share datasets
- How a specific dataset is created(ie. what are its upstream sources)
- Defining data quality and seeing if a dataset meets those
- Data security and access (e.g., ensuring PII information is only accessible when needed, etc.)
Most data platforms offer features for data governance:
Apache Spark
: If you do not have a dedicated data governance system, you will need to manage data governance.Databricks
: Has a set of features to help data governance .Snowflake/Redshift/BigQuery
: Has a set of features to help data governance .
3.6. Managing data platform is a lot of work (aka ops)
Operations (aka ops) refers to tools/techniques for managing data infrastructure, such as what happens when the system goes down, how to deploy new code to a cluster, what happens when a specific region cloud is down (e.g., AWS us-east-1 is down, does your snowflake instance still work?, etc.), what happens when infrastructure breaks (aka disaster recovery), etc.
Most enterprise analytical systems have their recommendations for disaster recovery. (e..g, Snowflake disaster recovery , Snowflake disaster recovery ).
However, setting up data bricks requires some work to integrate the different systems involved, such as code repos, cloud providers, etc. Operations are “easier” in Snowflake because you can run queries directly on Snowflake without deploying jars to any clusters. You do not have to manage clusters manually, and Snowflake can automatically spin up clusters when you try to run a query on it.
4. Conclusion
To recap we saw:
- What analytical data processing systems are
- Understand how data processing in analytical data processing systems
- Data storage & sharing formats
- Data platform support for data warehousing
- Data governance
- How managing data systems is a full-time job
Data platforms that started out as warehouses (e.g., Snowflake, BigQuery, Redshift, etc.) are generally easier to manage and set up. Data systems that started out as data processing systems (e.g., Spark) require a more hands-on approach to reach the same level of feature availability as the data warehouse systems, but they are also more powerful since they provide a deeper level of control over how data is processed and stored.
Most data platforms are converging in terms of features, e.g., Snowpipe from Snowflake and Warehouse pattern support from Databricks.
Finally, marketing also plays a huge role in how a data platform is perceived (e.g. What platform comes to mind when you hear data lakehouse?). When you are deciding between them, think in features and do not base your decision just on marketing material.
The next time you are working to understand what a data platform offers, use the list above to guide you. Please let me know in the comment section below if you have any questions or comments.
5. Further Reading
- What is a data warehouse
- SQL or Python for data transformations?
- How to reduce your snowflake costs
- SQL tips to level up your DE skills
If you found this article helpful, share it with a friend or colleague using one of the socials below!