How to use nested data types effectively in SQL
- 1. Introduction
- 2. Code & Data
- 3. Using nested data types effectively
- 4. Conclusion
- 5. Continue reading
1. Introduction
If you have worked in the data space, you’d inevitably come across tables with so many columns that it gets difficult to remember the exact names. If you are wondering
Is there a more straightforward way to represent data
How to use complex data types to represent relationships in data efficiently
This post is for you. In it, we will discuss how to use complex data types in SQL to represent relationships efficiently, improve developer ergonomics, and potentially reduce metric calculation issues.
By the end of this post, you will know how to use complex data types in your pipelines, their tradeoffs, and how to conceptually consider using complex data types as part of your table.
2. Code & Data
We will use the tpch
data for data. The TPC-H data represents a car parts seller’s data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part (parts sold), region, nation, and partsupp (parts supplier).
You can run the code on GitHub codespaces, by following this link & then going to ./concepts/nested_data_types/using_nested_data_types.ipynb .
You can also run the code locally using the following commands:
git clone https://github.com/josephmachado/adv_data_transformation_in_sql.git
cd adv_data_transformation_in_sql
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
python setup.py
jupyter lab
And going to ./concepts/nested_data_types/using_nested_data_types.ipynb .
How to use nested data types in SQL, YouTube Link (Oct 19th 2024 1PM EST)
3. Using nested data types effectively
When we think of tables, we think of rows and columns. With the advent of nested data types,
we can do the following.
- Represent related columns in a table as a
STRUCT
. - Combine data from multiple rows into one row with
ARRAY[STRUCT]
While there are other data types (DBengine specific), exploring the STRUCT
and ARRAY
data types and their use cases will enable you to explore the other data types.
3.1. Use STRUCT for one-to-one & hierarchical relationships
If you are joining tables with a one-to-one relationship, it is a good use case for using STRUCT
. Let’s consider that you are building an OBT table
called wide_lineitem
.
To create wide_lineitem
, we must enrich the lineitem
fact table with multiple dimensions(customer
and supplier
). Let’s see how we would do this without a nested data structure.
SELECT l.*,
c.*,
s.*
FROM
lineitem l
LEFT JOIN
orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;
l_orderkey | l_quantity | .. | c_custkey | .. | s_suppkey | .. |
---|---|---|---|---|---|---|
1 | 17.00 | .. | 370 | .. | 93 | .. |
1 | 36.00 | .. | 370 | .. | 75 | .. |
We will end up with a very wide table, and the column names would be hard to keep track of. Now consider the same, but with a STRUCT
data type to store the combined attributes of customer and order entities.
SELECT
l.*,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment
) AS customer,
struct_pack(
id := s.s_suppkey,
name := s.s_name,
address := s.s_address,
nationkey := s.s_nationkey,
phone := s.s_phone,
acctbal := s.s_acctbal,
comment := s.s_comment
) AS supplier
FROM
lineitem l
LEFT JOIN
orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;
l_orderkey | .. | customer | supplier | |
---|---|---|---|---|
1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’} | {‘id’: 93, ‘name’: ‘Supplier#000000093’} | |
1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’} | {‘id’: 75, ‘name’: ‘Supplier#000000075’} |
Now, the data for customer and supplier attributes are available as STRUCT
types. We can access them using an expressive notation such as customer.name
, supplier.address
, etc.
Let’s consider a hierarchical relationship. Consider a customer
and supplier
, each with its own nation
attribute. A STRUCT
allows us to represent this hierarchical data easily without having to rename the nation table to customer_nation_name, supplier_nation_name, etc. Let’s see how this is done.
-- Hierarchical data
SELECT
l.*,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment,
nation := struct_pack(
nationkey := n.n_nationkey,
name := n.n_name,
regionkey := n.n_regionkey,
comment := n.n_comment
)
) AS customer,
struct_pack(
id := s.s_suppkey,
name := s.s_name,
address := s.s_address,
nationkey := s.s_nationkey,
phone := s.s_phone,
acctbal := s.s_acctbal,
comment := s.s_comment,
nation := struct_pack(
nationkey := sn.n_nationkey,
name := sn.n_name,
regionkey := sn.n_regionkey,
comment := sn.n_comment
)
) AS supplier
FROM
lineitem l
LEFT JOIN
orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
nation n ON c.c_nationkey = n.n_nationkey
LEFT JOIN
supplier s ON l.l_suppkey = s.s_suppkey
LEFT JOIN
nation sn ON s.s_nationkey = sn.n_nationkey
LIMIT 5;
l_orderkey | .. | customer | supplier |
---|---|---|---|
1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ' quickly final packages. furiously i’}} | {‘id’: 93, ‘name’: ‘Supplier#000000093’, ‘nation’: {‘nationkey’: 16, ‘name’: ‘MOZAMBIQUE’, ‘regionkey’: 0, ‘comment’: ' beans after the carefully regular accounts r’}} |
1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ' quickly final packages. furiously i’}} | {‘id’: 75, ‘name’: ‘Supplier#000000075’, ‘nation’: {‘nationkey’: 18, ‘name’: ‘CHINA’, ‘regionkey’: 2, ‘comment’: ‘ckly special packages cajole slyly. unusual, unusual theodolites mold furiously. slyly sile’}} |
3.2. Use ARRAY[STRUCT] for one-to-many relationships
When you have a one-to-many relationship(say fact—fact), you’d either aggregate both tables to the same grain (aka what one row represents in that table) or duplicate some facts(numerical data). Duplicated numerical data, if not careful, can lead to incorrect metrics.
To avoid confusion in calculating metrics and have all the necessary data in one place, we can store the lower-grained data as an ARRAY of STRUCTS.
Let’s look at an example where we create a wide table at order grain, but it has lineitems (items that make up an order) data as an ARRAY[STRUCT]
.
WITH line_items as (
SELECT
l_orderkey as orderkey,
array_agg(struct_pack(
lineitemkey := l.l_linenumber,
partkey := l.l_partkey,
suppkey := l.l_suppkey,
quantity := l.l_quantity,
extendedprice := l.l_extendedprice,
discount := l.l_discount,
tax := l.l_tax,
returnflag := l.l_returnflag,
linestatus := l.l_linestatus,
shipdate := l.l_shipdate,
commitdate := l.l_commitdate,
receiptdate := l.l_receiptdate,
shipinstruct := l.l_shipinstruct,
shipmode := l.l_shipmode,
comment := l.l_comment
)) AS lineitems
FROM
lineitem l
GROUP BY
l_orderkey)
SELECT o.*,
len(l.lineitems) as num_lineitems,
l.lineitems
FROM orders o
LEFT JOIN line_items l
on o.o_orderkey = l.orderkey
LIMIT 5;
orderkey | .. | lineitems |
---|---|---|
2 | .. | [{‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33, ‘quantity’: Decimal(‘38.00’), ‘extendedprice’: Decimal(‘36596.28’), ‘discount’: Decimal(‘0.00’), ‘tax’: Decimal(‘0.05’), ‘returnflag’: ‘N’, ‘linestatus’: ‘O’, ‘shipdate’: ‘1997-01-28’, ‘commitdate’: ‘1997-01-14’, ‘receiptdate’: ‘1997-02-02’, ‘shipinstruct’: ‘TAKE BACK RETURN’, ‘shipmode’: ‘RAIL’, ‘comment’: ‘re. enticingly regular instruct’}] |
In the above code, we aggregate lineitems to order grain as ARRAY[STRUCT] and join it to the order table.
NOTE: We must ensure that the size of a single value in a column does not exceed the allowed limits (check your DB documentation).
3.3. Using nested data types in data processing
We saw how we can model 1:1, 1:m, and hierarchical relationships with nested data types. In this section, we will see how we transform nested data types.
To demonstrate the tips and techniques, let’s create an OBT table called wide_orders
, which has orders, lineitems, customer and nation data at order grain.
CREATE TABLE IF NOT EXISTS wide_orders AS
WITH line_items as (
SELECT
l_orderkey as orderkey,
array_agg(struct_pack(
lineitemkey := l.l_linenumber,
partkey := l.l_partkey,
suppkey := l.l_suppkey,
quantity := l.l_quantity,
extendedprice := l.l_extendedprice,
discount := l.l_discount,
tax := l.l_tax,
returnflag := l.l_returnflag,
linestatus := l.l_linestatus,
shipdate := l.l_shipdate,
commitdate := l.l_commitdate,
receiptdate := l.l_receiptdate,
shipinstruct := l.l_shipinstruct,
shipmode := l.l_shipmode,
comment := l.l_comment
)) AS lineitems
FROM
lineitem l
GROUP BY
l_orderkey)
SELECT
o.*,
l.lineitems,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment,
nation := struct_pack(
nationkey := n.n_nationkey,
name := n.n_name,
regionkey := n.n_regionkey,
comment := n.n_comment
)
) AS customer
FROM
orders o
LEFT JOIN
line_items l ON o.o_orderkey = l.orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
nation n ON c.c_nationkey = n.n_nationkey;
3.3.1. STRUCT enables more straightforward data schema and data access
While structs may seem different, you can use them usually like any other column. All you have to do is use the .
notation.
SELECT o_orderkey,
customer.name,
customer.address,
lineitems[1] as first_lineitem
FROM wide_orders
limit 2;
o_orderkey | customer_name | customer_address | first_lineitem |
---|---|---|---|
1 | Customer#000000370 | DtqbDO5rid | {‘lineitemkey’: 1, ‘partkey’: 1552, ‘suppkey’: 93} |
2 | Customer#000000781 | maoqGuL5,rHfX0leqZcFqHqpQH | {‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33} |
In the above query, we access data from the customer struct with the .
notation and access the first lineitem struct from an array of lifetime structs using the array notation []
.
Note: In DuckDB, the array index starts at 1.
3.3.2. Nested data types can be sorted
Structs can be ordered as any other data type. The db engine compares each value of the struct to define the order.
Let’s see an example: When we order by our customer struct, the id
field will be compared since that is the first key in the struct field.
SELECT customer
FROM wide_orders
GROUP BY 1
ORDER BY 1
LIMIT 5
We can also order elements in our lineitems ARRAY
with the list_sort
function.
3.3.3. UNNEST ARRAY to rows and GROUP rows to ARRAY
When you want to convert a set of rows into an array of structs or vice versa, use the ARRAY_AGG and UNNEST functions, respectively.
Let’s see how to convert a set of rows into an array.
WITH lineitems as (SELECT
o.o_orderkey,
UNNEST(o.lineitems) as line_item
FROM
wide_orders o),
unnested_line_items AS (
SELECT o_orderkey,
line_item.lineitemkey,
line_item.partkey,
line_item.quantity
FROM lineitems
)
SELECT o_orderkey,
array_agg(struct_pack(
line_item_key := lineitemkey,
part_key := partkey,
quantity := quantity)) as lineitems
FROM unnested_line_items
GROUP BY 1
LIMIT 5;
Let’s see how to convert an array of elements into individual rows.
WITH
lineitems AS (
SELECT
o.o_orderkey,
UNNEST (o.lineitems) AS line_item
FROM
wide_orders o
)
SELECT
o_orderkey,
line_item.lineitemkey,
line_item.partkey,
line_item.quantity
FROM
lineitems
LIMIT
5
3.3.4. Improve OBT usability with nested data types
In wide tables, you often end up with column names like customer_name,
supplier_name,
etc. Using structs, you can enable more straightforward naming conventions with the. `` notation, such as customer.name
& supplier.name
. These are even more applicable with deeply nested structures such as customer.nation.name
.
In addition, nested data types also help us avoid incorrect metrics computation. If we store data with multiple grains in the same table, we must be cautious when aggregating the data.
With nested data structure, we can avoid this, as shown below.
We are getting lineitem level metrics from the orders OBT table.
-- get lineitem metrics
WITH
lineitems AS (
SELECT
o.o_orderkey,
UNNEST (o.lineitems) AS line_item
FROM
wide_orders o
)
SELECT
o_orderkey,
COUNT(line_item.lineitemkey) AS num_line_items,
SUM(line_item.quantity) AS total_line_item_quantity
FROM
lineitems
GROUP BY
1
ORDER BY
1
LIMIT
10;
We are getting order-level metrics from the orders OBT table with lineitem data.
-- Get order metrics
SELECT
o_orderdate,
SUM(o_totalprice) AS order_total_price
FROM
wide_orders
GROUP BY
1
ORDER BY
1
LIMIT
5;
3.2. Ensure your performance meets your expectations
By using structs, we add overhead in terms of maintenance (the db engine will need to store the schema per struct), and the DB engine will have to parse out the struct when performing any operation (difficult with nested structs).
You’d want to check the performance of your DB engine + storage format using the EXPLAIN command. More often than not, the only additional step you’d see is a step to UNNEST
to convert a single row into multiple rows and STRUCT_EXTRACT
to get the required column from the struct data type.
result = %sql EXPLAIN WITH lineitems AS ( SELECT o.o_orderkey, UNNEST (o.lineitems) AS line_item FROM wide_orders o ) SELECT o_orderkey, COUNT(line_item.lineitemkey) AS num_line_items, SUM(line_item.quantity) AS total_line_item_quantity FROM lineitems GROUP BY 1 ORDER BY 1
print(result)
4. Conclusion
To recap we saw
- Using STRUCT for one-to-one & hierarchical relationships
- Using ARRAY of STRUCT for one-to-many relationships
- Using nested data types in data processing
- Improving OBT usability with nested data types
- Checking performance of your queries
If your current data model seem inflexible try out nested data structures. When used correctly it can significanty improve table usability and maintainability.
Please let me know in the comment section below if you have any questions or comments.
5. Continue reading
If you found this article helpful, share it with a friend or colleague using one of the socials below!