How to use nested data types effectively in SQL

Oct 14, 2024 · 11 min read

1. Introduction
2. Code & Data
3. Using nested data types effectively
4. Conclusion
5. Continue reading

1. Introduction

If you have worked in the data space, you’d inevitably come across tables with so many columns that it gets difficult to remember the exact names. If you are wondering

Is there a more straightforward way to represent data

How to use complex data types to represent relationships in data efficiently

This post is for you. In it, we will discuss how to use complex data types in SQL to represent relationships efficiently, improve developer ergonomics, and potentially reduce metric calculation issues.

By the end of this post, you will know how to use complex data types in your pipelines, their tradeoffs, and how to conceptually consider using complex data types as part of your table.

2. Code & Data

We will use the tpch data for data. The TPC-H data represents a car parts seller’s data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part (parts sold), region, nation, and partsupp (parts supplier).

TPCH ERD

You can run the code on GitHub codespaces, by following this link & then going to ./concepts/nested_data_types/using_nested_data_types.ipynb .

You can also run the code locally using the following commands:

git clone https://github.com/josephmachado/adv_data_transformation_in_sql.git
cd adv_data_transformation_in_sql
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
python setup.py
jupyter lab

And going to ./concepts/nested_data_types/using_nested_data_types.ipynb .

How to use nested data types in SQL, YouTube Link (Oct 19th 2024 1PM EST)

3. Using nested data types effectively

When we think of tables, we think of rows and columns. With the advent of nested data types, we can do the following.

Represent related columns in a table as a STRUCT.
Combine data from multiple rows into one row with ARRAY[STRUCT]

While there are other data types (DBengine specific), exploring the STRUCT and ARRAY data types and their use cases will enable you to explore the other data types.

3.1. Use STRUCT for one-to-one & hierarchical relationships

If you are joining tables with a one-to-one relationship, it is a good use case for using STRUCT. Let’s consider that you are building an OBT table called wide_lineitem.

To create wide_lineitem, we must enrich the lineitem fact table with multiple dimensions(customer and supplier). Let’s see how we would do this without a nested data structure.

SELECT l.*,
c.*,
s.*
FROM 
    lineitem l
LEFT JOIN 
    orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN 
    customer c ON o.o_custkey = c.c_custkey
LEFT JOIN 
    supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;

l_orderkey	l_quantity	..	c_custkey	..	s_suppkey	..
1	17.00	..	370	..	93	..
1	36.00	..	370	..	75	..

We will end up with a very wide table, and the column names would be hard to keep track of. Now consider the same, but with a STRUCT data type to store the combined attributes of customer and order entities.

Cols to struct

SELECT 
    l.*, 
    struct_pack(
        id := c.c_custkey,
        name := c.c_name,
        address := c.c_address,
        nationkey := c.c_nationkey,
        phone := c.c_phone,
        acctbal := c.c_acctbal,
        mktsegment := c.c_mktsegment,
        comment := c.c_comment
    ) AS customer,
    struct_pack(
        id := s.s_suppkey,
        name := s.s_name,
        address := s.s_address,
        nationkey := s.s_nationkey,
        phone := s.s_phone,
        acctbal := s.s_acctbal,
        comment := s.s_comment
    ) AS supplier
FROM 
    lineitem l
LEFT JOIN 
    orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN 
    customer c ON o.o_custkey = c.c_custkey
LEFT JOIN 
    supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;

l_orderkey	..	customer	supplier
1	..	{‘id’: 370, ‘name’: ‘Customer#000000370’}	{‘id’: 93, ‘name’: ‘Supplier#000000093’}
1	..	{‘id’: 370, ‘name’: ‘Customer#000000370’}	{‘id’: 75, ‘name’: ‘Supplier#000000075’}

Now, the data for customer and supplier attributes are available as STRUCT types. We can access them using an expressive notation such as customer.name, supplier.address, etc.

Let’s consider a hierarchical relationship. Consider a customer and supplier, each with its own nation attribute. A STRUCT allows us to represent this hierarchical data easily without having to rename the nation table to customer_nation_name, supplier_nation_name, etc. Let’s see how this is done.

-- Hierarchical data 
SELECT 
    l.*, 
    struct_pack(
        id := c.c_custkey,
        name := c.c_name,
        address := c.c_address,
        nationkey := c.c_nationkey,
        phone := c.c_phone,
        acctbal := c.c_acctbal,
        mktsegment := c.c_mktsegment,
        comment := c.c_comment,
        nation := struct_pack(
            nationkey := n.n_nationkey,
            name := n.n_name,
            regionkey := n.n_regionkey,
            comment := n.n_comment
        )
    ) AS customer,
    struct_pack(
        id := s.s_suppkey,
        name := s.s_name,
        address := s.s_address,
        nationkey := s.s_nationkey,
        phone := s.s_phone,
        acctbal := s.s_acctbal,
        comment := s.s_comment,
        nation := struct_pack(
            nationkey := sn.n_nationkey,
            name := sn.n_name,
            regionkey := sn.n_regionkey,
            comment := sn.n_comment
            )
    ) AS supplier
FROM 
    lineitem l
LEFT JOIN 
    orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN 
    customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
    nation n ON c.c_nationkey = n.n_nationkey
LEFT JOIN 
    supplier s ON l.l_suppkey = s.s_suppkey
LEFT JOIN
    nation sn ON s.s_nationkey = sn.n_nationkey
LIMIT 5;

l_orderkey	..	customer	supplier
1	..	{‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ' quickly final packages. furiously i’}}	{‘id’: 93, ‘name’: ‘Supplier#000000093’, ‘nation’: {‘nationkey’: 16, ‘name’: ‘MOZAMBIQUE’, ‘regionkey’: 0, ‘comment’: ' beans after the carefully regular accounts r’}}
1	..	{‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ' quickly final packages. furiously i’}}	{‘id’: 75, ‘name’: ‘Supplier#000000075’, ‘nation’: {‘nationkey’: 18, ‘name’: ‘CHINA’, ‘regionkey’: 2, ‘comment’: ‘ckly special packages cajole slyly. unusual, unusual theodolites mold furiously. slyly sile’}}

3.2. Use ARRAY[STRUCT] for one-to-many relationships

When you have a one-to-many relationship(say fact—fact), you’d either aggregate both tables to the same grain (aka what one row represents in that table) or duplicate some facts(numerical data). Duplicated numerical data, if not careful, can lead to incorrect metrics.

To avoid confusion in calculating metrics and have all the necessary data in one place, we can store the lower-grained data as an ARRAY of STRUCTS.

Let’s look at an example where we create a wide table at order grain, but it has lineitems (items that make up an order) data as an ARRAY[STRUCT].

Rows to List

WITH line_items as (
SELECT 
    l_orderkey as orderkey,
    array_agg(struct_pack(
        lineitemkey := l.l_linenumber,
        partkey := l.l_partkey,
        suppkey := l.l_suppkey,
        quantity := l.l_quantity,
        extendedprice := l.l_extendedprice,
        discount := l.l_discount,
        tax := l.l_tax,
        returnflag := l.l_returnflag,
        linestatus := l.l_linestatus,
        shipdate := l.l_shipdate,
        commitdate := l.l_commitdate,
        receiptdate := l.l_receiptdate,
        shipinstruct := l.l_shipinstruct,
        shipmode := l.l_shipmode,
        comment := l.l_comment
    )) AS lineitems
FROM 
    lineitem l 
GROUP BY 
    l_orderkey)
SELECT o.*,
len(l.lineitems) as num_lineitems,
l.lineitems
FROM orders o
LEFT JOIN line_items l
on o.o_orderkey = l.orderkey
LIMIT 5;

orderkey	..	lineitems
2	..	[{‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33, ‘quantity’: Decimal(‘38.00’), ‘extendedprice’: Decimal(‘36596.28’), ‘discount’: Decimal(‘0.00’), ‘tax’: Decimal(‘0.05’), ‘returnflag’: ‘N’, ‘linestatus’: ‘O’, ‘shipdate’: ‘1997-01-28’, ‘commitdate’: ‘1997-01-14’, ‘receiptdate’: ‘1997-02-02’, ‘shipinstruct’: ‘TAKE BACK RETURN’, ‘shipmode’: ‘RAIL’, ‘comment’: ‘re. enticingly regular instruct’}]

In the above code, we aggregate lineitems to order grain as ARRAY[STRUCT] and join it to the order table.

NOTE: We must ensure that the size of a single value in a column does not exceed the allowed limits (check your DB documentation).

3.3. Using nested data types in data processing

We saw how we can model 1:1, 1:m, and hierarchical relationships with nested data types. In this section, we will see how we transform nested data types.

To demonstrate the tips and techniques, let’s create an OBT table called wide_orders, which has orders, lineitems, customer and nation data at order grain.

CREATE TABLE IF NOT EXISTS wide_orders AS 
WITH line_items as (
SELECT 
    l_orderkey as orderkey,
    array_agg(struct_pack(
        lineitemkey := l.l_linenumber,
        partkey := l.l_partkey,
        suppkey := l.l_suppkey,
        quantity := l.l_quantity,
        extendedprice := l.l_extendedprice,
        discount := l.l_discount,
        tax := l.l_tax,
        returnflag := l.l_returnflag,
        linestatus := l.l_linestatus,
        shipdate := l.l_shipdate,
        commitdate := l.l_commitdate,
        receiptdate := l.l_receiptdate,
        shipinstruct := l.l_shipinstruct,
        shipmode := l.l_shipmode,
        comment := l.l_comment
    )) AS lineitems
FROM 
    lineitem l 
GROUP BY 
    l_orderkey)
SELECT 
    o.*,
    l.lineitems,
    struct_pack(
        id := c.c_custkey,
        name := c.c_name,
        address := c.c_address,
        nationkey := c.c_nationkey,
        phone := c.c_phone,
        acctbal := c.c_acctbal,
        mktsegment := c.c_mktsegment,
        comment := c.c_comment,
        nation := struct_pack(
            nationkey := n.n_nationkey,
            name := n.n_name,
            regionkey := n.n_regionkey,
            comment := n.n_comment
        )
    ) AS customer
FROM 
    orders o
LEFT JOIN 
    line_items l ON o.o_orderkey = l.orderkey
LEFT JOIN 
    customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
    nation n ON c.c_nationkey = n.n_nationkey;

3.3.1. STRUCT enables more straightforward data schema and data access

While structs may seem different, you can use them usually like any other column. All you have to do is use the . notation.

SELECT o_orderkey,
customer.name,
customer.address,
lineitems[1] as first_lineitem
FROM wide_orders
limit 2;

o_orderkey	customer_name	customer_address	first_lineitem
1	Customer#000000370	DtqbDO5rid	{‘lineitemkey’: 1, ‘partkey’: 1552, ‘suppkey’: 93}
2	Customer#000000781	maoqGuL5,rHfX0leqZcFqHqpQH	{‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33}

In the above query, we access data from the customer struct with the . notation and access the first lineitem struct from an array of lifetime structs using the array notation [].

Note: In DuckDB, the array index starts at 1.

3.3.2. Nested data types can be sorted

Structs can be ordered as any other data type. The db engine compares each value of the struct to define the order.

Let’s see an example: When we order by our customer struct, the id field will be compared since that is the first key in the struct field.

SELECT customer
FROM wide_orders 
GROUP BY 1
ORDER BY 1
LIMIT 5

We can also order elements in our lineitems ARRAY with the list_sort function.

3.3.3. UNNEST ARRAY to rows and GROUP rows to ARRAY

When you want to convert a set of rows into an array of structs or vice versa, use the ARRAY_AGG and UNNEST functions, respectively.

Let’s see how to convert a set of rows into an array.

WITH lineitems as (SELECT 
    o.o_orderkey,
    UNNEST(o.lineitems) as line_item
FROM 
    wide_orders o),
unnested_line_items AS (
SELECT o_orderkey,
    line_item.lineitemkey,
    line_item.partkey,
    line_item.quantity
    FROM lineitems
)
SELECT o_orderkey,
array_agg(struct_pack(
        line_item_key := lineitemkey,
        part_key := partkey,
        quantity := quantity)) as lineitems
FROM unnested_line_items
GROUP BY 1
LIMIT 5;

Let’s see how to convert an array of elements into individual rows.

WITH
  lineitems AS (
    SELECT
      o.o_orderkey,
      UNNEST (o.lineitems) AS line_item
    FROM
      wide_orders o
  )
SELECT
  o_orderkey,
  line_item.lineitemkey,
  line_item.partkey,
  line_item.quantity
FROM
  lineitems
LIMIT
  5

3.3.4. Improve OBT usability with nested data types

In wide tables, you often end up with column names like customer_name, supplier_name, etc. Using structs, you can enable more straightforward naming conventions with the. `` notation, such as customer.name & supplier.name. These are even more applicable with deeply nested structures such as customer.nation.name.

In addition, nested data types also help us avoid incorrect metrics computation. If we store data with multiple grains in the same table, we must be cautious when aggregating the data.

With nested data structure, we can avoid this, as shown below.

Nested OBT to pre aggregate

We are getting lineitem level metrics from the orders OBT table.

-- get lineitem metrics
WITH
  lineitems AS (
    SELECT
      o.o_orderkey,
      UNNEST (o.lineitems) AS line_item
    FROM
      wide_orders o
  )
SELECT
  o_orderkey,
  COUNT(line_item.lineitemkey) AS num_line_items,
  SUM(line_item.quantity) AS total_line_item_quantity
FROM
  lineitems
GROUP BY
  1
ORDER BY
  1
LIMIT
  10;

We are getting order-level metrics from the orders OBT table with lineitem data.

-- Get order metrics
SELECT
  o_orderdate,
  SUM(o_totalprice) AS order_total_price
FROM
  wide_orders
GROUP BY
  1
ORDER BY
  1
LIMIT
  5;

3.2. Ensure your performance meets your expectations

By using structs, we add overhead in terms of maintenance (the db engine will need to store the schema per struct), and the DB engine will have to parse out the struct when performing any operation (difficult with nested structs).

You’d want to check the performance of your DB engine + storage format using the EXPLAIN command. More often than not, the only additional step you’d see is a step to UNNEST to convert a single row into multiple rows and STRUCT_EXTRACT to get the required column from the struct data type.

result = %sql EXPLAIN WITH lineitems AS ( SELECT o.o_orderkey, UNNEST (o.lineitems) AS line_item FROM wide_orders o ) SELECT o_orderkey, COUNT(line_item.lineitemkey) AS num_line_items, SUM(line_item.quantity) AS total_line_item_quantity FROM lineitems GROUP BY 1 ORDER BY 1
print(result)

4. Conclusion

To recap we saw

If your current data model seem inflexible try out nested data structures. When used correctly it can significanty improve table usability and maintainability.

Please let me know in the comment section below if you have any questions or comments.

5. Continue reading

If you found this article helpful, share it with a friend or colleague using one of the socials below!