How to Join a fact and a type 2 dimension (SCD2) table

Introduction

If you are using a data warehouse, you would have heard of fact and dimension tables. Simply put, fact tables are used to record a business event and dimension tables are used to record the attributes of business items(eg user, item tables in an e-commerce app). This is also referred to as a star schema or dimension modeling.

For example, in an e-commerce website, a fact table would contain information about orders, such as when the order was placed, the items in that order, who placed that order, etc. The dimension tables would be an item table (containing item id, item price, size, etc) and an user table (containing user id, user name, user address etc).

If you are wondering

how can I store my table’s history over time ?

how can I join a fact table and an SCD2 dimension table ?

Then this post is for you. For a detailed explanation of what a data warehouse is, checkout this article.

What is an SCD2 table and why use it?

The dimensional data in a data warehouse are usually derived from an application’s database. There are 7 common types of ways to model and store dimensional data in a data warehouse. In this post, we will look exclusively at Type 2: Add New Row.

SCD2 stands for slowly changing dimension type 2. In this type, we create a new row for each change to an existing record in the corresponding transaction table. Each row in the SCD2 dimension table will have row effective and row expiration datetime columns to denote the range within which that row represents the state of the data.

Application table

Assume in our e-commerce applications database that we have a user table to store user attributes.

user column definition
user_id unique identifier for a user (primary key)
first_name first name of the user
last_name last name of the user
address address of the user
zipcode zipcode of the user
created_datetime Date and time when the user was created
updated_datetime Date and time when this user data was last modified

Let’s say that a user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb has a zip code of 10027 until 2020-12-31, after which, the user changes address and the new zip code is 10012. In our application table, this would mean that the record with user_id=b0cc9fde-a29a-498e-824f-e52399991beb now has a zip code of 10012.

Transaction update

Dimension table

In the data warehouse, we need to ensure that we have access to historical data to run historical analyses. For example, if an end user want to check how many of our users lived in the zip code 10027 for the month of Dec 2020, we need to make sure that the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is counted towards 10027 and not 10012, because that is where that user was, as of December 2020.

To store this historically changing data in our data warehouse, we create a new row for each change. Let’s consider our user_dim(user dimension) table in our data warehouse. This corresponds to the user table in our application database.

user_dim column definition
user_id unique identifier for a user (primary key)
first_name first name of the user
last_name last name of the user
address address of the user
zipcode zipcode of the user
created_datetime Date and time when the user was created
updated_datetime Date and time when this user data was last modified
row_effective_datetime The date and time from which this row is the effective data for this user_id
row_expiration_datetime The date and time until which this row is the effective data for this user_id
current_row_indicator Indicator denoting if this row is the most current state of the data for this user_id

SCD2

We do not update the value of an existing row, but we add a new record to indicate the new state. We also set the old record’s row_expiration_datetime and set current_row_indicator to expired.

--row_expiration_datetime=9999-12-31 00:00:00
++row_expiration_datetime=2021-01-01 17:59:59

--current_row_indicator=current
++current_row_indicator=expired

For the new record, we set the updated_datetime as row_effective_datetime and set end of time as row_expiration_datetime and set current for current_row_indicator to reflect the current state.

Setup

We will use PostgreSQL on a docker container for our data warehouse and pgcli to connect to it.

  1. docker
  2. pgcli

From your terminal, run

docker run --name pg_local -p 5432:5432 -e POSTGRES_USER=sde -e POSTGRES_PASSWORD=password -e POSTGRES_DB=scd2 -d postgres:12.2

Now, login to the running postgres instance as shown below. The password is password

pgcli -h localhost -p 5432 -U sde scd2 # password is password

Let’s create a simple user_dim table and an items_purchased fact table in our data warehouse.

CREATE DATABASE warehouse;
USE warehouse;
DROP TABLE IF EXISTS user_dim;
CREATE TABLE user_dim (
    user_key BIGINT,
    user_id VARCHAR(40),
    first_name VARCHAR(10),
    last_name VARCHAR(10),
    address VARCHAR(100),
    zipcode VARCHAR(10),
    created_datetime TIMESTAMP,
    updated_datetime TIMESTAMP,
    row_effective_datetime TIMESTAMP,
    row_expiration_datetime TIMESTAMP,
    current_row_indicator VARCHAR(10)
);
INSERT INTO user_dim (
        user_key,
        user_id,
        first_name,
        last_name,
        address,
        zipcode,
        created_datetime,
        updated_datetime,
        row_effective_datetime,
        row_expiration_datetime,
        current_row_indicator
    )
VALUES (
        1000,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10027,
        '2020-01-01 10:00:00',
        '2020-01-01 10:00:00',
        ' 2020-01-01 10:00:00',
        '2021-01-01 17:59:59',
        'expired'
    ),
    (
        1200,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10012,
        '2020-01-01 10:00:00',
        '2021-01-01 18:00:00',
        '2021-01-01 18:00:00',
        '9999-12-31 00:00:00',
        'current'
    );
DROP TABLE IF EXISTS items_purchased;
CREATE TABLE items_purchased (
    item_purchased_id VARCHAR(40),
    order_id VARCHAR(40),
    user_id VARCHAR(40),
    item_id VARCHAR(40),
    item_cost decimal(10, 2),
    purchased_datetime TIMESTAMP -- and other fact information
);
INSERT INTO items_purchased (
        item_purchased_id,
        order_id,
        user_id,
        item_id,
        item_cost,
        purchased_datetime
    )
VALUES (
        'nljbac724bbskd',
        'order_id_1',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_1',
        1500.00,
        '2020-12-28 12:30:00'
    ),
    (
        'ljbkcfvbj6758njh',
        'order_id_23',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_45',
        20.00,
        '2021-01-28 09:30:00'
    ),
    (
        'sjbv09uy7njhbvvj',
        'order_id_100',
        '11111111-2222-3333-44444444444',
        'item_id_12',
        32.00,
        '2021-02-01 11:00:00'
    );

We now have an items_purchased fact table and a user_dim dimension table, with some sample data.

Joining fact and SCD2 tables

Let’s say our data analysts wants to answer questions like

  1. I want to see the distribution of our high spending users by month, year and zip code(at time of purchase)
  2. I want to see the high spending users first name, last name, zip code along with how long they lived at each zip code(duration_of_stay). If they are currently living at a zip code then the duration_of_stay should be Null.

High spending user is defined as any user who has spent at least 1000$ on our website.

We can write a query to answer question 1, as shown below

WITH high_spenders AS (
    SELECT user_id
    FROM items_purchased
    GROUP BY user_id
    HAVING sum(item_cost) > 1000
),
user_items AS (
    SELECT ip.item_purchased_id,
        ip.user_id,
        ip.item_cost,
        ud.zipcode,
        ip.purchased_datetime
    FROM items_purchased ip
        JOIN user_dim ud ON ip.user_id = ud.user_id
        AND ip.purchased_datetime BETWEEN ud.row_effective_datetime AND ud.row_expiration_datetime
)
SELECT EXTRACT(
        YEAR
        FROM ui.purchased_datetime
    ) yr,
    EXTRACT(
        MONTH
        FROM ui.purchased_datetime
    ) mnth,
    ui.zipcode,
    COUNT(DISTINCT ui.user_id) num_high_spenders
FROM user_items ui
    JOIN high_spenders hs ON ui.user_id = hs.user_id
GROUP BY yr,
    mnth,
    ui.zipcode
ORDER BY yr DESC,
    mnth DESC;

Let’s look at each of the individual CTEs, high_spenders and user_items.

high_spenders

high spenders

We can see that the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is a high spender, because they spent a total of 1520.00 during their entire time in our application.

user_items

user_items

We can see how the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is associated with the zip code of 10027 for purchases made in the month of December 2020 and zip code of 10012 for purchases made in the month of January 2021.

The fact to SCD2 dimension table join happens in the user_items CTE.

Fact SCD2 Query

Fact SCD2 Join

We know that for any given row in the user_dim table, the row_effective_datetime and row_expiration_datetime define the ranges between which the row represents the state of data at that point in time.

This enables us to uniquely join a row in the fact table to the state of the user at the time the purchase was made.

The distribution of our high spending users by month, year, and zip code, will be as shown below.

yr mnth zipcode num_high_spenders
2021.0 1.0 10012 1
2020.0 12.0 10027 1

Answering question number 2 is left as an exercise for the reader. Comment your answers in the comment section below to check the implementation.

To stop and remove the docker containers, run

docker stop pg_local
docker rm pg_local

Educating end users

When using a slowly changing dimension, it is crucial to educate the end user on how to use the data. You can try to encapsulate its complexity by creating a view or a function on top of your slowly changing dimension table. However ultimately, any users of data should be aware of what SCD2 is, how to use, and how not to use them.

Conclusion

Hope this article gave you a good understanding of how to join fact tables with SCD2 dimension tables to get point in time information. Most end user questions arise from the inability to have point in time information which makes the results of their queries inaccurate and confusing.

The next time you are modeling a dimension table in your data warehouse, try SCD2 to store historical changes that will be required for running analytical queries later.

Further reading

  1. Common Table Expressions (CTEs)
  2. What is a data warehouse
  3. Mastering window functions
  4. DBT support for SCD2

References:

  1. The Data Warehouse Toolkit

Tired of VC-Funded, Fluff-Filled Data Content?

Build effective data systems equipped with core data engineering principles!

Subscribe to my newsletter for:

1. Core Data Concepts to master the tools and frameworks
2. Career Growth Tips to align with business needs
3. Design Patterns for smarter data pipeline strategies

No sponsors, no agenda—just pure, actionable, and reliable content. Get on the list now and future-proof your data engineering career.

    We only send useful and actionable content. Unsubscribe at any time.

    M ↓   Markdown
    J
    Justin Wong
    1 point
    4 years ago

    Very interesting, just what I was looking for. Looking forward to digging into this one.

    W
    wdesert
    1 point
    4 years ago

    Hi Joseph,
    Another great post. Please just don't stop writing :)

    P
    Profeta Holístico
    1 point
    4 years ago

    Excellent article, you guys are the best!

    J
    Justin Wong
    0 points
    3 years ago

    I didn't use row_effective_date or row_expiration_date, but was still able to solve the problem I think? Can you explain what is the issue with my queries?

    J
    Joseph Machado
    0 points
    3 years ago

    Ah great catch, Thank you. I have updated the questions to include zip code during time of purchase to demonstrate the need for SCD2 type tables.

    G
    Gagandeep Kaur
    0 points
    9 months ago

    I'm working with two datasets and need help with some SQL queries to handle aggregating events spanning multiple days.

    Hosts Table (SCD2 dim table): Includes columns host_id, created_at, last_seen, row_valid_from, row_valid_to, and is_current.

    Vulnerabilities Table (fact table): Includes columns vuln_id, host_id, created_time, updated_time, closed_time, and status.

    Use Case 1: I need a query to count the total number of distinct hosts that were last seen within the past 30 days for each day.

    Use Case 2: I need a query to count the number of open vulnerabilities for each day based on the created_time and closed_time.

    Broadly, I'm looking for best practices on how to handle aggregating events that span multiple days. Any ideas or insights?

    Thanks in advance!

    J
    Justin Wong
    0 points
    3 years ago

    I feel like this question is a better exercise for using the row_effective fields, and something that your query explanation touches upon:

    -- 3: Get the zipcode associated with the purchaser at the time -- the user purchased the item

    select i.*, u.zipcode from items_purchased i join user_dim u on i.user_id = u.user_id and i.purchased_datetime between u.row_effective_datetime and u.row_expiration_datetime;

    N
    Norbert Palinkas
    0 points
    2 years ago

    Hi @Joseph,

    Could you please review my solution for the query answering question #2? Is it okay to filter the dim data with the high spenders query, and not using the fact table values at all? Thank you! (tweaked the original for MS SQL)

    /* I want to see the high spending users first name, last name, zip code along WITH how long they lived at each zip code (duration_of_stay). If they are currently living at a zip code THEN the duration_of_stay should be Null. */ WITH high_spenders AS ( SELECT user_id FROM items_purchased GROUP BY user_id HAVING sum(item_cost) > 1000 ) SELECT ud.first_name firstname , ud.last_name lastname , ud.zipcode , CASE WHEN current_row_indicator = 'expired' THEN DATEDIFF(DAY, ud.row_effective_datetime, ud.row_expiration_datetime) ELSE NULL END duration_of_stay FROM user_dim ud JOIN high_spenders hs ON ud.user_id = hs.user_id ORDER BY ud.row_expiration_datetime ASC

    J
    Joseph Kevin Machado
    1 point
    2 years ago

    Hi @Norbert Palinkas, That's great, solution looks good to me!

    A
    ashok gupta
    0 points
    2 years ago

    as far as i know, its not recommended to have natural or business key refered in fact table, hence user_key should have been used in fact table instead of user_id. i am looking for solution where surrogate key is used in fact table and dimension table is following SCD type 2.

    J
    Joseph Kevin Machado
    0 points
    2 years ago

    That is true, but most companies skip having surrogate key in favor of using natural key since it is much easier (& has gotten much cheaper) to join on, than to create a surrogate key and update fact tables.

    A
    Abylaikhan Kaipbayev
    0 points
    10 months ago

    Hmm, I don't agree here... Suppose that this user bought multiple times different items in the same month, so then in this case fact table will have multiple rows per each transaction. Suppose that the same user changed some attributes like address or something else in the same month and that will be reflected in user_dim (as shown in your example screenshot). So now let's imagine that we need to sum all purchases of this user by within this month (purchased_datetime between 2020-01-01 and 2020-01-31), now you end up with many-to-many join because fact table has several transactions and user dim has several records per this user... That is why we need to use surrogate keys...

    J
    Joseph Machado
    0 points
    10 months ago

    Im not sure I understand this clearly. Can you please provide an example.

    The join criteria """FROM items_purchased ip JOIN user_dim ud ON ip.user_id = ud.user_id AND ip.purchased_datetime BETWEEN ud.row_effective_datetime AND ud.row_expiration_datetime""" is such that each transcation will only join with one row from the user_dim. In your example there are 2 transactions and changes in the user_dim, each transaction will join with the user_dim row which was active at the time that particular transaction occured.

    D
    Dimitry Borochin
    0 points
    3 years ago

    Hi, I was reviewing this article and I think there is a problem with your solution. You are joining your dim and fact, using the natural key, user_id. This is breaking a fundamental rule of dimensional modeling. https://www.kimballgroup.com/1998/05/surrogate-keys/

    Once surrogate keys are used to join dims and facts, you don't need to leverage the row_effective_date or row_expiration_date, as an end user. The join between SKs will do all the work for you.

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    hi, Yes that is intentional. If we were to use a surrogate key, we would need to populate the fact table with the corresponding key to the dimension. With latest partitioning patterns, joining on natural key + time dimensions are very cheap. Hope this helps.

    F
    fung.eth
    0 points
    3 years ago

    Thanks for the post! I am a data newbie and would like to ask a few questions related.

    1. Do you treat the date dimension table as the SCD2 table?
    2. For joining fact tables with dimension tables, are you joining them by surrogate key?
    3. if yes, how do write ur ETL task to put the surrogate key in fact table efficiently? (it may take a long time to look up different dimensions for each row in fact table)
    4. Do you put foreign keys in dimension tables? if yes, it seems will be annoying when we do delete-write approach for putting new data.

    Thanks for much.

    J
    Joseph Kevin Machado
    1 point
    3 years ago

    hi @fung.eth

    Thank you for taking the time to read.

    1. Im not sure I follow how date will be changing. Typically we can use something like this date table as a date dimension.
    2. Yes, the fact tables has a surrogate key to join with the dim table.
    3. If the ETL is in batch, a normal join (as shown in this article) with an SCD2 table is very fast. If the ETL is in streaming mode, there are a few ways 1. Have the dimensions in a fast look up db, such as Flink state store or AWS Dynamo and enrich the facts 2. If your dimensions have not arrived into your warehouse before the fact, you would have to handle that. This reconciliation pattern is a tricky but good approach.
    4. In warehouses foreign key constraints are generally not enforced ref docs. So if you do a delete-write(or overwrite or full snapshot) of dimension tables, it will be fine.

    LMK if this helps and if you have more questions.

    F
    fung.eth
    0 points
    3 years ago

    Thank you for your reply.

    For me, the biggest confusion part is the usage of the surrogate key. As your example, the user_key is the surrogate key and the user_id is the natural key in the dim_user table. When we do our ETL job to inject data into the dim_user & fact tables, are we just generating the surrogate key randomly (or sequentially) but we don't have to look it up and write it to fact_table for joining. Because we still use the natural key (user_id) to join the fact table with the dimension table?

    So is it correct to say that the fact tables don't need to know the surrogate key of the dimension tables? As we can join them together by those natural keys which we should already know from our data sources. Or, we should put the surrogate keys of the dimension tables into fact tables?

    J
    Joseph Kevin Machado
    1 point
    3 years ago

    we just generating the surrogate key randomly (or sequentially) => Its usually sequentially generated.

    Using either (surrogate key) or (natural key + date range) to join fact to dim table are valid options. They both have their pros and cons, in general I'd think about it like this

    Surrogate key

    1. Needs to be generated in dim table
    2. The fact table needs to be generated
    3. If your business users only need the most current dimensional values, you will have to join with natural key + date ranges again

    Natural key + date range

    1. Simpler
    2. You can get the most current dimensional values or the dimensional values at the time of fact

    IMO most companies use Natural key + date range.

    S
    Sergio E. Tobal
    0 points
    3 years ago

    I found this blog searching on how to manage dates in a star schema, and my main problem with them is when you have a lot of them.

    For example, in a subscription model you may have id_sub, id_client, state_sub, is_gift, shop_name, shop_client, sub_type, creation_date, modification_date, start_date, end_date, suspended_date, credit_card_year, credit_card_month. So, my concern is how to divide all of this fields in different tables. I was putting id_sub, id_client, and creation_date in the fact table, and the rest on dimension tables, but then I will need to add an id_shop, id_sub_type to reference the dimension tables, right?

    And then, my main problem, what can I do with all those dates? They could go into the fact table because they are related to the events. But then it would become an SCD Type 2 table because I would need to replicate the rows to update the different dates. Another way I saw it was creating an additional dimensional table, which would be an SCD2, but then how? creation_date_id in fact table, and an SCD2 table where it has creation_date_id, creation_date, start_date, end_date... or it would be better to partition the dates by columns? But then it would be too many columns for only the dates, since in some dates is important the minutes.