erDiagram
direction LR
customer {
UUID customer_id PK
VARCHAR email
VARCHAR full_name
VARCHAR phone
VARCHAR status
TIMESTAMPTZ created_at
TIMESTAMPTZ updated_at
}
orders {
UUID order_id PK
UUID customer_id FK
UUID shipping_addr_id FK
UUID billing_addr_id FK
VARCHAR status
NUMERIC total_amount
TIMESTAMPTZ placed_at
TIMESTAMPTZ created_at
TIMESTAMPTZ updated_at
}
customer ||--o{ orders : "places"
Input Data Relationship Represents How Your Business Operates
To model a warehouse, you need to understand how the input sources are related to one another.
OLTP databases (Postgres, MySQL) have constraints (Foreign Key) that ensure only data that satisfies them can be inserted into the tables.
OLAP systems (Snowflake, Iceberg, etc.) do not enforce foreign key constraints (with some exceptions). Making it very likely that incomplete data will end up in your outputs.
If you have faced issues with
Sudden NULL dimensions showing up in your dashboard
Blindly trusting schema-on-read, which leads to data quality nightmares
Feeling like a janitor at scale
This post is for you.
Ensuring that input data is correctly related is ensuring that your understanding of the source data is correct.
By the end of this post, you will know.
- Why referential integrity is critical
- How referential integrity is lost in transit (source → warehouse)
- Consequences of referential integrity violations
- Steps you can take to avoid data issues in your output
Follow along with code > Setup.
What Is Referential Integrity and Why It Matters
Assume you work for a company that allows customers to place orders.
Based on how the business operates, we know that an order cannot exist without a customer. Referential integrity is a way of defining this in our database.
We can define this using foreign key constraints, as shown below.
CREATE TABLE customer (
customer_id UUID PRIMARY KEY,
email VARCHAR NOT NULL UNIQUE,
full_name VARCHAR NOT NULL,
phone VARCHAR,
status VARCHAR NOT NULL DEFAULT 'active',
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
customer_id UUID NOT NULL REFERENCES customer(customer_id),
shipping_addr_id UUID NOT NULL
billing_addr_id UUID NOT NULL,
status VARCHAR NOT NULL,
total_amount NUMERIC(14,2) NOT NULL,
placed_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
)We say that orders.customer_id column references customer.customer_id, ensuring that any row inserted into the orders table must have a corresponding customer_id in the customer table.
For example, if you run the below insert query, you will get an error stating that the customer_id does not exist in the customer table.
INSERT INTO
orders (
order_id,
customer_id,
shipping_addr_id,
billing_addr_id,
status,
total_amount,
placed_at,
created_at,
updated_at
)
VALUES
(
'00000000-0000-0000-0000-000000000000',
'00000000-0000-0000-0000-000000000001', -- Non existent customer
'00000000-0000-0000-0000-000000000002',
'00000000-0000-0000-0000-000000000003',
'pending',
99.99,
now (),
now (),
now ()
);Referential integrity is critical to ensuring data accuracy.
While one can argue that the application that inserts data into the tables can do this, the application code is susceptible to bugs.
Data Pipelines and Warehouses Usually Ignore Relationships
Ingestion pipelines that pull data from sources to warehouses/lakehouses/etc ignore referential integrity.
Assume you have a daily data pull for customer data. And an hourly one for order data. (Source: Postgres → Warehouse: DuckDB).
# Full code in notebook
daily_run_ts = now.replace(hour=0, minute=0, second=0, microsecond=0).strftime("%Y-%m-%d %H:%M:%S")
hourly_run_ts = now.strftime("%Y-%m-%d %H:%M:%S")
con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
ATTACH 'dbname=ecommerce user=dataengineer password=datapipeline host=postgres port=5432'
AS pg (TYPE postgres);
""")
print(f"Pulling customer data until {daily_run_ts}")
con.execute(f"""
CREATE TABLE customer AS
SELECT * FROM pg.customer
WHERE updated_at <= '{daily_run_ts}';
""")
print(f"Pulling orders data until {hourly_run_ts}") # Fact data is typically loaded into the warehouse more frequently
con.execute(f"""
CREATE TABLE orders AS
SELECT * FROM pg.orders
WHERE created_at >= '{hourly_run_ts}';
""")
display(con.execute("select * from customer").fetch_df())
display(con.execute("select * from orders").fetch_df())customer
| # | customer_id | full_name | phone | status | created_at | updated_at | |
|---|---|---|---|---|---|---|---|
| 0 | 9bbd9c09 | alice@example.com | Alice Johnson | +1-555-0101 | active | 2026-04-12 16:52:15+00 | 2026-04-14 16:52:15+00 |
| 1 | b77c1919 | bob@example.com | Bob Martinez | +1-555-0202 | active | 2026-04-21 16:52:15+00 | 2026-04-21 16:52:15+00 |
orders
| # | order_id | customer_id | shipping_addr_id | billing_addr_id | status | total_amount | placed_at | created_at | updated_at |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 44a7657b | 9bbd9c09 | 1037ebd1 | 1037ebd1 | delivered | 129.99 | 2026-04-19 | 2026-04-22 | 2026-04-22 |
| 1 | 71cfa7ca | 9bbd9c09 | b53f894c | 1037ebd1 | processing | 54.00 | 2026-04-19 | 2026-04-22 | 2026-04-22 |
| 2 | ac624fb6 | b77c1919 | b9b004c4 | b9b004c4 | shipped | 249.50 | 2026-04-21 | 2026-04-22 | 2026-04-22 |
| 3 | 510070c7 | 560aada6 ⚠️ | 7344ddd6 | 7344ddd6 | pending | 75.00 | 2026-04-22 | 2026-04-22 | 2026-04-22 |
You can see that the orders table now has an orphaned customer_id: 560aada6.
RI breaks in transit when fact tables refresh faster than their dimension tables.
The warehouse sees the child before it sees the parent.
Referential Integrity Violation Leads to Null/dropped Data
When you transform input whose referential integrity is not fully valid, the result will have NULLs or missing rows.
Let’s look at an example.
The above query will only output 3 rows.
Most real pipelines use left join + coalesce to join facts and dimensions for this very reason.
While the UNKNOWN placeholder may not always be acceptable.
Even when it is, you will need to rerun the pipelines repeatedly to catch missing dimensions.
Your Options: 100% RI Compliance, UNKNOWN Columns Ok, or Something in Between
Given the limitations of pipelines and warehouse-native constraint checks, we can’t expect referential integrity in our inputs.
For most companies, there are 3 options:
- Only process the data if the input data has 100% RI.
- This approach will significantly impact your ability to get the output to stakeholders as soon as possible.
- Use Left join + Coalesce and tag missing data as UNKNOWN.
- This approach can be error-prone as you need to ensure that all the columns from the left-joined tables are handled appropriately.
- What happens if data from the left table is needed in a metric computation?
- Allow data processing if the input data meets X% RI.
- E.g., Allow data processing if 70% of rows conform to the RI.
failure_threshold = 0.30 # no more than 30% rows violating RI or 70% of rows conforming to RI
failure_query = f"""
With count_metrics as (
SELECT count(o.customer_id) as total_cnt
, count(c.customer_id) as matching_cnt
FROM orders o
LEFT JOIN customer c ON o.customer_id = c.customer_id)
select total_cnt
, matching_cnt
, ((total_cnt - matching_cnt)*1.0)/total_cnt as missing_percentage
, case when missing_percentage > {failure_threshold} then 'FAILED' else 'PASSED' end as RI_check
from count_metrics
"""
con.execute(failure_query).fetch_df()| total_cnt | matching_cnt | missing_percentage | RI_check |
|---|---|---|---|
| 4 | 3 | 0.25 | PASSED |
Alternatively, you can use the outbox pattern, but this requires a significant investment from upstream teams.
Conclusion
To recap, we saw,
- Why referential integrity is critical
- How referential integrity is lost in transit (source → warehouse)
- Consequences of referential integrity violations
- Steps you can take to avoid data issues in your output
Referential integrity represents business flow. We lose this information in transit. It is up to us, the data engineer, to think about the tradeoffs and choose the appropriate way to handle data relationships.
The next time you are building a data pipeline, use this post as a reference to consider how your inputs relate and how you want to handle them for your specific use case.