How to Prevent Missing Data With Referential Integrity Checks

A practical guide to detecting and handling RI violations before your stakeholders see NULL data.

A practical guide to detecting and handling RI violations before your stakeholders see NULL data.
BREAK INTO
LEARN FUNDAMENTALS
REAL WORLD
BEST PRACTICES
LEARN ARCHITECTURE
Author

Joseph Machado

Published

April 21, 2026

Keywords

referential integrity, data pipeline, data quality, orphaned records, foreign key, duckdb, data engineering, null data, data warehouse, ETL

Input Data Relationship Represents How Your Business Operates

To model a warehouse, you need to understand how the input sources are related to one another.

OLTP databases (Postgres, MySQL) have constraints (Foreign Key) that ensure only data that satisfies them can be inserted into the tables.

OLAP systems (Snowflake, Iceberg, etc.) do not enforce foreign key constraints (with some exceptions). Making it very likely that incomplete data will end up in your outputs.

If you have faced issues with

Sudden NULL dimensions showing up in your dashboard

Blindly trusting schema-on-read, which leads to data quality nightmares

Feeling like a janitor at scale

This post is for you.

Ensuring that input data is correctly related is ensuring that your understanding of the source data is correct.

By the end of this post, you will know.

  1. Why referential integrity is critical
  2. How referential integrity is lost in transit (source → warehouse)
  3. Consequences of referential integrity violations
  4. Steps you can take to avoid data issues in your output

Follow along with code > Setup.

What Is Referential Integrity and Why It Matters

Assume you work for a company that allows customers to place orders.

erDiagram
  direction LR
  customer {
    UUID customer_id PK
    VARCHAR email
    VARCHAR full_name
    VARCHAR phone
    VARCHAR status
    TIMESTAMPTZ created_at
    TIMESTAMPTZ updated_at
  }
  orders {
    UUID order_id PK
    UUID customer_id FK
    UUID shipping_addr_id FK
    UUID billing_addr_id FK
    VARCHAR status
    NUMERIC total_amount
    TIMESTAMPTZ placed_at
    TIMESTAMPTZ created_at
    TIMESTAMPTZ updated_at
  }
  customer ||--o{ orders : "places"

Based on how the business operates, we know that an order cannot exist without a customer. Referential integrity is a way of defining this in our database.

We can define this using foreign key constraints, as shown below.


CREATE TABLE customer (
        customer_id   UUID        PRIMARY KEY,
        email         VARCHAR     NOT NULL UNIQUE,
        full_name     VARCHAR     NOT NULL,
        phone         VARCHAR,
        status        VARCHAR     NOT NULL DEFAULT 'active',
        created_at    TIMESTAMPTZ NOT NULL,
        updated_at    TIMESTAMPTZ NOT NULL
    );

CREATE TABLE orders (
      order_id           UUID          PRIMARY KEY,
      customer_id        UUID          NOT NULL REFERENCES customer(customer_id),
      shipping_addr_id   UUID          NOT NULL
      billing_addr_id    UUID          NOT NULL,
      status             VARCHAR       NOT NULL,
      total_amount       NUMERIC(14,2) NOT NULL,
      placed_at          TIMESTAMPTZ   NOT NULL,
      created_at         TIMESTAMPTZ   NOT NULL,
      updated_at         TIMESTAMPTZ   NOT NULL
  )

We say that orders.customer_id column references customer.customer_id, ensuring that any row inserted into the orders table must have a corresponding customer_id in the customer table.

For example, if you run the below insert query, you will get an error stating that the customer_id does not exist in the customer table.

INSERT INTO
  orders (
    order_id,
    customer_id,
    shipping_addr_id,
    billing_addr_id,
    status,
    total_amount,
    placed_at,
    created_at,
    updated_at
  )
VALUES
  (
    '00000000-0000-0000-0000-000000000000',
    '00000000-0000-0000-0000-000000000001', -- Non existent customer
    '00000000-0000-0000-0000-000000000002',
    '00000000-0000-0000-0000-000000000003',
    'pending',
    99.99,
    now (),
    now (),
    now ()
  );
Note

Referential integrity is critical to ensuring data accuracy.

While one can argue that the application that inserts data into the tables can do this, the application code is susceptible to bugs.

Data Pipelines and Warehouses Usually Ignore Relationships

Ingestion pipelines that pull data from sources to warehouses/lakehouses/etc ignore referential integrity.

Assume you have a daily data pull for customer data. And an hourly one for order data. (Source: Postgres → Warehouse: DuckDB).

# Full code in notebook

daily_run_ts  = now.replace(hour=0, minute=0, second=0, microsecond=0).strftime("%Y-%m-%d %H:%M:%S")
hourly_run_ts = now.strftime("%Y-%m-%d %H:%M:%S")

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
    ATTACH 'dbname=ecommerce user=dataengineer password=datapipeline host=postgres port=5432' 
    AS pg (TYPE postgres);
""")

print(f"Pulling customer data until {daily_run_ts}")
con.execute(f"""
    CREATE TABLE customer AS 
    SELECT * FROM pg.customer
    WHERE updated_at <= '{daily_run_ts}';
""")

print(f"Pulling orders data until {hourly_run_ts}") # Fact data is typically loaded into the warehouse more frequently
con.execute(f"""
    CREATE TABLE orders AS 
    SELECT * FROM pg.orders
    WHERE created_at >= '{hourly_run_ts}';
""")

display(con.execute("select * from customer").fetch_df())
display(con.execute("select * from orders").fetch_df())

customer

# customer_id email full_name phone status created_at updated_at
0 9bbd9c09 alice@example.com Alice Johnson +1-555-0101 active 2026-04-12 16:52:15+00 2026-04-14 16:52:15+00
1 b77c1919 bob@example.com Bob Martinez +1-555-0202 active 2026-04-21 16:52:15+00 2026-04-21 16:52:15+00

orders

# order_id customer_id shipping_addr_id billing_addr_id status total_amount placed_at created_at updated_at
0 44a7657b 9bbd9c09 1037ebd1 1037ebd1 delivered 129.99 2026-04-19 2026-04-22 2026-04-22
1 71cfa7ca 9bbd9c09 b53f894c 1037ebd1 processing 54.00 2026-04-19 2026-04-22 2026-04-22
2 ac624fb6 b77c1919 b9b004c4 b9b004c4 shipped 249.50 2026-04-21 2026-04-22 2026-04-22
3 510070c7 560aada6 ⚠️ 7344ddd6 7344ddd6 pending 75.00 2026-04-22 2026-04-22 2026-04-22

You can see that the orders table now has an orphaned customer_id: 560aada6.

Note

RI breaks in transit when fact tables refresh faster than their dimension tables.

The warehouse sees the child before it sees the parent.

Referential Integrity Violation Leads to Null/dropped Data

When you transform input whose referential integrity is not fully valid, the result will have NULLs or missing rows.

Let’s look at an example.

con.execute("""
SELECT o.*
, c.full_name
, c.email
FROM orders o
JOIN customer c 
on o.customer_id = c.customer_id
""").fetch_df()

The above query will only output 3 rows.

Most real pipelines use left join + coalesce to join facts and dimensions for this very reason.

con.execute("""
SELECT COALESCE(c.full_name, 'UNKNOWN') as full_name
, COALESCE(c.email, 'UNKNOWN') as email
, o.*
FROM orders o
LEFT JOIN customer c 
on o.customer_id = c.customer_id
""").fetch_df()

While the UNKNOWN placeholder may not always be acceptable.

Even when it is, you will need to rerun the pipelines repeatedly to catch missing dimensions.

Your Options: 100% RI Compliance, UNKNOWN Columns Ok, or Something in Between

Given the limitations of pipelines and warehouse-native constraint checks, we can’t expect referential integrity in our inputs.

For most companies, there are 3 options:

  1. Only process the data if the input data has 100% RI.
    • This approach will significantly impact your ability to get the output to stakeholders as soon as possible.
  2. Use Left join + Coalesce and tag missing data as UNKNOWN.
    • This approach can be error-prone as you need to ensure that all the columns from the left-joined tables are handled appropriately.
    • What happens if data from the left table is needed in a metric computation?
  3. Allow data processing if the input data meets X% RI.
    • E.g., Allow data processing if 70% of rows conform to the RI.
failure_threshold = 0.30 # no more than 30% rows violating RI or 70% of rows conforming to RI

failure_query = f"""
With count_metrics as (
SELECT count(o.customer_id) as total_cnt
, count(c.customer_id) as matching_cnt
FROM orders o
LEFT JOIN customer c ON o.customer_id = c.customer_id)
select total_cnt
, matching_cnt
, ((total_cnt - matching_cnt)*1.0)/total_cnt as missing_percentage
, case when missing_percentage > {failure_threshold} then 'FAILED' else 'PASSED' end as RI_check
from count_metrics
"""

con.execute(failure_query).fetch_df()
total_cnt matching_cnt missing_percentage RI_check
4 3 0.25 PASSED
Tip

Alternatively, you can use the outbox pattern, but this requires a significant investment from upstream teams.

Conclusion

To recap, we saw,

  1. Why referential integrity is critical
  2. How referential integrity is lost in transit (source → warehouse)
  3. Consequences of referential integrity violations
  4. Steps you can take to avoid data issues in your output

Referential integrity represents business flow. We lose this information in transit. It is up to us, the data engineer, to think about the tradeoffs and choose the appropriate way to handle data relationships.

The next time you are building a data pipeline, use this post as a reference to consider how your inputs relate and how you want to handle them for your specific use case.

Read These

  1. Types of data quality checks
  2. How to implement data quality checks
Back to top