How to Prevent Missing Data With Referential Integrity Checks

Input Data Relationship Represents How Your Business Operates

To model a warehouse, you need to understand how the input sources are related to one another.

OLTP databases (Postgres, MySQL) have constraints (Foreign Key) that ensure only data that satisfies them can be inserted into the tables.

OLAP systems (Snowflake, Iceberg, etc.) do not enforce foreign key constraints (with some exceptions). Making it very likely that incomplete data will end up in your outputs.

If you have faced issues with

Sudden NULL dimensions showing up in your dashboard

Blindly trusting schema-on-read, which leads to data quality nightmares

Feeling like a janitor at scale

This post is for you.

Ensuring that input data is correctly related is ensuring that your understanding of the source data is correct.

By the end of this post, you will know.

Follow along with code > Setup.

What Is Referential Integrity and Why It Matters

Assume you work for a company that allows customers to place orders.

erDiagram
  direction LR
  customer {
    UUID customer_id PK
    VARCHAR email
    VARCHAR full_name
    VARCHAR phone
    VARCHAR status
    TIMESTAMPTZ created_at
    TIMESTAMPTZ updated_at
  }
  orders {
    UUID order_id PK
    UUID customer_id FK
    UUID shipping_addr_id FK
    UUID billing_addr_id FK
    VARCHAR status
    NUMERIC total_amount
    TIMESTAMPTZ placed_at
    TIMESTAMPTZ created_at
    TIMESTAMPTZ updated_at
  }
  customer ||--o{ orders : "places"

Based on how the business operates, we know that an order cannot exist without a customer. Referential integrity is a way of defining this in our database.

We can define this using foreign key constraints, as shown below.


CREATE TABLE customer (
        customer_id   UUID        PRIMARY KEY,
        email         VARCHAR     NOT NULL UNIQUE,
        full_name     VARCHAR     NOT NULL,
        phone         VARCHAR,
        status        VARCHAR     NOT NULL DEFAULT 'active',
        created_at    TIMESTAMPTZ NOT NULL,
        updated_at    TIMESTAMPTZ NOT NULL
    );

CREATE TABLE orders (
      order_id           UUID          PRIMARY KEY,
      customer_id        UUID          NOT NULL REFERENCES customer(customer_id),
      shipping_addr_id   UUID          NOT NULL
      billing_addr_id    UUID          NOT NULL,
      status             VARCHAR       NOT NULL,
      total_amount       NUMERIC(14,2) NOT NULL,
      placed_at          TIMESTAMPTZ   NOT NULL,
      created_at         TIMESTAMPTZ   NOT NULL,
      updated_at         TIMESTAMPTZ   NOT NULL
  )

We say that orders.customer_id column references customer.customer_id, ensuring that any row inserted into the orders table must have a corresponding customer_id in the customer table.

For example, if you run the below insert query, you will get an error stating that the customer_id does not exist in the customer table.

INSERT INTO
  orders (
    order_id,
    customer_id,
    shipping_addr_id,
    billing_addr_id,
    status,
    total_amount,
    placed_at,
    created_at,
    updated_at
  )
VALUES
  (
    '00000000-0000-0000-0000-000000000000',
    '00000000-0000-0000-0000-000000000001', -- Non existent customer
    '00000000-0000-0000-0000-000000000002',
    '00000000-0000-0000-0000-000000000003',
    'pending',
    99.99,
    now (),
    now (),
    now ()
  );

Note

Referential integrity is critical to ensuring data accuracy.

While one can argue that the application that inserts data into the tables can do this, the application code is susceptible to bugs.

Data Pipelines and Warehouses Usually Ignore Relationships

Ingestion pipelines that pull data from sources to warehouses/lakehouses/etc ignore referential integrity.

Assume you have a daily data pull for customer data. And an hourly one for order data. (Source: Postgres → Warehouse: DuckDB).

# Full code in notebook

daily_run_ts  = now.replace(hour=0, minute=0, second=0, microsecond=0).strftime("%Y-%m-%d %H:%M:%S")
hourly_run_ts = now.strftime("%Y-%m-%d %H:%M:%S")

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
    ATTACH 'dbname=ecommerce user=dataengineer password=datapipeline host=postgres port=5432' 
    AS pg (TYPE postgres);
""")

print(f"Pulling customer data until {daily_run_ts}")
con.execute(f"""
    CREATE TABLE customer AS 
    SELECT * FROM pg.customer
    WHERE updated_at <= '{daily_run_ts}';
""")

print(f"Pulling orders data until {hourly_run_ts}") # Fact data is typically loaded into the warehouse more frequently
con.execute(f"""
    CREATE TABLE orders AS 
    SELECT * FROM pg.orders
    WHERE created_at >= '{hourly_run_ts}';
""")

display(con.execute("select * from customer").fetch_df())
display(con.execute("select * from orders").fetch_df())

customer

#	customer_id	email	full_name	phone	status	created_at	updated_at
0	9bbd9c09	alice@example.com	Alice Johnson	+1-555-0101	active	2026-04-12 16:52:15+00	2026-04-14 16:52:15+00
1	b77c1919	bob@example.com	Bob Martinez	+1-555-0202	active	2026-04-21 16:52:15+00	2026-04-21 16:52:15+00

orders

#	order_id	customer_id	shipping_addr_id	billing_addr_id	status	total_amount	placed_at	created_at	updated_at
0	44a7657b	9bbd9c09	1037ebd1	1037ebd1	delivered	129.99	2026-04-19	2026-04-22	2026-04-22
1	71cfa7ca	9bbd9c09	b53f894c	1037ebd1	processing	54.00	2026-04-19	2026-04-22	2026-04-22
2	ac624fb6	b77c1919	b9b004c4	b9b004c4	shipped	249.50	2026-04-21	2026-04-22	2026-04-22
3	510070c7	560aada6 ⚠️	7344ddd6	7344ddd6	pending	75.00	2026-04-22	2026-04-22	2026-04-22

You can see that the orders table now has an orphaned customer_id: 560aada6.

Note

RI breaks in transit when fact tables refresh faster than their dimension tables.

The warehouse sees the child before it sees the parent.

Referential Integrity Violation Leads to Null/dropped Data

When you transform input whose referential integrity is not fully valid, the result will have NULLs or missing rows.

Let’s look at an example.

con.execute("""
SELECT o.*
, c.full_name
, c.email
FROM orders o
JOIN customer c 
on o.customer_id = c.customer_id
""").fetch_df()

The above query will only output 3 rows.

Most real pipelines use left join + coalesce to join facts and dimensions for this very reason.

con.execute("""
SELECT COALESCE(c.full_name, 'UNKNOWN') as full_name
, COALESCE(c.email, 'UNKNOWN') as email
, o.*
FROM orders o
LEFT JOIN customer c 
on o.customer_id = c.customer_id
""").fetch_df()

While the UNKNOWN placeholder may not always be acceptable.

Even when it is, you will need to rerun the pipelines repeatedly to catch missing dimensions.

Your Options: 100% RI Compliance, UNKNOWN Columns Ok, or Something in Between

Given the limitations of pipelines and warehouse-native constraint checks, we can’t expect referential integrity in our inputs.

For most companies, there are 3 options:

Only process the data if the input data has 100% RI.
- This approach will significantly impact your ability to get the output to stakeholders as soon as possible.
Use Left join + Coalesce and tag missing data as UNKNOWN.
- This approach can be error-prone as you need to ensure that all the columns from the left-joined tables are handled appropriately.
- What happens if data from the left table is needed in a metric computation?
Allow data processing if the input data meets X% RI.
- E.g., Allow data processing if 70% of rows conform to the RI.

failure_threshold = 0.30 # no more than 30% rows violating RI or 70% of rows conforming to RI

failure_query = f"""
With count_metrics as (
SELECT count(o.customer_id) as total_cnt
, count(c.customer_id) as matching_cnt
FROM orders o
LEFT JOIN customer c ON o.customer_id = c.customer_id)
select total_cnt
, matching_cnt
, ((total_cnt - matching_cnt)*1.0)/total_cnt as missing_percentage
, case when missing_percentage > {failure_threshold} then 'FAILED' else 'PASSED' end as RI_check
from count_metrics
"""

con.execute(failure_query).fetch_df()

total_cnt	matching_cnt	missing_percentage	RI_check
4	3	0.25	PASSED

Tip

Alternatively, you can use the outbox pattern, but this requires a significant investment from upstream teams.

Conclusion

To recap, we saw,

Referential integrity represents business flow. We lose this information in transit. It is up to us, the data engineer, to think about the tradeoffs and choose the appropriate way to handle data relationships.

The next time you are building a data pipeline, use this post as a reference to consider how your inputs relate and how you want to handle them for your specific use case.

Input Data Relationship Represents How Your Business Operates

What Is Referential Integrity and Why It Matters

Data Pipelines and Warehouses Usually Ignore Relationships

Referential Integrity Violation Leads to Null/dropped Data

Your Options: 100% RI Compliance, UNKNOWN Columns Ok, or Something in Between

Conclusion

Read These