How to Manage Upstream Schema Changes in Data Driven Fast Moving Company

Mar 1, 2025 · 4 min read

1. Introduction
2.Strategies for data teams to handle changing schemas
3. Conclusion
4. Recommended reading

1. Introduction

If you have worked at a company that moves fast (or claims to), you’ve inevitably had to deal with your pipelines breaking because the upstream team decided to change the data schema!

If you are

Frequently in meetings, fixing pipeline issues due to schema changes

Stressed, unable to deliver quality work, always in a hurry to put out the next fire

Working with teams who have to prioritize speed over everything

This post is for you. Constantly dealing with broken pipelines due to upstream data changes is detrimental to your career and leads to burnout.

What if you could focus on building great data projects? Imagine pipelines auto-correcting themselves! This post will enable you to do that.

We will discuss the strategies for handling upstream schema changes. These strategies will help you move from constant fire-fighting mode to a stable way of dealing with breaking upstream changes.

Upstream data schema changes are caused by:

Column level change: New column, column deletions, data type changes.
Table level change: Grain change, table name change, table swap, primary key change, join key change, etc
Business process change: Introducing new tables/removing tables, new relationships, etc

2.Strategies for data teams to handle changing schemas

We will review four strategies for changing schemas and mix-and-match them for your use case.

2.1. Meetings are the most straightforward approach

Communication only

Upstream and data teams proactively communicate with each other before a big change and agree on timelines and schemas before any change is made to upstream datasets.

✅ Pros:

Simplest approach
Documentation in docs/confluence, etc
Handshake between teams on a schema

❌ Cons:

Error prone
Meetings slow down development speed
Not every nuance in data can be captured

🎯 How-to:

Notes with clean to-dos
Notes and decisions logged in Confluence or Google Docs

2.2. Upstream dumps data, data team deals with it

Dump and forget Upstream teams produce data, and the data team deals with whatever data they get. This is by far the most common method of operation in the industry.

✅ Pros:

Simplest approach for upstream teams
Enables upstream teams to move fast
Good enough for most business use cases

❌ Cons:

The data team is constantly playing catch-up
Bad data, breaking pipelines, & tech debt
The data team loses conceptual knowledge of the data

🎯 How-to:

Use Apache Iceberg and Spark’s mergeschema .
For tools like dbt, there is generally an option to do this, e.g., on_schema_change .

2.3. The data team as upstream reviewer leads to issue prevention

upstream review

In this approach, the data team is involved in the upstream’s data modeling process. Data teams are usually more pedantic about data models.

Data team reviews can prevent bad data from being produced.

✅ Pros:

Bad data is prevented from being produced
Well-designed data model
Shared understanding of data between upstream and data teams

❌ Cons:

Slow down upstream teams
Will not catch aggregate issues, e.g., average revenue should be consistent across days

🎯 How-to:

The process can be sped up using data contracts
GitHub provides tools like CODEOWNERS with which upstream teams can easily include data teams in reviews

2.4. Validating input before processing saves on debug time

Input Validation

The data team checks if input data meets expectations before using it. If you find an issue with the input data, you will have to work with the upstream team to correct and reprocess the data for the time that it was inaccurate.

✅ Pros:

Quick detection of issues
Empowers data teams to automate issue debugging to relevant upstream teams

❌ Cons:

Upstream & data teams need to agree on data checks
Multiple input tests increase data processing times

🎯 How-to:

Any data quality tool can be used to check for input DQ
With a streaming system, you can use DLQs and reconciliation patterns

3. Conclusion

Common approaches are:

Most companies use a mix of approaches depending on the importance of the data assets to be produced.

Data teams are the connectors between upstream teams and analytics use cases, so it’s crucial to reduce friction and detect bad data.

Ideally, schema changes are handled (in order of low to high cost-to-fix)

Please select the option that best fits your use case. How does your team deal with changing upstream data schema? Please let me know in the comment section below.

4. Recommended reading

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Join now and get started on your data engineering journey!

Testimonials:

I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!

I have learned a lot from the course which is much more practical.

This course helped me build a project and actually land a data engineering job! Thank you.

When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

Built with Kit

Comment anonymously

M ↓ Markdown

Commento