Uplevel your dbt workflow with these tools and techniques
- 1. Introduction
- 2. Setup
- 3. Ways to uplevel your dbt workflow
- 3.1. Reproducible environment
- 3.2. Reduce feedback loop time when developing locally
- 3.3. Reduce the amount of code to write using dbt packages
- 3.4. Validate data before pushing changes to production
- 3.5. Observe data quality with elementary
- 3.6. Streamline common tasks
- 4. Conclusion
- 5. Further Reading
- 6. References
1. Introduction
Data Build Tool (dbt) is a powerful system that helps data engineers and analysts ship code faster, enabling good SWE best practices. dbt has had widespread adoption in the past few years. If you are wondering
How do you take your dbt development experience to the next level?
What dbt hacks do other DEs use that you wish you knew about?
How do you spend less time reviewing trivial stuff in PRs?
How do you enable your under-resourced data team to keep up with new features to improve business impact?
Is testing data only after building data the right approach?
Then this post is for you. In this post, we will go over some approaches you can quickly set up in your dbt project to improve development speed, confidently deploy while ensuring that your changes will not break datasets, enhance code quality, reduce feedback loop time, and ensure data quality.
By the end of this post, you will know about six concepts to speed up your dbt project’s feature delivery speed. You can also adopt these concepts to your dbt projects reasonably quickly.
2. Setup
Prerequisites
Clone the git repo.
git clone https://github.com/josephmachado/simple_dbt_project_dev.git
cd simple_dbt_project_dev
# Set dbt env variables to tell dbt where the profiles.yml and dbt_project.yml files are
export DBT_PROFILES_DIR=$(just profile-dir)
export DBT_PROJECT_DIR=$(just project-dir)
To fully grasp these concepts, you will need to know what dbt is and how to use it.
A common task with most systems is to run specific commands multiple times. In such cases, it’s beneficial to use custom scripts or a command runner like just that enables you to create shortcuts for the most commonly used commands.
Check out our justfile with shortcuts for the most commonly run commands. In the code examples below, we will show the command to run and include the corresponding shortcuts (with the just command runner) for your convenience.
3. Ways to uplevel your dbt workflow
This section will review six concepts to level up your dbt workflow. Each section will describe the problem it aims to solve, how to solve it with a tool/technique/argument, an example, and the tradeoffs you make when using the approach specified.
3.1. Reproducible environment
3.1.1. A virtual environment with Poetry
dbt is a Python project, so managing a Python environment will enable a reproducible codebase. Python suffers from dependency hell where different Python libraries, may depend on various versions of the same underlying library. We will use Poetry to handle dependency.
Poetry will use the pyproject.toml and poetry.lock files to identify the python libraries(along with the correct versions that work for all of our python project dependencies) needed. Let’s see how we can use Poetry to create and activate a virtual environment.
rm -rf .venv && poetry config virtualenvs.in-project true && poetry install --no-root
# This can be run as
# just create-venv
source .venv/bin/activate
You will now be in the virtual environment with all the Python libraries installed. To add/remove Python dependencies from the virtual environment, you will need to use Poetry.
3.1.2. Use Docker to run your warehouse locally
While a virtual environment provides good isolation for Python libraries, Docker allows you to run databases easily. We can use Docker to run a Postgres instance locally; this will serve as a data warehouse for local development and testing.
We will use Docker to run a Postgres instance locally for development; run the following command in your terminal.
docker run -d \
--name postgres \
-e POSTGRES_USER=dbt \
-e POSTGRES_PASSWORD=password1234 \
-e POSTGRES_DB=dbt \
-v $(pwd)/raw_data:/input_data \
-v $(pwd)/warehouse_setup:/docker-entrypoint-initdb.d \
-p 5432:5432 \
postgres:16
# You can run this with just start-db
The above command shows how to set the database name and credentials (with the POSTGRES_* environment variables). We mount our warehouse_setup
volume; this folder contains the setup SQL scripts to run to create the base tables and users and grant them appropriate access.
Note However, if you are using a closed source system like Snowflake, BigQuery, or Redshift, you will need to have a dev environment setup and be able to connect to it. While using Postgres as a local dev alternative to closed source systems may work, there will be a lot of consistency issues with data types, functions available, and query plans.
3.2. Reduce feedback loop time when developing locally
One of the most significant issues when developing data pipelines is the time it takes to run a pipeline and then check/validate its outputs. Feedback loop refers to the repetitive process of changing and testing the output. Ideally, the feedback loop when developing should be as low as possible for an efficient development flow.
We will review some steps you can follow to ensure your feedback loop is as quick as possible.
3.2.1. Run only required dbt objects with selectors
One of the most significant issues with running the dbt run
command is that it runs all the models. In most cases, your dbt project may have multiple unrelated pipelines; it is unnecessary to run all the models while testing a few models.
dbt offers the ability for the user to run only chosen models. One can choose which models to run with selectors (--select
flag), which is available with all the dbt (run, test, etc) commands.
dbt offers four powerful ways to select which models to run.
- Methods : Provides ways of selecting models to run (based on if they are seeds, tags, path, test name/type, file path, etc)
- Graph operators : Used to select models to run based on a given models up/down stream dependencies.
- Set operators : Used to combine (AND / OR) multiple selection criteria.
- Exclude : Used to exclude models to run.
With these selectors, you can choose precisely which models to run. If you are testing locally, it is beneficial to only run the necessary models instead of using dbt run,
which will run all the models in your dbt project.
An example of running only one model(customer_orders) is:
dbt run --select "customer_orders"
# List dbt objects that are tested and belong to source data
dbt ls --select "resource_type:test,source:*"
Another way to run only select models or tests is by using tags. While most of the selection methods specified here
use folder paths or dbt concepts (source, package, etc.), the tag
method enables the selection of arbitrary models/tests provided the appropriate tags are added to them.
For some of our tests (which we will go over in a later section), we tag them with the value of elementary
. Tagging tests/models enables us to tell dbt only to run those models/tests.
For example, one of our tests has this tag:
tests:
- elementary.volume_anomalies:
training_period:
period: day
count: 3000
timestamp_column: order_purchase_timestamp
time_bucket:
period: day
count: 1
tags: [elementary]
config:
severity: error
As you see, this specific test elementary.volume_anomalies
has added a tag called elementary
. Using the following command, we can tell dbt to list only this test (and other tests with the elementary
tag).
# Make sure that you are in the sde_dbt_tutorial directory
dbt ls --select "elementary"
While these selectors are extremely powerful, they require careful examination before running. It is critical to ensure that all your required models are being run based on the selectors used. Fortunately, dbt has a handy ls
command that lets one check the list of models/tests to run based on your chosen selectors.
dbt ls --select "elementary,resource_type:test"
3.2.2. Use prod datasets to build dev models with defer
You may need to use production (or another environment’s) data to build a model. The need to use a different environment to build a model in your development environment can be because the upstream models do not exist in your environment, or you need access to data from a different environment, etc.
In such cases, you can use defer
to tell dbt to use models from another environment to build your chosen model(s).
# make sure your virtual env is activated
just restart #. to clear out existing models in the dev and prod
just prod-run # to run all the dbt models in the prod environment
# We are copying over the manifest JSON file from the prod run to let dbt know which upstream models to use
rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/
# Run defer using pre-build prod tables and create a new model in dev
dbt run --select "customer_orders" --defer --state prod-run-artifacts
# You will see a log showing one model created
# Open warehouse with
just warehouse
On your development warehouse(log in with just warehouse
), you can check to see that only the customer_orders
model exists.
select * from your_name_warehouse.dim_customers; -- table does not exist error
select * from your_name_warehouse.customer_orders limit 10; -- you will see results
set search_path to 'your_name_warehouse';
\d -- you will only see customer_orders tables in your your_name_warehouse schema
\q -- exit pgcli
The manifest.json file contains the dbt lineage graph details, test details, macro details etc doc .
With defer and state:modified
selector, we can ensure that only the modified table built and all other data is from an alternate environment (prod/UAT, etc) or an older run of the current environment. Only creating the modified model is called slim ci.
Cons:
- Managing manifest.json, if your use is complex with multiple schemas, etc., you will need to ensure that the dbt manifest files are copied from the appropriate location.
- Using manifest has some caveats(such as issues with macros rendering the expected model name, variables, etc): https://docs.getdbt.com/reference/node-selection/state-comparison-caveats
3.2.3. Parallelize model building by increasing thread count
dbt builds models in parallel, and you can increase the number of models it can run at a given time by increasing the thread count. While increasing the thread count does increase the number of models built in parallel, it also impacts the database performance.
If you have a dedicated warehouse for development (like we do in our case), you can increase it; however, in production, you will need to ensure that you are not overtaxing the database and thus impacting other users/system data warehouse performance.
We can set thread count in profiles.yml .
3.3. Reduce the amount of code to write using dbt packages
One should ensure the reuse of existing code; this helps develop faster and reduces the chances of bugs.
3.3.1. dbt utils for commonly recreated SQL functionality
Building and maintaining a data warehouse is a pretty common task, and there are well-established patterns and codes for how to do this with dbt. When building a new feature with no native support by dbt, please check for it among the multiple dbt packages available.
A handy dbt package is dbt-utils
; it provides standard functionality often needed for data processing but is missing natively in warehouses and dbt-macros. For example, if you want to pivot a table in a warehouse that does not have a native pivot function, you can use the pivot macro. We use the pivot macro in our order_status_pivot
model.
SELECT
EXTRACT(
YEAR
FROM
order_approved_at
) AS order_year, --noqa: CV03
{{
dbt_utils.pivot('order_status',
dbt_utils.get_column_values(ref('customer_orders'), 'order_status')) }}
FROM
{{ ref('customer_orders') }}
GROUP BY
1
The above query gets compiled to the target folder (sde_dbt_tutorial/target/compiled/sde_dbt_tutorial/models/marts/marketing/order_status_pivot.sql). Check out dbt-utils documentation for a list of available features.
3.3.2. dbt expectations for extensive testing in dbt
Another good example is the dbt-expectations
library, which provides a wide range of tests you can use in your dbt project (inspired by Python’s great expectations package). For example, we use dbt-expectations
to compare column values.
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
column_A: delivered
column_B: invoiced
or_equal: false
row_condition: order_year > 2018
The con, however, is that you will not be able to control the implementation, which may be inefficient for your warehouse.
Check out dbt-expectations documentation for a list of available features.
3.4. Validate data before pushing changes to production
NOTE: data diff has been archived, do not use!
One of the most common ways of testing that your code changes are valid in a dbt project is by comparing the data generated using your code against the data already in the production database. Comparing development with production data ensures that the granularity remains the same and that the changes made do not have unintended consequences such as corrupting values in a column (e.g., lower casing string), changing units (dollars and cents etc.), etc.
data-diff is a Python library that compares datasets and shows their differences. For example, when changing a dataset, you can use data-diff to compare data in your development environment with data in production to ensure that your changes are not inadvertently affecting other columns.
just restart && just dev-run && just prod-run
rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/
# go to and ./sde_dbt_tutorial/models/marts/marketing/customer_orders.sql
# and add a new col 'som col' as some_new_col,
cd sde_dbt_tutorial
dbt run --select "customer_orders" && data-diff --dbt --state ./prod-run-artifacts/manifest.json --select customer_orders -k order_id,customer_id
cd ..
You will see the diff as shown below.
dbt.warehouse.customer_orders <> dbt.your_name_warehouse.customer_orders
Column(s) added: {'some_new_col'}
No row differences
Note that we can also use data-diff to compare datasets between different environments. In our example, both the datasets are in the same database (different schema); depending on the project, your data may be in the same database or another database. data-diff
has options to compare data within and outside the same database ref docs
.
Cons: While the open-source library offers a good comparison of datasets, the more valuable features are only available if you sign up for their cloud services. dbt audit helper is another option
3.5. Observe data quality with elementary
With data teams creating more models and becoming responsible for them, it’s essential that the data team understands the data quality, what/where/how failures occur, and data quality trends over time. To address the need to monitor data quality over time, a new term called data observability
has emerged.
Multiple tools (paid, semi-free, free) handle observability for the data produced. One of the tools that focuses primarily on dbt is elementary.
We will use the open-source elementary dbt package to add schema checks and anomaly detection and the elementary
python package to create a data quality dashboard.
3.5.1. Schema checks
One of the main reasons for pipeline failures is due to unexpected schema changes. Elementary provides the ability to check schema changes of upstream tables(schema checks
) and detect if your output model (consumed by a BI tool) has a schema change(exposure
), which may lead to BI tool breaking.
Let’s add schema checks to our sources to detect unexpected schema changes(schema checks ). We can add these tests to the source table definitions as shown below.
tables:
- name: customers
config:
tags: [raw_layer]
columns:
- name: customer_id
tests: [not_null]
tests:
- elementary.schema_changes:
tags: [elementary]
config:
severity: error
Let’s change our source schema and see how our tests fail; log into the warehouse with the just warehouse
command. Add a new dummy column called textcol
as shown below.
alter table raw_layer.customers add column textcol varchar;
\q
In your terminal, run the tests for the source tables, and you will see the schema check test fail. The schema check is helpful when another team/process controls data loading into the warehouse.
dbt test --select "source:*"
Elementary provides a schema check mechanism called ' exposure ' to ensure that the output data that our pipeline produces does not change types unexpectedly and break downstream consumers. Exposures
enable us to catch unexpected data type changes in our data sets used by some BI tools/other consumers.
Our exposure for order_status_pivot
model is as follows (exposure.yml
)
---
version: 2
exposures:
- name: order_status_pivot
label: Order Status Pivot
type: dashboard
maturity: high
url: https://your.bi.tool/dashboards/2
description: >
Shows year distribution of order by status
depends_on: [ref('order_status_pivot')]
owner:
name: John Reese
email: john.reese@themachine.com
meta:
referenced_columns:
- column_name: order_year
data_type: numeric
Let’s make a type change in the order_status_pivot
model to check how exposures alert us on a potential issue. Make the following changes to your order_status_pivot.sql file.
- EXTRACT(
- YEAR
- FROM
- order_approved_at
- ) AS order_year,
+ CAST(EXTRACT(
+ YEAR
+ FROM
+ order_approved_at
+ ) AS VARCHAR) AS order_year,
Now run the model and then test it as shown below.
dbt run --select "order_status_pivot"
dbt test --select "order_status_pivot"
You will see the error Failure in test elementary_exposure_schema_validity_order_status_pivot_ (models/marts/marketing/marketing.yml)
.
3.5.2. Anomaly detection
A particularly tricky issue to detect is value changes over time. Elementary provides a suite of anomaly detection tests to ensure
- Metrics along key dimensions remain within the allowed threshold
- Column values remain within the allowed threshold
- Volume and freshness of data are within the allowed threshold
By default, elementary considers any values with a standard score of 3 or above an anomaly(detection algorithm ). Elementary allows you to modify arguments to control the anomaly detection algorithm. There are different types of anomalies that elementary can detect ; the general idea is to monitor a number over multiple dbt runs to see if the value generated in a run is outside the expected range. This number can be:
- row count
- freshness of the datasets(based on some timestamp column)
- metrics from the group by dimension(s)
- column statistics such as null rate, length, min, max
Let’s see an example of monitoring row count over multiple dbt runs. Let’s add an anomaly check based on row count for the customer_orders
model as shown below.
tests:
- elementary.volume_anomalies:
training_period:
period: day
count: 3000
timestamp_column: order_purchase_timestamp
time_bucket:
period: day
count: 1
tags: [elementary]
config:
severity: error
In the above example, elementary will run volume anomaly detection with the given parameters (parameter docs ).
The red points indicate anomalies in the volume count of the customer_orders
model.
3.5.3. Build data observability reports
Elementary also allows generating a report with all the test results. Elementary will gather all the tests that were run as part of your dbt test (available at ./sde_dbt_tutorial/target/compiled/sde_dbt_tutorial/tests/*) and generate a static HTML in the ./edr_target/elementary_report.html
location.
Here, one can see all the tests run and anomaly detection results.
3.6. Streamline common tasks
In this section, we will go over some tips to speed up dbt development and testing.
3.6.1. [VS Code] Use the dbt power user extension
dbt uses jinja2, ref, etc., to create SQL queries(compile to SQL), which run on your warehouse. The common problems associated with this jinja2 -> SQL compilation approach are
- There is no autocomplete of SQL commands
- Not knowing the exact sql query to be executed
- Having to run
dbt docs
to create and view lineage - Not knowing if the referenced database object(table or view exists)
- Needing to run parents/children of a model quickly
While it is possible to check the above by compiling and inspecting the target folder and using dbt selectors
, the dbt power user
vscode extension
provides this via an easy to use vscode extenstion. The dbt power user
extension provides autocomplete, column lineage, complied view, sql validation, checking for presence of referenced database objects, easy to edit documentation & go to definition code exploration.
Note: some features(e.g. column lineage) require sigining up for their services.
3.6.2. Set up auto grants to streamline end-user access
Granting permission to the models created by dbt is a common requirement. Historically done with a post hook, where after the models run, you’d set up a GRANT stakeholder SELECT ACCESS TO some_model;
or similar. But dbt now has a feature dedicated to providing grants to models. You can specify grants for all the models in a path via dbt_project.ym
l, & models via individual config.
We have added a grant config in our dbt_project.yml
to ensure that the user stakeholder
(in actual project this will be a user group) has select access to all the models in marketing
path.
models:
sde_dbt_tutorial:
# Applies to all files under models/marts/core/
marts:
core:
materialized: table
marketing:
materialized: view
+grants:
select: ['stakeholder']
Note that this only grants access to models; you must ensure that the user/role has usage access for your schemas. We have added schema usage permissions via our setup script here .
3.6.3. Reduce time to production by avoiding PR nits
A typical comment among PR reviews is to format the code better or argue over formatting style. It is beneficial to have a standard style guide to ensure the code has a consistent feel. Having style standards automated ensures that it is easy to follow, and any dev new to the team is empowered to focus on the core feature delivery and not spend time fixing formatting issues.
There are two main concepts to understand
- Linter tells us what’s wrong with our code regarding style and code correctness.
- Formatter formats the code to conform to the standard style.
3.6.3.1. Lint and format sql with sqlfluff
sqlfluff is an SQL linter and a formatter. While there are multiple sql linters and formatters, we chose sqlfluff since it has good support for dbt (macros, jinja2, etc). We use the .sqlfluff to provide dbt specific settings (like where the dbt project file is, how to format macros, etc) and the .sqlfluffignore to ignore the folders to format.
We can lint and format sql files with sqlfluff via terminal as shown below
sqlfluff lint sde_dbt_tutorial/models --dialect postgres # just lint-sql
sqlfluff fix sde_dbt_tutorial/models --dialect postgres --show-lint-violations # just format-sql
3.6.3.2. Lint and format yaml with yamllint
We use yamllint and yamlfix to lint and format our yaml files. Note that dbt cloud uses prettier (installed with npm, yarn, etc.) to format; however, to keep our libraries Python focussed, we have opted for yamllint & yamlfix. We can lint and format yaml files as shown below.
yamllint sde_dbt_tutorial/models sde_dbt_tutorial/snapshots sde_dbt_tutorial/dbt_project.yml sde_dbt_tutorial/packages.yml sde_dbt_tutorial/profiles.yml # just lint-yml
yamlfix sde_dbt_tutorial/models # just format-yml
3.6.4. Parse manifest.json to inspect the dbt graph
When we run dbt, it creates the required lineage graph (i.e., order of debt models to run), compiles dbt code to raw SQL, associates tags and tests with the corresponding model, etc., and stores all this information in the target
folder, (./sde_dbt_tutorial/target). Within the target
folder, the manifest.json
file contains the references to which models/tests to run and how they are associated.
Parsing this manifest
file is an excellent way to inspect our models/tests via code. Let’s look at an example; our marketing.yml
file shows a test based on a table that does not exist.
- name: non_existent_table
columns:
- name: non_existent
tests: [unique]
dbt will compile these tests but not run them since the table non_existent_table
does not exist. As shown below, we can identify such cases with a simple Python script.
The above Python script reads the manifest.json file and checks if any tests do not depend on any node if it raises a warning nudging the engineer to remove unused code. The code also checks if any model does not use the ref function (i.e., hardcoded table names).
3.6.5. Autorun linting & checks locally before opening a PR to save on CI costs
Usually, your CI pipeline will run checks and tests to ensure the PR is up to standard. You can reduce the time taken at CI runs by preventing issues by adding the checks as a pre-commit git hook. The pre-commit git hook ensures that checks and tests run before a developer puts up a PR, saving potential CI time (if there are issues with the code).
As shown below, you can add a pre-commit hook to your .git/hooks
folder.
echo -e '
#!/bin/sh
just ci
' > .git/hooks/pre-commit
chmod ug+x .git/hooks/*
Now, the just ci
command will run each time you try to add a commit(just ci will run before the code commit).
4. Conclusion
This article gives you a good idea of how you can improve your dbt development experience. To recap, we went over.
- Reproducible environments
- Reducing feedback loop when developing locally
- Reducing the amount of code to write by leveraging dbt packages
- Validating data and Data Observability
- Streamlining common tasks
If you are working on a dbt project where the development speed(dev start to feature in prod time) is slow, identify where it could be better & fix them using the concepts specified above.
If you have any questions or comments, please leave them in the comment section below. If you have any issues with running the code, please open a GitHub issue here .
5. Further Reading
- dbt tutorial
- CI/CD with dbt
- Data pipeline testing
- CI/CD with Github actions
- Local dev environment with Docker
6. References
If you found this article helpful, share it with a friend or colleague using one of the socials below!