Automating data testing with CI pipelines, using Github Actions

1. Introduction

Automated testing is crucial for ensuring that your code is bug-free and avoiding regressions. If you are wondering

How can data tests be integrated into a CI (Continuous Integration) pipeline?

How does a typical CI system work?

Then this post is for you. By the end of this blog post, you will understand what CI is, why it is important and how to create a CI pipeline to run tests automatically when there is a pull-request .

2. CI

CI stands for continuous integration . It is a dev-ops best practice of continually testing and making sure a development branch is ready to be merged into the main branch.

When developers create a new pull request, the CI platform typically does the following steps:

  1. Spin up the new container(s)
  2. Clones the code into the container
  3. Runs tests (including checks)
  4. If all tests pass a check passed ✅ will be displayed on your pull request.
  5. If any tests fail a check failed ❌ will be displayed on your pull request. You can prevent the merge of failing branches by setting up protected branches .

CI process

The key components of a CI pipeline are:

  1. CI platform: These platforms handle spinning up the VMs, alerting on pass/fail, etc. There are many CI/CD platforms available. E.g. Jenkins, Github Actions, circleCI, AWS Code Pipeline, etc
  2. Virtual machines: You can configure the types of virtual machines that you want to run your tests on. These VMs can be run on any service that can create VMs, eg AWS ECS, AWS EC2, etc.

Github actions provide us with some standard VMs we can use, without setting up any infrastructure.

You can see how having tests as part of CI ensures that we do not inadvertently introduce bugs, ensuring that your code is bug-free and avoiding regressions.

3. Sample project: Data testing with Github Actions

3.1. Prerequisites

  1. git
  2. Github account
  3. Docker and Docker Compose v1.27.0

Set up your repository as shown below.

git clone https://github.com/josephmachado/data_test_ci.git # clone sample code
cd data_test_ci
rm -rf .git
git init
git add .
git commit -m 'Sample project for data tests on CI'
# Create a new repository on github.com
git remote add origin https://github.com/your-github-user-id/your-repo-name.git # replace your-github-user-id with your id and your-repo-name with the repo you created
git branch -M main
git push -u origin main

3.2. Project overview

CI project

This data pipeline pulls data from a table (user), enriches it, and loads it into another table(enriched_data).

The python process to enrich data and the database are set up as docker containers. Use the command below to set them up.

make up # spins up the Postgres and python containers
make ci # formats the code, checks typing, checks the formatting, and runs the python test suite

The Makefile contains common commands such as formatting the code, running type & lint checks, and running our test suite.

3.3. Automating data tests with Github Actions

The workflow should be defined in this path .github/workflows/. Our workflow file, named ci.yml is shown below:

name: ci
on: [pull_request]
jobs:
  run-ci-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      - name: Spin up containers
        run: make up
      - name: Run CI tests
        run: make ci

The on field specifies the actions (pull-request in our case) that are supposed to trigger this workflow. Our workflow has one job, run-ci-tests, which involves:

  1. Creating a virtual machine running ubuntu.
  2. Checkout repo: Cloning our repository to the virtual machine. The virtual machine has docker installed.
  3. Spin up containers: Running make up which is the command to spin up our Postgres and Python containers
  4. Run CI tests: Running make ci which is the command to format the code, check typing, check the formatting, and run python tests.

When you create a pull-request, the jobs defined in our workflow file will be run. Use the commands below to put up a PR.

git checkout -b sde-20220227-sample-ci-test-branch
echo '' >> src/data_test_ci/data_pipeline.py
git add .
git commit -m 'Fake commit to trigger CI'
git push origin sde-20220227-sample-ci-test-branch

Go to your repository on Github, click on Pull requests and click on Compare & pull request, and then click on the Create pull request button. This will trigger the workflow.

Clicking on the Details button on the Github UI in the run-ci-tests job section shows the steps that were run. The Setup job and Complete job steps are always run before and after our defined steps.

Click CI details CI with our defined steps
CI CI Details

4. Conclusion

To recap, in this article we saw

  1. What is a CI pipeline
  2. Why it is important
  3. Automating data tests with Github Actions

Hope this article gives you a good idea of what happens as part of a CI pipeline, the different platforms to use to set up CI pipelines, and how you can easily set one up using Github actions.

The next time you are building a data pipeline automate your data tests as part of a CI pipeline, your teammates and future self will thank you.

If you have any questions, comments, or suggestions please leave them in the comment section below.

5. Further reading

  1. Trying to figure out what data tests to create? read this article .
  2. Struggling with setting up different components of your data pipeline? checkout this article .
  3. Trying to set up a CI/CD pipeline for dbt? read this article .
  4. Wondering how to run unit tests on dbt? checkout this article .

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Sign up for my free DE-101 course that will take you from basics to building data projects in 4 weeks!

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown
    ?
    Anonymous
    0 points
    3 years ago

    For question 1, i realized my warehouse container wasn't running. Maybe there was something clashing with online_store (from the end to end data engineering article :https://www.startdataengineering.com/post/data-engineering-project-e2e/#transform), so i shut down all running contains in online_store, then started the warehouse container in data_test_ci and make ci (more specifically, pytest) worked. Still i'll like to find out about question 2 (how to develop ci.yml file)

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Hi,

    1. Yes if other docker containers are taking up the ports (see this using docker ps) the new containers will not start.
    2. The make ci will run these docker commands, its used as shortcut to run all the formatting, lint checking and testing with one command. This is different from ci.yml file which is a github workflow file. In order to test the github workflow you can try the act.

    Hope this helps. LMK if you have more questions.

    ?
    Anonymous
    0 points
    3 years ago
    1. Why when running make ci locally, i get an error at test/integration/test_data_pipeline.py:19: "psycopg2.OperationalError: could not translate host name "warehouse" to address: Name or service not known" , but when running on github, there's no problem?
    2. What's the recommended workflow to build and test the ci.yml file? Is it to run make ci locally to check if it works? (but question 1 makes me think local test results are not transferrable to github somehow)
    ?
    Anonymous
    0 points
    3 years ago

    Your links 3 and 4 under Further reading need to be swapped.

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Thank you for pointing this out. I switched it.

    ?
    Anonymous
    0 points
    11 months ago

    Hi, how come you use docker and not github action to do test and linting ect?

    J
    Joseph Kevin Machado
    0 points
    11 months ago

    Hi,

    I use docker to install all the required modules and GitHub actions to trigger the docker run command. You can spin up a machine, install tools and then trigger tests/lints, etc as well I prefer docker for keeping environment management simpler.