Build Data Engineering Projects, with Free Template

1. Introduction

Setting up data infra is one of the most complex parts of starting a data engineering project. If you are overwhelmed by

Setting up data infrastructure such as Airflow, Redshift, Snowflake, etc

Trying to setup your infrastructure with code

Not knowing how to deploy new features/columns to an existing data pipeline

Dev ops practices such as CI/CD for data pipelines

Then this post is for you. This post will cover the critical concepts of setting up data infrastructure, development workflow, and a few sample data projects that follow this pattern. We will also use a data project template that runs Airflow, Postgres, & Metabase to demonstrate how each concept works.

By the end of this post, you will be able to understand how to set up data infrastructure with code, how developers work together on new features to data pipeline, & have a GitHub template that you can use for your data projects.

2. Run Data Pipeline

Code available at data_engineering_project_template repository.

2.1. Run on codespaces

You can run this data pipeline using GitHub codespaces. Follow the instructions below.

  1. Create codespaces by going to the data_engineering_project_template repository, cloning it(or click Use this template button) and then clicking on Create codespaces on main button.
  2. Wait for codespaces to start, then in the terminal type make up.
  3. Wait for make up to complete, and then wait for 30s (for Airflow to start).
  4. After 30s go to the ports tab and click on the link exposing port 8080 to access Airflow UI (username and password is airflow).

codespaces start codespaces make up codespaces open url

2.2. Run locally

To run locally, you need:

  1. git
  2. Github account
  3. Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later

Clone the repo and run the following commands to start the data pipeline:

git clone https://github.com/josephmachado/data_engineering_project_template.git
cd data_engineering_project_template
make up
sleep 30 # wait for Airflow to start
make ci # run checks and tests

Go to http:localhost:8080 to see the Airflow UI. Username and password are both airflow.

3. Architecture and services in this template

This data engineering project template, includes the following:

  1. Airflow: To schedule and orchestrate DAGs.
  2. Postgres: To store Airflow’s details (which you can see via Airflow UI) and also has a schema to represent upstream databases.
  3. DuckDB: To act as our warehouse
  4. Quarto with Plotly: To convert code in markdown format to html files that can be embedded in your app or servered as is.
  5. cuallee: To run data quality checks on the data we extracted from CoinCap API.
  6. minio: To provide an S3 compatible open source storage system.

For simplicity services 1-5 of the above are installed and run in one container defined here .

DET

The coincap_elt DAG in the Airflow UI will look like the below image:

DAG

You can see the rendered html at ./visualizations/dashboard.html .

The file structure of our repo is as shown below:

File strucutre

4. CI/CD setup

We set up the development flow to make new feature releases easy and quick. We will use

  1. git for version control
  2. GitHub for hosting our repository, GitHub-flow for developing new features, and
  3. Github Actions for CI/CD.

We have the CI and CD workflows commented out , uncomment them if you want to setup CI/CD for your pipelines.

4.1. CI: Automated tests & checks before the merge with GitHub Actions

Note: Read this article that goes over how to use GitHub actions for CI .

Continuous integrations in our repository represent the automated code testing before merging into the main branch (which runs in the production server). In our template, we have defined formatting (isort, black), type checking (mypy), lint/Style checking (flake8), & python testing (pytest) as part of our ci .

We use GitHub actions to run the checks automatically when someone creates a pull request. The CI workflow is defined in this ci.yml file.

4.2. CD: Deploy to production servers with GitHub Actions

Continuous delivery in our repository means deploying our code to the production server. We use EC2 running docker containers as our production server; After merging into the main branch, our code is copied to the EC2 server using cd.yml .

Note that for our CD to work, we will first need to set up the infrastructure with terraform, & defined the following repository secrets. You can set up the repository secrets by going to Settings > Secrets > Actions > New repository secret.

  1. SERVER_SSH_KEY: We can get this by running terraform -chdir=./terraform output -raw private_key in the project directory and paste the entire content in a new Action secret called SERVER_SSH_KEY.
  2. REMOTE_HOST: Get this by running terraform -chdir=./terraform output -raw ec2_public_dns in the project directory.
  3. REMOTE_USER: The value for this is ubuntu.

secrets

5. Putting it all together with a Makefile

We use a Makefile to define aliases for the commands used during development and CI/CD.

6. Data projects using other tools and services

While an Airflow + Postgres warehouse setup might be sufficient for most practice projects, here are a few projects that use different tools with managed services.

  1. Beginner DE project
  2. Project to impress HM
  3. DBT DE project
Component Beginner DE project Project to impress HM manager dbt DE project
Scheduler Airflow cron -
Executor Apache Spark, DuckDB Python process DuckDB
Orchestrator Airflow - dbt
Source Postgres, CSV, S3 API flat file
Destination DuckDB DuckDB DuckDB
Visualization/BI tool Quarto Metabase -
Data quality checks - - dbt tests
Monitoring & Alerting - - -

All of the above projects use the same tools for data infrastructure setup.

  1. local development: Docker & Docker compose
  2. IAC: Terraform
  3. CI/CD: Github Actions
  4. Testing: Pytest
  5. Formatting: isort & black
  6. Lint check: flake8
  7. Type check: mypy

7. Conclusion

To recap, we saw

  1. How to set up Data infrastructure
  2. How to set up Development workflow

The next time you start a new data project or join an existing data team, look for these components to make developing data pipelines quick and easy.

This article helps you understand how to set up data infrastructure with code, how developers work together on new features to data pipeline, & how to use the GitHub template for your data projects.

If you have any questions or comments, please leave them in the comment section below.

8. Further reading

  1. Creating local end-to-end tests
  2. Data testing
  3. Beginner DE project
  4. Project to impress HM
  5. End-to-end DE project

9. References

  1. GitHub Actions
  2. CI/CD
  3. Terraform

If you found this article helpful, share it with a friend or colleague using one of the socials below!