How to choose the right tools for your data pipeline

1. Introduction

If you are building data pipelines from the ground up, the number of available data engineering tools to choose from can be overwhelming. If you are thinking

Most of the tools seem to be doing the same/similar thing, which one should I choose?

How to choose the right tool for my data pipeline?

Then this post is for you. In this post, we go over the core components of data pipelines and how you can use your requirements to select the right tools.

We will go over

  1. Gathering requirements
  2. Identifying components of your data pipeline
  3. Shortlisting tools
  4. Filtering tool choices based on your use case

Note: In this post we use the term tool to denote programming language, library, framework, SAAS, cloud resource.

2. Requirements

Understanding the requirements of data pipelines helps in designing your data pipeline and choosing the right tool.

Requirement Questions to ask Usual choices
Real-time or Batch Do you need data processed continuously or on a schedule (usually with frequency > 10m)?. {Batch, Stream}
Data size What is the size of data to be processed per run? {Batch: {MB, GB, TB, PB}, Stream: {records per second * size per record}}
Pipeline frequency How frequently do you want the pipeline to run? This typically applies to batch pipelines. {minutes, hours, days, months}
Data processing speed How quickly do you want the data to be processed? This should be lower than the pipeline frequency to prevent clogging your compute resources. {seconds, minutes, hours}
Latency requirements What is an acceptable wait time for an end-user querying your destination system? Typically measured using mean & 95th percentile values. {5ms, 1s, 10s, 1min, 10min, 30min}
Query patterns What types of queries will be run by the end-user? {analytical, full text search, NoSQL, transactional, graph-based, combination}

3. Components

With the requirements defined, you can design the data pipeline. Identify the components that make up your data pipeline.

Component Responsibility Examples
Scheduler Starting data pipelines at their scheduled frequency. Airflow scheduler, cron, dbt cloud, etc
Executor Running the data processing code. The executor can also call out other services to process the data. python, data warehouse, Spark, k8s, dbt, etc
Orchestrator Ensuring that the data pipeline tasks are executed in the right order, retrying on failures, storing metadata, and displaying progress via UI. Airflow, Prefect, Dagster, dbt
Source System where data is to be read from. OLTP databases, cloud storage, SFTP/FTP servers, REST APIs, etc
Destination Making data available for the end-user. data warehouses, Elastic search, NoSQL, CSV files, etc
Visualization/BI tool Enabling business users to look at data patterns and build shareable dashboards. Looker, Tableau, Apache Superset, Metabase, etc
Queue Accepting continuously incoming data (aka streaming) and making it available for the consuming system to read from. Kafka, Pulsar, AWS Kinesis, Nats, RabbitMQ, etc
Event triggers Triggering an action in response to a defined event occurring. AWS lambda triggers, Watchdog, etc
Monitoring & Alerting Continuously monitoring data pipelines and alerting in case of breakage or delay. Datadog, Newrelic, Grafana,
Data quality check Checking if data confines to your expectations. custom scripts checking for data constraints & business rules, Great expectations, dbt tests, etc

4. Choosing tools

After collecting requirements and identifying the components of your data pipeline you can use the below framework to identify tools that might be a good fit.

4.1 Requirement x Component framework

This involves creating a table with a list of all your components and requirements (as rows and columns) and filling them out with tools that can satisfy the requirement.

E.g. Let’s consider a data pipeline, where you pull data from 2 databases, join them and make them available for the end-user. The end-user usually joins this data with a large fact table that is in the data warehouse. The data should be made available every hour.

Data Pipeline

Let’s note down the components and note down some tools that can potentially meet our requirements.

Note that an missing fields in the table below denotes that a specific requirement does not apply to the corresponding component. For example, A requirement of sub 10s latency on the Latency requirement does not affect the Orchestrator component.

Requirement Source Orchestrator Scheduler Executor Destination Monitor & Alert
Batch w Pipeline Frequency: 1h - Airflow (MWAA, Astronomer, Cloud composer), dbt, Dagster, Prefect, custom python Airflow, dbt cloud, Databricks scheduler, Dagster, Prefect, cron - - custom alerts, Datadog, newrelic, AWS cloudwatch,
Data Size: 10GB - - - Python, Airflow worker, k8s pod, Spark, Snowflake, Redshift, Dask, Databricks, AWS EMR, - papertrail, datadog, newrelic
Data processing speed: <=10m - - - Python, Airflow operator, k8s pod, Spark, Snowflake, Redshift, Databricks, AWS EMR, - papertrail, datadog, newrelic
Query pattern: Analytical - - - - Data warehouse, Redshift, Snowflake, Bigquery, Clickhouse, Delta lake, Alerts on query failures
Latency req: 10s - - - - Data warehouse, Redshift, Snowflake, Bigquery, Clickhouse, Delta lake, Alerts on query timeouts

We now have a list of tools that we can use to build the data pipeline.

4.2 Filters

With so many tools, filtering is essential to eliminate tools that are not a good fit. Shown below are some common constraints that you might have, use these to eliminate tools that do not fit your scenario.

  1. Existing infrastructure: If your existing infrastructure can handle your new data pipeline use it.
  2. Deadlines: Choose the tool that you have the time to set up. If you only have 2 days to build, test & deploy the data pipeline, it might not be a good idea to try to set up Airflow on Kubernetes.
  3. Cost: Most vendor tools/services costs money, make sure that the cost is acceptable. Another cost to think about is developer time.
  4. Data strategy Discuss how the above requirements might evolve in the short, mid, and long term. While choosing a tool for the long term might seem like the best choice, it might not always be possible with deadlines and cost.
  5. Managed v self-hosted Do you have the ability/team to manage a data platform? This will include setting up CI/CD, scaling, reducing downtime, patching, handling system failures, etc. Does your company’s security policy allow the use of managed service?. Self-managing tools can be a lot of work.
  6. Support Does the tool have a good support system? Is it open source and popular? Is its code easy to read? Do they have good documentation? Do they have a great community(slack/Discord/Discourse)? A well-supported or documented service can speed up development velocity.
  7. Developer ergonomics Having good developer ergonomics such as git, local dev environment, ability to test locally, and CI/CD can significantly reduce the number of bugs, increase the speed of development and make the data pipeline a joy to work with. Beware of closed source vendor services, they are very hard to test locally.
  8. Number of tools Choosing tools that satisfy most of the requirements can be beneficial in keeping the pipeline complexity low. This helps with onboarding new engineers, faster development time, and simpler management.

Asking these questions for the tools you had chosen will allow you to narrow down your choices.

5. Conclusion

In summary, when building a data pipeline

  1. Start with the requirements
  2. Identify components
  3. Use the requirement x component framework to choose your tools
  4. Use the list of filters to choose the right tool for your use case

Hope this article gives you a good idea of how to choose the right tools for your data pipelines. Whether you are building a practice data pipeline or building out your company’s data infrastructure, start from the requirements and filter out tools that don’t fit your scenario.

If you have any questions or comments or would like me to add more components/requirements/filters please leave them in the comment section below.

6. Further reading

  1. 4 ways to load data into a data warehouse
  2. Creating local end-to-end tests
  3. Writing efficient data pipelines
  4. Adding tests to data pipelines

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Sign up for my free DE-101 course that will take you from basics to building data projects in 4 weeks!

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown
    ?
    Anonymous
    2 points
    3 years ago

    Good article, I learned a lot from your posts, please keep posting!

    P
    PanzerFlow
    0 points
    3 years ago

    Hey Joseph, Thanks again for the great article! Amazing insight as always. One quick question, how did you make your data flow charts? Did you VISIO or something else?

    Thank you

    Andi

    J
    Joseph Machado
    1 point
    3 years ago

    Thank you for the kind words @PanzerFlow. I use google slide + https://thenounproject.com/ .

    P
    PanzerFlow
    0 points
    3 years ago

    Thank you Joseph!

    ?
    Anonymous
    0 points
    3 years ago

    I couldn't understand the requirements X components table. For eg. what does data size x destination mean? Are the missing cells supposed to be missing? Is every component supposed to fulfill every requirement?

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Thank you for asking. The requirements X components is meant to be filled with tools that maybe required per your design. E.g. in the data size x destination example, if we were to store the data in the local machine we would need to make sure that the local machine (destination in this case) can store 10 GB of data.

    Are the missing cells supposed to be missing? => Depends on the design. In the example the size x destination is empty since we know we will be using Snowflake and it can easily handle that. But I understand the confusion. If it helps you can fill out all the fields.

    Hope this helps. LMK if you have more questions

    ?
    Anonymous
    0 points
    3 years ago

    Thank you this article is amazing both from a work perspective and also while preparing for system design interviews. I also found the requirements X components section a bit confusing. Some additional information for the missing fields or a key would be really helpful. For example explicitly stating that a field is empty because the solution is obvious e.g use a DW such as Snowflake or because the field isn't necessary.

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Thank you for the kind words and the great feedback. I added in a note that explains why the fields are empty. Hopefully this helps future readers.