Docker can be overwhelming to start with. Most data projects use Docker to set up the data infrastructure locally (and often in production as well). Setting up data tools locally without Docker is (usually)a nightmare! The official Docker documentation, while extremely instructive, does not provide a simple guide covering the basics for setting up data infrastructure.
With a good understanding of data components and their interactions combined with some networking knowledge, you can easily set up a local data infrastructure with Docker. Knowing the core fundamentals of Docker will not only help you set up data infrastructure quickly but also empower you to think about networking, volumes, ports, etc., which are critical parts of most cloud data infrastructure.
I wrote a post that covers the fundamental concepts one will need to set up complex data infra locally. By the end of the post, you will be able to use Docker to run any data tool (that is open source) locally on your laptop. In the post, we set up a Spark cluster, Postgres database, and minio (OSS cloud storage system) that can communicate with each other using Docker.
Imagine this scenario: You are on call when suddenly an obscure alert pops up. It just says that your pipeline failed but has no other information. The pipelines you inherited (or didn't build) seem like impenetrable black boxes. When they break, it's a mystery—why did it happen? Where did it go wrong? The feeling is palpable: frustration and anxiety mount as you scramble to resolve the issue swiftly. It's a common struggle, especially for new team members who have yet to unravel the system's intricacies or data engineers who have to deal with pipelines built without observability. The root cause often lies in systems built without consideration for debugging and quick issue identification. The consequence? Lengthy downtimes, overburdened on-call engineers, and a slowdown in feature delivery. However, the ramifications extend beyond the technical realm, as incorrect data or failure to quickly fix a high-priority pipeline can erode stakeholder trust. Bugs are inevitable, but imagine a system that detects issues and provides the necessary information for an engineer to fix them quickly! A well-setup system that captures pertinent pipeline metadata and logs and exposes them in an easy-to-access UI will significantly reduce engineering time spent on fixing bugs. In the following post, you will learn what metadata is (in the context of data pipelines), how to log and monitor it, and how to design actionable alerts that simplify resolving bugs, even for someone new to the team. You will also set up an end-to-end logging system with Spark, Prometheus, and Grafana.
Are you part of an under-resourced team where adding time-saving dbt (data build tool) features take a back seat to delivering new datasets? Do you want to incorporate time (& money) saving dbt processes but need more time? While focussing on delivery may help in the short term, the delivery speed will suffer without proper workflow!
A good workflow will save time, prevent bad data, and ensure high development speed! Imagine the time (& mental pressure) savings if you didn't have to validate data manually each time you put up a PR? Your development speed will be high, you can be confident that your change does not bring down the pipeline, & you can concentrate on creating value for your end users!
In this post, we will see how to add improvements to an existing dbt project with an example. By the end of the post, you will know the most common enhancements engineers make to their dbt projects, how to do them yourself quickly, and how further to optimize dbt workflow for your specific use case.
Do you need clarification about what Open Table Formats (OTF) are? Is it more than just a pointer to some metadata files that helps you sift through the data quickly? What is the difference between table formats (Apache Iceberg, Apache Hudi, Delta Lake) & file formats (Parquet, ORC)? How do OTFs work?
Then this post is for you. Understanding the underlying principle behind open table formats will enable you to deeply understand what happens behind the scenes and make the right decisions when designing your data systems.
This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.
Whether you are a new Data Engineer or someone with a few years of experience, you inevitably would have encountered messy data systems that seemed impossible to fix. Working at such a company usually comes with multiple pointless meetings, no clear work expectations, frustration, career stagnation, and ultimately no satisfaction from work!
The reasons can be Managerial: Such as politics, red tape, cluelessness of management, influential people dictating roadmap, etc or Technical: Such as no data strategy at a leadership level, multiple teams using Excel as a warehouse, data/metric duplication across systems (without clear bounded context), lack of data rigor by upstream teams, etc
Imagine if the data systems were seamless and a joy to work with; what would that do for your sanity, happiness & career growth?
While there is no data utopia or a mythical mature organization where the data systems are perfect, there will always be some issues with the data. We, as data engineers, have the ability & responsibility to clean up the mess, build a great data warehouse, and make data accessible for the company.
In this post, we will go over six critical steps to having a data warehouse that gives stakeholders precisely what they want while avoiding messy data.
If you are trying to improve your data engineering skills or are the sole data person in your company, it can be hard to know how well your technical skills are developing. Questions like Am I building pipelines the right way? How do I measure up to DEs at bigger tech companies? How do I get feedback on my pipeline design? It can cause a lot of uncertainty in career development!
Imagine if you know that your code is on par (or even better than) with pipelines at tech-forward companies and that you are using industry best practices. You will be confident with your career progression and can quickly ramp up on any code base.
These industry-standard best practices and concepts required to build resilient data pipelines are what you will learn in this post!
By the end of this post, you will know the underlying concepts behind best practices and when to use them. While there is no perfect code/design, following these concepts will help you build resilient and easy-to-maintain data pipelines.
Are you a data engineer who can't respond quickly to user requests since your self-serve tool is over-complex with a lot of tech debt? Has your team's over-reliance on so-called self-serve tools (vs. focusing on end-user) caused the company to waste a lot of money? Is your work satisfaction suffering due to slow-moving, technical debt-ridden systems meant to enable end-users to use data effectively? Are you tired of vendors trying to sell you their self-serve data platform while not elaborating on what it is and why it may be helpful?
Imagine empowering end-users to analyze data and make impactful decisions with minimal dependence on data engineers. The end-user impact will skyrocket, and your work will enable your company to use data effectively.
Then this post is for you! In this post, we go over what self-serve is, what problems it aims to solve, the core components of a self-serve platform, and an approach you can follow to build a solid self-serve platform.
Are you looking to better yourself as a data engineer? But, when you look at job postings or company tech stack, you are overwhelmed by the sheer amount of tools you have to learn! Do you feel like you are just winging it and need a solid plan? Choosing what to learn among 100s of tools/frameworks can lead to analysis paralysis. The result is feeling overwhelmed, confused, and developing imposter syndrome, which is not helpful!
What if you can have a fun and impactful career? You can be a force multiplier for any team or business you are a part of. You can be confident in providing significant value to any business. Companies will roll out the red carpet to work with you!
If you want to become a valuable data engineer, this post is for you. This post will review what makes a data engineer (or any engineer) valuable. We will also go over a step-by-step method that you can use to choose/work on projects that can provide significant business impact, thus improving your value as a data engineer significantly.
Stream processing differs from batch; one needs to be mindful of the system's memory, event order, and system recovery in case of failures. However, understanding the fundamental concepts of time attributes, cluster memory, time-bounded joins, and system monitoring will enable you to build resilient and efficient streaming pipelines. If you are looking for an end-to-end streaming tutorial or a project to understand the foundational skills required to build streaming pipelines, this post is for you. In this post, we will design & build a streaming pipeline that multiple marketing companies build in-house. We will create a real-time first-click attribution pipeline. By the end of this post, you will know the fundamental concepts to develop your streaming pipelines. We will use Apache Flink and Apache Kafka for stream processing and queuing. However, the ideas in this project apply to all stream processing systems.
Change data capture is a popular technique to copy data from DBs into warehouses. However, it can be tricky to understand at first. Without working with a CDC system, knowing what it does, why it's needed, or how it works can be challenging. However, understanding the what, why, and how of CDC can help you set up pipelines that are resilient and reliable. If you have wondered what CDC does, why it's needed, and how it works, this post is for you. By the end of this post, you will have a good idea of what a CDC system is, where it's used, the different types of CDC, and how a CDC system built on debezium and Kafka works.