Most data engineering job requirements involve one of the big data platforms: Databricks, Snowflake, Bigquery, Redshift, etc. You try to learn about these platforms, but all you can glean is that they are used to store and process data. You also try to understand why these platforms are necessary in the first place, but you do not have a clear answer to why they are necessary.
With the plethora of marketing material out there, it is easy to get overwhelmed by exactly what these platforms are and why they may be very beneficial for your use case. The end result is confusion and no clear picture of why these platforms matter or why companies spend large amounts of money on them.
Imagine knowing the key requirements of a data processing system and quickly choosing the right platform for your use case. By knowing what these platforms offer and when to use them, you can make recommendations to your team/leadership, making you a key contributor to any data team.
In this post, we will look at these platforms (and the open-source alternative) from the perspective of the features they provide. By the end of this post, you will have a clear idea of what to expect from these platforms. You can also choose which platform to use for your specific use case.
As a data engineer, you always want to uplevel yourself. SQL is the bread and butter of data engineering. Whether you are a seasoned pro or new to data engineering, there is always a way to improve your SQL skills. Do you ever think:
> I wish I had known this SQL feature sooner
> I wish 'learn SQL' online got into more interesting depths of the dialect than the basic shit it always is
> I wish I had known this sooner; it's a much simpler way to use window functions for filtering; no more nested queries
> I wish I didn't have to pull data into Python to do some loops
This post is for you. Imagine being proficient in data processing patterns in SQL in addition to the standard functions. You will be able to write easy-to-maintain, clean, and scalable SQL.
This post will review eight patterns to help you write easy-to-maintain SQL code and uplevel your SQL skills.
If you've worked in the data space, you've likely faced the frustration of dealing with massive tables—so many columns that it’s hard to remember their names or how they relate to each other. This complexity slows you down, increases the risk of errors, and makes your pipelines harder to maintain. You may be left wondering, Is there a simpler, more efficient way to handle this?
Imagine a world where your data is organized intuitively, where relationships between entities are clear, and you don’t have to memorize dozens of column names to get your work done. In this world, representing complex relationships in data is straightforward, and your metrics are calculated with accuracy and ease.
In this post, I’ll show you how to use complex data types in SQL to represent relationships more efficiently. You’ll learn how these data types can simplify your pipeline, improve developer experience, and minimize the risk of calculation errors. By the end, you’ll know the tradeoffs and practical applications of complex data types—and how to integrate them into your tables to make your work smoother and more effective.
Whether you are looking to improves your data skills or building portfolio projects to land a job, you would have faced the issue of deciding what and how to build data projects. If you are struggling to decide what tools/frameworks to use for your portfolio data projects or are not sure that what you are building is actually serving any purpose.
Then this post is for you! Imagine being able make a potential referrer or hiring manager quickly understand that you have the expertise that they are looking for. By showing them exactly what they are looking for you improve the chances of landing an interview.
By the end of this post you will have an algorithm that can help you decide what tools/frameworks to use for your data project to get the most out of the time you spend on it.
There are a lot of data projects available on the web. While these projects are great, starting from scratch to build your data project can be challenging. If you are
Wondering how to go from an idea to a production-ready data pipeline, Feeling overwhelmed by how all the parts of a data system fit together, or Unsure that the pipelines you build are up to industry-standard
If so, this post is for you! In it, we will go over how to build a data project step-by-step from scratch.
By the end of this post, you will be able to quickly create data projects for any use case and see how the different parts of data systems work together.
If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. The list of tools/frameworks to know can be overwhelming. If you are wondering
> What are the parts of data engineering?
> Which parts of data engineering are the most important?
> Which popular tools should you focus your learning on?
> How to build portfolio projects?
Then this post is for you. This post will review the critical components of data engineering and how you can combine them.
By the end of this post, you will know all the critical components necessary for building a data pipeline.
Preparing for data engineering interviews can be stressful. There are so many things to learn. In this 'Data Engineering Interview Series', you will learn how to crack each section of the data engineering interview.
If you have felt
> That you need to practice 100s of Leetcode questions to crack the data engineering interview
> That you have no idea where/how to start preparing for the data structures and algorithms interview
> That you are not good enough to crack the data structures and algorithms interview.
Then this post is for you!
Data quality checks are critical for any production pipeline. While there are many ways to implement data quality checks, the greatexpectations library is one of the popular ones. If you have wondered
1. How can you effectively use the greatexpectations library? 2. Why is the greatexpectations library so complex? 3. Why is the greatexpectations library so clunky and has many moving pieces?
Then this post is for you. In this post, we will go over the key concepts you’ll need to get up and running with the greatexpectations library, along with examples of the types of tests you may run.
By the end of this post, you will have a mental model of how the greatexpectations library works and be able to quickly set up and run your own data quality checks with greatexpectations.
Data quality is such a broad topic. There are many ways to check the data quality of a dataset, but knowing what checks to run and when can be confusing and unclear.
In this post, we will review the main types of data quality checks, where to use them, and what to do if a DQ check fails.
By the end of this post, you will not only have a clear understanding of the different types of DQ checks and when to use them, but you'll also be equipped with the knowledge to prioritize which DQ checks to implement.
Do you use SQL or Python for data processing? Every data engineer will have their preference. Some will swear by Python, stating that it's a Turing-complete language. At the same time, the SQL camp will restate its performance, ease of understanding, etc. Not using the right tool for the job can lead to hard-to-maintain code and sleepless nights!
Using the right tool for the job can help you progress the career ladder, but every advice online seems to be 'Just use Python' or 'Just use SQL.'
Understanding how the underlying execution engine and code interact and the tradeoffs you can choose from will equip you with the mental model to make a calculated, objective decision about which tool to use for your use case.
By the end of this post, you will understand how the underlying execution engine impacts your pipeline performance. You will have a list of criteria to consider when using Python or SQL for a data processing task. With this checklist, you can use each tool to its benefit.