Should Python Data Pipelines be Function based or Object-Oriented (OOP)?

As a data engineer, you would have spent hours trying to figure out the right place to make a change in your repository—I know I have. > You think, "Why is it so difficult to make a simple change?". > You push a simple change (with tests, by the way), and suddenly, production issues start popping up! > Dealing with on-call issues when your repository is spaghetti code with multiple layers of abstracted logic is a special hell that makes data engineers age in dog years! > Messy code leads to delayed feature delivery and slow debug cycles, which lowers work satisfaction and delays promotions! **Bad code leads to a bad life** If this resonates with you, know that you are not alone. Every day, thousands of data engineers deal with bad code and, with the best intentions, write messy code. Most data engineers want to write good code, but the common SWE patterns don't translate easily to data processing patterns, and there aren't many practical examples that illustrate how to write clean data pipelines. **Imagine a code base where every engineer knows where to look when something breaks, even if they have never worked on that part of the code** base before. Imagine knowing intuitively where a piece of logic would be and quickly figuring out the source of any issue. That is what this article helps you do! In this post, I explain how to combine functions and OOP patterns in Python to write pipelines that are easy to maintain/debug. By the end of this post, **you will have a clear picture of when and how to use functions and OOP effectively to make your (and your colleagues') life easy.**

How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

If you've been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago. However, no one dares to touch them, as they may be powering some important part of the data pipeline, and everyone is scared of accidentally breaking them. If you feel > Rough SQL is a good place to start, but it cannot scale after a certain limit > That dogmatic KISS approach leads to unmaintainable systems > The simplest solution that takes the shortest time is not always the most optimal. > The need to build the 80% solution and then rebuild the entire thing again if you need the 100% solution later is not better than creating the 100% solution so you don't have to make it twice Then this post is for you! Imagine working with pipelines that are a joy to work with; any updates will be quick and straightforward. In this post, we will see how to convert 1000-ish lines of messy SQL into modular code that is easy to test and modify. By the end of this post, you will have a systematic approach to converting your messy SQL queries into modular, well-scoped, easily testable code.

How to ensure consistent metrics in your warehouse

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist. The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets using this query—each tweaking the metric definition slightly. As the number of downstream consumers grows, so does the volume of ad-hoc requests to the data team to investigate inconsistencies. Before long, the team spends most of its time firefighting data bugs and reconciling metric definitions instead of delivering new insights. This cycle erodes trust, stifles career growth, and lowers team morale. This post explores two options to reduce ad-hoc data issues and empower consumers to derive insights independently.

Data Engineering Interview Series #2: System Design

System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being able to solve any data systems design interviews systematically. You'll be able to showcase your abilities and demonstrate clear thinking to your interviewer. By the end of this post, you will have a list of questions ordered by concepts that you can use to approach any data systems design interview.

How to reference a seed from a different dbt project?

If your company has multiple dbt projects, you would have had to use code cross projects. Creating cross-project dependencies is not straightforward in a SQL templating system like dbt. If you are wondering: How to use seed data defined in one dbt project in another, How dbt packages work under the hood, Caveats to be aware of when using assets cross-projects, etc. This post is for you. In this post, we will go over how to use packaging in dbt to reuse assets and how packaging works under the hood. By the end of this post, you will know how to access seed data cross-projects.

What do Snowflake, Databricks, Redshift, BigQuery actually do?

Most data engineering job requirements involve one of the big data platforms: Databricks, Snowflake, Bigquery, Redshift, etc. You try to learn about these platforms, but all you can glean is that they are used to store and process data. You also try to understand why these platforms are necessary in the first place, but you do not have a clear answer to why they are necessary. With the plethora of marketing material out there, it is easy to get overwhelmed by exactly what these platforms are and why they may be very beneficial for your use case. The end result is confusion and no clear picture of why these platforms matter or why companies spend large amounts of money on them. Imagine knowing the key requirements of a data processing system and quickly choosing the right platform for your use case. By knowing what these platforms offer and when to use them, you can make recommendations to your team/leadership, making you a key contributor to any data team. In this post, we will look at these platforms (and the open-source alternative) from the perspective of the features they provide. By the end of this post, you will have a clear idea of what to expect from these platforms. You can also choose which platform to use for your specific use case.

25 SQL tips to level up your data engineering skills

As a data engineer, you always want to uplevel yourself. SQL is the bread and butter of data engineering. Whether you are a seasoned pro or new to data engineering, there is always a way to improve your SQL skills. Do you ever think: > I wish I had known this SQL feature sooner > I wish 'learn SQL' online got into more interesting depths of the dialect than the basic shit it always is > I wish I had known this sooner; it's a much simpler way to use window functions for filtering; no more nested queries > I wish I didn't have to pull data into Python to do some loops This post is for you. Imagine being proficient in data processing patterns in SQL in addition to the standard functions. You will be able to write easy-to-maintain, clean, and scalable SQL. This post will review eight patterns to help you write easy-to-maintain SQL code and uplevel your SQL skills.

How to use nested data types effectively in SQL

If you've worked in the data space, you've likely faced the frustration of dealing with massive tables—so many columns that it’s hard to remember their names or how they relate to each other. This complexity slows you down, increases the risk of errors, and makes your pipelines harder to maintain. You may be left wondering, Is there a simpler, more efficient way to handle this? Imagine a world where your data is organized intuitively, where relationships between entities are clear, and you don’t have to memorize dozens of column names to get your work done. In this world, representing complex relationships in data is straightforward, and your metrics are calculated with accuracy and ease. In this post, I’ll show you how to use complex data types in SQL to represent relationships more efficiently. You’ll learn how these data types can simplify your pipeline, improve developer experience, and minimize the risk of calculation errors. By the end, you’ll know the tradeoffs and practical applications of complex data types—and how to integrate them into your tables to make your work smoother and more effective.

How to decide on a data project for your portfolio

Whether you are looking to improves your data skills or building portfolio projects to land a job, you would have faced the issue of deciding what and how to build data projects. If you are struggling to decide what tools/frameworks to use for your portfolio data projects or are not sure that what you are building is actually serving any purpose. Then this post is for you! Imagine being able make a potential referrer or hiring manager quickly understand that you have the expertise that they are looking for. By showing them exactly what they are looking for you improve the chances of landing an interview. By the end of this post you will have an algorithm that can help you decide what tools/frameworks to use for your data project to get the most out of the time you spend on it.

How to build a data project with step-by-step instructions

There are a lot of data projects available on the web. While these projects are great, starting from scratch to build your data project can be challenging. If you are Wondering how to go from an idea to a production-ready data pipeline, Feeling overwhelmed by how all the parts of a data system fit together, or Unsure that the pipelines you build are up to industry-standard If so, this post is for you! In it, we will go over how to build a data project step-by-step from scratch. By the end of this post, you will be able to quickly create data projects for any use case and see how the different parts of data systems work together.