Understand & Deliver on Your Data Engineering Task

Aug 29, 2021 · 7 min read

Want to deliver on your data engineering tasks with confidence? Then this post is for you. In this post, we go over a list of steps that you can use to understand what your assigned work is, why it matters and how to deliver great work.

4 Key Patterns to Load Data Into A Data Warehouse

Aug 17, 2021 · 5 min read

Unsure how to load data into a data warehouse? Then this post is for you. In this post, we go over 4 key patterns to load data into a data warehouse. These patterns can help you build resilient and easy-to-use data pipelines. Level up as a data engineer and deliver usable data faster!

How to Validate Datatypes in Python

Jul 21, 2021 · 5 min read

Frustrated with handling data type conversion issues in python? Then this post is for you. In this post, we go over a reusable data type conversion pattern using Pydantic. We will also go over the caveats involved in using this library.

Designing a Data Project to Impress Hiring Managers

Jun 25, 2021 · 9 min read

Frustrated that hiring managers are not reading your Github projects? then this post is for you. In this post, we discuss a way to impress hiring managers by hosting a live dashboard with near real-time data. We will also go over coding best practices such as project structure, automated formatting, and testing to make your code professional. By the end of this post, you will have deployed a live dashboard that you can link to your resume and LinkedIn.

How to make data pipelines idempotent

May 13, 2021 · 4 min read

Unable to find practical examples of idempotent data pipelines? Then, this post is for you. In this post, we go over a technique that you can use to make your data pipelines professional and data reprocessing a breeze.

Writing memory efficient data pipelines in Python

Apr 26, 2021 · 7 min read

Working with a dataset that is too large to fit in memory? Then this post is for you. In this post, we will write memory efficient data pipelines using python generators. We also cover the common generator patterns you will need for your data pipelines.

How to gather requirements to re-engineer a legacy data pipeline

Apr 8, 2021 · 6 min read

If you are overwhelmed with re-engineering a legacy data pipeline, then this post is for you. In this post, we go over 6 key principles to help you figure out the most impactful data features for your end user and how to deliver them.

How to trigger a spark job from AWS Lambda

Mar 27, 2021 · 6 min read

Wondering how to execute a spark job on an AWS EMR cluster, based on a file upload event on S3? Then this post if for you. In this post we go over how to trigger spark jobs on an AWS EMR cluster, using AWS Lambda. The lambda function will execute in response to an S3 upload event. We will go over this event driven pattern with code snippets and set up a fully functioning pipeline.

How to set up a dbt data-ops workflow, using dbt cloud and Snowflake

Feb 28, 2021 · 11 min read

Setting up an ELT data-ops workflow with multiple environments for developers is often extremely time consuming. What if there was a way to speed up this process, so that you could concentrate on modeling your data and delivering value to your end users? The good news is that there is a way. You can leverage dbt cloud to setup an ELT data-ops workflow in a very short time. In this post, we cover how to setup a data-ops workflow for an ELT system. We will go over how to setup dbt, snowflake, CI and schedule jobs. This data-ops workflow can be easily modified and built upon as your data team's needs evolve.

Apache Superset Tutorial

Feb 13, 2021 · 6 min read

Spending hundreds of thousands of dollars on vendor BI tools ? Looking for a clean open source alternative ? Then this post is for you. In this post we go over Apache Superset, which is one of the most popular open source visualization tools. We will go over its architecture and build charts and dashboards to visualize data. We will end with a list of pros and cons with using an open source visualization tool like Apache Superset.