Whenever updating a few records in an OLTP table we just use the update command. But what if we have to update millions of records in an OLTP table? If you run a large update, your database will lock those records and other transactions may fail. In this post we look at how a large update can cause lock timeout error and how running batches of smaller updates can eliminate this issue.
Using dbt you can test the output of your sql transformations. If you have wondered how to "unit test" your sql transformations in dbt, then this post is for you. In this post, we go over how to write unit tests for your sql transformations with mock inputs/outputs and test them locally. This helps keep the development cycle shorter and enables you to follow a TDD approach for your sql based data pipelines.
Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.
Change data capture(CDC) is a software design pattern in which we track every change(update, insert, delete) to the data in a database and replicate it to other database(s). In this post we will see how to do CDC by reading data from database logs and replicating it to other databases, using the popular open source Singer standard.
If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.
There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.
Ensure your data meets basic and business specific data quality constraints. In this post we go over a data quality testing framework called great expectations, which provides powerful functionality to cover the most common test cases and the ability to group them together and run them.
With the advent of powerful data warehouses like snowflake, bigquery, redshift spectrum, etc that allow separation of storage and execution, it has become very economical to store data in the data warehouse and then transform them as required. This post goes over how to design such a ELT system using stitch and DBT. The main objective is to keep the code complexity and server management low, while automating as much as possible
This post covers key techniques to optimize your Apache Spark code. You will know exactly what distributed data storage and distributed data processing systems are, how they operate and how to use them efficiently. Go beyond the basic syntax and learn 3 powerful strategies to drastically improve the performance of your Apache Spark project.
Apache Kafka is a distributed streaming platform. This post goes over the common scenarios when using Apache Kafka will be beneficial, how to use it and the basic concepts of Apache Kafka