How to Pull Data from an API, Using AWS Lambda

If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.

How to submit Spark jobs to EMR cluster from Airflow

There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.

Ensuring Data Quality, With Great Expectations

Ensure your data meets basic and business specific data quality constraints. In this post we go over a data quality testing framework called great expectations, which provides powerful functionality to cover the most common test cases and the ability to group them together and run them.

Designing a "low-effort" ELT system, using stitch and dbt

With the advent of powerful data warehouses like snowflake, bigquery, redshift spectrum, etc that allow separation of storage and execution, it has become very economical to store data in the data warehouse and then transform them as required. This post goes over how to design such a ELT system using stitch and DBT. The main objective is to keep the code complexity and server management low, while automating as much as possible

3 Key techniques, to optimize your Apache Spark code

This post covers key techniques to optimize your Apache Spark code. You will know exactly what distributed data storage and distributed data processing systems are, how they operate and how to use them efficiently. Go beyond the basic syntax and learn 3 powerful strategies to drastically improve the performance of your Apache Spark project.

Advantages of Using dbt(Data Build Tool)

In this article we aim to go over the reasoning behind why someone might want to use dbt. If you are interested in learning dbt checkout this article . Some common questions from Data Engineers about dbt are

it is not very clear to me why would I use dbt instead of running SQL queries on Airflow