10 Skills to Ace Your Data Engineering Interviews

Oct 11, 2021 · 6 min read

Introduction
Skills
Interview prep, the TL;DR version
Conclusion

Introduction

Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts? Then this post is for you. In this post, we go over the most common tools and concepts you need to know to ace any data engineering interview.

Skills

The following 10 skills are ordered based on how frequently they are asked in interviews. If you are short on time go to this section .

1. SQL

SQL is crucial for any data engineering interview. These questions generally involve the interviewer giving you an ERD and asking you to write queries to answer analytical questions. Things to know include

Select, from, where, like, joins
Joins: left outer, right outer, inner, full outer , anti join, and know when to use a specific join type
Window functions
Table relationships:
- one to many, many to many, one to one type table relationships
- What is primary key, foreign key
Sub query, derived tables, CTEs
What is an index and why use it

2. Python

Python is the most common language used in building data engineering pipelines. Concepts to know include

3. Leetcode: data structures and algorithms

This is a standard section for most software job interviews. Use the links below to practice your leetcode skills.

Leetcode 75 coding problems , I found the Neetcode youtube channel and their problem tracking sheet to be extremely helpful.
Leetcode SQL questions

4. Data modeling

You will be tested on data modeling concepts (e.g. star schema, facts, and dimensions). Read the concepts below, understand what they are and why they are needed.

4.1 Data warehousing

4.2 OLTP

5. Data pipelines

Data engineering interviews usually involve a data pipeline design question. Upon designing the data pipeline, you will be asked about testing, backfilling, scaling, handling bad data, resolving dependencies, etc.

Data pipeline desing basics
Basics of orchestration tools like Airflow and DBT
What is backfilling?
Difference between ETL and ELT, and when to use one over the other
EL tools such as stitch or fivetran
Data testing
What is an idempotent data pipeline?
Efficient data pipelines in python
How to load data into a data warehouse

6. Distributed system fundamentals

Depending on the job, you might be asked about distributed system fundamentals. It is a good idea to read up on how distributed systems work. Use the links below to understand the basics of distributed systems.

7. Event streaming

Understand what event streams are, why, and how to use them.

What is Kafka and when to use it

8. System design

This is a standard question for most software jobs. Specifically for a data engineering position, you will be asked about building data pipelines where you will have to understand the source system, ETL schedule requirements, data usage pattern, and design a data pipeline to pull data into a warehouse. Some common questions in this section are, "How do you design a clickstream data storage system?", and, "How would you use CDC pattern to replicate OLTP tables into a data warehouse?". Some companies may choose to ask a standard system design question like designing Twitter, Netflix, etc.

Sample Batch ETL
Sample Streaming ETL
What is CDC (Change data capture) pattern
Standard system design concepts such as cache, db, load balancers, data modeling, etc and common questions like designing twitter, netflix, and whatsapp.

9. Business questions

Sometimes interviews may involve questions about business metrics. You can be asked to design metrics to track the health of the business, how to calculate it, why you chose it, etc.

You may be asked about your impact. You will need to explain what KPI/SLA (requirements) you were trying to optimize and how you achieved it. See the STAR method for how to organize your response to such questions.

10. Cloud computing

Be familiar with what cloud computing is. You do not need to know everything in detail but should have a general idea of what they are and what benefits they bring.

You can find corresponding components for GCP or Azure as well.

11. Probabilistic data structures (optional)

Although not frequently asked, these are asked sometimes. You may be asked a question like, “What is an efficient way to count the number of distinct ids in a distributed system where some probability of error is fine?". The most common questions of this type can usually be answered using the concepts shown below.

Hyperloglog Used for approximate distinct calculations.
Count min sketch Used for counting things in a stream of data.
Bloom filter Used for is_present type checks.

Optional: Knowledge of a JVM language (e.g. Scala or Java) is helpful.

Interview prep, the TL;DR version

If you have only a few days to prepare for your interview, make sure to understand the topics shown below.

SQL: Basics , joins , union, CTEs , window functions and leetcode medium/hard problems
Python: Basic data structures , Easy and medium (represented by green and yellow respectively) problems from the “blind 75” list .
Answering business questions with SQL: You will be given a set of tables and asked a business question. Usually, the tables have the same name as the business process/entity.
Data pipeline design: You will be asked to design a data pipeline. Make sure to understand the objective of the data pipeline. Some common questions in this section are, "How do you design a clickstream data storage system?", and, "How would you use CDC pattern to replicate OLTP tables into a data warehouse?". The interviewer will want to understand if you think about data lineage, schedule (batch or streaming), data duplication , scaling , loading data , testing , and end-user access patterns.

Conclusion

Hope this article gives you a good list of skills to be better prepared for a data engineering interview. The next time you are interviewing for a data engineering job, use the list above to prepare.

As always, if you have any questions or comments, please feel free to leave them in the comments section below.

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Comment anonymously

M ↓ Markdown

Upvotes Newest Oldest

Anonymous

2 points

4 years ago

Thanks for updating this; I appreciate what you doing for fellas' like me preparing for both the technical and hopefully the onsite too. Thank you

Justin Wong

2 points

4 years ago

You keep churning out high quality content, thank you Joseph.

Joseph Machado

0 points

4 years ago

thanks @disqus_60LFr6SW35:disqus :)

Anonymous

1 point

3 years ago

Very Crisp and useful and highlights areas of focus. Appreciate it!

Anonymous

0 points

2 years ago

Hey Joseph, greetings from india. I am currently a fresh grad with a bachelors degree in computer science engineering. I have been struggling to create a data engineering portfolio and develop my skills in the data engineering stack. Any advice on how i can increase my chances of getting selected in interviews.Thank you.

Joseph Machado

0 points

2 years ago

Hello, For people entering DE, the main things a company typically looks for is leetcode (py & sql) and knowledge of cloud, what a data warehouse is, etc. I wrote about it here https://www.startdataengineering.com/post/n-steps-high-pay-de-job/#24-preparing-for-interviews this may help. Please lmk if you have any questions about a specific section.

Anonymous

0 points

16 months ago

Thanks for providing the complete list for preparing Data Engineering interview. I would also like to add based on current trend on cloud data pipelines. I feel we should also know about Docker and Kubernetes on how to deploy data pipelines in cloud using kubernetes cluster.

Commento

10 Skills to Ace Your Data Engineering Interviews

Introduction

Skills

1. SQL

2. Python

3. Leetcode: data structures and algorithms

4. Data modeling

4.1 Data warehousing

4.2 OLTP

5. Data pipelines

6. Distributed system fundamentals

7. Event streaming

8. System design

9. Business questions

10. Cloud computing

11. Probabilistic data structures (optional)

Interview prep, the TL;DR version

Conclusion

Tired of VC-Funded, Fluff-Filled Data Content?

Land your dream Data Engineering job!