10 Skills to Ace Your Data Engineering Interviews
Introduction
Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts? Then this post is for you. In this post, we go over the most common tools and concepts you need to know to ace any data engineering interview.
Skills
The following 10 skills are ordered based on how frequently they are asked in interviews. If you are short on time go to this section .
1. SQL
SQL is crucial for any data engineering interview. These questions generally involve the interviewer giving you an ERD and asking you to write queries to answer analytical questions. Things to know include
- Select, from, where, like, joins
- Joins: left outer, right outer, inner, full outer , anti join, and know when to use a specific join type
- Window functions
- Table relationships:
one to many
,many to many
,one to one
type table relationships- What is
primary key
,foreign key
- Sub query, derived tables, CTEs
- What is an index and why use it
2. Python
Python is the most common language used in building data engineering pipelines. Concepts to know include
3. Leetcode: data structures and algorithms
This is a standard section for most software job interviews. Use the links below to practice your leetcode skills.
- Leetcode 75 coding problems , I found the Neetcode youtube channel and their problem tracking sheet to be extremely helpful.
- Leetcode SQL questions
4. Data modeling
You will be tested on data modeling concepts (e.g. star schema, facts, and dimensions). Read the concepts below, understand what they are and why they are needed.
4.1 Data warehousing
- What is a data warehouse
- What is a dimensional modeling?
- What are slowly changing dimensions, especially SCD1, SCD2 and SCD3 types ?
- Explain the separation of compute and storage
- External tables
- Data partitioning
- Columnar storage formats, such as parquet, orc
- Row vs column oriented databases?
4.2 OLTP
5. Data pipelines
Data engineering interviews usually involve a data pipeline design question. Upon designing the data pipeline, you will be asked about testing, backfilling, scaling, handling bad data, resolving dependencies, etc.
- Data pipeline desing basics
- Basics of orchestration tools like Airflow and DBT
- What is backfilling?
- Difference between ETL and ELT, and when to use one over the other
- EL tools such as stitch or fivetran
- Data testing
- What is an idempotent data pipeline?
- Efficient data pipelines in python
- How to load data into a data warehouse
6. Distributed system fundamentals
Depending on the job, you might be asked about distributed system fundamentals. It is a good idea to read up on how distributed systems work. Use the links below to understand the basics of distributed systems.
7. Event streaming
Understand what event streams are, why, and how to use them.
8. System design
This is a standard question for most software jobs. Specifically for a data engineering position, you will be asked about building data pipelines where you will have to understand the source system, ETL schedule requirements, data usage pattern, and design a data pipeline to pull data into a warehouse. Some common questions in this section are, "How do you design a clickstream data storage system?"
, and, "How would you use CDC pattern to replicate OLTP tables into a data warehouse?"
. Some companies may choose to ask a standard system design question like designing Twitter, Netflix, etc.
- Sample Batch ETL
- Sample Streaming ETL
- What is CDC (Change data capture) pattern
- Standard system design concepts such as cache, db, load balancers, data modeling, etc and common questions like designing twitter, netflix, and whatsapp.
9. Business questions
Sometimes interviews may involve questions about business metrics. You can be asked to design metrics to track the health of the business, how to calculate it, why you chose it, etc.
You may be asked about your impact. You will need to explain what KPI/SLA (requirements) you were trying to optimize and how you achieved it. See the STAR method for how to organize your response to such questions.
10. Cloud computing
Be familiar with what cloud computing is. You do not need to know everything in detail but should have a general idea of what they are and what benefits they bring.
- Cloud storage, such as AWS S3
- Cloud compute, such as AWS EC2
- Cloud database, such as AWS RDS
- Cloud managed HDFS, Spark, Flink, such as AWS EMR
- Cloud serverless functions, such as AWS Lambda
- Cloud managed Kafka, such as AWS Kinesis
You can find corresponding components for GCP or Azure as well.
11. Probabilistic data structures (optional)
Although not frequently asked, these are asked sometimes. You may be asked a question like, “What is an efficient way to count the number of distinct ids in a distributed system where some probability of error is fine?". The most common questions of this type can usually be answered using the concepts shown below.
- Hyperloglog Used for approximate distinct calculations.
- Count min sketch Used for counting things in a stream of data.
- Bloom filter Used for is_present type checks.
Optional: Knowledge of a JVM language (e.g. Scala or Java) is helpful.
Interview prep, the TL;DR version
If you have only a few days to prepare for your interview, make sure to understand the topics shown below.
- SQL: Basics , joins , union, CTEs , window functions and leetcode medium/hard problems
- Python: Basic data structures , Easy and medium (represented by green and yellow respectively) problems from the “blind 75” list .
- Answering business questions with SQL: You will be given a set of tables and asked a business question. Usually, the tables have the same name as the business process/entity.
- Data pipeline design: You will be asked to design a data pipeline. Make sure to understand the objective of the data pipeline. Some common questions in this section are,
"How do you design a clickstream data storage system?"
, and,"How would you use CDC pattern to replicate OLTP tables into a data warehouse?"
. The interviewer will want to understand if you think about data lineage, schedule (batch or streaming), data duplication , scaling , loading data , testing , and end-user access patterns.
Conclusion
Hope this article gives you a good list of skills to be better prepared for a data engineering interview. The next time you are interviewing for a data engineering job, use the list above to prepare.
As always, if you have any questions or comments, please feel free to leave them in the comments section below.
If you found this article helpful, share it with a friend or colleague using one of the socials below!