Data Engineering Interview Series #1: Data Structures and Algorithms

1. Introduction

Preparing for data engineering interviews can be stressful. There are so many things to learn. In this Data Engineering Interview Series, you will learn how to crack each section of the data engineering interview.

If you have felt

That you need to practice 100s of Leetcode questions to crack the data engineering interview

That you have no idea where/how to start preparing for the data structures and algorithms interview

That you are not good enough to crack the data structures and algorithms interview.

Then this post is for you! In this post, we will cover the following:

  1. Data structures & algorithms to know
  2. Common patterns of questions asked
  3. How to do industry-specific research

By the end of this post, you will be able to pass the DSA part of your data engineering interview.

Practice content in this post with code using this repository .

2. Data structures and algorithms to know

Understanding basic data structures is essential to implement answers to DSA questions. You should know the topics in this section very well.

2.1. List

Description: A list is a dynamic array that allows you to store a sequence of elements. It supports indexing, slicing, and various methods for manipulating the data.

Time Complexity:

  1. Read: O(1) - Accessing an element by an index is fast and takes constant time.
  2. Write: O(1) - Appending an element is usually O(1), but inserting or removing can be O(n) due to shifting elements.
my_list = [1, 2, 3, 4, 5]
# Accessing an element
element = my_list[2]  # element = 3

# Adding an element
my_list.append(6)  # my_list = [1, 2, 3, 4, 5, 6]

# Removing an element
my_list.remove(3)  # my_list = [1, 2, 4, 5, 6]
print(my_list)

# sort list in place
my_list.sort()


# Iterate through a list

for idx, elt in enumerate(my_list):
    print(f'Use enumerate to get index: {idx} and element: {elt}')

for elt in my_list:
    print(f'Use loop directly to loop through elements: {elt}')

2.2. Dictionary

Description: A dictionary is an unordered collection of key-value pairs. It allows for fast lookups, insertions, and deletions based on keys.

Time Complexity:

  1. Read: O(1) - Accessing a value by key is fast due to the underlying hash table.
  2. Write: O(1) - Inserting or deleting a key-value pair is also fast, thanks to the hash table.
my_dict = {'a': 1, 'b': 2, 'c': 3}
# Accessing a value
value = my_dict['b']  # value = 2

# Adding a key-value pair
my_dict['d'] = 4  # my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# Removing a key-value pair
del my_dict['a']  # my_dict = {'b': 2, 'c': 3, 'd': 4}
print(my_dict)

2.3. Queue

Description: A queue is a linear data structure that follows the First-In-First-Out (FIFO) principle. Elements are added at the rear and removed from the front.

Time Complexity:

  1. Read: O(1) - Accessing the front element is O(1) if using deque.
  2. Write: O(1) - Enqueuing (adding) and dequeuing (removing) operations are O(1) with deque.
from collections import deque

queue = deque([1, 2, 3])
# Adding an element
queue.append(4)  # queue = deque([1, 2, 3, 4])

# Removing an element
element = queue.popleft()  # element = 1, queue = deque([2, 3, 4])
print(f'The popped element is {element} and the remaining queue is {queue}')

2.4. Stack

Description: A stack is a linear data structure that follows the Last-In-First-Out (LIFO) principle. Elements are added and removed from the top.

Time Complexity:

  1. Read: O(1) - Accessing the top element is O(1).
  2. Write: O(1) - Pushing (adding) and popping (removing) elements are O(1).
stack = [1, 2, 3]
# Adding an element (push)
stack.append(4)  # stack = [1, 2, 3, 4]

# Removing an element (pop)
element = stack.pop()  # element = 4, stack = [1, 2, 3]
print(f'The popped element is {element} and the remaining stack is {stack}')

2.5. Set

Description: A set is an unordered collection of unique elements. Sets are helpful for membership testing, removing duplicates from a sequence, and performing mathematical set operations like union, intersection, and difference.

Time Complexity:

  1. Read: O(1) - Checking if an element is in the set (e.g., 4 in my_set) is O(1) on average.
  2. Write:
    • Adding/Removing: O(1) - Adding or removing an element is O(1) on average.
    • Set Operations:
      1. Union: O(len(set1) + len(set2)) - Combining two sets into one.
      2. Intersection: O(min(len(set1), len(set2))) - Finding common elements between two sets.
      3. Difference: O(len(set1)) - Finding elements in one set but not another.

Sets are highly efficient for operations involving unique elements and membership checks, with an average O(1) time complexity for most read-and-write operations.

# Creating a set
my_set = {1, 2, 3, 4, 5}

# Adding an element
my_set.add(6)  # my_set = {1, 2, 3, 4, 5, 6}

# Removing an element
my_set.remove(3)  # my_set = {1, 2, 4, 5, 6}

# Checking membership
is_in_set = 4 in my_set  # is_in_set = True

# Set operations: union, intersection, difference
other_set = {4, 5, 6, 7, 8}
union_set = my_set.union(other_set)  # union_set = {1, 2, 4, 5, 6, 7, 8}
intersection_set = my_set.intersection(other_set)  # intersection_set = {4, 5, 6}
difference_set = my_set.difference(other_set)  # difference_set = {1, 2}

print(f' Here is my_set {my_set} and other_set {other_set}')
print(f' Here is union_set {union_set}')
print(f' Here is intersection_set {intersection_set}')
print(f' Here is difference_set {difference_set}')

2.6. Counter (from collections module)

Description: Counter is a subclass of the dictionary in Python that helps count hashable objects. It is beneficial for counting the frequency of elements in an iterable.

from collections import Counter

# Creating a Counter from a list
my_counter = Counter(['apple,' 'banana,' 'apple,' 'orange,' 'banana,' 'apple'])

print(my_counter)

2.7. Heap

Description: A heap is a specialized binary tree-based data structure that satisfies the heap property. In a max-heap, for any given node i, the value of i is greater than or equal to the values of its children. In a min-heap, the value of i is less than or equal to the values of its children. Heaps are commonly used to implement priority queues.

Time Complexity:

  1. Read: O(1) - Accessing the maximum or minimum element (the root) is fast because it is always at the top of the heap.
  2. Write:
  3. Insertion: O(log n) - Inserting a new element requires maintaining the heap property, which involves comparing and possibly swapping elements up the tree.
  4. Deletion: O(log n) - Deleting the maximum or minimum element requires rebalancing the heap, typically involving a series of swaps down the tree.
import heapq

# by default, this is a min-heap. To get the max heap, multiply the number by -1 before inserting
a = []
heapq.heappush(a, 10)

heapq.heappop(a)

2.8.1 Depth First Search (DFS)


def iterative_dfs(graph, start_node):
    visited = set()
    stack = [start_node]

    while stack:
        node = stack.pop()
        if node not in visited:
            visited.add(node)
            # Add neighbors to the stack
            for neighbor in graph[node]:
                if neighbor not in visited:
                    stack.append(neighbor)
    return visited


def recursive_dfs(graph, node, visited=None):
    if visited is None:
        visited = set()
    visited.add(node)
    
    for neighbor in graph[node]:
        if neighbor not in visited:
            recursive_dfs(graph, neighbor, visited)
    
    return visited

# Define a simple graph using an adjacency list
graph = {
    'A': ['B', 'C'],
    'B': ['A', 'D', 'E'],
    'C': ['A', 'F'],
    'D': ['B'],
    'E': ['B', 'F'],
    'F': ['C', 'E']
}

# Test case 1: Iterative DFS
iterative_result = iterative_dfs(graph, 'A')
print("Iterative DFS:", iterative_result)

# Test case 2: Recursive DFS
recursive_result = recursive_dfs(graph, 'A')
print("Recursive DFS:", recursive_result)

# Test case 3: Both methods should produce the same result for the same graph and start node
assert iterative_result == recursive_result, "Iterative and Recursive DFS results should match"

# Test case 4: Start from a different node
iterative_result_b = iterative_dfs(graph, 'B')
recursive_result_b = recursive_dfs(graph, 'B')
print("Iterative DFS from B:", iterative_result_b)
print("Recursive DFS from B:", recursive_result_b)

assert iterative_result_b == recursive_result_b, "Iterative and Recursive DFS results should match"


2.8.2. Breadth First Search BFS

from collections import deque

def iterative_bfs(graph, start_node):
    visited = set()
    queue = deque([start_node])

    while queue:
        node = queue.popleft()
        if node not in visited:
            visited.add(node)
            # Add neighbors to the queue
            for neighbor in graph[node]:
                if neighbor not in visited:
                    queue.append(neighbor)
    return visited

def recursive_bfs_aux(graph, queue, visited):
    if not queue:
        return visited
    
    node = queue.popleft()
    visited.add(node)
    
    for neighbor in graph[node]:
        if neighbor not in visited and neighbor not in queue:
            queue.append(neighbor)
    
    return recursive_bfs_aux(graph, queue, visited)

def recursive_bfs(graph, start_node):
    visited = set()
    queue = deque([start_node])
    return recursive_bfs_aux(graph, queue, visited)

# Define a simple graph using an adjacency list
graph = {
    'A': ['B', 'C'],
    'B': ['A', 'D', 'E'],
    'C': ['A', 'F'],
    'D': ['B'],
    'E': ['B', 'F'],
    'F': ['C', 'E']
}

# Test case 1: Iterative BFS
iterative_result = iterative_bfs(graph, 'A')
print("Iterative BFS:", iterative_result)

# Test case 2: Recursive BFS
recursive_result = recursive_bfs(graph, 'A')
print("Recursive BFS:", recursive_result)

# Test case 3: Both methods should produce the same result for the same graph and start node
assert iterative_result == recursive_result, "Iterative and Recursive BFS results should match"

# Test case 4: Start from a different node
iterative_result_b = iterative_bfs(graph, 'B')
recursive_result_b = recursive_bfs(graph, 'B')
print("Iterative BFS from B:", iterative_result_b)
print("Recursive BFS from B:", recursive_result_b)

assert iterative_result_b == recursive_result_b, "Iterative and Recursive BFS results should match"

def iterative_binary_search(arr, target):
    left, right = 0, len(arr) - 1

    while left <= right:
        mid = (left + right) // 2

        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1  # Target not found

def recursive_binary_search(arr, target, left, right):
    if left > right:
        return -1  # Target not found

    mid = (left + right) // 2

    if arr[mid] == target:
        return mid
    elif arr[mid] < target:
        return recursive_binary_search(arr, target, mid + 1, right)
    else:
        return recursive_binary_search(arr, target, left, mid - 1)

arr = [1, 3, 5, 7, 9, 11, 13, 15]

# Test case 1: Iterative Binary Search
iterative_result = iterative_binary_search(arr, 7)
print("Iterative Binary Search for 7:", iterative_result)  # Should print index 3

# Test case 2: Recursive Binary Search
recursive_result = recursive_binary_search(arr, 7, 0, len(arr) - 1)
print("Recursive Binary Search for 7:", recursive_result)  # Should print index 3

# Test case 3: Target not found
iterative_not_found = iterative_binary_search(arr, 4)
recursive_not_found = recursive_binary_search(arr, 4, 0, len(arr) - 1)
print("Iterative Binary Search for 4:", iterative_not_found)  # Should print -1
print("Recursive Binary Search for 4:", recursive_not_found)  # Should print -1

3. Common DSA questions asked during DE interviews

These problems are taken from Blind 75 and based on common questions during data engineering interviews. The topic segmentation is based on Neetcode .

Do the exercises in the order shown below!

You should be able to complete the problems in this section in under 15 minutes per problem.

⏰ Practice these problems multiple times until you can do them in under 15 minutes without looking at the solutions!!

3.1. Intervals

  1. Insert interval
  2. Merge intervals
  3. Nonoverlapping intervals

3.2. Graph

  1. Number of Islands
  2. Number of Connected Components
  3. Course Schedule

3.3. Two Pointers

  1. Container with most water
  2. 3Sum
  3. Valid Palindrome

3.4. Array & Hashing

  1. Contains Duplicate
  2. Valid Anagram
  3. Two Sum
  4. Top K Frequent Elements
  5. Longest Consecutive Sequence

3.5. Stacks

  1. Validate parenthesis

3.6. Sliding Window

  1. Best time to buy and sell stock

3.7. Linked List

  1. Reverse Linked List
  2. Merge 2 Sorted List
  3. Linked List Cycle
  4. Merge K sorted List

3.8. Tree

  1. Invert a binary tree
  2. Depth of a binary tree
  3. Subtree of a binary tree
  4. LCA in BST
  5. Serialize and deserialize Binary Tree

3.9. Heap

  1. Find median in a datastream

3.10. Dynamic Programming

  1. Coin change
  2. House robber
  3. Decode Ways

3.11. Construct data structures

  1. LRU Cache
  2. Circular queue

4. Company specific research

Search for company-specific problems on

  1. Leetcode
  2. TeamBlind
  3. Glassdoor

Sort by the latest and do as many of them as possible. Like before, you should be able to complete these problems in under 15 minutes.

5. Conclusion

To recap, we saw:

  1. Data structures & algorithms to know
  2. Common patterns of questions asked
  3. How to do industry-specific research

If you have an interview coming up, make sure you can complete the typical patterns section in order.

Please let me know in the comment section below if you have any questions or comments.

6. Further reading

  1. Steps to land a high-paying data engineering job
  2. How to become a valuable data engineer

If you found this article helpful, share it with a friend or colleague using one of the socials below!