The Number Of Tools Listed In Data Engineering Job Requirements Is Insane
The market is tough right now.
Every data engineering job requires multiple tools and multiple years of experience.
It can be overwhelming to try to land a high-paying DE job.
But what if you can make potential employers excited to hire you?
We will see how to do that in this post.
Companies are looking for problem solvers. Let’s go over the list of problems and how to address them.
Better Tools, Same Problems
Despite the tremendous improvement in data technology over the past decades. Data teams face the same problems they always have.
These are:
- Getting complete and correct data, on time, to the users
- Making sure the data is easy to use for analytics
- Fixing critical issues quickly
- Keeping costs manageable
Let’s go over the concepts that address these problems. Each concept will include links to further reading and a list of tools you can use to implement it.
Create easy-to-analyze datasets using Data Warehousing Techniques
- Use kimball data modeling to ensure data is easy to analyze. Kimball Data Modeling.
- Use 3-hop (medallion, dbt project structure) to create datasets with all the metrics a non technical user may need. Multi Hop Architecture.
- Know enough SQL to create these datasets. SQL Techniques.
Commonly used tools:
Create Complete, Correct, And Quick-To-Fix Datasets Using Pipeline Design Patterns
- Use idempotency and backfill-ability to rerun pipelines where there are issues.
- Only process necessary data with incremental and snapshot patterns.
- Use lambda architecture to process inputs with late arriving data
- Python is the glue that holds your pipeline together.
- Use cloud service providers to enable you to concentrate on your deliverables and not infrastructure management. Snowflake, Databricks, Bigquery, etc.
- Ensure the data you produce is correct, with Data Quality Checks.
Commonly used tools: Snowflake, AWS, Databricks, Python, SQL, Apache Spark, BigQuery, Apache Iceberg
Produce data on time with Scheduler & Orchestrator
- Create data as frequently as needed with Scheduler.
- Coordinate multiple systems in a single script with Orchestrator.
- Handle dependency chain between datasets with data update based scheduling.
Commonly used tools: Apache Airflow, dbt Core, dbt Cloud, Dagster, Prefect
Process data for cheap using Data Storage & processing Patterns
- Data processing optimization is:
- Reducing the amount of data to process
- minimizing movement of data between machines in a distributed system.
- Reducing the amount of data to process
- Reduce amount of data to process, using partitioning & bucketing.
- Reduce movement of data between nodes in a distributed system with join strategies, SPJ, broadcast joins, & bucketing.
- Encode your data for analytical data processing patterns.
- Understand transformation types to design your code to scale effectively.
- Inspect query plan for areas to improve.
Commonly used tools: Apache Spark, Snowflake, Amazon S3, Apache Iceberg, Apache Parquet
Conclusion
To recap, we saw
- How tools have improved significantly, but problems remain the same.
- How to create easy-to-analyze data with Data Warehousing
- How to create complete, correct, and quick-to-fix datasets with Pipeline Design Patterns.
- How to produce data on time with Scheduler & Orchestrator.
- How to process data efficiently using Data Storage & processing Patterns.
When facing any data system design scenario, start with the problem. It will usually fall into one of the above.
Design a solution based on the concepts explained, and implement it using the tools/framework/system you have access to.