Data engineering interview questions that actually come up.
SQL questions (you WILL be asked these)
SQL is the single biggest signal in a data engineering interview. Expect at least one live SQL problem — usually windows or self-joins.
- Find the 2nd highest salary per department. Tests window functions (
DENSE_RANK) vs correlated subqueries. Say why you picked one. - Compute a 7-day rolling average of daily active users. Tests
RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROWand handling of days with zero events. - Given events, find sessions (30-min inactivity gap). Classic gaps-and-islands with
LAG+ a running sum. - Deduplicate a table keeping the latest row per key.
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1. - Explain the difference between
WHEREandHAVING. And whereQUALIFYfits.
Python questions
Not LeetCode. Real "can you write a small ETL script" questions.
- Parse a JSON file with nested arrays and flatten it into rows.
- Write a generator that streams lines from a large file without loading it into memory. They want to see
yieldand understand why. - Retry an HTTP call with exponential backoff. Bonus points for mentioning
tenacityor writing a clean decorator. - Difference between
list,tuple,set,dict— and when you'd use each in a pipeline. - What does
__init__.pydo? What isif __name__ == "__main__"? Sanity checks for people who Copy-pasted their way in.
Spark & distributed processing
- Narrow vs wide transformations. Which ones cause shuffles?
- What is data skew and how do you fix it? Salting, broadcast joins,
AQE. - Broadcast join vs sort-merge join — when does each win?
- How do you tune the number of partitions? Talk about
spark.sql.shuffle.partitions, target partition size (~128MB), and file size after write. - Explain lazy evaluation and the DAG. Why
.count()can be expensive on a chain of transformations.
Warehousing, modeling & dbt
- Star vs snowflake schema — trade-offs.
- SCD Type 1 vs Type 2 — when do you need history?
- How would you model an e-commerce dataset (orders, users, products) for analytics?
- Explain dbt sources, staging, marts. Where do tests live? What's a snapshot?
- Incremental models — when to use
mergevsappendvsdelete+insert.
System design (the round that decides the offer)
System design separates junior from mid, and mid from senior. There's rarely one right answer — they're grading how you reason.
- Design a pipeline for near real-time analytics on user events. Kafka → Spark Streaming / Flink → Iceberg → dbt → warehouse. Discuss latency SLAs and exactly-once.
- Design a data warehouse for an e-commerce company. Sources, ingestion pattern (CDC vs snapshot), staging, marts, orchestration, monitoring.
- How do you handle schema evolution? Avro/Protobuf, backward/forward compatibility, contract testing, migrations in the warehouse.
- Batch vs streaming — how do you choose? Latency, cost, complexity, freshness requirement.
- Idempotency and exactly-once — how do you actually guarantee them? Merge keys, dedupe windows, transactional sinks, Kafka + Iceberg transactions.
Behavioral questions to expect
- "Tell me about a pipeline that broke in production. What did you do?"
- "How do you handle a stakeholder who wants a dashboard 'by tomorrow' on data that doesn't exist yet?"
- "Describe a technical decision you disagreed with. How did you handle it?"
- "How do you document a pipeline so the next person doesn't page you at 3am?"
Use the STAR framework (Situation, Task, Action, Result). Keep answers under 2 minutes.
How DataForge prepares you
The 14-course roadmap covers every fundamental behind these questions — SQL windows, Spark internals, dbt patterns, Kafka, Iceberg, orchestration. Bug Hunt drills you on real broken pipelines. Once you can complete the roadmap and hunt bugs unassisted, most of these questions become one-line answers instead of stumbling paragraphs.
FAQ
- How should I prepare for a data engineering interview?
- Cover four fronts: SQL (windows, joins, aggregations), Python (data structures, iterators, small ETL scripts), one distributed system (Spark or Kafka fundamentals), and system design (batch vs streaming, warehouse modeling, orchestration trade-offs). Practice out loud — most rejections come from unclear communication, not wrong answers.
- How hard are data engineering interviews compared to software engineering?
- Different, not harder. Less LeetCode grinding, more SQL reasoning, data modeling and pipeline design. If you can explain WHY you chose a partitioning strategy or how you'd handle late-arriving data, you're already ahead of most candidates.
- Do data engineering interviews include LeetCode?
- Sometimes, but usually easy/medium level and often SQL-focused instead of algorithm-focused. FAANG-tier companies still ask coding rounds; most others focus on SQL, Python scripting, and system design.
- What system design questions should I expect?
- Common ones: 'design a pipeline for real-time analytics', 'design a data warehouse for an e-commerce company', 'how would you handle schema evolution', 'batch vs streaming for this use case', 'how do you guarantee idempotency and exactly-once processing'.
- How long should I prepare?
- Two to four weeks of focused prep is enough if your fundamentals are already solid. If you're still shaky on SQL windows or Spark internals, spend the time on fundamentals first — no amount of interview practice compensates for missing basics.
Ready to start?
7 days free. Then less than a coffee per month.
Prep with the courses- No credit card for the trial
- Cancel anytime
- 300+ exercises
- 14 full courses