Interview prep

Data engineering interview questions that actually come up.

The questions real teams ask — SQL, Python, Spark and system design — with the answers hiring managers are listening for. A natural next step after you've walked the DataForge roadmap.

SQL questions (you WILL be asked these)

SQL is the single biggest signal in a data engineering interview. Expect at least one live SQL problem — usually windows or self-joins.

  • Find the 2nd highest salary per department. Tests window functions (DENSE_RANK) vs correlated subqueries. Say why you picked one.
  • Compute a 7-day rolling average of daily active users. Tests RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW and handling of days with zero events.
  • Given events, find sessions (30-min inactivity gap). Classic gaps-and-islands with LAG + a running sum.
  • Deduplicate a table keeping the latest row per key. QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1.
  • Explain the difference between WHERE and HAVING. And where QUALIFY fits.

Python questions

Not LeetCode. Real "can you write a small ETL script" questions.

  • Parse a JSON file with nested arrays and flatten it into rows.
  • Write a generator that streams lines from a large file without loading it into memory. They want to see yield and understand why.
  • Retry an HTTP call with exponential backoff. Bonus points for mentioning tenacity or writing a clean decorator.
  • Difference between list, tuple, set, dict — and when you'd use each in a pipeline.
  • What does __init__.py do? What is if __name__ == "__main__"? Sanity checks for people who Copy-pasted their way in.

Spark & distributed processing

  • Narrow vs wide transformations. Which ones cause shuffles?
  • What is data skew and how do you fix it? Salting, broadcast joins, AQE.
  • Broadcast join vs sort-merge join — when does each win?
  • How do you tune the number of partitions? Talk about spark.sql.shuffle.partitions, target partition size (~128MB), and file size after write.
  • Explain lazy evaluation and the DAG. Why .count() can be expensive on a chain of transformations.

Warehousing, modeling & dbt

  • Star vs snowflake schema — trade-offs.
  • SCD Type 1 vs Type 2 — when do you need history?
  • How would you model an e-commerce dataset (orders, users, products) for analytics?
  • Explain dbt sources, staging, marts. Where do tests live? What's a snapshot?
  • Incremental models — when to use merge vs append vs delete+insert.

System design (the round that decides the offer)

System design separates junior from mid, and mid from senior. There's rarely one right answer — they're grading how you reason.

  • Design a pipeline for near real-time analytics on user events. Kafka → Spark Streaming / Flink → Iceberg → dbt → warehouse. Discuss latency SLAs and exactly-once.
  • Design a data warehouse for an e-commerce company. Sources, ingestion pattern (CDC vs snapshot), staging, marts, orchestration, monitoring.
  • How do you handle schema evolution? Avro/Protobuf, backward/forward compatibility, contract testing, migrations in the warehouse.
  • Batch vs streaming — how do you choose? Latency, cost, complexity, freshness requirement.
  • Idempotency and exactly-once — how do you actually guarantee them? Merge keys, dedupe windows, transactional sinks, Kafka + Iceberg transactions.

Behavioral questions to expect

  • "Tell me about a pipeline that broke in production. What did you do?"
  • "How do you handle a stakeholder who wants a dashboard 'by tomorrow' on data that doesn't exist yet?"
  • "Describe a technical decision you disagreed with. How did you handle it?"
  • "How do you document a pipeline so the next person doesn't page you at 3am?"

Use the STAR framework (Situation, Task, Action, Result). Keep answers under 2 minutes.

How DataForge prepares you

The 14-course roadmap covers every fundamental behind these questions — SQL windows, Spark internals, dbt patterns, Kafka, Iceberg, orchestration. Bug Hunt drills you on real broken pipelines. Once you can complete the roadmap and hunt bugs unassisted, most of these questions become one-line answers instead of stumbling paragraphs.

FAQ

How should I prepare for a data engineering interview?
Cover four fronts: SQL (windows, joins, aggregations), Python (data structures, iterators, small ETL scripts), one distributed system (Spark or Kafka fundamentals), and system design (batch vs streaming, warehouse modeling, orchestration trade-offs). Practice out loud — most rejections come from unclear communication, not wrong answers.
How hard are data engineering interviews compared to software engineering?
Different, not harder. Less LeetCode grinding, more SQL reasoning, data modeling and pipeline design. If you can explain WHY you chose a partitioning strategy or how you'd handle late-arriving data, you're already ahead of most candidates.
Do data engineering interviews include LeetCode?
Sometimes, but usually easy/medium level and often SQL-focused instead of algorithm-focused. FAANG-tier companies still ask coding rounds; most others focus on SQL, Python scripting, and system design.
What system design questions should I expect?
Common ones: 'design a pipeline for real-time analytics', 'design a data warehouse for an e-commerce company', 'how would you handle schema evolution', 'batch vs streaming for this use case', 'how do you guarantee idempotency and exactly-once processing'.
How long should I prepare?
Two to four weeks of focused prep is enough if your fundamentals are already solid. If you're still shaky on SQL windows or Spark internals, spend the time on fundamentals first — no amount of interview practice compensates for missing basics.

Ready to start?

7 days free. Then less than a coffee per month.

Prep with the courses