Data Engineering Glossary
The vocabulary every data professional should speak fluently. 160+ terms across ingestion, modeling, streaming, governance, LLMs and more.
Backfill
Re-running a pipeline over historical dates to fill or fix data.
CDC (Change Data Capture)
Streaming row-level INSERT/UPDATE/DELETE from a source DB by tailing its WAL/binlog.
ELT
Extract, Load (raw), then Transform in-warehouse with SQL/dbt.
ETL
Old paradigm: transform data in-flight before loading.
Watermark
Stored timestamp marking how far an incremental loader has progressed.
Idempotent
Running the job twice gives the same result as running it once.
Upsert / MERGE
Insert if absent, update if present — atomic in one statement.
Snapshot
Full copy of a table at a point in time.
Incremental load
Loading only rows that changed since the last run.
Schema drift
The source schema changes without telling you.
MAR (Monthly Active Rows)
Fivetran's billing unit — each row updated this month counts once.
Connector
Pre-built integration that extracts from a source (Stripe, Salesforce…).
Tap / Target (Singer)
Singer convention: tap reads from a source, target writes to a sink.
Binlog / WAL
Database transaction log CDC tools tail to capture changes.
Dead Letter Queue (DLQ)
Side topic/table for messages the pipeline couldn't process.
Parquet
Columnar binary file format optimized for analytics.
ORC
Columnar format from the Hive world — similar to Parquet, Hadoop-native.
Avro
Row-based binary format with embedded schema — great for streaming.
Apache Iceberg
Open table format on top of Parquet with ACID, time-travel and hidden partitioning.
Delta Lake
Databricks' open table format — Parquet + transaction log.
Apache Hudi
Open table format optimized for upserts and incremental queries.
Lakehouse
Lake (cheap object storage) with warehouse features (ACID, SQL) on top.
Partition
Splitting a table by a column so queries can skip irrelevant files.
Bucketing
Hash-based file split inside a partition — helps joins and skews.
Z-ordering
Multi-dimensional clustering inside Parquet files for faster column pruning.
VACUUM / Compaction
Removing old/unreferenced files; merging small files into bigger ones.
Small files problem
Thousands of tiny files killing query performance and metadata overhead.
Time travel
Querying a table as it existed at a past version or timestamp.
Object storage
Cheap, flat, HTTP-addressable storage — S3, GCS, ADLS.
Columnar storage
Storing values column-by-column so analytic scans only read needed columns.
Schema Registry
Central service that stores and validates evolving schemas (often Avro).
Star schema
Fact table at the center, dimension tables around it — analytics default.
Snowflake schema
Star schema with normalized dimensions (dims linked to sub-dims).
Fact table
Table of measurable events (sales, clicks) with FKs to dimensions.
Dimension table
Descriptive context for facts (who, what, where, when).
SCD (Slowly Changing Dimension)
Patterns for tracking history of dimension changes: Type 1 (overwrite), Type 2 (new row + dates), Type 3 (extra column).
Surrogate key
Synthetic primary key (e.g. an auto-int or hash), independent of business keys.
Natural key / Business key
Real-world identifier from the source system (e.g. order_id, email).
Grain
What one row of a fact table represents — declare it before building.
Data Vault
Hubs + Links + Satellites — modeling style optimized for auditability and source change.
Medallion (Bronze/Silver/Gold)
Layered lake architecture: raw bronze → cleaned silver → consumer gold.
Wide / OBT (One Big Table)
Pre-joined denormalized table for fast read — common in analytics layers.
Normalization
Splitting data into many tables to remove redundancy — OLTP default.
Denormalization
Duplicating data into one table to make reads cheap.
Metrics layer / Semantic layer
Single source of truth for business metric definitions (MRR, ARPU…).
Apache Kafka
Distributed append-only log — the backbone of event-driven data.
Topic
Named stream in Kafka — split into partitions.
Consumer group
Set of consumers that share partitions — each partition goes to one member.
Exactly-once
Each event affects state exactly once even with retries and failures.
At-least-once
Events may be delivered more than once — consumers must be idempotent.
Watermark (streaming)
Time marker telling the engine 'no more events older than this'.
Event time vs processing time
Event time = when it happened. Processing time = when we saw it.
Windowing
Grouping events into bounded chunks (tumbling, sliding, session) for aggregation.
Stateful processing
The stream operator keeps memory across events (aggregates, joins, sessions).
Apache Flink
True streaming engine — event-by-event, low latency, strong state.
Spark Structured Streaming
Micro-batch streaming on the Spark engine — easier ops than Flink.
Micro-batch
'Streaming' implemented by tiny batches every N seconds.
Kafka Connect
Framework for source/sink connectors that move data in/out of Kafka.
ksqlDB / Kafka Streams
Streaming SQL / Java library that processes Kafka data without a separate cluster.
Backpressure
Slow consumer signals upstream to slow down, preventing memory blow-up.
DAG
Directed Acyclic Graph — nodes are tasks, edges are dependencies, no cycles.
Apache Airflow
Python-based scheduler — define DAGs in code, run them on a cluster.
Dagster
Asset-oriented orchestrator — you declare data assets, deps are inferred.
Prefect
Modern Python orchestrator — dynamic flows, hybrid execution.
Scheduler
Component that decides when to trigger a job (cron, sensor, manual).
Sensor
Task that waits for an external event (file arrived, table updated).
SLA / SLO / SLI
SLI=metric, SLO=target, SLA=contract with consequences. Pipelines need them.
Catchup / Backfill (Airflow)
Airflow re-running historical runs when the DAG falls behind.
Retry / Backoff
Auto re-running failed tasks with growing delay (exponential backoff).
Idempotency key
Unique token a client sends so a server can dedupe retried calls.
Data quality
How well data fits its purpose: accuracy, completeness, freshness, uniqueness.
Great Expectations
Python library declaring 'expectations' (assertions) over data.
Soda / Soda Core
YAML-based data checks — runs in CI or in your warehouse on a schedule.
Freshness
How recently a table was updated vs SLA expectation.
Volume check
Anomaly check: row count today vs typical — flags drops/spikes.
Lineage
The dependency graph: which source feeds which model feeds which dashboard.
Data observability
Monitoring data the way SREs monitor services: freshness, volume, schema, distribution, lineage.
Anomaly detection
ML or stats catching unexpected changes (counts, distributions, freshness).
dbt test
Built-in dbt assertions: unique, not_null, accepted_values, relationships.
Data diff
Comparing two table versions row-by-row to see what a change would alter.
Data contract
Producer-consumer agreement on schema, semantics, SLAs — versioned, breakable only on bump.
Data Mesh
Org pattern: domain teams own and serve their data as products.
Data product
A dataset treated as a product: owner, SLA, docs, contract, discoverability.
Data catalog
Searchable inventory of datasets with owner, schema, docs, lineage.
Unity Catalog
Databricks' unified governance: tables, ML, files, lineage, audit, fine-grained access.
PII
Personally Identifiable Information — must be tagged, masked, access-controlled.
GDPR / LGPD
EU / Brazilian data protection laws — right to erasure, consent, purpose limitation.
RBAC / ABAC
Role-based vs attribute-based access control.
Dynamic data masking
Hiding sensitive values at query time based on the caller's role.
Data steward
Person accountable for a dataset's definitions, quality, access.
OpenLineage
Open standard for emitting lineage events from any tool.
IaC (Infrastructure as Code)
Provisioning infra via versioned code instead of clicks.
Terraform / OpenTofu
Cloud-agnostic IaC tool — declares desired state, plans the diff, applies.
VPC
Virtual Private Cloud — isolated network in AWS/GCP/Azure.
IAM
Identity & Access Management — who can do what on cloud resources.
Kubernetes (K8s)
Container orchestrator — manages pods, services, scaling.
Serverless
Compute that scales to zero and bills per invocation (Lambda, Cloud Run).
Egress cost
Cloud charges for data leaving its network — sneaky bill killer.
BigQuery
GCP's serverless analytics warehouse — pay per scanned byte or slot.
Snowflake
Cloud-native warehouse separating storage from per-second compute (warehouses).
Redshift
AWS's warehouse — classic provisioned or Serverless modes.
Athena
AWS serverless SQL over S3 — Trino under the hood.
AWS Glue
AWS managed ETL + data catalog on Spark.
Databricks
Managed Spark + Delta + ML — turned into the 'lakehouse' platform.
Microsoft Fabric
Microsoft's unified data platform: OneLake + Synapse + Power BI under SKU pricing.
LLM
Large Language Model — transformer trained on huge text corpora.
Token
Sub-word unit an LLM reads/writes — billing and context limits use tokens.
Context window
Max tokens the model can consider in one call (input + output).
Embedding
Vector representation of text/image — close vectors = similar meaning.
RAG (Retrieval-Augmented Generation)
Retrieve relevant chunks, stuff them into the prompt, then generate.
Vector database
DB optimized for nearest-neighbor search on embeddings (Pinecone, pgvector, Weaviate).
Chunking
Splitting docs into pieces sized for embedding + retrieval.
Reranker
Second-pass model that re-orders retrieved results by true relevance.
Agent
LLM that plans + calls tools in a loop to accomplish a goal.
Tool calling / Function calling
LLM emits structured JSON to invoke a function — your code runs it.
Fine-tuning
Continuing training of a base model on your task-specific data.
Prompt engineering
Crafting LLM inputs (system, role, examples) to get better outputs.
Hallucination
Confidently-wrong LLM output — invents facts not in input/data.
Eval (LLM evaluation)
Systematic measurement of LLM output quality (Ragas, LangSmith, custom).
Guardrails
Filters/validators around LLM input/output (PII redaction, jailbreak detection, JSON schema).
MLOps
DevOps practices for ML: training pipelines, model registry, monitoring drift.
Feature store
Repository serving consistent features to training AND online inference.
Shuffle
Redistributing rows across nodes for a join/group — Spark's #1 cost.
Broadcast join
Replicate the small side of a join to every executor — skips shuffle.
Data skew
One key has way more rows than others — one task takes forever.
Predicate pushdown
Pushing WHERE clauses down to the file format so it reads fewer rows.
Column pruning
Reading only the columns the query needs — free with columnar formats.
EXPLAIN / Query plan
Engine's plan to run a query — scan, join, aggregate ordering.
Materialized view
Precomputed query stored as a table; refreshed periodically or incrementally.
Caching (warehouse)
Reusing previous query results when underlying data hasn't changed.
AQE (Adaptive Query Execution)
Spark re-plans during execution based on real runtime stats.
Photon
Databricks' vectorized C++ engine — 2-3x faster than vanilla Spark on SQL.
DuckDB
In-process columnar OLAP DB — SQLite for analytics.
Lambda architecture
Two paths: slow batch (truth) + fast stream (low latency), merged at serve time.
Kappa architecture
One streaming path for everything — reprocess history by replaying.
CQRS
Separate read and write models — writes go to one store, reads to another.
Event sourcing
Store every state change as an immutable event; current state = replay.
Central platform vs Data Mesh
Central team owns everything vs domains own their own data products.
Fitness function
Automated check that the architecture still meets quality goals (perf, deps, cost).
Data warehouse
Structured, query-optimized store for analytics (Snowflake, BigQuery, Redshift).
Data lake
Cheap object storage holding raw, semi-structured, and structured data.
Data mart
Subset of a warehouse focused on one domain (finance, marketing).
Blue/Green deploy
Run two versions; route traffic to the new only after it's verified.
Canary release
Send a small % of traffic to the new version first.
Rollback
Reverting to the previous good state after a bad deploy.
CI/CD
Auto-build/test on every PR + auto-deploy to envs.
On-call
Engineer responsible for responding to alerts off-hours.
Runbook
Step-by-step doc to handle a known incident.
Postmortem
Blameless writeup of an incident: what happened, why, and what changes.
Blast radius
How much breaks when a single component fails.
DORA metrics
Deploy frequency, lead time, change failure rate, MTTR — DevOps north stars.
MTTR
Mean Time To Recovery — how fast we restore service after an incident.
FinOps
Practice of optimizing cloud cost continuously, owned by eng + finance.