153+ terms

Data Engineering Glossary

The vocabulary every data professional should speak fluently. 160+ terms across ingestion, modeling, streaming, governance, LLMs and more.

🔌Ingestion

Backfill

Re-running a pipeline over historical dates to fill or fix data.

“We're backfilling the last 90 days after the schema bug.”

Study topic

🔌Ingestion

CDC (Change Data Capture)

Streaming row-level INSERT/UPDATE/DELETE from a source DB by tailing its WAL/binlog.

“We use Debezium CDC instead of nightly snapshots — it captures deletes.”

Study topic

🔌Ingestion

ELT

Extract, Load (raw), then Transform in-warehouse with SQL/dbt.

“Modern stack is ELT — dbt handles the T.”

Study topic

🔌Ingestion

ETL

Old paradigm: transform data in-flight before loading.

“Legacy Informatica ETL jobs are being replaced by ELT + dbt.”

🔌Ingestion

Watermark

Stored timestamp marking how far an incremental loader has progressed.

“Each run pulls WHERE updated_at > last watermark.”

Study topic

🔌Ingestion

Idempotent

Running the job twice gives the same result as running it once.

“MERGE makes the loader idempotent — retries don't duplicate.”

Study topic

🔌Ingestion

Upsert / MERGE

Insert if absent, update if present — atomic in one statement.

“Use MERGE INTO target USING staging ON id.”

🔌Ingestion

Snapshot

Full copy of a table at a point in time.

“Daily snapshots are simple but miss intra-day deletes.”

🔌Ingestion

Incremental load

Loading only rows that changed since the last run.

“Switch from full reload to incremental once volume passes 1M rows.”

🔌Ingestion

Schema drift

The source schema changes without telling you.

“Bronze tolerates drift; silver/gold fail-loud.”

Study topic

🔌Ingestion

MAR (Monthly Active Rows)

Fivetran's billing unit — each row updated this month counts once.

“Hot tables blow up MAR — model your top-5 before signing.”

Study topic

🔌Ingestion

Connector

Pre-built integration that extracts from a source (Stripe, Salesforce…).

“Use the Airbyte Stripe connector instead of building one.”

🔌Ingestion

Tap / Target (Singer)

Singer convention: tap reads from a source, target writes to a sink.

“Meltano composes tap-postgres + target-snowflake.”

🔌Ingestion

Binlog / WAL

Database transaction log CDC tools tail to capture changes.

“Postgres exposes WAL; MySQL uses binlog.”

🔌Ingestion

Dead Letter Queue (DLQ)

Side topic/table for messages the pipeline couldn't process.

“Bad events go to DLQ for triage instead of blocking the stream.”

🗄️Storage & Formats

Parquet

Columnar binary file format optimized for analytics.

“Always store the lake as Parquet, not CSV.”

🗄️Storage & Formats

ORC

Columnar format from the Hive world — similar to Parquet, Hadoop-native.

“Old Hortonworks clusters still write ORC.”

🗄️Storage & Formats

Avro

Row-based binary format with embedded schema — great for streaming.

“Kafka + Avro + Schema Registry is the classic combo.”

🗄️Storage & Formats

Apache Iceberg

Open table format on top of Parquet with ACID, time-travel and hidden partitioning.

“We picked Iceberg for engine portability across Spark, Trino, Snowflake.”

🗄️Storage & Formats

Delta Lake

Databricks' open table format — Parquet + transaction log.

“Delta is the default on Databricks; Iceberg dominates outside it.”

🗄️Storage & Formats

Apache Hudi

Open table format optimized for upserts and incremental queries.

“Uber built Hudi to handle their massive upsert workloads.”

🗄️Storage & Formats

Lakehouse

Lake (cheap object storage) with warehouse features (ACID, SQL) on top.

“Iceberg + Trino on S3 is a lakehouse without buying Databricks.”

🗄️Storage & Formats

Partition

Splitting a table by a column so queries can skip irrelevant files.

“Partition fact_events by event_date — never by user_id (too many).”

🗄️Storage & Formats

Bucketing

Hash-based file split inside a partition — helps joins and skews.

“Bucket on user_id to make user-level joins shuffle-free.”

🗄️Storage & Formats

Z-ordering

Multi-dimensional clustering inside Parquet files for faster column pruning.

“ZORDER BY (country, event_type) cuts scan time on filters.”

🗄️Storage & Formats

VACUUM / Compaction

Removing old/unreferenced files; merging small files into bigger ones.

“Run OPTIMIZE + VACUUM weekly on hot Delta tables.”

🗄️Storage & Formats

Small files problem

Thousands of tiny files killing query performance and metadata overhead.

“Streaming writes create the small-files problem — compact periodically.”

🗄️Storage & Formats

Time travel

Querying a table as it existed at a past version or timestamp.

“SELECT * FROM orders VERSION AS OF 42 to audit a bad deploy.”

🗄️Storage & Formats

Object storage

Cheap, flat, HTTP-addressable storage — S3, GCS, ADLS.

“The lake lives on object storage — compute is separate.”

🗄️Storage & Formats

Columnar storage

Storing values column-by-column so analytic scans only read needed columns.

“Parquet is columnar; CSV is row-based and slow for analytics.”

🗄️Storage & Formats

Schema Registry

Central service that stores and validates evolving schemas (often Avro).

“Confluent Schema Registry prevents producers from breaking consumers.”

🧱Modeling

Star schema

Fact table at the center, dimension tables around it — analytics default.

“fact_orders joined to dim_customer, dim_product, dim_date.”

🧱Modeling

Snowflake schema

Star schema with normalized dimensions (dims linked to sub-dims).

“dim_product → dim_category → dim_department.”

🧱Modeling

Fact table

Table of measurable events (sales, clicks) with FKs to dimensions.

“fact_sales has order_id, dim FKs, quantity, revenue.”

🧱Modeling

Dimension table

Descriptive context for facts (who, what, where, when).

“dim_customer holds name, segment, signup_date.”

🧱Modeling

SCD (Slowly Changing Dimension)

Patterns for tracking history of dimension changes: Type 1 (overwrite), Type 2 (new row + dates), Type 3 (extra column).

“dim_customer is SCD2 — we keep history of plan_tier.”

🧱Modeling

Surrogate key

Synthetic primary key (e.g. an auto-int or hash), independent of business keys.

“Use a surrogate key on dim_customer so SCD2 history works.”

🧱Modeling

Natural key / Business key

Real-world identifier from the source system (e.g. order_id, email).

“Keep the natural key as a column even when using a surrogate PK.”

🧱Modeling

Grain

What one row of a fact table represents — declare it before building.

“The grain of fact_orders is one row per order line.”

🧱Modeling

Data Vault

Hubs + Links + Satellites — modeling style optimized for auditability and source change.

“Banks love Data Vault for its lineage and historization.”

🧱Modeling

Medallion (Bronze/Silver/Gold)

Layered lake architecture: raw bronze → cleaned silver → consumer gold.

“BI dashboards only read from the gold layer.”

🧱Modeling

Wide / OBT (One Big Table)

Pre-joined denormalized table for fast read — common in analytics layers.

“One Big Table for the dashboard avoids join cost at query time.”

🧱Modeling

Normalization

Splitting data into many tables to remove redundancy — OLTP default.

“3NF is fine for OLTP; analytics prefer denormalized stars.”

🧱Modeling

Denormalization

Duplicating data into one table to make reads cheap.

“Denormalize dim attributes onto fact when joins get expensive.”

🧱Modeling

Metrics layer / Semantic layer

Single source of truth for business metric definitions (MRR, ARPU…).

“dbt MetricFlow centralizes metric SQL once, used by every BI tool.”

🌊Streaming

Apache Kafka

Distributed append-only log — the backbone of event-driven data.

“Every event hits Kafka first, then fans out to consumers.”

🌊Streaming

Topic

Named stream in Kafka — split into partitions.

“orders.created and orders.shipped are separate topics.”

🌊Streaming

Consumer group

Set of consumers that share partitions — each partition goes to one member.

“Scale consumers up to partition count for parallelism.”

🌊Streaming

Exactly-once

Each event affects state exactly once even with retries and failures.

“Kafka + Flink with checkpoints gives end-to-end exactly-once.”

🌊Streaming

At-least-once

Events may be delivered more than once — consumers must be idempotent.

“Most systems are at-least-once by default.”

🌊Streaming

Watermark (streaming)

Time marker telling the engine 'no more events older than this'.

“Late events past the watermark are dropped or sent to a side output.”

🌊Streaming

Event time vs processing time

Event time = when it happened. Processing time = when we saw it.

“Always window on event time — processing time gives wrong results on backfills.”

🌊Streaming

Windowing

Grouping events into bounded chunks (tumbling, sliding, session) for aggregation.

“Tumbling 5-min windows count clicks per page.”

🌊Streaming

Stateful processing

The stream operator keeps memory across events (aggregates, joins, sessions).

“Flink's RocksDB state backend handles huge stateful jobs.”

🌊Streaming

Apache Flink

True streaming engine — event-by-event, low latency, strong state.

“Pick Flink for sub-second SLAs and complex stateful joins.”

Study topic

🌊Streaming

Spark Structured Streaming

Micro-batch streaming on the Spark engine — easier ops than Flink.

“We use Structured Streaming because the team already knows Spark.”

🌊Streaming

Micro-batch

'Streaming' implemented by tiny batches every N seconds.

“Snowpipe Streaming and Auto Loader are micro-batch under the hood.”

🌊Streaming

Kafka Connect

Framework for source/sink connectors that move data in/out of Kafka.

“Debezium runs as a Kafka Connect source.”

🌊Streaming

ksqlDB / Kafka Streams

Streaming SQL / Java library that processes Kafka data without a separate cluster.

“ksqlDB joins two topics with a SELECT.”

🌊Streaming

Backpressure

Slow consumer signals upstream to slow down, preventing memory blow-up.

“Flink propagates backpressure across operators automatically.”

🎼Orchestration

DAG

Directed Acyclic Graph — nodes are tasks, edges are dependencies, no cycles.

“Airflow DAGs define the daily ETL.”

🎼Orchestration

Apache Airflow

Python-based scheduler — define DAGs in code, run them on a cluster.

“Airflow is the boring, reliable default for batch orchestration.”

🎼Orchestration

Dagster

Asset-oriented orchestrator — you declare data assets, deps are inferred.

“Dagster's asset graph maps cleanly to your dbt models.”

🎼Orchestration

Prefect

Modern Python orchestrator — dynamic flows, hybrid execution.

“Prefect 2 is lighter than Airflow for small teams.”

🎼Orchestration

Scheduler

Component that decides when to trigger a job (cron, sensor, manual).

“The scheduler triggered the DAG at 02:00 UTC.”

🎼Orchestration

Sensor

Task that waits for an external event (file arrived, table updated).

“S3KeySensor blocks until the file lands.”

🎼Orchestration

SLA / SLO / SLI

SLI=metric, SLO=target, SLA=contract with consequences. Pipelines need them.

“SLO: 99% of daily loads land by 6am.”

🎼Orchestration

Catchup / Backfill (Airflow)

Airflow re-running historical runs when the DAG falls behind.

“Set catchup=False on dashboards you don't want re-built every reboot.”

🎼Orchestration

Retry / Backoff

Auto re-running failed tasks with growing delay (exponential backoff).

“Set retries=3 with exponential backoff for flaky APIs.”

🎼Orchestration

Idempotency key

Unique token a client sends so a server can dedupe retried calls.

“Stripe requires an idempotency key on every charge.”

🔍Quality & Observability

Data quality

How well data fits its purpose: accuracy, completeness, freshness, uniqueness.

“dbt tests catch quality issues before BI users do.”

🔍Quality & Observability

Great Expectations

Python library declaring 'expectations' (assertions) over data.

“expect_column_values_to_not_be_null('user_id').”

🔍Quality & Observability

Soda / Soda Core

YAML-based data checks — runs in CI or in your warehouse on a schedule.

“Soda Core checks freshness < 6h on the orders table.”

🔍Quality & Observability

Freshness

How recently a table was updated vs SLA expectation.

“Freshness alert: dim_user not updated in 24h.”

🔍Quality & Observability

Volume check

Anomaly check: row count today vs typical — flags drops/spikes.

“Volume on fact_orders dropped 80% — page on-call.”

🔍Quality & Observability

Lineage

The dependency graph: which source feeds which model feeds which dashboard.

“OpenLineage emits events Marquez visualizes.”

🔍Quality & Observability

Data observability

Monitoring data the way SREs monitor services: freshness, volume, schema, distribution, lineage.

“Monte Carlo and Datafold are observability tools.”

🔍Quality & Observability

Anomaly detection

ML or stats catching unexpected changes (counts, distributions, freshness).

“Monte Carlo flagged a 3σ drop in distinct user_ids.”

🔍Quality & Observability

dbt test

Built-in dbt assertions: unique, not_null, accepted_values, relationships.

“dbt test fails the build if a PK has duplicates.”

🔍Quality & Observability

Data diff

Comparing two table versions row-by-row to see what a change would alter.

“Datafold data-diff in PRs prevents silent breakages.”

📜Governance & Contracts

Data contract

Producer-consumer agreement on schema, semantics, SLAs — versioned, breakable only on bump.

“The data contract forces the backend team to bump version before renaming a column.”

📜Governance & Contracts

Data Mesh

Org pattern: domain teams own and serve their data as products.

“We're moving from central platform to data mesh.”

📜Governance & Contracts

Data product

A dataset treated as a product: owner, SLA, docs, contract, discoverability.

“The 'active_users' data product has a PM and a roadmap.”

📜Governance & Contracts

Data catalog

Searchable inventory of datasets with owner, schema, docs, lineage.

“DataHub, Atlas, Unity Catalog, OpenMetadata.”

📜Governance & Contracts

Unity Catalog

Databricks' unified governance: tables, ML, files, lineage, audit, fine-grained access.

“Unity Catalog replaces table ACLs + Hive metastore.”

📜Governance & Contracts

PII

Personally Identifiable Information — must be tagged, masked, access-controlled.

“Tag email and phone as PII; mask in non-prod.”

📜Governance & Contracts

GDPR / LGPD

EU / Brazilian data protection laws — right to erasure, consent, purpose limitation.

“LGPD requires a 'delete user data' workflow across the lake.”

📜Governance & Contracts

RBAC / ABAC

Role-based vs attribute-based access control.

“Snowflake uses RBAC; row-access policies enable ABAC.”

📜Governance & Contracts

Dynamic data masking

Hiding sensitive values at query time based on the caller's role.

“Analysts see masked email; security sees the real value.”

📜Governance & Contracts

Data steward

Person accountable for a dataset's definitions, quality, access.

“Each domain has a data steward registered in the catalog.”

📜Governance & Contracts

OpenLineage

Open standard for emitting lineage events from any tool.

“Airflow, dbt, Spark all speak OpenLineage now.”

☁️Cloud & Infra

IaC (Infrastructure as Code)

Provisioning infra via versioned code instead of clicks.

“Terraform manages all our buckets and IAM.”

☁️Cloud & Infra

Terraform / OpenTofu

Cloud-agnostic IaC tool — declares desired state, plans the diff, applies.

“terraform plan; terraform apply.”

☁️Cloud & Infra

VPC

Virtual Private Cloud — isolated network in AWS/GCP/Azure.

“The warehouse lives in the prod VPC; only the bastion can SSH.”

☁️Cloud & Infra

IAM

Identity & Access Management — who can do what on cloud resources.

“Give the loader role read-only IAM on the bucket.”

☁️Cloud & Infra

Kubernetes (K8s)

Container orchestrator — manages pods, services, scaling.

“Spark on Kubernetes replaces YARN in many shops.”

☁️Cloud & Infra

Serverless

Compute that scales to zero and bills per invocation (Lambda, Cloud Run).

“Tiny ingestion jobs are perfect for Lambda.”

☁️Cloud & Infra

Egress cost

Cloud charges for data leaving its network — sneaky bill killer.

“Cross-region replication doubled the bill via egress.”

☁️Cloud & Infra

BigQuery

GCP's serverless analytics warehouse — pay per scanned byte or slot.

“Partition + cluster cuts BigQuery scan cost dramatically.”

☁️Cloud & Infra

Snowflake

Cloud-native warehouse separating storage from per-second compute (warehouses).

“Use an XS warehouse for dev; size up for prod loads.”

☁️Cloud & Infra

Redshift

AWS's warehouse — classic provisioned or Serverless modes.

“Redshift Serverless removes node sizing headaches.”

☁️Cloud & Infra

Athena

AWS serverless SQL over S3 — Trino under the hood.

“Query Parquet on S3 directly with Athena, no cluster.”

☁️Cloud & Infra

AWS Glue

AWS managed ETL + data catalog on Spark.

“Glue Crawlers populate the Data Catalog from S3.”

☁️Cloud & Infra

Databricks

Managed Spark + Delta + ML — turned into the 'lakehouse' platform.

“Databricks Workflows replaces Airflow for many shops.”

☁️Cloud & Infra

Microsoft Fabric

Microsoft's unified data platform: OneLake + Synapse + Power BI under SKU pricing.

“Fabric's OneLake is one logical lake across services.”

🤖LLMs & AI

LLM

Large Language Model — transformer trained on huge text corpora.

“GPT-4, Claude, Llama 3 are LLMs.”

🤖LLMs & AI

Token

Sub-word unit an LLM reads/writes — billing and context limits use tokens.

“1 token ≈ 4 chars in English; pricing is per-million tokens.”

🤖LLMs & AI

Context window

Max tokens the model can consider in one call (input + output).

“Don't stuff 1M tokens just because it fits — cost and latency soar.”

🤖LLMs & AI

Embedding

Vector representation of text/image — close vectors = similar meaning.

“Embed all docs, store in a vector DB, search by cosine similarity.”

🤖LLMs & AI

RAG (Retrieval-Augmented Generation)

Retrieve relevant chunks, stuff them into the prompt, then generate.

“RAG over the docs gives grounded answers with citations.”

🤖LLMs & AI

Vector database

DB optimized for nearest-neighbor search on embeddings (Pinecone, pgvector, Weaviate).

“pgvector lets Postgres double as your vector DB.”

🤖LLMs & AI

Chunking

Splitting docs into pieces sized for embedding + retrieval.

“Chunk by semantic boundary, not fixed token size.”

🤖LLMs & AI

Reranker

Second-pass model that re-orders retrieved results by true relevance.

“Cohere Rerank pushes the best chunks to the top.”

🤖LLMs & AI

Agent

LLM that plans + calls tools in a loop to accomplish a goal.

“The agent searches, reads, then writes a SQL query.”

🤖LLMs & AI

Tool calling / Function calling

LLM emits structured JSON to invoke a function — your code runs it.

“Define get_weather(city); the model decides when to call it.”

🤖LLMs & AI

Fine-tuning

Continuing training of a base model on your task-specific data.

“Fine-tune Llama 3 on internal tickets to match company tone.”

🤖LLMs & AI

Prompt engineering

Crafting LLM inputs (system, role, examples) to get better outputs.

“Few-shot examples beat zero-shot for tricky formats.”

🤖LLMs & AI

Hallucination

Confidently-wrong LLM output — invents facts not in input/data.

“RAG reduces hallucinations by grounding in real docs.”

🤖LLMs & AI

Eval (LLM evaluation)

Systematic measurement of LLM output quality (Ragas, LangSmith, custom).

“Run evals in CI before shipping a new prompt.”

🤖LLMs & AI

Guardrails

Filters/validators around LLM input/output (PII redaction, jailbreak detection, JSON schema).

“Guardrails reject malformed JSON before it hits production.”

🤖LLMs & AI

MLOps

DevOps practices for ML: training pipelines, model registry, monitoring drift.

“MLflow tracks experiments and registers production models.”

🤖LLMs & AI

Feature store

Repository serving consistent features to training AND online inference.

“Feast unifies offline (warehouse) + online (Redis) feature lookups.”

⚡Performance

Shuffle

Redistributing rows across nodes for a join/group — Spark's #1 cost.

“Reduce shuffle: bucket pre-joined tables, use broadcast joins.”

⚡Performance

Broadcast join

Replicate the small side of a join to every executor — skips shuffle.

“Spark auto-broadcasts tables under 10MB.”

⚡Performance

Data skew

One key has way more rows than others — one task takes forever.

“Salt the skewed key to distribute load.”

⚡Performance

Predicate pushdown

Pushing WHERE clauses down to the file format so it reads fewer rows.

“Parquet supports predicate pushdown on min/max stats.”

⚡Performance

Column pruning

Reading only the columns the query needs — free with columnar formats.

“SELECT * defeats column pruning. List the columns.”

⚡Performance

EXPLAIN / Query plan

Engine's plan to run a query — scan, join, aggregate ordering.

“EXPLAIN ANALYZE shows actual rows vs estimated.”

⚡Performance

Materialized view

Precomputed query stored as a table; refreshed periodically or incrementally.

“Snowflake dynamic tables are incrementally-refreshed MVs.”

⚡Performance

Caching (warehouse)

Reusing previous query results when underlying data hasn't changed.

“Snowflake's result cache returns repeated queries in <1s.”

⚡Performance

AQE (Adaptive Query Execution)

Spark re-plans during execution based on real runtime stats.

“AQE auto-coalesces shuffle partitions and handles skew.”

⚡Performance

Photon

Databricks' vectorized C++ engine — 2-3x faster than vanilla Spark on SQL.

“Enable Photon on SQL warehouses for cheaper $/query.”

⚡Performance

DuckDB

In-process columnar OLAP DB — SQLite for analytics.

“DuckDB queries Parquet on laptop faster than your cluster.”

🏗️Architecture

Lambda architecture

Two paths: slow batch (truth) + fast stream (low latency), merged at serve time.

“Mostly historical now — Kappa replaced it in most stacks.”

🏗️Architecture

Kappa architecture

One streaming path for everything — reprocess history by replaying.

“Kappa works when your storage can replay events forever.”

🏗️Architecture

CQRS

Separate read and write models — writes go to one store, reads to another.

“Postgres for writes, denormalized search index for reads.”

🏗️Architecture

Event sourcing

Store every state change as an immutable event; current state = replay.

“Event sourcing makes audit and time-travel native.”

🏗️Architecture

Central platform vs Data Mesh

Central team owns everything vs domains own their own data products.

“Below ~30 engineers, central wins. Above, mesh starts to pay off.”

🏗️Architecture

Fitness function

Automated check that the architecture still meets quality goals (perf, deps, cost).

“CI fails if a query plan crosses cost threshold.”

🏗️Architecture

Data warehouse

Structured, query-optimized store for analytics (Snowflake, BigQuery, Redshift).

“Warehouse for BI, lake for raw and ML.”

🏗️Architecture

Data lake

Cheap object storage holding raw, semi-structured, and structured data.

“The lake on S3 stores everything; warehouse only the curated part.”

🏗️Architecture

Data mart

Subset of a warehouse focused on one domain (finance, marketing).

“Marketing has its own data mart with attribution models.”

🛠️Operations

Blue/Green deploy

Run two versions; route traffic to the new only after it's verified.

“Build the new gold table side-by-side, then swap.”

🛠️Operations

Canary release

Send a small % of traffic to the new version first.

“Canary the new dbt model to 10% of dashboards.”

🛠️Operations

Rollback

Reverting to the previous good state after a bad deploy.

“Iceberg lets us rollback to a snapshot in one statement.”

🛠️Operations

CI/CD

Auto-build/test on every PR + auto-deploy to envs.

“GitHub Actions runs dbt build on every PR before merge.”

🛠️Operations

On-call

Engineer responsible for responding to alerts off-hours.

“PagerDuty rotates the data on-call weekly.”

🛠️Operations

Runbook

Step-by-step doc to handle a known incident.

“The runbook for 'pipeline late' explains how to triage and restart.”

🛠️Operations

Postmortem

Blameless writeup of an incident: what happened, why, and what changes.

“Every Sev1 gets a postmortem within 48h.”

🛠️Operations

Blast radius

How much breaks when a single component fails.

“Splitting the monolith reduced blast radius.”

🛠️Operations

DORA metrics

Deploy frequency, lead time, change failure rate, MTTR — DevOps north stars.

“Data teams should measure DORA too.”

🛠️Operations

MTTR

Mean Time To Recovery — how fast we restore service after an incident.

“Better alerts cut MTTR from 2h to 15min.”

🛠️Operations

FinOps

Practice of optimizing cloud cost continuously, owned by eng + finance.

“FinOps reviews flagged the runaway BigQuery slot.”