153+ terms

Data Engineering Glossary

The vocabulary every data professional should speak fluently. 160+ terms across ingestion, modeling, streaming, governance, LLMs and more.

🔌Ingestion

Backfill

Re-running a pipeline over historical dates to fill or fix data.

“We're backfilling the last 90 days after the schema bug.”
Study topic
🔌Ingestion

CDC (Change Data Capture)

Streaming row-level INSERT/UPDATE/DELETE from a source DB by tailing its WAL/binlog.

“We use Debezium CDC instead of nightly snapshots — it captures deletes.”
Study topic
🔌Ingestion

ELT

Extract, Load (raw), then Transform in-warehouse with SQL/dbt.

“Modern stack is ELT — dbt handles the T.”
Study topic
🔌Ingestion

ETL

Old paradigm: transform data in-flight before loading.

“Legacy Informatica ETL jobs are being replaced by ELT + dbt.”
🔌Ingestion

Watermark

Stored timestamp marking how far an incremental loader has progressed.

“Each run pulls WHERE updated_at > last watermark.”
Study topic
🔌Ingestion

Idempotent

Running the job twice gives the same result as running it once.

“MERGE makes the loader idempotent — retries don't duplicate.”
Study topic
🔌Ingestion

Upsert / MERGE

Insert if absent, update if present — atomic in one statement.

“Use MERGE INTO target USING staging ON id.”
🔌Ingestion

Snapshot

Full copy of a table at a point in time.

“Daily snapshots are simple but miss intra-day deletes.”
🔌Ingestion

Incremental load

Loading only rows that changed since the last run.

“Switch from full reload to incremental once volume passes 1M rows.”
🔌Ingestion

Schema drift

The source schema changes without telling you.

“Bronze tolerates drift; silver/gold fail-loud.”
Study topic
🔌Ingestion

MAR (Monthly Active Rows)

Fivetran's billing unit — each row updated this month counts once.

“Hot tables blow up MAR — model your top-5 before signing.”
Study topic
🔌Ingestion

Connector

Pre-built integration that extracts from a source (Stripe, Salesforce…).

“Use the Airbyte Stripe connector instead of building one.”
🔌Ingestion

Tap / Target (Singer)

Singer convention: tap reads from a source, target writes to a sink.

“Meltano composes tap-postgres + target-snowflake.”
🔌Ingestion

Binlog / WAL

Database transaction log CDC tools tail to capture changes.

“Postgres exposes WAL; MySQL uses binlog.”
🔌Ingestion

Dead Letter Queue (DLQ)

Side topic/table for messages the pipeline couldn't process.

“Bad events go to DLQ for triage instead of blocking the stream.”
🗄️Storage & Formats

Parquet

Columnar binary file format optimized for analytics.

“Always store the lake as Parquet, not CSV.”
🗄️Storage & Formats

ORC

Columnar format from the Hive world — similar to Parquet, Hadoop-native.

“Old Hortonworks clusters still write ORC.”
🗄️Storage & Formats

Avro

Row-based binary format with embedded schema — great for streaming.

“Kafka + Avro + Schema Registry is the classic combo.”
🗄️Storage & Formats

Apache Iceberg

Open table format on top of Parquet with ACID, time-travel and hidden partitioning.

“We picked Iceberg for engine portability across Spark, Trino, Snowflake.”
🗄️Storage & Formats

Delta Lake

Databricks' open table format — Parquet + transaction log.

“Delta is the default on Databricks; Iceberg dominates outside it.”
🗄️Storage & Formats

Apache Hudi

Open table format optimized for upserts and incremental queries.

“Uber built Hudi to handle their massive upsert workloads.”
🗄️Storage & Formats

Lakehouse

Lake (cheap object storage) with warehouse features (ACID, SQL) on top.

“Iceberg + Trino on S3 is a lakehouse without buying Databricks.”
🗄️Storage & Formats

Partition

Splitting a table by a column so queries can skip irrelevant files.

“Partition fact_events by event_date — never by user_id (too many).”
🗄️Storage & Formats

Bucketing

Hash-based file split inside a partition — helps joins and skews.

“Bucket on user_id to make user-level joins shuffle-free.”
🗄️Storage & Formats

Z-ordering

Multi-dimensional clustering inside Parquet files for faster column pruning.

“ZORDER BY (country, event_type) cuts scan time on filters.”
🗄️Storage & Formats

VACUUM / Compaction

Removing old/unreferenced files; merging small files into bigger ones.

“Run OPTIMIZE + VACUUM weekly on hot Delta tables.”
🗄️Storage & Formats

Small files problem

Thousands of tiny files killing query performance and metadata overhead.

“Streaming writes create the small-files problem — compact periodically.”
🗄️Storage & Formats

Time travel

Querying a table as it existed at a past version or timestamp.

“SELECT * FROM orders VERSION AS OF 42 to audit a bad deploy.”
🗄️Storage & Formats

Object storage

Cheap, flat, HTTP-addressable storage — S3, GCS, ADLS.

“The lake lives on object storage — compute is separate.”
🗄️Storage & Formats

Columnar storage

Storing values column-by-column so analytic scans only read needed columns.

“Parquet is columnar; CSV is row-based and slow for analytics.”
🗄️Storage & Formats

Schema Registry

Central service that stores and validates evolving schemas (often Avro).

“Confluent Schema Registry prevents producers from breaking consumers.”
🧱Modeling

Star schema

Fact table at the center, dimension tables around it — analytics default.

“fact_orders joined to dim_customer, dim_product, dim_date.”
🧱Modeling

Snowflake schema

Star schema with normalized dimensions (dims linked to sub-dims).

“dim_product → dim_category → dim_department.”
🧱Modeling

Fact table

Table of measurable events (sales, clicks) with FKs to dimensions.

“fact_sales has order_id, dim FKs, quantity, revenue.”
🧱Modeling

Dimension table

Descriptive context for facts (who, what, where, when).

“dim_customer holds name, segment, signup_date.”
🧱Modeling

SCD (Slowly Changing Dimension)

Patterns for tracking history of dimension changes: Type 1 (overwrite), Type 2 (new row + dates), Type 3 (extra column).

“dim_customer is SCD2 — we keep history of plan_tier.”
🧱Modeling

Surrogate key

Synthetic primary key (e.g. an auto-int or hash), independent of business keys.

“Use a surrogate key on dim_customer so SCD2 history works.”
🧱Modeling

Natural key / Business key

Real-world identifier from the source system (e.g. order_id, email).

“Keep the natural key as a column even when using a surrogate PK.”
🧱Modeling

Grain

What one row of a fact table represents — declare it before building.

“The grain of fact_orders is one row per order line.”
🧱Modeling

Data Vault

Hubs + Links + Satellites — modeling style optimized for auditability and source change.

“Banks love Data Vault for its lineage and historization.”
🧱Modeling

Medallion (Bronze/Silver/Gold)

Layered lake architecture: raw bronze → cleaned silver → consumer gold.

“BI dashboards only read from the gold layer.”
🧱Modeling

Wide / OBT (One Big Table)

Pre-joined denormalized table for fast read — common in analytics layers.

“One Big Table for the dashboard avoids join cost at query time.”
🧱Modeling

Normalization

Splitting data into many tables to remove redundancy — OLTP default.

“3NF is fine for OLTP; analytics prefer denormalized stars.”
🧱Modeling

Denormalization

Duplicating data into one table to make reads cheap.

“Denormalize dim attributes onto fact when joins get expensive.”
🧱Modeling

Metrics layer / Semantic layer

Single source of truth for business metric definitions (MRR, ARPU…).

“dbt MetricFlow centralizes metric SQL once, used by every BI tool.”
🌊Streaming

Apache Kafka

Distributed append-only log — the backbone of event-driven data.

“Every event hits Kafka first, then fans out to consumers.”
🌊Streaming

Topic

Named stream in Kafka — split into partitions.

“orders.created and orders.shipped are separate topics.”
🌊Streaming

Consumer group

Set of consumers that share partitions — each partition goes to one member.

“Scale consumers up to partition count for parallelism.”
🌊Streaming

Exactly-once

Each event affects state exactly once even with retries and failures.

“Kafka + Flink with checkpoints gives end-to-end exactly-once.”
🌊Streaming

At-least-once

Events may be delivered more than once — consumers must be idempotent.

“Most systems are at-least-once by default.”
🌊Streaming

Watermark (streaming)

Time marker telling the engine 'no more events older than this'.

“Late events past the watermark are dropped or sent to a side output.”
🌊Streaming

Event time vs processing time

Event time = when it happened. Processing time = when we saw it.

“Always window on event time — processing time gives wrong results on backfills.”
🌊Streaming

Windowing

Grouping events into bounded chunks (tumbling, sliding, session) for aggregation.

“Tumbling 5-min windows count clicks per page.”
🌊Streaming

Stateful processing

The stream operator keeps memory across events (aggregates, joins, sessions).

“Flink's RocksDB state backend handles huge stateful jobs.”
🌊Streaming

Apache Flink

True streaming engine — event-by-event, low latency, strong state.

“Pick Flink for sub-second SLAs and complex stateful joins.”
Study topic
🌊Streaming

Spark Structured Streaming

Micro-batch streaming on the Spark engine — easier ops than Flink.

“We use Structured Streaming because the team already knows Spark.”
🌊Streaming

Micro-batch

'Streaming' implemented by tiny batches every N seconds.

“Snowpipe Streaming and Auto Loader are micro-batch under the hood.”
🌊Streaming

Kafka Connect

Framework for source/sink connectors that move data in/out of Kafka.

“Debezium runs as a Kafka Connect source.”
🌊Streaming

ksqlDB / Kafka Streams

Streaming SQL / Java library that processes Kafka data without a separate cluster.

“ksqlDB joins two topics with a SELECT.”
🌊Streaming

Backpressure

Slow consumer signals upstream to slow down, preventing memory blow-up.

“Flink propagates backpressure across operators automatically.”
🎼Orchestration

DAG

Directed Acyclic Graph — nodes are tasks, edges are dependencies, no cycles.

“Airflow DAGs define the daily ETL.”
🎼Orchestration

Apache Airflow

Python-based scheduler — define DAGs in code, run them on a cluster.

“Airflow is the boring, reliable default for batch orchestration.”
🎼Orchestration

Dagster

Asset-oriented orchestrator — you declare data assets, deps are inferred.

“Dagster's asset graph maps cleanly to your dbt models.”
🎼Orchestration

Prefect

Modern Python orchestrator — dynamic flows, hybrid execution.

“Prefect 2 is lighter than Airflow for small teams.”
🎼Orchestration

Scheduler

Component that decides when to trigger a job (cron, sensor, manual).

“The scheduler triggered the DAG at 02:00 UTC.”
🎼Orchestration

Sensor

Task that waits for an external event (file arrived, table updated).

“S3KeySensor blocks until the file lands.”
🎼Orchestration

SLA / SLO / SLI

SLI=metric, SLO=target, SLA=contract with consequences. Pipelines need them.

“SLO: 99% of daily loads land by 6am.”
🎼Orchestration

Catchup / Backfill (Airflow)

Airflow re-running historical runs when the DAG falls behind.

“Set catchup=False on dashboards you don't want re-built every reboot.”
🎼Orchestration

Retry / Backoff

Auto re-running failed tasks with growing delay (exponential backoff).

“Set retries=3 with exponential backoff for flaky APIs.”
🎼Orchestration

Idempotency key

Unique token a client sends so a server can dedupe retried calls.

“Stripe requires an idempotency key on every charge.”
🔍Quality & Observability

Data quality

How well data fits its purpose: accuracy, completeness, freshness, uniqueness.

“dbt tests catch quality issues before BI users do.”
🔍Quality & Observability

Great Expectations

Python library declaring 'expectations' (assertions) over data.

“expect_column_values_to_not_be_null('user_id').”
🔍Quality & Observability

Soda / Soda Core

YAML-based data checks — runs in CI or in your warehouse on a schedule.

“Soda Core checks freshness < 6h on the orders table.”
🔍Quality & Observability

Freshness

How recently a table was updated vs SLA expectation.

“Freshness alert: dim_user not updated in 24h.”
🔍Quality & Observability

Volume check

Anomaly check: row count today vs typical — flags drops/spikes.

“Volume on fact_orders dropped 80% — page on-call.”
🔍Quality & Observability

Lineage

The dependency graph: which source feeds which model feeds which dashboard.

“OpenLineage emits events Marquez visualizes.”
🔍Quality & Observability

Data observability

Monitoring data the way SREs monitor services: freshness, volume, schema, distribution, lineage.

“Monte Carlo and Datafold are observability tools.”
🔍Quality & Observability

Anomaly detection

ML or stats catching unexpected changes (counts, distributions, freshness).

“Monte Carlo flagged a 3σ drop in distinct user_ids.”
🔍Quality & Observability

dbt test

Built-in dbt assertions: unique, not_null, accepted_values, relationships.

“dbt test fails the build if a PK has duplicates.”
🔍Quality & Observability

Data diff

Comparing two table versions row-by-row to see what a change would alter.

“Datafold data-diff in PRs prevents silent breakages.”
📜Governance & Contracts

Data contract

Producer-consumer agreement on schema, semantics, SLAs — versioned, breakable only on bump.

“The data contract forces the backend team to bump version before renaming a column.”
📜Governance & Contracts

Data Mesh

Org pattern: domain teams own and serve their data as products.

“We're moving from central platform to data mesh.”
📜Governance & Contracts

Data product

A dataset treated as a product: owner, SLA, docs, contract, discoverability.

“The 'active_users' data product has a PM and a roadmap.”
📜Governance & Contracts

Data catalog

Searchable inventory of datasets with owner, schema, docs, lineage.

“DataHub, Atlas, Unity Catalog, OpenMetadata.”
📜Governance & Contracts

Unity Catalog

Databricks' unified governance: tables, ML, files, lineage, audit, fine-grained access.

“Unity Catalog replaces table ACLs + Hive metastore.”
📜Governance & Contracts

PII

Personally Identifiable Information — must be tagged, masked, access-controlled.

“Tag email and phone as PII; mask in non-prod.”
📜Governance & Contracts

GDPR / LGPD

EU / Brazilian data protection laws — right to erasure, consent, purpose limitation.

“LGPD requires a 'delete user data' workflow across the lake.”
📜Governance & Contracts

RBAC / ABAC

Role-based vs attribute-based access control.

“Snowflake uses RBAC; row-access policies enable ABAC.”
📜Governance & Contracts

Dynamic data masking

Hiding sensitive values at query time based on the caller's role.

“Analysts see masked email; security sees the real value.”
📜Governance & Contracts

Data steward

Person accountable for a dataset's definitions, quality, access.

“Each domain has a data steward registered in the catalog.”
📜Governance & Contracts

OpenLineage

Open standard for emitting lineage events from any tool.

“Airflow, dbt, Spark all speak OpenLineage now.”
☁️Cloud & Infra

IaC (Infrastructure as Code)

Provisioning infra via versioned code instead of clicks.

“Terraform manages all our buckets and IAM.”
☁️Cloud & Infra

Terraform / OpenTofu

Cloud-agnostic IaC tool — declares desired state, plans the diff, applies.

“terraform plan; terraform apply.”
☁️Cloud & Infra

VPC

Virtual Private Cloud — isolated network in AWS/GCP/Azure.

“The warehouse lives in the prod VPC; only the bastion can SSH.”
☁️Cloud & Infra

IAM

Identity & Access Management — who can do what on cloud resources.

“Give the loader role read-only IAM on the bucket.”
☁️Cloud & Infra

Kubernetes (K8s)

Container orchestrator — manages pods, services, scaling.

“Spark on Kubernetes replaces YARN in many shops.”
☁️Cloud & Infra

Serverless

Compute that scales to zero and bills per invocation (Lambda, Cloud Run).

“Tiny ingestion jobs are perfect for Lambda.”
☁️Cloud & Infra

Egress cost

Cloud charges for data leaving its network — sneaky bill killer.

“Cross-region replication doubled the bill via egress.”
☁️Cloud & Infra

BigQuery

GCP's serverless analytics warehouse — pay per scanned byte or slot.

“Partition + cluster cuts BigQuery scan cost dramatically.”
☁️Cloud & Infra

Snowflake

Cloud-native warehouse separating storage from per-second compute (warehouses).

“Use an XS warehouse for dev; size up for prod loads.”
☁️Cloud & Infra

Redshift

AWS's warehouse — classic provisioned or Serverless modes.

“Redshift Serverless removes node sizing headaches.”
☁️Cloud & Infra

Athena

AWS serverless SQL over S3 — Trino under the hood.

“Query Parquet on S3 directly with Athena, no cluster.”
☁️Cloud & Infra

AWS Glue

AWS managed ETL + data catalog on Spark.

“Glue Crawlers populate the Data Catalog from S3.”
☁️Cloud & Infra

Databricks

Managed Spark + Delta + ML — turned into the 'lakehouse' platform.

“Databricks Workflows replaces Airflow for many shops.”
☁️Cloud & Infra

Microsoft Fabric

Microsoft's unified data platform: OneLake + Synapse + Power BI under SKU pricing.

“Fabric's OneLake is one logical lake across services.”
🤖LLMs & AI

LLM

Large Language Model — transformer trained on huge text corpora.

“GPT-4, Claude, Llama 3 are LLMs.”
🤖LLMs & AI

Token

Sub-word unit an LLM reads/writes — billing and context limits use tokens.

“1 token ≈ 4 chars in English; pricing is per-million tokens.”
🤖LLMs & AI

Context window

Max tokens the model can consider in one call (input + output).

“Don't stuff 1M tokens just because it fits — cost and latency soar.”
🤖LLMs & AI

Embedding

Vector representation of text/image — close vectors = similar meaning.

“Embed all docs, store in a vector DB, search by cosine similarity.”
🤖LLMs & AI

RAG (Retrieval-Augmented Generation)

Retrieve relevant chunks, stuff them into the prompt, then generate.

“RAG over the docs gives grounded answers with citations.”
🤖LLMs & AI

Vector database

DB optimized for nearest-neighbor search on embeddings (Pinecone, pgvector, Weaviate).

“pgvector lets Postgres double as your vector DB.”
🤖LLMs & AI

Chunking

Splitting docs into pieces sized for embedding + retrieval.

“Chunk by semantic boundary, not fixed token size.”
🤖LLMs & AI

Reranker

Second-pass model that re-orders retrieved results by true relevance.

“Cohere Rerank pushes the best chunks to the top.”
🤖LLMs & AI

Agent

LLM that plans + calls tools in a loop to accomplish a goal.

“The agent searches, reads, then writes a SQL query.”
🤖LLMs & AI

Tool calling / Function calling

LLM emits structured JSON to invoke a function — your code runs it.

“Define get_weather(city); the model decides when to call it.”
🤖LLMs & AI

Fine-tuning

Continuing training of a base model on your task-specific data.

“Fine-tune Llama 3 on internal tickets to match company tone.”
🤖LLMs & AI

Prompt engineering

Crafting LLM inputs (system, role, examples) to get better outputs.

“Few-shot examples beat zero-shot for tricky formats.”
🤖LLMs & AI

Hallucination

Confidently-wrong LLM output — invents facts not in input/data.

“RAG reduces hallucinations by grounding in real docs.”
🤖LLMs & AI

Eval (LLM evaluation)

Systematic measurement of LLM output quality (Ragas, LangSmith, custom).

“Run evals in CI before shipping a new prompt.”
🤖LLMs & AI

Guardrails

Filters/validators around LLM input/output (PII redaction, jailbreak detection, JSON schema).

“Guardrails reject malformed JSON before it hits production.”
🤖LLMs & AI

MLOps

DevOps practices for ML: training pipelines, model registry, monitoring drift.

“MLflow tracks experiments and registers production models.”
🤖LLMs & AI

Feature store

Repository serving consistent features to training AND online inference.

“Feast unifies offline (warehouse) + online (Redis) feature lookups.”
Performance

Shuffle

Redistributing rows across nodes for a join/group — Spark's #1 cost.

“Reduce shuffle: bucket pre-joined tables, use broadcast joins.”
Performance

Broadcast join

Replicate the small side of a join to every executor — skips shuffle.

“Spark auto-broadcasts tables under 10MB.”
Performance

Data skew

One key has way more rows than others — one task takes forever.

“Salt the skewed key to distribute load.”
Performance

Predicate pushdown

Pushing WHERE clauses down to the file format so it reads fewer rows.

“Parquet supports predicate pushdown on min/max stats.”
Performance

Column pruning

Reading only the columns the query needs — free with columnar formats.

“SELECT * defeats column pruning. List the columns.”
Performance

EXPLAIN / Query plan

Engine's plan to run a query — scan, join, aggregate ordering.

“EXPLAIN ANALYZE shows actual rows vs estimated.”
Performance

Materialized view

Precomputed query stored as a table; refreshed periodically or incrementally.

“Snowflake dynamic tables are incrementally-refreshed MVs.”
Performance

Caching (warehouse)

Reusing previous query results when underlying data hasn't changed.

“Snowflake's result cache returns repeated queries in <1s.”
Performance

AQE (Adaptive Query Execution)

Spark re-plans during execution based on real runtime stats.

“AQE auto-coalesces shuffle partitions and handles skew.”
Performance

Photon

Databricks' vectorized C++ engine — 2-3x faster than vanilla Spark on SQL.

“Enable Photon on SQL warehouses for cheaper $/query.”
Performance

DuckDB

In-process columnar OLAP DB — SQLite for analytics.

“DuckDB queries Parquet on laptop faster than your cluster.”
🏗️Architecture

Lambda architecture

Two paths: slow batch (truth) + fast stream (low latency), merged at serve time.

“Mostly historical now — Kappa replaced it in most stacks.”
🏗️Architecture

Kappa architecture

One streaming path for everything — reprocess history by replaying.

“Kappa works when your storage can replay events forever.”
🏗️Architecture

CQRS

Separate read and write models — writes go to one store, reads to another.

“Postgres for writes, denormalized search index for reads.”
🏗️Architecture

Event sourcing

Store every state change as an immutable event; current state = replay.

“Event sourcing makes audit and time-travel native.”
🏗️Architecture

Central platform vs Data Mesh

Central team owns everything vs domains own their own data products.

“Below ~30 engineers, central wins. Above, mesh starts to pay off.”
🏗️Architecture

Fitness function

Automated check that the architecture still meets quality goals (perf, deps, cost).

“CI fails if a query plan crosses cost threshold.”
🏗️Architecture

Data warehouse

Structured, query-optimized store for analytics (Snowflake, BigQuery, Redshift).

“Warehouse for BI, lake for raw and ML.”
🏗️Architecture

Data lake

Cheap object storage holding raw, semi-structured, and structured data.

“The lake on S3 stores everything; warehouse only the curated part.”
🏗️Architecture

Data mart

Subset of a warehouse focused on one domain (finance, marketing).

“Marketing has its own data mart with attribution models.”
🛠️Operations

Blue/Green deploy

Run two versions; route traffic to the new only after it's verified.

“Build the new gold table side-by-side, then swap.”
🛠️Operations

Canary release

Send a small % of traffic to the new version first.

“Canary the new dbt model to 10% of dashboards.”
🛠️Operations

Rollback

Reverting to the previous good state after a bad deploy.

“Iceberg lets us rollback to a snapshot in one statement.”
🛠️Operations

CI/CD

Auto-build/test on every PR + auto-deploy to envs.

“GitHub Actions runs dbt build on every PR before merge.”
🛠️Operations

On-call

Engineer responsible for responding to alerts off-hours.

“PagerDuty rotates the data on-call weekly.”
🛠️Operations

Runbook

Step-by-step doc to handle a known incident.

“The runbook for 'pipeline late' explains how to triage and restart.”
🛠️Operations

Postmortem

Blameless writeup of an incident: what happened, why, and what changes.

“Every Sev1 gets a postmortem within 48h.”
🛠️Operations

Blast radius

How much breaks when a single component fails.

“Splitting the monolith reduced blast radius.”
🛠️Operations

DORA metrics

Deploy frequency, lead time, change failure rate, MTTR — DevOps north stars.

“Data teams should measure DORA too.”
🛠️Operations

MTTR

Mean Time To Recovery — how fast we restore service after an incident.

“Better alerts cut MTTR from 2h to 15min.”
🛠️Operations

FinOps

Practice of optimizing cloud cost continuously, owned by eng + finance.

“FinOps reviews flagged the runaway BigQuery slot.”