MODULE 1

Data Architecture Layers

Source → ingestion → storage → transformation → serving → consumption.

Modern Data Stack

LayerPurposeTools
SourcesOLTP DBs, SaaS APIs, events, filesPostgres, Stripe, Salesforce, S3
IngestionLoad raw data into warehouse / lakeFivetran, Airbyte, Debezium, custom
StorageDurable persistent layerS3, ADLS, GCS; Snowflake, BigQuery, Redshift
TransformSQL / Python on raw → modelsdbt, SQLMesh, Spark, Flink
OrchestrateSchedule + dependency mgmtAirflow, Dagster, Prefect, Argo
QualityValidation, freshness, anomalyGreat Expectations, dbt tests, Soda, Monte Carlo
Catalog / lineageDiscoverability + impactDataHub, OpenLineage, Atlan, Unity
ServingReverse ETL, BI, ML featureHightouch, Looker, Tableau, Feast

Architectural Patterns

  • Lambda — batch + speed layers + serving merge. Two codepaths to maintain.
  • Kappa — single streaming pipeline reprocesses log on change.
  • Lakehouse — open table format on object storage (Iceberg / Delta / Hudi). One layer for batch + stream + ML.
  • Data Mesh — domain-owned data products + federated governance.
  • Medallion (Bronze/Silver/Gold) — raw → cleaned → business-ready zoning convention.
MODULE 2

Ingestion: ETL vs ELT & CDC

Move bytes reliably. Today: extract + load now, transform later.

ELT vs ETL

  • ETL — transform before load. Pre-warehouse era; brittle, slow.
  • ELT — load raw, transform in warehouse with SQL. Cheap compute + replay-friendly. Standard now.
  • Hybrid: light parsing / PII redaction in flight; semantic transforms in warehouse.

Change Data Capture

  • Replicate row-level changes from OLTP source. Avoids full-table scans + staleness.
  • Log-based: read DB redo log (Postgres WAL, MySQL binlog). Lowest impact, exact ordering. Tools: Debezium, AWS DMS, Fivetran log mode.
  • Trigger-based: DB triggers write to audit table. Adds OLTP load.
  • Query-based: poll updated_at column. Misses deletes, doesn't handle soft delete.
  • Schema evolution: ALTER TABLE in source must propagate. Plan handler.

File Formats

FormatTypeUse
JSON / NDJSONRow, textLogs, semi-structured ingest
CSVRow, textLegacy interchange
AvroRow, binary, schemaStreaming (Kafka), schema evolution
ParquetColumnarAnalytics, lakehouse default
ORCColumnarHive ecosystem
ArrowColumnar in-memoryCross-engine zero-copy
MODULE 3

Data Warehouses

Columnar MPP for analytics. Snowflake / BigQuery / Redshift / Databricks SQL.

Columnar MPP Internals

  • Columnar storage: same column packed; compresses well; only read needed cols.
  • Encoding: dictionary, run-length, delta, bit-packing.
  • Vectorized execution: process column batches per CPU vector.
  • MPP: data partitioned across nodes; each node runs same op locally; results merged.
  • Pruning: zone maps / min-max + bloom filters skip blocks. Partition + cluster by high-selectivity cols.

Snowflake / BigQuery / Redshift

SystemComputePricing modelNotes
SnowflakeVirtual warehouses (clusters); decoupled from storagePer-second compute + storageMulti-cluster autosuspend; time travel; zero-copy clone
BigQueryServerless; slot-basedOn-demand $/TB scanned, or flat-rate slotsNative streaming inserts; partition + cluster only
Redshift (RA3)Managed cluster; managed storagePer-hour cluster + storageSpectrum to S3; concurrency scaling; AQUA
Databricks SQLPhoton engine on lakehouseDBU / hourNative to Delta + Iceberg; ML + BI shared

Performance Levers

  • Partition by date / tenant; cluster (sort) by high-selectivity filter columns.
  • Prune via WHERE on partition column. Always.
  • Avoid SELECT * — kills column pruning.
  • Pre-aggregate hot dashboards into materialized views / cached tables.
  • Stale or skewed stats hurt joins → ANALYZE / refresh.
  • Warehouse sizing: scale up for one-shot heavy query; multi-cluster for concurrency.
MODULE 4

Data Modeling

Dimensional + denorm patterns. SCDs. Layered lineage.

Kimball Star Schema

  • Fact table — measurements (orders, clicks). Foreign keys to dims + numeric measures.
  • Dimension table — descriptive attributes (customer, product, date).
  • Star = facts → dims directly. Snowflake = dims normalized further.
  • Grain = "what does one row mean?" Define explicitly per fact table.

Slowly Changing Dimensions

TypeBehavior
Type 0No change tracking; ignore updates
Type 1Overwrite — current truth only
Type 2New row per change with valid_from / valid_to + is_current
Type 3Add prior_value column
Type 4Mini history table
Type 6Hybrid 1+2+3 (current + history + previous)

Medallion Layering

  • Bronze (raw) — exact source byte-for-byte. Append-only. PII as-is, encrypted.
  • Silver (cleaned) — typed, dedup, joined with conformed dims. Useful but not business-shaped.
  • Gold (mart) — denormalized, business metrics, aggregates, fit-for-BI.
  • Reprocess Silver/Gold from Bronze; never edit Bronze.

One Big Table (OBT) vs Star

Modern columnar engines tolerate wide denorm tables. OBT = simpler queries, faster on dashboards; star = better for ad-hoc + smaller writes. Hybrid: star at Silver, OBT at Gold for hot dashboards.

MODULE 5

dbt & Transformation

SQL + Jinja + DAG. Tests + docs + lineage built in.

Core Concepts

  • ModelsSELECT in .sql files; dbt wraps as table / view / incremental.
  • Sources — declare external tables; freshness checks.
  • Tests — schema (unique, not_null, accepted_values, relationships) + custom singular.
  • Macros — Jinja functions, reused SQL.
  • Snapshots — built-in SCD2 capture.
  • Seeds — small CSVs version-controlled.
  • Exposures — declare downstream BI / app usage.

Materializations

TypeWhen
viewCheap, small, pass-through
tableHeavy compute reused; full rebuild OK
incrementalBig tables; only new/changed rows
ephemeralCTE-inlined; no DB object
snapshotSCD2 history
-- incremental example
{{ config(materialized='incremental', unique_key='id', incremental_strategy='merge') }}
select *
from {{ source('app', 'orders') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Best Practices

  • Layer naming: stg_ (raw rename + types), int_ (joins, intermediate), fct_ + dim_ (mart).
  • 1 model = 1 logical entity. Long SQL → split with intermediate models, not subquery soup.
  • Tests on PKs (unique + not_null) on every model.
  • CI: dbt build on PR; Slim CI runs only modified + downstream.
  • Document at the model + column level. Docs site auto-generated.
MODULE 6

Orchestration

Schedule, retry, recover, observe pipelines.

Tooling

ToolModelStrength
AirflowTask DAG, PythonUbiquitous; rich ops + connectors; large ecosystem
DagsterAsset-orientedLineage + types + dev-loop; software-defined assets
PrefectPythonic flowsHybrid execution; dynamic DAGs
Argo WorkflowsK8s CRDContainer-native; scales horizontally
Step FunctionsAWS managed state machineServerless; tight AWS integration

Principles

  • Idempotent tasks — same partition rerun = same output. Required for safe retry.
  • Backfill safe — task accepts logical date; doesn't depend on wall clock.
  • Atomic writes — write to staging, then COMMIT / RENAME / SWAP. Never partial visible.
  • SLA + alerts — freshness threshold; page on miss.
  • No business logic in DAG — DAGs orchestrate; logic lives in SQL/dbt/jobs.

Dependencies

  • Time-based scheduling: simple but brittle (upstream late → downstream still runs).
  • Sensor / signal-based: wait for upstream completion or data availability.
  • Asset-based (Dagster, dbt Cloud): "this asset is materialized; downstream can run".
  • Data contracts: producer + consumer schema agreement. Break in CI, not in prod.
MODULE 7

Streaming & Real-Time

Kafka + Flink / Spark Structured Streaming / Materialize. Bounded-latency processing.

Kafka Fundamentals

  • Topic = ordered log; partitions = unit of parallelism + ordering.
  • Producer key → partition (hash). Same key → same partition → ordered.
  • Consumer group: each partition assigned to one consumer in group. Scale by adding partitions + consumers.
  • Retention: time / size based. Compact topics retain latest value per key (DB-like).
  • Offsets stored in __consumer_offsets; auto-commit dangerous, prefer at-least-once + idempotent sinks.

Delivery Semantics

ModeMechanism
At-most-onceCommit before process; lose on crash
At-least-onceProcess before commit; duplicates on crash
Exactly-onceTransactional producer + idempotent + read-committed; or sink dedup

Windows

  • Tumbling — fixed, non-overlapping (every 5 min).
  • Hopping / sliding — fixed size, slide step (5 min size, 1 min step).
  • Session — close after gap of inactivity (user activity sessions).
  • Event time vs processing time — event time tolerates late arrivals via watermarks.
  • Watermark — bound on "no event older than this is expected". Closes window.

Stream Engines

  • Flink — true streaming, EOS, savepoints, stateful ops at scale. Industry standard for complex pipelines.
  • Spark Structured Streaming — micro-batch (or continuous experimental); reuses Spark SQL.
  • Kafka Streams / ksqlDB — JVM lib; simple in-Kafka topology.
  • Materialize / RisingWave — streaming SQL; incrementally maintained materialized views.
  • Beam — unified batch + stream API; runs on Flink / Dataflow.

Streaming CDC

Debezium tails DB log → Kafka topic per table. Downstream consumers (Flink, Materialize) maintain near-real-time replicas, enriched joins, denormalized views.

MODULE 8

Lakehouse: Iceberg / Delta / Hudi

ACID + schema + time travel on object storage.

Why Open Table Formats

  • Plain Parquet on S3: no atomic writes, no schema evolution, no row-level update.
  • Open table format adds metadata layer: snapshots, manifests, partition evolution, ACID via optimistic concurrency.
  • Multi-engine: Spark + Trino + Flink + Snowflake + DuckDB read same table.

Iceberg / Delta / Hudi

FormatOriginStrength
IcebergNetflix → ApacheHidden partitioning, partition evolution, broad engine support, REST catalog
Delta LakeDatabricks → Linux FoundationTight Spark integration; OSS Delta UniForm
HudiUber → ApacheStreaming upsert (MoR table type); incremental queries

Common Features

  • ACID via snapshot + commit log.
  • MERGE / UPDATE / DELETE — row-level mutations on Parquet.
  • Time travel — query as-of snapshot or timestamp.
  • Schema evolution — add / drop / rename columns.
  • Partition evolution (Iceberg) — change partition spec without rewrite.
  • File compaction + tombstone cleanup — required maintenance jobs.
MODULE 9

Data Quality & Contracts

Catch broken pipelines before consumers do.

Test Categories

CategoryExamples
SchemaRequired cols present; types match; nullability
Uniqueness / referentialPK unique; FK present in dim
Range / accepted values0 ≤ rate ≤ 1; status ∈ {…}
VolumeRow count within ±X% of 7-day avg
Freshnessmax(updated_at) within SLA
DistributionMean / null-rate within control limits
Custom business ruleΣ(line_items.amount) = orders.total

Tools

  • dbt tests — fast, in-pipeline. Unique / not_null / relationships built in.
  • Great Expectations — Python-driven expectations + docs.
  • Soda — declarative YAML checks.
  • Monte Carlo / Bigeye / Anomalo — ML-based anomaly + lineage.
  • Open standards: OpenLineage, OpenMetadata.

Data Contracts

  • Producer publishes typed schema + SLA to consumers.
  • Schema registry (Confluent SR, Apicurio) enforces compat (BACKWARD / FORWARD / FULL).
  • Breaking change = new version (v2 topic / table) + deprecation window.
  • CI test on producer side: schema diff against last published; block breaking changes.
MODULE 10

Spark / Distributed Compute

Big-data engine. Where SQL alone won't scale or task isn't tabular.

Spark Internals

  • Driver builds DAG; cluster manager (YARN / Kubernetes / standalone) allocates executors.
  • Job → stage (per shuffle) → tasks (per partition).
  • Lazy evaluation — transformations build plan; action (count, write) triggers execution.
  • DataFrame / SQL → Catalyst optimizer → physical plan → Tungsten codegen.
  • Adaptive Query Execution (AQE) — replan at runtime: coalesce shuffle, skew handle, broadcast switch.

Shuffle

  • Wide deps (groupBy, join, repartition) move data across nodes — expensive.
  • Minimize shuffles: filter + project early, broadcast small side of join (< 10 MB default), avoid groupBy when reduceByKey suffices.
  • Skew: hot key concentrates work on one task. Salt key, AQE skew join, or split.
  • Spill to disk when memory exceeded — ensure shuffle dirs sized + fast disks.

Tuning

  • spark.sql.shuffle.partitions — default 200; tune to ≈ data_size / 128 MB.
  • Cache strategically; uncache when done. Storage levels MEMORY_AND_DISK_SER common.
  • Broadcast joins for < 10–100 MB right side.
  • Avoid UDFs in Python (serialization overhead) — use built-in / pandas UDFs / SQL.
  • Watch GC + executor OOM — bump spark.executor.memoryOverhead.
MODULE 11

Governance, Privacy & Cost

Catalog, access, retention, FinOps.

Catalog & Lineage

  • Catalog = searchable inventory of datasets + columns + owners + business glossary.
  • Lineage = upstream / downstream graph at table + column level. OpenLineage standard.
  • Tools: DataHub, OpenMetadata, Atlan, Collibra, Unity Catalog (Databricks), Snowflake Horizon.

Access Control

  • RBAC at warehouse + lake.
  • Row-level security (RLS) — filter rows per role / tenant.
  • Column masking — hash / redact PII for non-privileged readers.
  • ABAC / tag-based policies — apply based on column tag (e.g., pii).
  • Audit logs — query history retained per compliance window.

Privacy

  • PII inventory — know which columns / files contain PII.
  • Minimization — collect only what's needed.
  • Right to erasure (GDPR Art 17, CCPA) — index user_id → all locations; delete or tombstone.
  • Data residency — region-pinned datasets; cross-border replication policies.
  • Retention — tier + delete on schedule (lifecycle rules).

FinOps for Data

  • Tag every workload (team / cost center / use case).
  • Snowflake: autosuspend warehouses fast (60–300s); right-size XS → 4XL per workload.
  • BigQuery: prefer slot reservations for predictable, on-demand for spiky; partition + cluster to cut TB scanned.
  • Cache hot tables; pre-aggregate dashboards.
  • Watch SELECT * in Looker / dashboards — biggest budget bleeder.
  • S3 lifecycle: hot → IA → Glacier; expire orphan multipart uploads.
MODULE 12

ML / Feature Engineering Interface

DE work that lands in models. Feature stores, training/serving parity.

Feature Store

  • Offline store (warehouse / lakehouse) for training; online store (Redis / Dynamo / Cassandra) for low-latency serving.
  • Single feature definition produces both — no train/serve skew.
  • Point-in-time correctness: training join uses feature values as of label time, not now.
  • Tools: Feast, Tecton, Hopsworks, Vertex / SageMaker Feature Store, Databricks FS.

Point-in-Time Joins

-- naive (leakage)
join features f on f.user_id = label.user_id

-- correct (point-in-time, AS OF label.event_ts)
join features f
  on f.user_id = label.user_id
 and f.valid_from <= label.event_ts
 and (f.valid_to is null or f.valid_to > label.event_ts)

Embeddings as Data

DE owns the pipeline: text/image → embedding → indexed vector store. Tracking + versioning + reindex on model change. Integrates with RAG (see AI Engineer notebook).

MODULE 13

Cheat Sheet

Default stack + decision rules.

Default Stack

  • Storage: S3 + Iceberg
  • Warehouse: Snowflake / BigQuery / Databricks SQL
  • Ingest: Fivetran / Airbyte + Debezium CDC
  • Transform: dbt
  • Orchestrate: Airflow or Dagster
  • Stream: Kafka + Flink
  • Quality: dbt tests + GE / Soda
  • Catalog: DataHub / OpenMetadata
  • BI: Looker / Lightdash / Metabase

Pick Storage

  • Pure analytics + SQL → warehouse (Snowflake / BQ)
  • Multi-engine + ML + cheap → lakehouse (Iceberg + Spark)
  • OLTP-style updates → Postgres / Aurora; CDC out
  • K-V serving → DynamoDB / Cassandra
  • Full-text + vector → OpenSearch / Vespa

Streaming Defaults

  • Partition key = entity needing order
  • Avro + schema registry
  • At-least-once + idempotent sink
  • Watermark on event time
  • Compact topic for "current state"
  • Dead-letter queue for poison events

Modeling Defaults

  • Bronze (raw) → Silver (clean) → Gold (mart)
  • Star schema at Silver; OBT at Gold for hot dashboards
  • SCD2 for dims that need history
  • Surrogate keys (hashed natural)
  • Grain documented per fact

Cost Quick Wins

  • Partition + cluster mandatory
  • Materialize hot dashboards
  • Autosuspend warehouses 60s
  • Kill SELECT * in BI tools
  • S3 lifecycle to IA / Glacier
  • Compact small files in lake

Numbers

  • Parquet 5–20× smaller than CSV
  • Lake target file size 128–512 MB
  • Spark shuffle ≈ data_size / 128 MB partitions
  • Kafka partition count = max parallel consumers
  • Watermark lag < 1× window size
  • dbt model PK test on every model