Master Notebook: Data Engineer — FAANG Interview Guide

MODULE 1

Data Architecture Layers

Source → ingestion → storage → transformation → serving → consumption.

Modern Data Stack

Layer	Purpose	Tools
Sources	OLTP DBs, SaaS APIs, events, files	Postgres, Stripe, Salesforce, S3
Ingestion	Load raw data into warehouse / lake	Fivetran, Airbyte, Debezium, custom
Storage	Durable persistent layer	S3, ADLS, GCS; Snowflake, BigQuery, Redshift
Transform	SQL / Python on raw → models	dbt, SQLMesh, Spark, Flink
Orchestrate	Schedule + dependency mgmt	Airflow, Dagster, Prefect, Argo
Quality	Validation, freshness, anomaly	Great Expectations, dbt tests, Soda, Monte Carlo
Catalog / lineage	Discoverability + impact	DataHub, OpenLineage, Atlan, Unity
Serving	Reverse ETL, BI, ML feature	Hightouch, Looker, Tableau, Feast

Architectural Patterns

Lambda — batch + speed layers + serving merge. Two codepaths to maintain.
Kappa — single streaming pipeline reprocesses log on change.
Lakehouse — open table format on object storage (Iceberg / Delta / Hudi). One layer for batch + stream + ML.
Data Mesh — domain-owned data products + federated governance.
Medallion (Bronze/Silver/Gold) — raw → cleaned → business-ready zoning convention.

MODULE 2

Ingestion: ETL vs ELT & CDC

Move bytes reliably. Today: extract + load now, transform later.

ELT vs ETL

ETL — transform before load. Pre-warehouse era; brittle, slow.
ELT — load raw, transform in warehouse with SQL. Cheap compute + replay-friendly. Standard now.
Hybrid: light parsing / PII redaction in flight; semantic transforms in warehouse.

Change Data Capture

Replicate row-level changes from OLTP source. Avoids full-table scans + staleness.
Log-based: read DB redo log (Postgres WAL, MySQL binlog). Lowest impact, exact ordering. Tools: Debezium, AWS DMS, Fivetran log mode.
Trigger-based: DB triggers write to audit table. Adds OLTP load.
Query-based: poll updated_at column. Misses deletes, doesn't handle soft delete.
Schema evolution: ALTER TABLE in source must propagate. Plan handler.

File Formats

Format	Type	Use
JSON / NDJSON	Row, text	Logs, semi-structured ingest
CSV	Row, text	Legacy interchange
Avro	Row, binary, schema	Streaming (Kafka), schema evolution
Parquet	Columnar	Analytics, lakehouse default
ORC	Columnar	Hive ecosystem
Arrow	Columnar in-memory	Cross-engine zero-copy

MODULE 3

Data Warehouses

Columnar MPP for analytics. Snowflake / BigQuery / Redshift / Databricks SQL.

Columnar MPP Internals

Columnar storage: same column packed; compresses well; only read needed cols.
Encoding: dictionary, run-length, delta, bit-packing.
Vectorized execution: process column batches per CPU vector.
MPP: data partitioned across nodes; each node runs same op locally; results merged.
Pruning: zone maps / min-max + bloom filters skip blocks. Partition + cluster by high-selectivity cols.

Snowflake / BigQuery / Redshift

System	Compute	Pricing model	Notes
Snowflake	Virtual warehouses (clusters); decoupled from storage	Per-second compute + storage	Multi-cluster autosuspend; time travel; zero-copy clone
BigQuery	Serverless; slot-based	On-demand $/TB scanned, or flat-rate slots	Native streaming inserts; partition + cluster only
Redshift (RA3)	Managed cluster; managed storage	Per-hour cluster + storage	Spectrum to S3; concurrency scaling; AQUA
Databricks SQL	Photon engine on lakehouse	DBU / hour	Native to Delta + Iceberg; ML + BI shared

Performance Levers

Partition by date / tenant; cluster (sort) by high-selectivity filter columns.
Prune via WHERE on partition column. Always.
Avoid SELECT * — kills column pruning.
Pre-aggregate hot dashboards into materialized views / cached tables.
Stale or skewed stats hurt joins → ANALYZE / refresh.
Warehouse sizing: scale up for one-shot heavy query; multi-cluster for concurrency.

MODULE 4

Data Modeling

Dimensional + denorm patterns. SCDs. Layered lineage.

Kimball Star Schema

Fact table — measurements (orders, clicks). Foreign keys to dims + numeric measures.
Dimension table — descriptive attributes (customer, product, date).
Star = facts → dims directly. Snowflake = dims normalized further.
Grain = "what does one row mean?" Define explicitly per fact table.

Slowly Changing Dimensions

Type	Behavior
Type 0	No change tracking; ignore updates
Type 1	Overwrite — current truth only
Type 2	New row per change with valid_from / valid_to + is_current
Type 3	Add prior_value column
Type 4	Mini history table
Type 6	Hybrid 1+2+3 (current + history + previous)

Medallion Layering

Bronze (raw) — exact source byte-for-byte. Append-only. PII as-is, encrypted.
Silver (cleaned) — typed, dedup, joined with conformed dims. Useful but not business-shaped.
Gold (mart) — denormalized, business metrics, aggregates, fit-for-BI.
Reprocess Silver/Gold from Bronze; never edit Bronze.

One Big Table (OBT) vs Star

Modern columnar engines tolerate wide denorm tables. OBT = simpler queries, faster on dashboards; star = better for ad-hoc + smaller writes. Hybrid: star at Silver, OBT at Gold for hot dashboards.

MODULE 5

dbt & Transformation

SQL + Jinja + DAG. Tests + docs + lineage built in.

Core Concepts

Models — SELECT in .sql files; dbt wraps as table / view / incremental.
Sources — declare external tables; freshness checks.
Tests — schema (unique, not_null, accepted_values, relationships) + custom singular.
Macros — Jinja functions, reused SQL.
Snapshots — built-in SCD2 capture.
Seeds — small CSVs version-controlled.
Exposures — declare downstream BI / app usage.

Materializations

Type	When
view	Cheap, small, pass-through
table	Heavy compute reused; full rebuild OK
incremental	Big tables; only new/changed rows
ephemeral	CTE-inlined; no DB object
snapshot	SCD2 history

-- incremental example
{{ config(materialized='incremental', unique_key='id', incremental_strategy='merge') }}
select *
from {{ source('app', 'orders') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Best Practices

Layer naming: stg_ (raw rename + types), int_ (joins, intermediate), fct_ + dim_ (mart).
1 model = 1 logical entity. Long SQL → split with intermediate models, not subquery soup.
Tests on PKs (unique + not_null) on every model.
CI: dbt build on PR; Slim CI runs only modified + downstream.
Document at the model + column level. Docs site auto-generated.

MODULE 6

Orchestration

Schedule, retry, recover, observe pipelines.

Tooling

Tool	Model	Strength
Airflow	Task DAG, Python	Ubiquitous; rich ops + connectors; large ecosystem
Dagster	Asset-oriented	Lineage + types + dev-loop; software-defined assets
Prefect	Pythonic flows	Hybrid execution; dynamic DAGs
Argo Workflows	K8s CRD	Container-native; scales horizontally
Step Functions	AWS managed state machine	Serverless; tight AWS integration

Principles

Idempotent tasks — same partition rerun = same output. Required for safe retry.
Backfill safe — task accepts logical date; doesn't depend on wall clock.
Atomic writes — write to staging, then COMMIT / RENAME / SWAP. Never partial visible.
SLA + alerts — freshness threshold; page on miss.
No business logic in DAG — DAGs orchestrate; logic lives in SQL/dbt/jobs.

Dependencies

Time-based scheduling: simple but brittle (upstream late → downstream still runs).
Sensor / signal-based: wait for upstream completion or data availability.
Asset-based (Dagster, dbt Cloud): "this asset is materialized; downstream can run".
Data contracts: producer + consumer schema agreement. Break in CI, not in prod.

MODULE 7

Streaming & Real-Time

Kafka + Flink / Spark Structured Streaming / Materialize. Bounded-latency processing.

Kafka Fundamentals

Topic = ordered log; partitions = unit of parallelism + ordering.
Producer key → partition (hash). Same key → same partition → ordered.
Consumer group: each partition assigned to one consumer in group. Scale by adding partitions + consumers.
Retention: time / size based. Compact topics retain latest value per key (DB-like).
Offsets stored in __consumer_offsets; auto-commit dangerous, prefer at-least-once + idempotent sinks.

Delivery Semantics

Mode	Mechanism
At-most-once	Commit before process; lose on crash
At-least-once	Process before commit; duplicates on crash
Exactly-once	Transactional producer + idempotent + read-committed; or sink dedup

Windows

Tumbling — fixed, non-overlapping (every 5 min).
Hopping / sliding — fixed size, slide step (5 min size, 1 min step).
Session — close after gap of inactivity (user activity sessions).
Event time vs processing time — event time tolerates late arrivals via watermarks.
Watermark — bound on "no event older than this is expected". Closes window.

Stream Engines

Flink — true streaming, EOS, savepoints, stateful ops at scale. Industry standard for complex pipelines.
Spark Structured Streaming — micro-batch (or continuous experimental); reuses Spark SQL.
Kafka Streams / ksqlDB — JVM lib; simple in-Kafka topology.
Materialize / RisingWave — streaming SQL; incrementally maintained materialized views.
Beam — unified batch + stream API; runs on Flink / Dataflow.

Streaming CDC

Debezium tails DB log → Kafka topic per table. Downstream consumers (Flink, Materialize) maintain near-real-time replicas, enriched joins, denormalized views.

MODULE 8

Lakehouse: Iceberg / Delta / Hudi

ACID + schema + time travel on object storage.

Why Open Table Formats

Plain Parquet on S3: no atomic writes, no schema evolution, no row-level update.
Open table format adds metadata layer: snapshots, manifests, partition evolution, ACID via optimistic concurrency.
Multi-engine: Spark + Trino + Flink + Snowflake + DuckDB read same table.

Iceberg / Delta / Hudi

Format	Origin	Strength
Iceberg	Netflix → Apache	Hidden partitioning, partition evolution, broad engine support, REST catalog
Delta Lake	Databricks → Linux Foundation	Tight Spark integration; OSS Delta UniForm
Hudi	Uber → Apache	Streaming upsert (MoR table type); incremental queries

Common Features

ACID via snapshot + commit log.
MERGE / UPDATE / DELETE — row-level mutations on Parquet.
Time travel — query as-of snapshot or timestamp.
Schema evolution — add / drop / rename columns.
Partition evolution (Iceberg) — change partition spec without rewrite.
File compaction + tombstone cleanup — required maintenance jobs.

MODULE 9

Data Quality & Contracts

Catch broken pipelines before consumers do.

Test Categories

Category	Examples
Schema	Required cols present; types match; nullability
Uniqueness / referential	PK unique; FK present in dim
Range / accepted values	0 ≤ rate ≤ 1; status ∈ {…}
Volume	Row count within ±X% of 7-day avg
Freshness	max(updated_at) within SLA
Distribution	Mean / null-rate within control limits
Custom business rule	Σ(line_items.amount) = orders.total

Tools

dbt tests — fast, in-pipeline. Unique / not_null / relationships built in.
Great Expectations — Python-driven expectations + docs.
Soda — declarative YAML checks.
Monte Carlo / Bigeye / Anomalo — ML-based anomaly + lineage.
Open standards: OpenLineage, OpenMetadata.

Data Contracts

Producer publishes typed schema + SLA to consumers.
Schema registry (Confluent SR, Apicurio) enforces compat (BACKWARD / FORWARD / FULL).
Breaking change = new version (v2 topic / table) + deprecation window.
CI test on producer side: schema diff against last published; block breaking changes.

MODULE 10

Spark / Distributed Compute

Big-data engine. Where SQL alone won't scale or task isn't tabular.

Spark Internals

Driver builds DAG; cluster manager (YARN / Kubernetes / standalone) allocates executors.
Job → stage (per shuffle) → tasks (per partition).
Lazy evaluation — transformations build plan; action (count, write) triggers execution.
DataFrame / SQL → Catalyst optimizer → physical plan → Tungsten codegen.
Adaptive Query Execution (AQE) — replan at runtime: coalesce shuffle, skew handle, broadcast switch.

Shuffle

Wide deps (groupBy, join, repartition) move data across nodes — expensive.
Minimize shuffles: filter + project early, broadcast small side of join (< 10 MB default), avoid groupBy when reduceByKey suffices.
Skew: hot key concentrates work on one task. Salt key, AQE skew join, or split.
Spill to disk when memory exceeded — ensure shuffle dirs sized + fast disks.

Tuning

spark.sql.shuffle.partitions — default 200; tune to ≈ data_size / 128 MB.
Cache strategically; uncache when done. Storage levels MEMORY_AND_DISK_SER common.
Broadcast joins for < 10–100 MB right side.
Avoid UDFs in Python (serialization overhead) — use built-in / pandas UDFs / SQL.
Watch GC + executor OOM — bump spark.executor.memoryOverhead.

MODULE 11

Governance, Privacy & Cost

Catalog, access, retention, FinOps.

Catalog & Lineage

Catalog = searchable inventory of datasets + columns + owners + business glossary.
Lineage = upstream / downstream graph at table + column level. OpenLineage standard.
Tools: DataHub, OpenMetadata, Atlan, Collibra, Unity Catalog (Databricks), Snowflake Horizon.

Access Control

RBAC at warehouse + lake.
Row-level security (RLS) — filter rows per role / tenant.
Column masking — hash / redact PII for non-privileged readers.
ABAC / tag-based policies — apply based on column tag (e.g., pii).
Audit logs — query history retained per compliance window.

Privacy

PII inventory — know which columns / files contain PII.
Minimization — collect only what's needed.
Right to erasure (GDPR Art 17, CCPA) — index user_id → all locations; delete or tombstone.
Data residency — region-pinned datasets; cross-border replication policies.
Retention — tier + delete on schedule (lifecycle rules).

FinOps for Data

Tag every workload (team / cost center / use case).
Snowflake: autosuspend warehouses fast (60–300s); right-size XS → 4XL per workload.
BigQuery: prefer slot reservations for predictable, on-demand for spiky; partition + cluster to cut TB scanned.
Cache hot tables; pre-aggregate dashboards.
Watch SELECT * in Looker / dashboards — biggest budget bleeder.
S3 lifecycle: hot → IA → Glacier; expire orphan multipart uploads.

MODULE 12

ML / Feature Engineering Interface

DE work that lands in models. Feature stores, training/serving parity.

Feature Store

Offline store (warehouse / lakehouse) for training; online store (Redis / Dynamo / Cassandra) for low-latency serving.
Single feature definition produces both — no train/serve skew.
Point-in-time correctness: training join uses feature values as of label time, not now.
Tools: Feast, Tecton, Hopsworks, Vertex / SageMaker Feature Store, Databricks FS.

Point-in-Time Joins

-- naive (leakage)
join features f on f.user_id = label.user_id

-- correct (point-in-time, AS OF label.event_ts)
join features f
  on f.user_id = label.user_id
 and f.valid_from <= label.event_ts
 and (f.valid_to is null or f.valid_to > label.event_ts)

Embeddings as Data

DE owns the pipeline: text/image → embedding → indexed vector store. Tracking + versioning + reindex on model change. Integrates with RAG (see AI Engineer notebook).

MODULE 13

Cheat Sheet

Default stack + decision rules.

Default Stack

Storage: S3 + Iceberg
Warehouse: Snowflake / BigQuery / Databricks SQL
Ingest: Fivetran / Airbyte + Debezium CDC
Transform: dbt
Orchestrate: Airflow or Dagster
Stream: Kafka + Flink
Quality: dbt tests + GE / Soda
Catalog: DataHub / OpenMetadata
BI: Looker / Lightdash / Metabase

Pick Storage

Pure analytics + SQL → warehouse (Snowflake / BQ)
Multi-engine + ML + cheap → lakehouse (Iceberg + Spark)
OLTP-style updates → Postgres / Aurora; CDC out
K-V serving → DynamoDB / Cassandra
Full-text + vector → OpenSearch / Vespa

Streaming Defaults

Partition key = entity needing order
Avro + schema registry
At-least-once + idempotent sink
Watermark on event time
Compact topic for "current state"
Dead-letter queue for poison events

Modeling Defaults

Bronze (raw) → Silver (clean) → Gold (mart)
Star schema at Silver; OBT at Gold for hot dashboards
SCD2 for dims that need history
Surrogate keys (hashed natural)
Grain documented per fact

Cost Quick Wins

Partition + cluster mandatory
Materialize hot dashboards
Autosuspend warehouses 60s
Kill SELECT * in BI tools
S3 lifecycle to IA / Glacier
Compact small files in lake

Numbers

Parquet 5–20× smaller than CSV
Lake target file size 128–512 MB
Spark shuffle ≈ data_size / 128 MB partitions
Kafka partition count = max parallel consumers
Watermark lag < 1× window size
dbt model PK test on every model