MODULE 1

Service Models

IaaS / PaaS / SaaS / FaaS — what the cloud manages, what you manage.

Stack Layers

ModelYou manageCloud managesExample
On-premEverythingNothingBare metal in your DC
IaaSOS, runtime, app, dataVirtualization, networking, storage, hardwareEC2, GCE, Azure VM
CaaSContainer image, app configContainer runtime, scheduler, infraECS, GKE, AKS, Fargate
PaaSApp code, dataRuntime, OS, scaling, infraApp Engine, Heroku, Elastic Beanstalk
FaaSFunction codeEverything else; pay per invocationLambda, Cloud Functions
SaaSConfiguration, usersEverythingGmail, Salesforce
MODULE 2

AWS Core Services

The 20 services that cover 80% of architectures.

Compute

  • EC2 — VMs. Instance families: m (general), c (compute), r (memory), i (storage), g/p (GPU). Spot = up to 90% off, can be reclaimed in 2 min.
  • Lambda — FaaS. Max 15 min, 10 GB memory. Cold start mitigations: provisioned concurrency, smaller bundle, SnapStart for Java.
  • ECS / Fargate — container orchestration. Fargate = serverless containers (no EC2 to manage).
  • EKS — managed Kubernetes control plane.
  • Batch — long-running compute jobs.

Storage

  • S3 — object store. Classes: Standard, IA, One-Zone-IA, Intelligent-Tiering, Glacier Instant / Flexible / Deep Archive. Strong read-after-write since 2020.
  • EBS — block storage for EC2. gp3 (general SSD, 3000–16000 IOPS), io2 (provisioned), st1 (HDD throughput), sc1 (cold HDD).
  • EFS — NFS. Multi-AZ, scales automatically. Slower than EBS.
  • FSx — Lustre / Windows / OpenZFS / NetApp managed file systems.

Databases & Data

  • RDS — managed Postgres / MySQL / MariaDB / Oracle / SQL Server. Multi-AZ for HA, read replicas for scale.
  • Aurora — RDS-compatible, storage-decoupled. 6 copies across 3 AZs. Faster failover than RDS.
  • DynamoDB — managed k-v / document. Single-digit ms p99. PK + optional SK. GSI / LSI for secondary access. On-demand vs provisioned RCU/WCU.
  • ElastiCache — managed Redis / Memcached.
  • Redshift — columnar warehouse, OLAP.
  • Athena — serverless SQL on S3 (Presto).
  • Kinesis — managed streaming. Data Streams (Kafka-like), Firehose (load to S3/Redshift), Analytics (Flink-on-Kinesis).

Networking

  • VPC — private virtual network. Subnets per AZ, public/private split.
  • ALB — L7 LB, HTTP-aware, target groups, path/host routing, WebSockets.
  • NLB — L4 LB, TCP/UDP, static IP, ultra-low latency, preserves source IP.
  • API Gateway — managed API frontend, auth, rate limit, transformation.
  • CloudFront — CDN with edge compute (Lambda@Edge / Functions).
  • Route 53 — DNS with health checks + weighted / latency / geo routing.
  • VPN / Direct Connect / Transit Gateway — hybrid + multi-VPC connectivity.

Messaging & Integration

  • SQS — managed queue. Standard (best-effort order, at-least-once) or FIFO (exactly-once, ordered). DLQ for poison messages.
  • SNS — pub/sub fanout. SNS → SQS / Lambda / HTTP / email / SMS.
  • EventBridge — event bus with rules + scheduler.
  • Step Functions — managed state machines, sagas.
  • MSK — managed Kafka.

IAM & Security

  • IAM — users, roles, policies. Trust + permission policies.
  • STS — short-lived credentials via AssumeRole.
  • KMS — managed keys, envelope encryption. CMK + data keys.
  • Secrets Manager — secret storage with rotation; Parameter Store for non-secret config.
  • Cognito — user pools (identity) + identity pools (federated).
  • WAF / Shield — L7 firewall + DDoS.

Observability

  • CloudWatch — metrics, logs, alarms. Custom metrics via PutMetric / EMF.
  • X-Ray — distributed tracing.
  • CloudTrail — API audit log.
  • Config — resource state + compliance.
MODULE 3

GCP / Azure Mappings

Same primitives, different names.

Cross-Cloud Equivalents

AWSGCPAzure
EC2Compute EngineVirtual Machines
LambdaCloud Functions / RunFunctions
S3Cloud StorageBlob Storage
EBSPersistent DiskManaged Disks
RDSCloud SQLAzure SQL / Postgres / MySQL
AuroraAlloyDB / Spanner (global)Cosmos DB (multi-model)
DynamoDBFirestore / BigtableCosmos DB
SQSPub/Sub (queue mode)Service Bus / Queue Storage
SNSPub/SubEvent Grid / Service Bus topics
KinesisPub/Sub + DataflowEvent Hubs
ALBCloud Load Balancing (HTTPS)App Gateway / Front Door
CloudFrontCloud CDNFront Door / CDN
Route 53Cloud DNSAzure DNS / Traffic Manager
EKSGKEAKS
RedshiftBigQuerySynapse Analytics
IAMCloud IAMEntra ID + RBAC
KMSCloud KMSKey Vault
CloudWatchCloud Monitoring + LoggingMonitor + Log Analytics
MODULE 4

Containers

Image format, runtime, registry.

Image Anatomy

  • Layered filesystem (overlayfs). Each Dockerfile instruction = layer.
  • Layer cache: identical layer hashes are reused across builds. Order Dockerfile from least → most changing.
  • Multi-stage builds: build in heavy image, copy artifacts to slim runtime.
  • OCI image spec — vendor-neutral standard. Distroless / scratch / alpine for minimal base.
# multi-stage example
FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/app

FROM gcr.io/distroless/static
COPY --from=build /out/app /app
USER 65532:65532
ENTRYPOINT ["/app"]

Runtime & Linux Primitives

  • Namespaces — pid, net, mnt, uts, ipc, user, cgroup. Isolate process view.
  • cgroups v2 — limit + account CPU, memory, IO, pids.
  • seccomp / AppArmor / SELinux — syscall filtering, MAC.
  • Container runtimes — containerd, CRI-O, runc (OCI runtime).
MODULE 5

Kubernetes

Declarative orchestration. Reconciliation loops drive desired state.

Architecture

  • Control plane: API server (kube-apiserver), etcd (state), scheduler, controller-manager, cloud-controller-manager.
  • Node: kubelet (agent), kube-proxy (network), container runtime (containerd).
  • Everything is a CRUD operation on the API server; controllers watch + reconcile.

Core Objects

ObjectPurpose
PodSmallest unit; 1+ containers sharing net + IPC
ReplicaSetMaintains N pod replicas
DeploymentDeclarative rolling update over ReplicaSets
StatefulSetStable identity + ordered rollout (DBs)
DaemonSetOne pod per node (logs, monitoring agents)
Job / CronJobRun-to-completion / scheduled work
Service (ClusterIP / NodePort / LB)Stable virtual IP + endpoints
Ingress / Gateway APIL7 routing into cluster
ConfigMap / SecretConfig + sensitive data
PersistentVolume / PVC / StorageClassStorage provisioning
HPA / VPA / Cluster Autoscaler / KarpenterScaling pods + nodes
NetworkPolicyPod-to-pod L3/L4 firewall

Scheduling

  • Filters (node selector, taints/tolerations, affinity, resource fit) → scoring → bind.
  • Requests = scheduler reservation; limits = runtime cap. Set CPU request always; CPU limit cautiously (throttling).
  • QoS classes: Guaranteed (req=lim), Burstable, BestEffort. Eviction order: BestEffort → Burstable → Guaranteed.
  • Pod disruption budgets (PDB) protect availability during voluntary disruption.

Operators & CRDs

Custom controllers reconcile custom resources. Patterns: state machine for app lifecycle (e.g., Postgres operator handles backups, failover, scaling). Tools: kubebuilder, Operator SDK.

Rollouts

  • RollingUpdate: maxSurge, maxUnavailable. Default deployment.
  • Recreate: kill all then start. Brief downtime.
  • Blue/Green: two deployments, switch service selector.
  • Canary: weighted routing via Service Mesh (Istio/Linkerd) or Argo Rollouts.
MODULE 6

Infrastructure as Code

Terraform / Pulumi / CloudFormation / CDK.

Terraform Concepts

  • Providers (aws, gcp, k8s) → resources → state file.
  • State: source of truth for what TF created. Remote backend (S3 + DynamoDB lock, Terraform Cloud).
  • Plan / Apply: dry-run shows diff. Apply executes.
  • Modules: reusable groups of resources. Inputs / outputs / variables.
  • Workspaces or directories for env separation (dev / staging / prod).
  • Drift = real infra ≠ state. Detect via plan; reconcile or import.

Tooling Comparison

ToolLangProsCons
TerraformHCLMulti-cloud, large ecosystem, matureHCL limited; state mgmt headaches
PulumiPython/TS/GoReal lang, tests, conditionalsSmaller community
CloudFormationYAML/JSONNative AWS, no state to manageAWS-only, slow rollback
AWS CDKTS/Python/Java/GoSynth to CFN; constructsAWS-only; opinionated abstractions
CrossplaneK8s CRDsGitOps-native; multi-cloud as K8sK8s-shaped solutions only
MODULE 7

Cloud Networking

VPC anatomy + connectivity patterns.

VPC Topology

  • VPC = isolated address space (e.g., 10.0.0.0/16).
  • Subnets per AZ (typically 2–3 AZs). Public subnet = route to IGW; private = no IGW route.
  • NAT Gateway in public subnet → outbound internet from private subnet.
  • Route tables associate with subnets. Security groups (stateful, instance-level) + NACLs (stateless, subnet-level).
  • VPC endpoints: Interface (ENI for AWS service) or Gateway (S3 / DynamoDB free).

Cross-VPC / Hybrid

  • VPC peering — 1:1, non-transitive, same/different account/region.
  • Transit Gateway — hub-and-spoke, transitive, scales to thousands of VPCs.
  • PrivateLink — expose service via interface endpoint without VPC peering.
  • VPN — IPsec over internet, ~1 Gbps per tunnel.
  • Direct Connect — dedicated fiber, 1–100 Gbps, lower latency + predictable.

Egress Cost

MODULE 8

Storage Tiers & Patterns

Match access pattern to tier; lifecycle policies move automatically.

S3 Classes

ClassCost / GB-moRetrievalUse
Standard$0.023InstantHot data
Intelligent-Tiering$0.023 → $0.0036InstantUnknown access pattern
Standard-IA$0.0125Instant + retrieval $Backup, infrequent access
One-Zone-IA$0.01InstantRe-creatable, single AZ
Glacier Instant Retrieval$0.004InstantArchive accessed rarely
Glacier Flexible$0.0036Minutes–hoursBackup
Glacier Deep Archive$0.0009912 hoursCompliance, long retain

EBS Volume Types

  • gp3 — default. 3000 IOPS / 125 MB/s baseline; provision more.
  • io2 Block Express — up to 256k IOPS, <1 ms; expensive.
  • st1 / sc1 — HDD, sequential workloads, cheap.
  • Snapshots = incremental, stored in S3 internally.
MODULE 9

Cost Optimization

FinOps basics. Engineering decisions = cost decisions at scale.

Pricing Levers

  • On-Demand — full price, no commit.
  • Reserved Instances / Savings Plans — 1 or 3 yr commit; up to ~72% off. Compute SP = flexible across services.
  • Spot — up to 90% off; can be reclaimed in 2 min. Pair with checkpointing or Karpenter.
  • Right-sizing — match instance to workload. Compute Optimizer surfaces.
  • Auto-scaling — scale to zero off-hours where possible.

Watch For

  • Idle EBS volumes / snapshots / Elastic IPs.
  • NAT Gateway egress for internal traffic — use VPC endpoints.
  • Cross-AZ chatter — colocate replicas / cache.
  • S3 unfinished multipart uploads (lifecycle rule to expire).
  • Log retention default = forever; set retention.
  • RDS over-provisioned IOPS; switch to gp3.
MODULE 10

Cheat Sheet

Pick services fast. Defaults for new projects.

Default Stack (AWS)

  • Compute: ECS Fargate or EKS
  • Data: Aurora Postgres + DynamoDB
  • Cache: ElastiCache Redis
  • Queue: SQS + EventBridge
  • Object: S3 + CloudFront
  • Auth: Cognito or external IdP

Pick LB

  • HTTP(S) + path/host routing → ALB
  • TCP/UDP, static IP, ultra-low latency → NLB
  • Global anycast → Global Accelerator / CloudFront
  • Cross-region failover → Route 53 health checks

Pick DB

  • Relational + transactions → Aurora / RDS Postgres
  • K-V, ms latency, scale-out → DynamoDB
  • Search → OpenSearch
  • Time-series → Timestream / Influx on EC2
  • Graph → Neptune
  • Wide-column → Keyspaces (Cassandra)

K8s Sanity

  • Always set requests; limits with care
  • Liveness ≠ readiness; startup probe for slow boot
  • PDB on every Deployment in prod
  • NetworkPolicy default-deny ingress
  • Pin image digest, not tag
  • Use HPA on custom metrics, not just CPU

Cost Quick Wins

  • S3 lifecycle: Standard → IA → Glacier
  • EBS gp2 → gp3 (10–20% cheaper)
  • VPC endpoint for S3/Dynamo to skip NAT
  • Compute Savings Plan @ ~60% baseline
  • Spot for batch/CI/dev
  • Set log retention < 30 days unless required

IaC Hygiene

  • Remote state + state lock
  • One module per logical unit
  • Plan in PR; apply on merge
  • No hardcoded secrets — use SM / Vault
  • Drift detection scheduled
  • Tag everything: env, owner, cost-center