Master Notebook: Operating Systems & Concurrency

Module 01

Processes & Threads

What the kernel tracks per process and per thread, how fork/exec work, and exactly what a context switch costs.

Process Anatomy & the PCB

A process is an instance of an executing program with its own virtual address space, open-file table, credentials, and at least one thread of control. The kernel tracks each process via a Process Control Block (PCB) — in Linux this is struct task_struct (~1.7 KB per task on x86_64, 2024 vintage).

Virtual address-space layout (x86_64 Linux, typical)

Figure 1.1 — Process virtual memory regions. Each region has independent permissions enforced by the page table.

PCB fields — what the kernel tracks per process

Field	Linux field name	Purpose
PID	`pid_t pid`	Unique process identifier in PID namespace
Parent PID	`struct task_struct *parent`	Backpointer for `SIGCHLD` / `wait()`
State	`long state`	RUNNING / INTERRUPTIBLE / UNINTERRUPTIBLE / STOPPED / ZOMBIE
Scheduling	`struct sched_entity se`	vruntime, policy (CFS / FIFO / RR), nice value
Address space	`struct mm_struct *mm`	Page tables, VMA list, brk, stack start
File table	`struct files_struct *files`	FD → file_struct mapping
Signals	`struct sighand_struct *sighand`	Signal handlers + pending masks
Credentials	`struct cred *cred`	UID, GID, capabilities
Namespaces	`struct nsproxy *nsproxy`	pid/net/mount/uts/ipc/cgroup namespaces
Saved context	`struct thread_struct thread`	Register file, FPU state, kernel stack pointer
Memory stats	`rss, vsize`	Resident / virtual size, for `top`/`ps`

Process states — the five-state machine

Figure 1.2 — The classic five-state process lifecycle. Linux collapses Ready/Running into TASK_RUNNING and splits Blocked into INTERRUPTIBLE vs UNINTERRUPTIBLE.

Gotcha

A zombie is a terminated process whose PCB is still retained so the parent can read its exit code via wait(). If the parent never waits, zombies pile up — each consumes a PID slot. A ghost parent (init, PID 1) adopts orphaned children to reap them, which is why init always loops on wait().

Threads & the TCB

A thread is an independently schedulable flow of execution that shares its address space, file table, and signal handlers with sibling threads. On Linux, threads are just task_structs created with clone(CLONE_VM|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|…) — the PCB is the TCB. Shared resources are stored once in mm_struct / files_struct and pointed to by every thread.

Per-thread (TCB) vs per-process state

Kind of state	Where it lives	Per-thread?
Register file / PC / RSP	`thread_struct`	Yes — the essence of a thread
Kernel stack (8–16 KiB)	`task_struct::stack`	Yes
User stack	VMA in shared mm	Yes (distinct ranges)
TLS block (`fs`/`gs` base)	Per-thread arena	Yes
Pending signals	`task_struct::pending`	Yes — thread-directed signals
Signal handlers	`sighand_struct`	No — shared
Address space / page tables	`mm_struct`	No
Open file descriptors	`files_struct`	No
Heap / BSS / Text	VMAs in shared mm	No
PID (thread group leader)	`tgid`	No — TID is per-thread, PID is shared

User vs kernel threads

1:1 (kernel threads) — every userspace thread maps to one kernel task_struct. Linux NPTL, Windows threads, modern macOS all do this. Blocking a thread only blocks that thread; multiple threads can run on multiple cores simultaneously.
N:1 (green threads) — many user threads multiplexed onto one kernel thread. A blocking syscall freezes every sibling. Historic early Java (Solaris green threads, GNU Pth). Simple, but no multicore.
M:N (hybrid) — user scheduler maps M user threads onto N kernel threads. Go's goroutines, Rust's (former) greenlets, Erlang BEAM. Cheap context switches in userspace; requires a runtime to trap blocking syscalls.

Interview Lens

When asked “process vs thread”, the sharp answer distinguishes the unit of resource ownership (process: address space, FDs) from the unit of scheduling (thread: registers + stack). Go's goroutines are not threads — they are a userspace M:N runtime where a go-scheduler multiplexes 100K goroutines onto GOMAXPROCS kernel threads.

fork / exec / wait

POSIX fork() creates a duplicate of the calling process. The child gets a new PID, a copy of the parent's memory (copy-on-write), a duplicate FD table, and picks up execution on the instruction after fork returns.

/* fork_demo.c — classic fork/exec/wait pipeline */
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(void) {
    pid_t pid = fork();
    if (pid < 0) { perror("fork"); return 1; }

    if (pid == 0) {
        /* child: replace our image with /bin/ls */
        char *argv[] = { "ls", "-la", NULL };
        execvp(argv[0], argv);
        perror("execvp");      /* only reached if exec fails */
        _exit(127);
    }

    /* parent: wait for the child and inspect its exit status */
    int status;
    waitpid(pid, &status, 0);
    if (WIFEXITED(status))
        printf("child exited with %d\n", WEXITSTATUS(status));
    return 0;
}

What actually happens on `fork`

Kernel copies task_struct — trivial bytes.
Clones mm_struct by copying page tables, not physical pages. All writable pages are marked read-only. This is copy-on-write (COW).
Duplicates the FD table (refcount++ on each file*).
Copies signal dispositions (but not pending signals).
Sets the child's return value to 0 and the parent's to the child PID.
Schedules the child; the first page written triggers a fault, the kernel copies that page only.

fork cost breakdown (x86_64 Linux, 2024)

Resident size	Pages	fork() time (approx)	Why
10 MB	2,500	~90 µs	PT copy dominated
100 MB	25,000	~450 µs	Linear in page-table entries
1 GB	250,000	~3.5 ms	Multi-level PT walk + copy
10 GB	2.5M	~35 ms	Why Redis uses `fork` for BGSAVE and sees pauses

Gotcha

fork after pthread_create is dangerous. Only the calling thread survives in the child; others vanish mid-lock. If a killed thread held malloc's arena lock, the child's first malloc will deadlock. Use posix_spawn or only call async-signal-safe functions between fork and exec.

Context Switch Cost

A context switch is the kernel replacing the executing thread's saved state with another thread's. The direct cost is saving and restoring registers. The indirect cost — cache and TLB pollution — often dwarfs it.

Anatomy of a context switch (from T1 to T2)

Direct cost breakdown

Component	Cycles (approx)	Time @ 3 GHz	Notes
Trap into kernel (syscall/IRQ)	~100	33 ns	swapgs, push rsp, re-target IDT
Save GPRs + FPU/AVX	~200	66 ns	AVX-512 alone: 512 bytes/thread
scheduler_tick + pick_next_task	~500–2000	0.2–0.7 µs	CFS RB-tree O(log n)
switch_mm (different address space)	~300	100 ns	CR3 write + possible TLB flush
Restore registers + iret	~200	66 ns	Mirror of save
Direct total	~1,500–5,000	1–2 µs

Indirect cost: what you pay after the switch

TLB flush: cross-address-space switch on non-PCID CPUs flushes the TLB entirely; the first few thousand memory references miss. PCID (CR4.PCIDE) tags entries by ASID so the cost is near-zero.
Cache pollution: T2 evicts T1's working set from L1/L2. When T1 resumes, cold misses re-fetch from L3 or RAM. Measured indirect cost for unrelated workloads: 10–100 µs.
Branch-predictor state: Spectre mitigations (IBPB, STIBP) on cross-trust-domain switches can add ~5 µs on older CPUs.

Numbers

On modern hardware a same-process thread-to-thread switch costs ~1 µs direct. An epoll-based event loop on one core is cheap because it never switches. A thread-per-request server with 10K threads doing 50K switches/sec burns ~5% of one core just on kernel bookkeeping, plus cache pollution.

IPC Mechanisms

Processes do not share memory by default, so they communicate through kernel-mediated channels:

Mechanism	Direction	Latency / GB·s	Best for
Anonymous pipe	Parent→child, 1-way	~1 µs / 2 GB·s	Shell command pipelines
Named pipe (FIFO)	Any-any, 1-way	~1 µs / 2 GB·s	Unrelated process rendezvous
UNIX-domain socket	Bidirectional, datagram or stream	~2 µs / 1–3 GB·s	Local RPC, FD passing
Shared memory (mmap / SysV)	Any-any	~memory speed / 20+ GB·s	Zero-copy bulk data (needs your own sync)
Message queue (POSIX)	Any-any	~2 µs	Structured bounded-priority messages
Signal	Any-any, 1 bit	<1 µs	Notify only — never carry data
eventfd / pipe2	Self-wake	<1 µs	Cross-thread wake via epoll

Trade-off

Shared memory is the fastest IPC, but offers no synchronization. You still need a futex / semaphore / atomic in that shared region to coordinate writes. Pipes and sockets trade throughput for built-in flow control and blocking semantics.

Module 02

Memory & Virtual Memory

How an MMU translates virtual to physical, what a page-table entry actually contains, and the TLB path that keeps it fast.

Virtual Address Spaces

Each process sees a private, contiguous virtual address space. x86_64 uses 48-bit canonical addresses (256 TiB) split into a user half (0x0000_0000_0000_0000 .. 0x0000_7fff_ffff_ffff) and a kernel half (0xffff_8000_0000_0000 and up). Addresses between are non-canonical and fault on use. 5-level paging raises the limit to 57 bits (128 PiB), gated by CR4.LA57.

Why bother with virtual memory? Three wins:

Isolation — process A cannot read/write process B's pages because their page tables point to different physical frames.
Sparseness — a process claims 256 TiB of logical space and only pays for pages it touches (the kernel populates page-table entries lazily on fault).
Relocation & sharing — shared libraries and COW pages live in physical memory once but are mapped into many virtual spaces.

Interview Lens

Candidates love to say “virtual memory lets you use more than RAM.” True, but swap-based paging to disk is the least important win today — servers rarely swap. The isolation and zero-copy (mmap, shared libraries, fork COW) wins matter every millisecond.

Paging & Page-Table Entries

Memory is divided into fixed-size pages (4 KiB default; 2 MiB and 1 GiB “huge pages” available). The MMU translates a virtual page number (VPN) to a physical frame number (PFN) by walking the multi-level page table anchored at CR3.

x86_64 4-level paging (48-bit virtual address)

Figure 2.1 — 48-bit virtual address split. Walking the hierarchy costs 4 memory references plus the final data access.

Page-Table Entry (4 KiB page, x86_64) — the bits that matter

Bit	Name	Meaning
0	P (Present)	1 = mapped to a frame. 0 = fault on access (paged out, lazy alloc, COW).
1	R/W	0 = read-only (fault on write — used for COW and `.text`).
2	U/S	0 = supervisor-only (kernel pages).
3	PWT	Page Write-Through caching.
4	PCD	Page Cache Disable (MMIO / DMA regions).
5	A (Accessed)	Set by hardware on any reference; cleared by kernel for clock replacement.
6	D (Dirty)	Set by hardware on write; drives page-cache writeback.
7	PS (Page Size)	1 at the PD level = 2 MiB huge page; at PDPT level = 1 GiB.
8	G (Global)	Survives CR3 reload (kernel mappings).
12-51	PFN	Physical frame number (40 bits × 4 KiB = 4 PiB).
63	XD (NX)	1 = no-execute; kills buffer-overflow shellcode on the stack.

Numbers

A 4-level walk reads 4 page-table entries. Each read is a cache line (64 B). Without a TLB a user-space load would cost ~5 memory references. The TLB collapses that to 1 when the PTE is cached.

Virtual → Physical Translation (with TLB miss walk)

The hot path checks the TLB; on miss, the hardware page-table walker (PMH on Intel) walks all four levels. The visualization below steps through a load of address 0x7FFE_1234_5ABC.

Address translation: TLB hit, TLB miss with 4-level walk, and page fault

Cost of a miss

Best case — all four PTEs in L1d: ~4 cycles × 4 + transit = ~20 ns.
Typical case — PTEs in L2/L3: ~100–200 ns for the walk.
Worst case — PTEs in RAM or not present: a full page fault is tens of microseconds; swapping from disk is milliseconds.

TLB & Miss Handling

The Translation Lookaside Buffer is a small, fully or set-associative cache of recent VPN → PFN mappings. Modern CPUs have a split L1 TLB (dTLB + iTLB), unified L2 TLB, and separate large-page TLBs:

TLB	Intel Skylake	AMD Zen 4	Latency
L1 dTLB (4 KiB)	64 entries	72	1 cycle
L1 iTLB (4 KiB)	128	64	1 cycle
L1 dTLB (2 MiB)	32	64	1 cycle
L2 STLB (unified)	1,536	3,072	7 cycles
L1 dTLB (1 GiB)	4	24	1 cycle

Who flushes the TLB?

Writing CR3 (cross-address-space switch) flushes all non-Global entries — unless PCID (CR4.PCIDE=1) tags entries by ASID.
INVLPG addr flushes one VPN — issued after a PTE edit.
INVPCID invalidates by PCID (kernel path).
Spectre-style mitigations (IBPB) may invalidate the branch predictor but not the TLB.
A multi-core TLB flush is a “shootdown” — one core IPIs the others to run their INVLPG; this is why large unmap operations on big multicore machines are expensive.

Gotcha

A random-access working set larger than TLB-reach murders throughput even if it fits in L1d cache. Reach = entries × page size. 64 × 4 KiB = 256 KiB reach. Going to 2 MiB huge pages turns 64 entries into 128 MiB of reach — often a 2–3× speedup on graph / OLAP workloads.

Segmentation & mmap

Historic segmentation (Intel 80286/386) sliced the address space into length-limited segments with distinct CS/DS/SS/ES bases. x86_64 reduced it to a relic: all segments have base=0, limit=infinity. Only fs and gs bases survive as pointers to per-thread / per-CPU data (TLS).

Instead, Linux exposes virtual memory areas (VMAs) through mmap. Each VMA has a base, length, protection (r/w/x), and backing object (file, dev/zero, or anonymous).

/* memory-map a 64 MiB file, read lazily by page fault */
#include <sys/mman.h>
#include <fcntl.h>

int fd = open("data.bin", O_RDONLY);
void *base = mmap(NULL, 64*1024*1024, PROT_READ,
                  MAP_PRIVATE, fd, 0);
/* touching base[0] triggers a page fault; kernel populates the
   page from the page cache (or reads it in). zero-copy for reads. */
madvise(base, 64*1024*1024, MADV_RANDOM);  /* hint */

Anonymous vs file-backed

Anonymous (MAP_ANONYMOUS): backed by swap, zero-filled on first touch. Used by malloc's large allocations and per-thread stacks.
File-backed: pages are populated from the file on fault; dirty pages flushed back on writeback or msync. Shared mappings (MAP_SHARED) propagate writes to the file and to other mappers.

Page Fault Handling

A #PF is raised when the MMU sees P=0, a permission violation, or a non-canonical address. The kernel's do_page_fault classifies the fault by looking up the faulting address in the VMA list:

Classification	What happened	Kernel action
Minor (lazy-alloc)	VMA exists; no physical frame yet	Allocate a zeroed frame; install PTE; return — a few µs
Minor (file cache)	File page already in page cache	Install PTE pointing at cached frame — a few µs
Minor (COW)	Write to r/o COW page (post-`fork`)	Copy page to new frame, mark PTE writable
Major	Page must be read from disk/swap	I/O; context switches; ~ms on HDD, ~10–100 µs on NVMe
Invalid (SIGSEGV)	Address not in any VMA, or W to r/o non-COW	Deliver SIGSEGV to the thread

Interview Lens

“What is a page fault?” is a trick question if you only answer “page not in RAM.” The interviewer wants: most page faults are minor — the page is either already in the buffer cache or needs to be zero-allocated. Only a tiny fraction are major (swap-in or file read), and those dominate the cost.

Module 03

CPU Scheduling

FCFS, SJF/SRTF, Round Robin, MLFQ, CFS — each running the same five-process workload so you can compare Gantt charts directly.

Goals & Metrics

A scheduler picks which ready thread gets the CPU next. It is pulled in four directions that conflict:

Throughput — jobs completed per unit time. Maximized by minimizing context switches.
Turnaround time — finish − arrival. Shorter = better responsiveness for batch.
Waiting time — time spent in the ready queue. A proxy for perceived latency.
Response time — first time the job runs after arrival. Critical for interactive workloads.
Fairness — every thread of equal priority makes comparable progress. Violated by starvation.

Two orthogonal knobs:

Preemptive vs non-preemptive — can the scheduler forcibly take the CPU back?
Priority vs time-slice — who gets to run, or for how long?

Trade-off

Throughput and response time are enemies. Long quanta amortize switch cost (high throughput) but make interactive tasks wait. Short quanta keep latency low but waste cycles on context switches. Every modern scheduler blends both.

Benchmark Workload

To compare schedulers apples-to-apples, we run the same five processes through each algorithm:

Process	Arrival time	CPU burst	Priority (low = high)
P1	0	7	3
P2	2	4	1
P3	4	1	4
P4	5	4	2
P5	6	6	5

Total work = 22 units. Below we derive waiting time (how long each process sat in the ready queue) and turnaround time (completion − arrival) for each scheduler.

FCFS (First-Come First-Served)

Non-preemptive. Runs jobs in arrival order. Simple to implement, easy to starve: a 10-second burst arriving at time 0 delays every subsequent 1-ms job behind it — the convoy effect.

FCFS Gantt chart (same 5 jobs)

Derivation of averages

Process	Start	Finish	Waiting (start − arrival)	Turnaround (finish − arrival)
P1	0	7	0	7
P2	7	11	5	9
P3	11	12	7	8
P4	12	16	7	11
P5	16	22	10	16
avg			5.8	10.2

/* FCFS dispatcher (kernel-style pseudocode in C) */
struct task *fcfs_pick_next(struct runqueue *rq) {
    if (list_empty(&rq->tasks)) return NULL;
    return list_first_entry(&rq->tasks, struct task, run_list);
}
void fcfs_enqueue(struct runqueue *rq, struct task *t) {
    list_add_tail(&t->run_list, &rq->tasks);   /* append in arrival order */
}

Gotcha

FCFS is terrible for interactive workloads but excellent for throughput-only batch: minimal bookkeeping, no switches while a job runs. Real systems use FCFS only as a deep-background idle class.

SJF / SRTF (Shortest Job First)

Non-preemptive SJF picks the ready job with the smallest CPU burst. Preemptive SJF = SRTF (Shortest Remaining Time First): if a shorter job arrives, kick the current one out. Provably optimal for mean waiting time — but requires knowing burst length, which production systems must estimate via exponential averaging.

SRTF Gantt chart (preemptive SJF)

Derivation of averages

Process	Finish	Waiting	Turnaround
P1	16	9	16
P2	7	1	5
P3	5	0	1
P4	11	2	6
P5	22	10	16
avg		4.4	8.8

Gotcha: starvation

SJF/SRTF can starve long jobs indefinitely if short jobs keep arriving. Production schedulers (MLFQ, CFS) add aging: a long-waiting job's effective priority rises until it has to run.

Round Robin (RR)

Preemptive. Every ready process gets a fixed time quantum q in FIFO order. When q expires the job goes to the tail of the queue. q too small → high switch overhead; too large → degenerates into FCFS.

Below we simulate RR with q = 2. Arrivals enter the queue tail as they appear.

Round Robin Gantt chart (quantum q=2)

Derivation (q=2)

Process	Finish	Waiting	Turnaround
P1	20	13	20
P2	13	7	11
P3	5	0	1
P4	17	8	12
P5	22	10	16
avg		7.6	12.0

/* Round-robin scheduler skeleton */
struct task *rr_pick_next(struct runqueue *rq) {
    return list_first_entry(&rq->tasks, struct task, run_list);
}
void rr_tick(struct task *t, struct runqueue *rq) {
    if (--t->quantum == 0) {
        t->quantum = RR_QUANTUM;              /* e.g. 4 ms */
        list_move_tail(&t->run_list, &rq->tasks);
        schedule();                           /* forces resched */
    }
}

Numbers

Classic rule of thumb: q should be ~80% of CPU bursts < q. Linux's CFS dynamically sizes its equivalent via sched_latency_ns (default 24 ms) divided across active tasks, bounded below by sched_min_granularity_ns (3 ms).

Multi-Level Feedback Queue (MLFQ)

Several priority queues (Q0 highest .. Qn lowest). A job starts in Q0. If it uses its full quantum it drops to Q(i+1). If it voluntarily yields (I/O) before the quantum ends, it stays — interactive jobs float near the top, CPU-bound ones sink. Periodic priority boost at interval S raises everyone back to Q0 to prevent starvation.

Rules (Arpaci-Dusseau):

If Priority(A) > Priority(B): A runs.
If Priority(A) == Priority(B): round-robin.
New job enters at the highest priority.
If a job uses its entire time allotment at a level, it drops.
After time S, move all jobs to the topmost queue (boost).

MLFQ Gantt (3 queues, q0=2, q1=4, q2=8)

Gotcha: the I/O trick

Without priority boosting, a CPU-hog can game MLFQ by issuing a fake I/O right before its quantum expires — it stays at the top forever. Fix: account total time at a level (allotment), not per-quantum.

CFS — Linux Completely Fair Scheduler

CFS picks the runnable task with the smallest vruntime (virtual runtime). The tasks sit in a red-black tree indexed by vruntime; picking the leftmost node is O(log n). “Fair” = every task makes the same vruntime progress over sched_latency.

Nice values map to weights: nice 0 → 1024, each nice step multiplies by ~1.25. Effective time-slice = sched_latency × (weight / total_weight), so a higher-weighted task runs longer slices and accumulates vruntime slower:

/* Simplified CFS vruntime update (kernel/sched/fair.c) */
static void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec = now - curr->exec_start;
    if ((s64)delta_exec <= 0) return;
    curr->exec_start = now;
    curr->vruntime  += calc_delta_fair(delta_exec, curr);  /* weighted */
    update_min_vruntime(cfs_rq);
}

CFS Gantt chart (simplified, equal nice values)

CFS knobs

Tunable	Default	Effect
`sched_latency_ns`	24,000,000	Target period in which every runnable task runs once
`sched_min_granularity_ns`	3,000,000	Floor on per-task slice (caps nr_tasks / latency)
`sched_wakeup_granularity_ns`	4,000,000	How much vruntime gap before a wakee preempts current
`sched_migration_cost_ns`	500,000	Penalty to cross-CPU task migration; prevents thrashing

Linux 6.6+ replaces CFS with EEVDF (Earliest Eligible Virtual Deadline First) — still a weighted-fair queue, but with a proper deadline-based ordering that reduces latency outliers.

Interview Lens

You don't need to memorize every line of CFS. The idea to convey: vruntime is a normalized progress clock; CFS always picks the task that looks most “behind.” Nice values are weights, not priorities; the scheduler still runs every task, just at different rates.

Head-to-Head Summary

Algorithm	Preempt	Pick	Avg wait	Avg TT	Starvation	Best for
FCFS	No	FIFO	5.8	10.2	No	Throughput batch
SJF/SRTF	Yes (SRTF)	min burst	4.4	8.8	Yes	Known workloads
Round Robin (q=2)	Yes	FIFO cycle	7.6	12.0	No	Time-sharing TTY
MLFQ	Yes	hi-prio Q	5.4	9.8	With boost	Mixed interactive+batch
CFS	Yes	min vruntime	6.8	11.2	No	Linux desktop+server

Trade-off

SJF has the best average waiting time by construction. Real schedulers (MLFQ, CFS) sacrifice a point or two of wait for fairness and starvation-freedom — losing 2% of average waiting for a 100% guarantee of eventual progress is a bargain.

Module 04

Synchronization

Mutexes, semaphores, monitors, and condition variables — with the bounded-buffer producer-consumer animated step-by-step.

Race Conditions & Atomicity

A race condition arises when two threads access the same memory location concurrently, at least one is a write, and there is no happens-before ordering between them. The canonical example is an unprotected counter:

/* counter.c — the classic race */
int counter = 0;
void *work(void *arg) {
    for (int i = 0; i < 1000000; ++i)
        counter++;           /* load, inc, store — 3 instructions */
    return NULL;
}
/* 4 threads: expected 4,000,000; observed 3,127,841 (lost updates). */

Each counter++ compiles to mov, inc, mov. Interleave two threads and a read can miss an update from another core. The fix is either a mutex or an atomic (hardware lock xadd).

Critical-section requirements (Dijkstra / Lamport)

Mutual exclusion — at most one thread inside.
Progress — no thread outside the section blocks entry.
Bounded waiting — once you request entry you wait at most k entries by others.

Gotcha: word-tearing

On most architectures a long write is atomic if aligned, but a mis-aligned or > 8-byte store can tear. Use _Atomic/std::atomic; do not assume volatile buys you atomicity (it only suppresses compiler reordering of that variable).

Mutexes & Spinlocks

A mutex is a binary, owned lock: only the holder can unlock. Under contention the loser either spins (CPU-busy) or sleeps (kernel-parked).

Lock kind	Uncontended cost	Contended cost	When to use
Spinlock	~20 ns (atomic CAS)	Wastes CPU cycles waiting	Short critical sections, kernel IRQ paths
Futex-backed mutex (Linux)	~30 ns CAS, no syscall	~1 µs wake via `FUTEX_WAKE`	General user-space lock
Adaptive (pthread)	~30 ns	Spins briefly, then parks	Mix of short and long holds
Reader-writer (RW)	~50 ns read	Writers starvation risk	Read-heavy workloads

/* Test-and-set spinlock (toy, no back-off) */
typedef struct { _Atomic int flag; } spin_t;

static inline void spin_lock(spin_t *s) {
    while (atomic_exchange_explicit(&s->flag, 1,
                                   memory_order_acquire)) {
        while (atomic_load_explicit(&s->flag,
                                   memory_order_relaxed))
            __builtin_ia32_pause(); /* spin-hint */
    }
}
static inline void spin_unlock(spin_t *s) {
    atomic_store_explicit(&s->flag, 0, memory_order_release);
}

Why not just spin forever?

Priority inversion — a high-priority spinner burns CPU while the low-priority holder is waiting for CPU; neither makes progress.
Cache-line ping-pong — every spinner issues cache-line-exclusive requests; the lock line bounces between cores.
Power — spinning defeats C-states; phones and laptops hate it.

Linux userspace uses futexes: the fast path is an atomic CAS entirely in userspace; only on contention does the loser call FUTEX_WAIT into the kernel.

Numbers

An uncontended pthread_mutex_lock + unlock is ~30 ns (one atomic + one atomic). A contended lock with a single waiter costs ~2 µs (syscall round-trip). A ticket lock under heavy contention can cost 10+ µs per cycle due to cache-line bouncing.

Semaphores — Counting & Binary

A semaphore holds an integer. wait/P(s) decrements (blocking if it would go negative); signal/V(s) increments (waking one waiter). Dijkstra's original design is the seed of most sync primitives.

Binary semaphore — values in {0,1}. Looks like a mutex but anyone can signal it (no ownership). Good for signalling.
Counting semaphore — models a pool of N resources (connection pool, bounded buffer slots).

/* POSIX sem_t — counting semaphore bounding connection pool */
#include <semaphore.h>

static sem_t slots;
sem_init(&slots, 0, 16);          /* 16 available */

void *handler(void *arg) {
    sem_wait(&slots);            /* P: block if 0 */
    conn_t c = pool_take();
    serve(c);
    pool_return(c);
    sem_post(&slots);            /* V: wake one */
    return NULL;
}

Gotcha: missed wakeups

Semaphores solve the “signal-before-wait” race that bare condition variables have — the count is preserved. But using a binary semaphore as a mutex loses ownership tracking: any thread can unlock, error detection (unlock-by-non-owner, recursive unlock) is lost.

Monitors & Condition Variables

A monitor is a higher-level construct: an object whose methods are mutually exclusive (implicit mutex) and whose threads rendezvous via condition variables. Java's synchronized + wait/notify is a direct descendant. POSIX models it as a pthread_mutex_t paired with a pthread_cond_t.

CV operations:

wait(mutex) — atomically release the mutex and block. On wake, reacquire the mutex before returning.
signal() / notify() — wake one waiter (no-op if empty).
broadcast() / notifyAll() — wake all waiters (they contend for the mutex).

Wait semantics: Hoare vs Mesa

	Hoare	Mesa (Java, POSIX)
Signal hands off mutex directly to signalled thread	Yes — signaller is suspended	No — signaller keeps running
Must re-check predicate on wake?	No	Yes — always loop on `while`
Implementation burden	Heavy (complex scheduler)	Light

Gotcha: always use a while loop

On Mesa-semantics systems a spurious wakeup is legal, and even without that, another thread may have grabbed the resource between the signal and your wake-up. Always re-test the predicate inside a loop: while (!ready) pthread_cond_wait(&cv, &m);

Bounded-Buffer Producer-Consumer (2P + 2C, size 4)

Two producers, two consumers, a circular buffer of 4 slots. Three semaphores coordinate: empty counts free slots (init 4), full counts produced items (init 0), and mutex protects buffer indices (binary, init 1).

/* bounded_buffer.c — classic Bernstein/Dijkstra pattern */
#define N 4
item_t buf[N];
int   in = 0, out = 0;
sem_t empty, full, mutex;
sem_init(&empty, 0, N);
sem_init(&full,  0, 0);
sem_init(&mutex, 0, 1);

void *producer(void *_) {
    for (;;) {
        item_t it = produce_next();
        sem_wait(&empty);            /* at least one slot */
        sem_wait(&mutex);            /* critical section */
        buf[in] = it; in = (in + 1) % N;
        sem_post(&mutex);
        sem_post(&full);             /* signal consumers */
    }
}
void *consumer(void *_) {
    for (;;) {
        sem_wait(&full);
        sem_wait(&mutex);
        item_t it = buf[out]; out = (out + 1) % N;
        sem_post(&mutex);
        sem_post(&empty);
        consume(it);
    }
}

Bounded-buffer prod/cons: 2 producers, 2 consumers, 4 slots

Order of acquiring semaphores matters

Both producer and consumer acquire their counting semaphore before the mutex. Reverse it and you deadlock: producer holds mutex while blocked on empty, consumer cannot acquire mutex to release a slot.

Interview Lens

The key question isn't “can you code prod-cons?” but “why are there three semaphores?” Two counting semaphores give you flow control in both directions (full and empty). The mutex protects the shared indices. Collapsing any of the three creates either a race or a deadlock.

Module 05

Deadlock

The four Coffman conditions, detection via resource-allocation graphs, prevention and avoidance, and the Banker's algorithm worked through a real allocation table.

The Four Coffman Conditions (1971)

All four must hold simultaneously for deadlock. Break any one and deadlock is impossible.

#	Condition	Meaning	How you'd break it
1	Mutual exclusion	Resources cannot be shared	Make the resource virtual/shared (e.g., immutable data, optimistic reads)
2	Hold & wait	A process holds one resource and waits for another	Require all-at-once acquisition, or release before re-request
3	No preemption	Resources are only released voluntarily	Steal resources (e.g., DB abort-on-conflict with MVCC)
4	Circular wait	Cycle in the wait-for graph	Impose a global lock order; always acquire in that order

Interview Lens

The cheap prevention answer is always: sort lock IDs and acquire in order. This breaks circular wait. For dining philosophers, number the chopsticks 1–5 and have every philosopher pick up the lower-numbered first. The asymmetry breaks the cycle.

Canonical example — the opposite-direction lock order

/* Thread A */          /* Thread B */
pthread_mutex_lock(&m1);      pthread_mutex_lock(&m2);
pthread_mutex_lock(&m2);      pthread_mutex_lock(&m1);  /* wait — deadlock */
/* work */                    /* work */
pthread_mutex_unlock(&m2);    pthread_mutex_unlock(&m1);
pthread_mutex_unlock(&m1);    pthread_mutex_unlock(&m2);
/* Fix: both sides acquire in m1-then-m2 order. */

Resource Allocation Graph (RAG)

Model processes as circles, resources as squares. An edge from process → resource means “requests”; resource → process means “currently allocated to.” Deadlock in a single-instance system iff there is a cycle. With multiple instances, a cycle is necessary but not sufficient — check actual availability.

Resource allocation graph with a cycle P1 → R2 → P2 → R1 → P1

Detection via Cycle Finding

With a RAG in memory, detect deadlock by running DFS and looking for a back-edge in the directed graph. Kernel deadlock detectors (lockdep in Linux) keep an in-memory chain of lock-acquire-order pairs and flag any new edge that would complete a cycle, before the deadlock actually happens.

# Simple DFS cycle-detection on a wait-for graph (Python)
def has_cycle(adj):
    WHITE, GRAY, BLACK = 0, 1, 2
    color = {v: WHITE for v in adj}
    def dfs(u):
        color[u] = GRAY
        for v in adj[u]:
            if color[v] == GRAY:       # back-edge = cycle
                return True
            if color[v] == WHITE and dfs(v):
                return True
        color[u] = BLACK
        return False
    return any(dfs(v) for v in adj if color[v] == WHITE)

Recovery strategies once deadlock is detected

Process termination — kill all deadlocked processes, or pick a victim by cost (total CPU spent, priority, resources held).
Resource preemption — force-release a resource; requires rollback to a safe checkpoint (common in DBs via transaction abort).
Operator intervention — log, alert, manual cleanup. Used when correctness trumps availability.

Prevention vs Avoidance

	Prevention	Avoidance	Detection
When it acts	Statically: break one Coffman condition	Dynamically: refuse grants that could deadlock	After the fact: find cycles, recover
Needs runtime info	No	Max-resource declarations per process	Just the wait-for graph
Overhead	Low (design-time)	High (every grant runs the safety check)	Medium (periodic scan)
Downside	Conservative: may forbid safe patterns	Must know max claims up front	Deadlock happens; recovery cost non-zero
Example	Global lock order	Banker's algorithm	Linux `lockdep`, MySQL InnoDB wait-for graph

Trade-off

Most production systems use a mix: static prevention where cheap (lock ordering) + post-hoc detection for dynamic allocators (DB row locks). Full avoidance (Banker's) is rare outside of embedded safety-critical code because nobody knows their maxima.

Banker's Algorithm — Avoidance by Safety Check

Dijkstra 1965. Given each process's maximum claim, grant a request only if the post-grant state is safe — i.e. there exists an ordering of processes that can all complete with current available resources.

Inputs (for n processes, m resource types):

Max[n][m] — max units process i will ever need of resource j.
Alloc[n][m] — units currently held.
Need[i][j] = Max[i][j] - Alloc[i][j] — remaining demand.
Available[m] — pool.

Worked example: 5 processes, 3 resource types (A, B, C)

Total resources: A=10, B=5, C=7. Initial state:

Process	Alloc (A,B,C)	Max (A,B,C)	Need (A,B,C)
P0	0,1,0	7,5,3	7,4,3
P1	2,0,0	3,2,2	1,2,2
P2	3,0,2	9,0,2	6,0,0
P3	2,1,1	2,2,2	0,1,1
P4	0,0,2	4,3,3	4,3,1
Sum alloc	7,2,5

Available = Total - Sum alloc = (3, 3, 2).

Banker's safety algorithm walk (Available starts at 3,3,2)

Reference implementation

/* bankers.py — classic safety algorithm */
def is_safe(avail, max_need, alloc):
    n, m = len(alloc), len(avail)
    need = [[max_need[i][j] - alloc[i][j] for j in range(m)]
            for i in range(n)]
    work = avail[:]
    finish = [False] * n
    order = []
    while True:
        progressed = False
        for i in range(n):
            if finish[i]: continue
            if all(need[i][j] <= work[j] for j in range(m)):
                for j in range(m): work[j] += alloc[i][j]
                finish[i] = True
                order.append(i)
                progressed = True
        if not progressed: break
    return (all(finish), order)

Gotcha

Banker's algorithm is O(m·n²) per request. It also requires that every process declare its maximum claim up front — rarely true in real software. Consequence: Banker's is mostly a textbook topic; real kernels use prevention + detection, not avoidance.

Module 06

File Systems

How disks are organized into inodes and directories, what a journal actually logs, and why LFS / COW file systems look so different.

Inodes & On-Disk Layout

UNIX file systems separate metadata (an inode) from the data blocks. An inode carries: mode, uid/gid, timestamps, size, refcount, and pointers to data blocks. Directories are just special files whose data is a table of name → inode.

Figure 6.1 — ext2/3-style inode. Small files use only direct pointers; large files fan out through indirect trees.

Numbers

On ext4 (4 KiB blocks, 4-byte pointers): 12 direct = 48 KiB, single indirect = 4 MiB, double = 4 GiB, triple = 4 TiB. ext4 also supports extents (base+length) so a 1 GiB contiguous file is 1 extent, not 262,144 pointers.

Directories, Hard Links & Path Resolution

A directory is a file whose data blocks contain (name, inode_number) records. ls reads that; stat then resolves each inode. The filesystem tree is a directed graph with hard links (more than one name per inode) and soft links (a file whose content is the path to traverse).

Hard vs soft links

	Hard link	Soft / symbolic link
Refers to	Same inode	Path string
Cross filesystem?	No	Yes
Survives target rename?	Yes (unaffected)	Breaks (dangling)
Refcount bump?	Yes (inode `nlink++`)	No
Can target a directory?	No (only root can; forbidden in practice)	Yes

# Observe hard vs soft links
$ echo hello > a
$ ln   a hard      # hard link — same inode, nlink = 2
$ ln -s a soft     # symlink — separate inode
$ stat -c '%i %h %n' a hard soft
12345 2 a
12345 2 hard
12346 1 soft

Journaling — Crash Consistency without fsck

Without a journal, a crash during a multi-block write can leave the filesystem inconsistent (e.g., inode says 1 MiB but only 512 KiB of data blocks allocated). The journal is a circular log of transactions: group metadata edits, write them to the log first, then the main filesystem.

Phases of a journalled write (ext4 in `data=ordered` mode)

Begin txn — gather all dirty inodes, bitmaps, and directory blocks for this syscall.
Data write — write user data blocks to their final location. Wait for them to be durable.
Journal log — write the metadata edits (inodes, bitmaps) to the journal with a descriptor block. Wait.
Commit record — write a single commit block with checksum. This is the atomic point: either the commit is durable (txn replayable) or it isn't (txn will be discarded).
Checkpoint — later, write the metadata to the main filesystem and mark the journal space reusable.

ext4 journal timeline (data=ordered)

Journal modes

Mode	What's logged	Data consistency	Perf
`journal`	Both metadata and data	Strongest — data never appears from an uncommitted txn	Lowest (2x writes)
`ordered` (default)	Metadata only, but data is forced before commit	Post-crash: no stale bytes	Good balance
`writeback`	Metadata only; data can land any time	Post-crash: files may contain garbage	Fastest

Gotcha

A journal guarantees consistency of the filesystem, not of your application. A database on ext4 still needs fsync + barriers + a WAL of its own, because the page cache may hold your latest writes even after the syscall returns.

LFS vs COW

Traditional (ext4, XFS) update data blocks in-place and use a journal for metadata. Two alternative designs eliminate in-place updates entirely:

Log-Structured FS (LFS / F2F / NILFS)

All writes (data + metadata) are appended to a log.
An inode map (imap) tracks “where is the latest version of inode i?”
Reads follow imap → inode → data blocks (same as regular FS).
A segment cleaner compacts partially-dead segments to reclaim space.
Ideal for flash: write-heavy, sequential, no RMW amplification.

Copy-on-Write (ZFS, btrfs, APFS)

Never overwrite a live block. Every update writes to a new location and replaces pointers up the tree.
A superblock atomically swings to the new root — all edits atomic.
Cheap snapshots: retain an old root, its subtree is still reachable.
End-to-end checksums on every block (optional) — catches silent bit rot.
Write amplification: editing a byte may rewrite the path from root to leaf.

Trade-off

Journaled in-place: small writes are cheap, fragmentation low, recovery fast. COW: snapshots cheap, integrity strong, but metadata writes amplify. LFS: flash-friendly write pattern, cleaner overhead eats some of that win.

Crash Consistency & fsync

Even with a journal you must invoke fsync(fd) to tell the kernel “flush my data to durable storage.” Without it the page cache may hold your write for seconds. fsync on a directory ensures directory entries are durable (critical for rename-based atomic replace).

/* Durable atomic file replace (POSIX) */
int fd = open("data.new", O_WRONLY|O_CREAT|O_TRUNC, 0644);
write(fd, payload, len);
fsync(fd);                    /* flush data */
close(fd);
rename("data.new", "data"); /* atomic on same FS */
int dfd = open(".", O_RDONLY|O_DIRECTORY);
fsync(dfd);                   /* flush dir entry, else rename can be lost */
close(dfd);

Module 07

I/O Models

Blocking, non-blocking, multiplexed (select/poll/epoll/kqueue), and async (io_uring) — plus the subtle epoll bugs that trip even senior engineers.

Blocking / Non-blocking / Sync / Async Matrix

Two orthogonal properties govern any read or write:

Blocking vs non-blocking — does the call wait when data isn't ready?
Synchronous vs asynchronous — does the caller do the I/O itself, or ask the kernel to do it and notify when done?

	Sync	Async
Blocking	Classic `read()`: thread sleeps until data ready, then copies it	POSIX AIO (`aio_read`): rarely used, kernel threads behind it
Non-blocking	`fcntl(O_NONBLOCK)`+`select`/`epoll`: app polls for readiness, then copies itself	`io_uring`: app submits, kernel performs the I/O, app reaps completions

Interview Lens

The C10K problem (one thread per connection blocks around 10K fds) was solved by readiness notification — epoll/kqueue. The C10M problem (10 million connections) is attacked by completion notification — io_uring — which avoids per-I/O syscall overhead entirely.

select / poll / epoll / kqueue

All four let a single thread monitor many FDs. They differ in how the FD set is conveyed and how scalable they are:

API	Portability	Per-call cost	Max fds	Semantics
`select`	POSIX	O(FD_SETSIZE) bit scan	1024 by default	Pass bitset; kernel copies it each call
`poll`	POSIX	O(n) array scan	ulimit	Pass `pollfd[]`; kernel iterates all
`epoll`	Linux	O(ready) — returns only active fds	Tens of millions	Register once; kernel queues ready events
`kqueue`	*BSD / macOS	O(ready)	Unbounded	Event-typed filter API (files, signals, timers)
`IOCP`	Windows	O(1) completion reap	Unbounded	Completion-based, not readiness-based

/* epoll event loop skeleton */
int ep = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN|EPOLLET, .data.fd = sock };
epoll_ctl(ep, EPOLL_CTL_ADD, sock, &ev);

for (;;) {
    struct epoll_event evs[64];
    int n = epoll_wait(ep, evs, 64, -1);
    for (int i = 0; i < n; ++i) {
        int fd = evs[i].data.fd;
        if (evs[i].events & EPOLLIN) {
            /* ET mode: drain until EAGAIN */
            for (;;) {
                ssize_t r = read(fd, buf, sizeof(buf));
                if (r < 0 && errno == EAGAIN) break;
                if (r <= 0) { close(fd); break; }
                handle(fd, buf, r);
            }
        }
    }
}

Edge vs Level Triggered

Figure 7.1 — LT re-wakes every poll while data exists; ET fires once per empty→non-empty transition.

Gotcha: the classic ET bug

You must loop and drain the fd until read returns EAGAIN. If you read exactly what epoll said was there and stop, you'll miss the rest — and epoll will not notify you again until another edge. Same rule for accept loops: pull all pending connections out of the listen queue.

io_uring — True Async I/O on Linux

Introduced in Linux 5.1 (2019). Two lock-free shared-memory ring buffers (SQ: submission, CQ: completion) between userspace and kernel. The app writes SQEs; the kernel consumes them, does the I/O, writes CQEs. Zero syscalls per I/O in the busy case (IORING_SETUP_SQPOLL polls the SQ from a kernel thread).

What it subsumes

Every read/write/send/recv/accept/connect/openat/stat/close/fsync — submittable, batchable.
Buffered, direct, and zero-copy (IORING_OP_SEND_ZC).
Chained / linked SQEs: “open, then read, then close” as a single submission.
Registered buffers + files: skip reference counting per I/O.

Numbers

For 4 KiB random reads on NVMe, epoll + pread tops out around ~600 K IOPS per core. io_uring with polled mode reaches ~3 M IOPS per core — ~5× the throughput at ~1/3 the CPU, mostly by eliminating syscall entry/exit and the copy_user cost.

Common epoll / Event-Loop Bugs

Gotcha 1 — Thundering herd on accept

Many worker threads all epoll_wait on the same listen fd. A connection arrives; all wake. Only one gets the socket; the rest find EAGAIN and go back to sleep. Fix: EPOLLEXCLUSIVE (kernel 4.5+) wakes only one waiter. Or use SO_REUSEPORT and one accepter per thread.

Gotcha 2 — fd reuse

You close fd 7, the kernel immediately reuses it for a new socket, then your epoll event for “old fd 7” fires and you process bytes for the wrong connection. Fix: always epoll_ctl(EPOLL_CTL_DEL) before close, or associate a generation counter in epoll_event.data.u64.

Gotcha 3 — ET without non-blocking

Using EPOLLET on a blocking fd: the drain loop blocks forever after the first read fills the buffer. Always pair ET with O_NONBLOCK.

Gotcha 4 — Write buffering for sends

TCP send buffer fills and write returns EAGAIN. You must re-arm EPOLLOUT, queue the unsent bytes, and flush when writable — then disarm EPOLLOUT. Leaving EPOLLOUT armed causes a busy-loop of wakeups.

Module 08

Concurrency Patterns

Futures, actors, CSP channels, and the event loop — the building blocks every modern async runtime picks from.

Futures & Promises

A future is a handle to a value that will arrive later. A promise is the write side of that pair — whoever computes the value completes the promise, which fulfils all futures derived from it. Semantically identical to a continuation-passing callback, but composable.

# Python futures/async composition
import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as r:
        return await r.text()

async def main(urls):
    async with aiohttp.ClientSession() as s:
        # All fetches run concurrently; gather awaits all.
        pages = await asyncio.gather(*(fetch(s, u) for u in urls))
        return [len(p) for p in pages]

asyncio.run(main(["https://a.example/", "https://b.example/"]))

Combinators you must know

map — transform the eventual value.
flatMap / then — chain another future that depends on this one's result.
all / gather — wait for N futures, return a list of N results.
race / any — return the first to succeed (or fail).
timeout — cancel after a deadline.

Gotcha: unawaited futures

If you call an async function and don't await it, its body never runs (Python coroutines) or runs detached (JS Promises with unhandled rejections). Both languages warn; treat these warnings as errors.

Actor Model — Isolation by Mailbox

Each actor is a lightweight process with private state and a mailbox. Communication is one-way messages: actor A sends to B's mailbox, B processes messages one at a time. Because state is private and access is serial, no locks required. Erlang/OTP, Akka, Elixir popularized it.

# Actor-style counter with a single mailbox coroutine
import asyncio

async def counter_actor(inbox):
    n = 0
    while True:
        msg = await inbox.get()
        if msg == "inc": n += 1
        elif isinstance(msg, tuple) and msg[0] == "get":
            msg[1].set_result(n)
        elif msg == "stop":
            return

Strengths / weaknesses

+ No shared state, no locks, compositional supervision trees.
+ “Let it crash” recovery — a supervisor restarts a failing actor.
− Ordering only within a mailbox; cross-actor invariants require careful protocol design.
− Bounded mailboxes or you risk unbounded memory growth under back-pressure.

CSP Channels (Hoare, 1978 → Go)

Communicating Sequential Processes: goroutines synchronize by sending values through typed channels, not shared memory. Unbuffered channels are rendezvous (sender blocks until receiver arrives); buffered channels act as a bounded queue.

// Go: pipeline of goroutines via channels
package main
import "fmt"

func producer(ch chan<- int) {
    for i := 0; i < 5; i++ { ch <- i }
    close(ch)
}
func consumer(ch <-chan int, done chan<- bool) {
    for v := range ch { fmt.Println(v) }
    done <- true
}
func main() {
    ch := make(chan int, 2)
    done := make(chan bool)
    go producer(ch)
    go consumer(ch, done)
    <-done
}

Trade-off: actors vs CSP

Actor identity is who (mailbox address); CSP identity is what (channel). Actors model independent agents, CSP models typed streams. They are duals: an actor is a goroutine looping on its channel.

Event Loop — Single-threaded Concurrency

One thread running a FIFO of callbacks. Non-CPU work (network, disk, timers) is delegated to the kernel (epoll/kqueue) or a small threadpool (libuv); the loop only dispatches ready tasks. JS engines (Node, browsers) and Python's asyncio both work this way.

Node.js event loop phases for one tick

Gotcha: CPU-bound work kills latency

A single sync JSON.parse on a 50 MB blob blocks the loop for ~500 ms; every other connection stalls. Offload to a worker thread (worker_threads) or a child process. Rule: if any handler takes >10 ms, profile and move it off the loop.

async / await Mechanics

Syntactic sugar over state machines. The compiler translates each async def into a class with methods send, throw, close; each await becomes a suspension point. The runtime drives the state machine from an event loop.

# Roughly what "async def f()" becomes at the bytecode level
class AwaitableF:
    def __init__(self): self.state = 0
    def send(self, val):
        if self.state == 0:
            self.state = 1
            return some_future           # suspend
        elif self.state == 1:
            result = val                   # future's value
            raise StopIteration(result)    # done

Interview Lens

“Why can a single thread handle 100K connections?” Because every await parks the coroutine, freeing the thread to run another. The cost of a context switch in userspace (saving a few pointers) is ~20 ns, vs 1–5 µs for a kernel thread switch.

Module 09

Memory Model & Atomics

Why the hardware reorders your loads and stores, how happens-before rules let you reason about it, and the ABA problem that ruins naive lock-free code.

Reordering & Visibility

Two threads on two cores see a different order of the same stores. This isn't a bug; it's the architecture. Both the compiler and the CPU freely reorder instructions that are independent within a single thread. Cross-thread visibility requires explicit synchronization.

Who reorders what?

Layer	Can do	Defeat with
Compiler	Reorder, hoist, combine, reuse registers	`volatile` / `std::atomic` / inline-asm `"memory"` clobber
CPU (out-of-order)	Reorder independent loads/stores within the pipeline	Memory barriers / fences
Store buffer	Delay stores so loads go first (x86 allows store→load reordering)	`mfence`, `lock`-prefixed ops
Cache coherence	Propagation delay between cores	Acquire/release ordering and cache line protocol

Hardware model strengths

x86/x86_64 — TSO (Total Store Ordering). Stores don't reorder with earlier stores; loads don't reorder with earlier loads or stores. Only reordering allowed: a store followed by a load to a different address.
ARM64 — Relaxed. Any pair of independent memory ops can reorder. Requires explicit dmb/ldar/stlr.
POWER — Even weaker than ARM. Must fence often.

Gotcha

Code that “just works” on x86 often breaks on ARM. Java (JMM), C++11 (std::memory_order), and Go provide portable memory models — if you use them correctly you're safe on any architecture.

Happens-Before — The Reasoning Primitive

Define a partial order “hb” over events in a program. If a hb b, then a's writes are visible when b executes. The rules are few and compose transitively:

Program order — within a thread, earlier statements hb later ones.
Unlock / Lock — an unlock of m hb every subsequent lock of m.
Release / Acquire — an atomic store-release hb every atomic load-acquire that reads it.
Volatile (Java) — a write to a volatile hb every subsequent read of that volatile.
Thread start / join — t.start() hb everything in t; everything in t hb t.join() returning.
Transitivity — a hb b and b hb c implies a hb c.

Happens-before: producer writes data then releases; consumer acquires then reads

Interview Lens

The JMM / C++ memory model is intimidating on paper, but “do the writes I did before the flag become visible to someone who reads the flag?” is the only question. Release on write, acquire on read, and you're home.

Acquire / Release / SeqCst

Order	Meaning	Typical use
`memory_order_relaxed`	Atomicity only — no ordering	Perf counter increments
`memory_order_acquire`	No later loads/stores move before this load	Lock acquire, flag read
`memory_order_release`	No earlier loads/stores move after this store	Lock release, flag write
`memory_order_acq_rel`	Both — used on RMW (CAS, fetch_add)	Lock-free data structures
`memory_order_seq_cst`	Single total order across all SC ops	Default — safest but slowest

/* Dekker-style publication with release/acquire (C11) */
_Atomic int  flag = 0;
int          payload;

void producer(void) {
    payload = 42;                                   /* non-atomic write */
    atomic_store_explicit(&flag, 1,
                          memory_order_release);
}
void consumer(void) {
    while (atomic_load_explicit(&flag,
                                 memory_order_acquire) == 0) { }
    printf("%d\n", payload); /* guaranteed to see 42 */
}

Numbers

On x86, memory_order_acquire/release cost zero additional instructions vs relaxed (x86's TSO already provides the ordering). seq_cst on a store requires an mfence, ~15–30 cycles. On ARM the difference is real: relaxed = ldr, acquire = ldar, seq_cst = dmb + op.

Fences & Barriers

A fence is a standalone instruction that restricts reordering around it without itself being a read or write. Use when you need ordering without an atomic op:

Full fence (mfence / dmb ish): no load or store moves across.
Load fence (lfence): earlier loads complete before later loads.
Store fence (sfence): earlier stores complete before later stores.
Compiler fence only: asm volatile("" ::: "memory") — prevents the compiler from reordering, but no hardware barrier.

Figure 9.1 — A full fence as a reorder barrier between store and load streams.

The ABA Problem on a Lock-Free Stack

The canonical CAS-based pop is: read head, read head->next, CAS head from old to next. If between your read and CAS someone pops A, pops B, then pushes A back — head is still A, CAS succeeds, but head->next now points at freed memory.

ABA on a lock-free stack (Treiber stack)

Fixes

Versioned pointers (double-word CAS) — store (ptr, version) in 128 bits; every pop increments version. cmpxchg16b on x86.
Hazard pointers (Michael, 2004) — before dereferencing a shared pointer, publish it as “in use”; reclaimers skip nodes announced by any thread.
Epoch-based reclamation (crossbeam in Rust) — a node is freed only when no thread is in an older epoch.
RCU (Linux kernel) — reclaim after a grace period in which every CPU has context-switched.

/* Treiber stack pop with DWCAS versioned head */
struct Node { int v; Node* next; };
struct Head { Node* p; uint64_t ver; };
_Atomic(struct Head) head;

Node* pop(void) {
    struct Head old, neu;
    do {
        old = atomic_load(&head);
        if (!old.p) return NULL;
        neu.p   = old.p->next;
        neu.ver = old.ver + 1;       /* defeats ABA */
    } while (!atomic_compare_exchange_weak(&head, &old, neu));
    return old.p;
}

Interview Lens

When asked “what's wrong with this lock-free stack,” the expected answer is ABA, followed by one of the four remediations above. Bonus: mention that GC languages (Java, Go) don't see ABA on reference-typed slots because the old node can't be reclaimed while a thread still holds a reference.

Module 10

Cheat Sheet & Quick Reference

One-page summaries, must-memorize latency numbers, rapid-fire Q&A, and a POSIX command matrix.

Module Summaries

Processes & Threads

Process = address space + FDs; thread = registers + stack
Linux: both are task_struct; threads share mm/files/sighand
fork = COW address space; exec replaces image
Context switch: ~1 µs direct + 10–100 µs indirect (cache/TLB)

Memory / VM

4-level paging: 9+9+9+9+12 = 48-bit VA, 4 KiB pages
PTE bits: P, R/W, U/S, A, D, G, PS, XD, PFN
TLB miss = hardware walker reads 4 cache lines
Huge pages extend TLB reach from 256 KiB to 128 MiB

Scheduling

FCFS: simple, convoy effect
SJF/SRTF: optimal mean wait, starvation risk
RR: fair, q tuning = latency vs throughput
MLFQ: adaptive priority + aging boost
CFS: RB-tree of vruntime, weighted fair; EEVDF in 6.6+

Synchronization

Mutex: owned lock; pthread = futex-backed
Counting semaphore: pool / flow control
CV: always loop on predicate (Mesa semantics)
Prod-cons needs 3 sems: empty, full, mutex

Deadlock

Coffman: mutex, hold&wait, no-preempt, circular-wait
Break circular-wait with a global lock order
Detect via DFS cycle in wait-for graph
Banker's: O(mn²) safety check per request

File Systems

Inode = metadata; directory = name → inode table
Journal modes: journal > ordered (default) > writeback
COW FS (ZFS/btrfs): cheap snapshots, write amplification
Atomic file replace: write temp, fsync, rename, fsync dir

I/O Models

select O(n), poll O(n), epoll O(ready), kqueue O(ready)
ET requires non-blocking + drain-until-EAGAIN
io_uring: shared rings, ~0 syscalls per I/O
Watch out: fd reuse, thundering herd, EPOLLOUT busy-loop

Concurrency Patterns

Futures: map / flatMap / all / race / timeout
Actors: private state + mailbox; let-it-crash
CSP: channels as typed queues (Go)
Event loop: macrotask ↔ microtask drain per tick

Memory Model

x86 = TSO (strong); ARM = relaxed
Happens-before: PO, unlock/lock, release/acquire, start/join
Release on publisher's write, acquire on consumer's read
ABA defeated by versioned CAS, hazard ptrs, RCU, epochs

Latency Numbers Every OS Candidate Should Know

Operation	Time (2024)	Relative
L1 cache reference	0.5 ns	1x
Branch mispredict	3 ns	6x
L2 cache reference	4 ns	8x
Mutex lock/unlock (uncontended)	25 ns	50x
Main memory reference	100 ns	200x
Compress 1 KB with Snappy	2 µs	4,000x
TLB miss (full 4-level walk from RAM)	200 ns	400x
Context switch	1–5 µs	2,000–10,000x
System call (simple syscall)	100 ns	200x
NVMe SSD random 4K read	20 µs	40,000x
Round-trip in same DC	0.5 ms	1,000,000x
HDD seek	10 ms	20,000,000x
Page fault (minor, zero-fill)	1 µs	2,000x
Page fault (major, NVMe swap-in)	20 µs	40,000x
Page fault (major, HDD swap-in)	10 ms	20,000,000x
fork() on 100 MB RSS	450 µs	900,000x
fsync on NVMe	30 µs	60,000x

Rapid-Fire Q&A

#	Question	Answer
1	Process vs thread?	Process owns resources (AS, FDs); thread is a unit of scheduling (registers, stack).
2	What is copy-on-write?	Page tables point at shared read-only pages; first write faults and copies that page only.
3	Why is fork fast?	It copies page tables, not physical pages. COW defers actual copies until writes.
4	What's in a PCB?	PID, state, register context, memory map, FD table, signals, creds, scheduler state.
5	What does a context switch cost?	~1 µs direct + 10–100 µs indirect (TLB/cache cold).
6	Why 4-level paging?	Supports sparse 48-bit virtual space without allocating 2^48 / 4 KiB = 2^36 PTEs upfront.
7	What happens on TLB miss?	Hardware walks PML4 → PDPT → PD → PT, fetching up to 4 cache lines; loads PTE into TLB.
8	How does COW detect the write?	PTE marked read-only; write triggers #PF; kernel copies page, updates PTE.
9	What is thrashing?	Working set exceeds RAM; CPU spends >50% on page faults instead of useful work.
10	Difference between a mutex and a semaphore?	Mutex owned; unlock only by holder. Counting semaphore: multi-resource; any thread can V.
11	Why use a CV and not a busy-wait?	Busy-wait burns CPU + cache. CV parks the thread in the kernel until signalled.
12	What are the Coffman conditions?	Mutual exclusion, hold & wait, no preemption, circular wait.
13	Best prevention technique in practice?	Global lock ordering — breaks circular wait.
14	When does SJF starve?	When short jobs keep arriving faster than a long job finishes; aging fixes it.
15	What does CFS pick?	The runnable task with the smallest weighted vruntime (leftmost RB-tree node).
16	Why is RR quantum tuning hard?	Too small: all overhead. Too large: degenerates to FCFS.
17	What is the convoy effect?	Many short jobs queue behind one long job under FCFS.
18	What is priority inversion?	Low-pri holds a lock needed by high-pri; medium-pri preempts low-pri indefinitely. Fix: priority inheritance.
19	Hoare vs Mesa CVs?	Hoare: signal hands off mutex directly. Mesa: signalled thread re-contends; always loop on predicate.
20	What is a spurious wakeup?	A thread wakes from cond_wait without a corresponding signal. Re-test predicate.
21	What is livelock?	Threads aren't blocked but make no progress; e.g., both backing off and retrying in lockstep.
22	Inode vs directory entry?	Inode = metadata + block pointers. Dirent = (name, inode) in a directory file.
23	Hard link limits?	Same filesystem only; inode nlink must fit; root-only for directories.
24	What does fsync do?	Flushes dirty pages of the fd to durable storage and issues a FLUSH cache command.
25	Why fsync a directory?	Renames/creates are metadata; the directory file's updated block must be durable too.
26	What does a journal guarantee?	Filesystem metadata consistency after crash; not application data unless data=journal.
27	LFS vs COW?	LFS: log-append everything + cleaner. COW: copy on modify and swing root atomically.
28	What is select's main limit?	O(n) with FD_SETSIZE (1024). poll is O(n) unbounded. epoll is O(ready).
29	Why non-blocking for epoll ET?	ET fires once per transition; you must read until EAGAIN, which requires non-blocking.
30	What is io_uring's advantage?	Zero-copy shared rings; zero syscalls per I/O in polled mode; batching.
31	What are futures good at?	Composing async ops with map/flatMap/all/race without callback pyramids.
32	What is back-pressure?	Flow control so fast producers don't overwhelm slow consumers (bounded queues, pausing streams).
33	Actors vs CSP?	Actor identity = mailbox owner. CSP identity = channel. Dual formulations.
34	What is a microtask?	Promise callback / process.nextTick; drained fully between macrotasks.
35	Why is x86 easier to program?	TSO: only store→load reorderable. ARM/POWER: any independent pair may reorder.
36	Release / acquire pair meaning?	Release on publisher's write; acquire on reader's read — synchronizes-with for HB.
37	What is a data race?	Concurrent accesses to the same memory, at least one write, no HB ordering between them.
38	What is volatile NOT?	NOT atomic (only suppresses compiler reorder/elide). Use `std::atomic` for thread safety.
39	ABA — one-line summary?	CAS sees old value, but it was pop-pop-push-back; your "next" pointer is now dangling.
40	RCU in one sentence?	Readers lock-free; updater publishes a new version, frees old after a grace period.
41	What is a futex?	User-space fast-path atomic + kernel FUTEX_WAIT/WAKE syscall for contended slow-path.
42	Why do threads share the signal table?	POSIX: signals are per-process, delivered to any thread whose mask permits it.
43	Zombie vs orphan?	Zombie: terminated, not reaped. Orphan: parent died, init adopts and reaps.
44	What does EINTR mean?	A syscall was interrupted by a signal; must retry (unless SA_RESTART auto-restart).
45	mmap vs read for files?	mmap: zero-copy, page-fault driven, lazy. read: explicit, always a syscall + copy.
46	Why is pthread_mutex_trylock useful?	Avoids blocking; lets you fall back (try a different lock order, back off, do other work).
47	Time vs priority in SCHED_FIFO?	Highest fixed priority runs until it yields; no time-slicing — risk of starving others.
48	Huge pages — cost?	Require contiguous physical frames; hard to allocate on a fragmented host; transparent HP (THP) mitigates.
49	copy_file_range vs sendfile?	copy_file_range: file→file zero-copy. sendfile: file→socket zero-copy (older API).
50	What does a page cache buy you?	In-memory cache of file data; turns the second read into memory-speed; absorbs writes until flush.

POSIX / Linux Command Matrix

Task	Syscall	Header	Notes
Create process	`fork`, `vfork`, `clone`	unistd.h / sched.h	`clone` with flags builds threads or containers
Replace image	`execve` family	unistd.h	Only returns on failure
Reap child	`wait`, `waitpid`, `waitid`	sys/wait.h	Clears zombie; returns exit status
Allocate memory	`brk`, `mmap`	unistd.h / sys/mman.h	Anonymous mapping = malloc backing for big allocs
Map file	`mmap`	sys/mman.h	Shared or private; zero-copy access
Create thread	`pthread_create`	pthread.h	Wrapper around clone(CLONE_THREAD\|CLONE_VM\|…)
Mutex	`pthread_mutex_*`	pthread.h	Futex-backed on Linux
Condition variable	`pthread_cond_*`	pthread.h	Always used with a mutex, loop on predicate
Semaphore	`sem_init/wait/post`	semaphore.h	Counting; POSIX named or unnamed
Event-driven I/O	`epoll_create1/ctl/wait`	sys/epoll.h	Linux only; use kqueue on BSD/macOS
Async I/O (modern)	`io_uring_setup` + user lib `liburing`	liburing.h	Submission + completion rings
Open / close fd	`open`, `close`, `openat`	fcntl.h / unistd.h	`O_NONBLOCK`, `O_CLOEXEC`, `O_DIRECT`
Durability	`fsync`, `fdatasync`, `sync_file_range`	unistd.h	fsync on dir for rename atomicity
Rename / atomic swap	`rename`, `renameat2 RENAME_EXCHANGE`	stdio.h / fcntl.h	Atomic on same filesystem
Signals	`sigaction`, `kill`, `pthread_sigmask`	signal.h	Prefer sigaction; async-signal-safe funcs only in handlers
Memory protection	`mprotect`, `madvise`	sys/mman.h	Change perms, hint access pattern
IPC: shared mem	`shm_open` + `mmap`	sys/mman.h	POSIX SHM; prefer over SysV shmget
IPC: unix sockets	`socket(AF_UNIX)`	sys/socket.h	SOCK_STREAM or SOCK_SEQPACKET; FD passing via SCM_RIGHTS
Scheduling tuning	`sched_setscheduler`, `nice`, `setpriority`	sched.h	SCHED_OTHER (CFS), SCHED_FIFO, SCHED_RR, SCHED_DEADLINE
CPU pinning	`sched_setaffinity`	sched.h	Reduces migration cost; aligns with NUMA topology

Operating Systems & Concurrency

Processes & Threads

Process Anatomy & the PCB

Virtual address-space layout (x86_64 Linux, typical)

PCB fields — what the kernel tracks per process

Process states — the five-state machine

Threads & the TCB

Per-thread (TCB) vs per-process state

User vs kernel threads

fork / exec / wait

What actually happens on fork

fork cost breakdown (x86_64 Linux, 2024)

Context Switch Cost

Direct cost breakdown

Indirect cost: what you pay after the switch

IPC Mechanisms

Memory & Virtual Memory

Virtual Address Spaces

Paging & Page-Table Entries

x86_64 4-level paging (48-bit virtual address)

Page-Table Entry (4 KiB page, x86_64) — the bits that matter

Virtual → Physical Translation (with TLB miss walk)

Cost of a miss

TLB & Miss Handling

Who flushes the TLB?

Segmentation & mmap

Anonymous vs file-backed

Page Fault Handling

CPU Scheduling

Goals & Metrics

Benchmark Workload

FCFS (First-Come First-Served)

Derivation of averages

SJF / SRTF (Shortest Job First)

Derivation of averages

Round Robin (RR)

Derivation (q=2)

Multi-Level Feedback Queue (MLFQ)

CFS — Linux Completely Fair Scheduler

CFS knobs

Head-to-Head Summary

Synchronization

Race Conditions & Atomicity

Critical-section requirements (Dijkstra / Lamport)

Mutexes & Spinlocks

Why not just spin forever?

Semaphores — Counting & Binary

Monitors & Condition Variables

Wait semantics: Hoare vs Mesa

Bounded-Buffer Producer-Consumer (2P + 2C, size 4)

Order of acquiring semaphores matters

Deadlock

The Four Coffman Conditions (1971)

Canonical example — the opposite-direction lock order

Resource Allocation Graph (RAG)

Detection via Cycle Finding

Recovery strategies once deadlock is detected

Prevention vs Avoidance

Banker's Algorithm — Avoidance by Safety Check

Worked example: 5 processes, 3 resource types (A, B, C)

Reference implementation

File Systems

Inodes & On-Disk Layout

Directories, Hard Links & Path Resolution

Hard vs soft links

Journaling — Crash Consistency without fsck

Phases of a journalled write (ext4 in data=ordered mode)

Journal modes

LFS vs COW

Log-Structured FS (LFS / F2F / NILFS)

Copy-on-Write (ZFS, btrfs, APFS)

Crash Consistency & fsync

I/O Models

Blocking / Non-blocking / Sync / Async Matrix

select / poll / epoll / kqueue

Edge vs Level Triggered

io_uring — True Async I/O on Linux

What it subsumes

Common epoll / Event-Loop Bugs

Concurrency Patterns

What actually happens on `fork`

Phases of a journalled write (ext4 in `data=ordered` mode)