Operating Systems & Concurrency

A FAANG interview consolidation covering processes, virtual memory, scheduling, synchronization, deadlock, file systems, I/O, concurrency patterns, and memory models.

10 Modules • 25+ Visualizations • POSIX / Linux depth • Interview-grade

Module 01

Processes & Threads

What the kernel tracks per process and per thread, how fork/exec work, and exactly what a context switch costs.

Process Anatomy & the PCB

A process is an instance of an executing program with its own virtual address space, open-file table, credentials, and at least one thread of control. The kernel tracks each process via a Process Control Block (PCB) — in Linux this is struct task_struct (~1.7 KB per task on x86_64, 2024 vintage).

Virtual address-space layout (x86_64 Linux, typical)

Kernel space (upper half, shared) Stack (grows ↓) · unmapped · mmap region (libs, anonymous) Heap (grows ↑, brk) BSS + Data (rw-) Text / code (r-x) Reserved (NULL page, unmapped) 0xffff... 0x7fff... 0x0804... 0x0000 Lower addresses at bottom; ASLR randomizes stack/mmap/heap bases.
Figure 1.1 — Process virtual memory regions. Each region has independent permissions enforced by the page table.

PCB fields — what the kernel tracks per process

FieldLinux field namePurpose
PIDpid_t pidUnique process identifier in PID namespace
Parent PIDstruct task_struct *parentBackpointer for SIGCHLD / wait()
Statelong stateRUNNING / INTERRUPTIBLE / UNINTERRUPTIBLE / STOPPED / ZOMBIE
Schedulingstruct sched_entity sevruntime, policy (CFS / FIFO / RR), nice value
Address spacestruct mm_struct *mmPage tables, VMA list, brk, stack start
File tablestruct files_struct *filesFD → file_struct mapping
Signalsstruct sighand_struct *sighandSignal handlers + pending masks
Credentialsstruct cred *credUID, GID, capabilities
Namespacesstruct nsproxy *nsproxypid/net/mount/uts/ipc/cgroup namespaces
Saved contextstruct thread_struct threadRegister file, FPU state, kernel stack pointer
Memory statsrss, vsizeResident / virtual size, for top/ps

Process states — the five-state machine

New Ready Running Blocked Terminated admit dispatch preempt I/O wait I/O done exit
Figure 1.2 — The classic five-state process lifecycle. Linux collapses Ready/Running into TASK_RUNNING and splits Blocked into INTERRUPTIBLE vs UNINTERRUPTIBLE.
Gotcha

A zombie is a terminated process whose PCB is still retained so the parent can read its exit code via wait(). If the parent never waits, zombies pile up — each consumes a PID slot. A ghost parent (init, PID 1) adopts orphaned children to reap them, which is why init always loops on wait().

Threads & the TCB

A thread is an independently schedulable flow of execution that shares its address space, file table, and signal handlers with sibling threads. On Linux, threads are just task_structs created with clone(CLONE_VM|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|…) — the PCB is the TCB. Shared resources are stored once in mm_struct / files_struct and pointed to by every thread.

Per-thread (TCB) vs per-process state

Kind of stateWhere it livesPer-thread?
Register file / PC / RSPthread_structYes — the essence of a thread
Kernel stack (8–16 KiB)task_struct::stackYes
User stackVMA in shared mmYes (distinct ranges)
TLS block (fs/gs base)Per-thread arenaYes
Pending signalstask_struct::pendingYes — thread-directed signals
Signal handlerssighand_structNo — shared
Address space / page tablesmm_structNo
Open file descriptorsfiles_structNo
Heap / BSS / TextVMAs in shared mmNo
PID (thread group leader)tgidNo — TID is per-thread, PID is shared

User vs kernel threads

  • 1:1 (kernel threads) — every userspace thread maps to one kernel task_struct. Linux NPTL, Windows threads, modern macOS all do this. Blocking a thread only blocks that thread; multiple threads can run on multiple cores simultaneously.
  • N:1 (green threads) — many user threads multiplexed onto one kernel thread. A blocking syscall freezes every sibling. Historic early Java (Solaris green threads, GNU Pth). Simple, but no multicore.
  • M:N (hybrid) — user scheduler maps M user threads onto N kernel threads. Go's goroutines, Rust's (former) greenlets, Erlang BEAM. Cheap context switches in userspace; requires a runtime to trap blocking syscalls.
Interview Lens

When asked “process vs thread”, the sharp answer distinguishes the unit of resource ownership (process: address space, FDs) from the unit of scheduling (thread: registers + stack). Go's goroutines are not threads — they are a userspace M:N runtime where a go-scheduler multiplexes 100K goroutines onto GOMAXPROCS kernel threads.

fork / exec / wait

POSIX fork() creates a duplicate of the calling process. The child gets a new PID, a copy of the parent's memory (copy-on-write), a duplicate FD table, and picks up execution on the instruction after fork returns.

/* fork_demo.c — classic fork/exec/wait pipeline */
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(void) {
    pid_t pid = fork();
    if (pid < 0) { perror("fork"); return 1; }

    if (pid == 0) {
        /* child: replace our image with /bin/ls */
        char *argv[] = { "ls", "-la", NULL };
        execvp(argv[0], argv);
        perror("execvp");      /* only reached if exec fails */
        _exit(127);
    }

    /* parent: wait for the child and inspect its exit status */
    int status;
    waitpid(pid, &status, 0);
    if (WIFEXITED(status))
        printf("child exited with %d\n", WEXITSTATUS(status));
    return 0;
}

What actually happens on fork

  1. Kernel copies task_struct — trivial bytes.
  2. Clones mm_struct by copying page tables, not physical pages. All writable pages are marked read-only. This is copy-on-write (COW).
  3. Duplicates the FD table (refcount++ on each file*).
  4. Copies signal dispositions (but not pending signals).
  5. Sets the child's return value to 0 and the parent's to the child PID.
  6. Schedules the child; the first page written triggers a fault, the kernel copies that page only.

fork cost breakdown (x86_64 Linux, 2024)

Resident sizePagesfork() time (approx)Why
10 MB2,500~90 µsPT copy dominated
100 MB25,000~450 µsLinear in page-table entries
1 GB250,000~3.5 msMulti-level PT walk + copy
10 GB2.5M~35 msWhy Redis uses fork for BGSAVE and sees pauses
Gotcha

fork after pthread_create is dangerous. Only the calling thread survives in the child; others vanish mid-lock. If a killed thread held malloc's arena lock, the child's first malloc will deadlock. Use posix_spawn or only call async-signal-safe functions between fork and exec.

Context Switch Cost

A context switch is the kernel replacing the executing thread's saved state with another thread's. The direct cost is saving and restoring registers. The indirect cost — cache and TLB pollution — often dwarfs it.

Anatomy of a context switch (from T1 to T2)

Direct cost breakdown

ComponentCycles (approx)Time @ 3 GHzNotes
Trap into kernel (syscall/IRQ)~10033 nsswapgs, push rsp, re-target IDT
Save GPRs + FPU/AVX~20066 nsAVX-512 alone: 512 bytes/thread
scheduler_tick + pick_next_task~500–20000.2–0.7 µsCFS RB-tree O(log n)
switch_mm (different address space)~300100 nsCR3 write + possible TLB flush
Restore registers + iret~20066 nsMirror of save
Direct total~1,500–5,0001–2 µs

Indirect cost: what you pay after the switch

  • TLB flush: cross-address-space switch on non-PCID CPUs flushes the TLB entirely; the first few thousand memory references miss. PCID (CR4.PCIDE) tags entries by ASID so the cost is near-zero.
  • Cache pollution: T2 evicts T1's working set from L1/L2. When T1 resumes, cold misses re-fetch from L3 or RAM. Measured indirect cost for unrelated workloads: 10–100 µs.
  • Branch-predictor state: Spectre mitigations (IBPB, STIBP) on cross-trust-domain switches can add ~5 µs on older CPUs.
Numbers

On modern hardware a same-process thread-to-thread switch costs ~1 µs direct. An epoll-based event loop on one core is cheap because it never switches. A thread-per-request server with 10K threads doing 50K switches/sec burns ~5% of one core just on kernel bookkeeping, plus cache pollution.

IPC Mechanisms

Processes do not share memory by default, so they communicate through kernel-mediated channels:

MechanismDirectionLatency / GB·sBest for
Anonymous pipeParent→child, 1-way~1 µs / 2 GB·sShell command pipelines
Named pipe (FIFO)Any-any, 1-way~1 µs / 2 GB·sUnrelated process rendezvous
UNIX-domain socketBidirectional, datagram or stream~2 µs / 1–3 GB·sLocal RPC, FD passing
Shared memory (mmap / SysV)Any-any~memory speed / 20+ GB·sZero-copy bulk data (needs your own sync)
Message queue (POSIX)Any-any~2 µsStructured bounded-priority messages
SignalAny-any, 1 bit<1 µsNotify only — never carry data
eventfd / pipe2Self-wake<1 µsCross-thread wake via epoll
Trade-off

Shared memory is the fastest IPC, but offers no synchronization. You still need a futex / semaphore / atomic in that shared region to coordinate writes. Pipes and sockets trade throughput for built-in flow control and blocking semantics.

Module 02

Memory & Virtual Memory

How an MMU translates virtual to physical, what a page-table entry actually contains, and the TLB path that keeps it fast.

Virtual Address Spaces

Each process sees a private, contiguous virtual address space. x86_64 uses 48-bit canonical addresses (256 TiB) split into a user half (0x0000_0000_0000_0000 .. 0x0000_7fff_ffff_ffff) and a kernel half (0xffff_8000_0000_0000 and up). Addresses between are non-canonical and fault on use. 5-level paging raises the limit to 57 bits (128 PiB), gated by CR4.LA57.

Why bother with virtual memory? Three wins:

  • Isolation — process A cannot read/write process B's pages because their page tables point to different physical frames.
  • Sparseness — a process claims 256 TiB of logical space and only pays for pages it touches (the kernel populates page-table entries lazily on fault).
  • Relocation & sharing — shared libraries and COW pages live in physical memory once but are mapped into many virtual spaces.
Interview Lens

Candidates love to say “virtual memory lets you use more than RAM.” True, but swap-based paging to disk is the least important win today — servers rarely swap. The isolation and zero-copy (mmap, shared libraries, fork COW) wins matter every millisecond.

Paging & Page-Table Entries

Memory is divided into fixed-size pages (4 KiB default; 2 MiB and 1 GiB “huge pages” available). The MMU translates a virtual page number (VPN) to a physical frame number (PFN) by walking the multi-level page table anchored at CR3.

x86_64 4-level paging (48-bit virtual address)

PML4 [47:39] PDPT [38:30] PD [29:21] PT [20:12] offset [11:0] 9 bits 9 bits 9 bits 9 bits 12 bits (4 KiB) CR3 → PML4 → PDPT → PD → PT entry → physical frame Each level: index * 8 bytes into a 4 KiB table (512 entries). A walk is 4 memory reads.
Figure 2.1 — 48-bit virtual address split. Walking the hierarchy costs 4 memory references plus the final data access.

Page-Table Entry (4 KiB page, x86_64) — the bits that matter

BitNameMeaning
0P (Present)1 = mapped to a frame. 0 = fault on access (paged out, lazy alloc, COW).
1R/W0 = read-only (fault on write — used for COW and .text).
2U/S0 = supervisor-only (kernel pages).
3PWTPage Write-Through caching.
4PCDPage Cache Disable (MMIO / DMA regions).
5A (Accessed)Set by hardware on any reference; cleared by kernel for clock replacement.
6D (Dirty)Set by hardware on write; drives page-cache writeback.
7PS (Page Size)1 at the PD level = 2 MiB huge page; at PDPT level = 1 GiB.
8G (Global)Survives CR3 reload (kernel mappings).
12-51PFNPhysical frame number (40 bits × 4 KiB = 4 PiB).
63XD (NX)1 = no-execute; kills buffer-overflow shellcode on the stack.
Numbers

A 4-level walk reads 4 page-table entries. Each read is a cache line (64 B). Without a TLB a user-space load would cost ~5 memory references. The TLB collapses that to 1 when the PTE is cached.

Virtual → Physical Translation (with TLB miss walk)

The hot path checks the TLB; on miss, the hardware page-table walker (PMH on Intel) walks all four levels. The visualization below steps through a load of address 0x7FFE_1234_5ABC.

Address translation: TLB hit, TLB miss with 4-level walk, and page fault

Cost of a miss

  • Best case — all four PTEs in L1d: ~4 cycles × 4 + transit = ~20 ns.
  • Typical case — PTEs in L2/L3: ~100–200 ns for the walk.
  • Worst case — PTEs in RAM or not present: a full page fault is tens of microseconds; swapping from disk is milliseconds.

TLB & Miss Handling

The Translation Lookaside Buffer is a small, fully or set-associative cache of recent VPN → PFN mappings. Modern CPUs have a split L1 TLB (dTLB + iTLB), unified L2 TLB, and separate large-page TLBs:

TLBIntel SkylakeAMD Zen 4Latency
L1 dTLB (4 KiB)64 entries721 cycle
L1 iTLB (4 KiB)128641 cycle
L1 dTLB (2 MiB)32641 cycle
L2 STLB (unified)1,5363,0727 cycles
L1 dTLB (1 GiB)4241 cycle

Who flushes the TLB?

  • Writing CR3 (cross-address-space switch) flushes all non-Global entries — unless PCID (CR4.PCIDE=1) tags entries by ASID.
  • INVLPG addr flushes one VPN — issued after a PTE edit.
  • INVPCID invalidates by PCID (kernel path).
  • Spectre-style mitigations (IBPB) may invalidate the branch predictor but not the TLB.
  • A multi-core TLB flush is a “shootdown” — one core IPIs the others to run their INVLPG; this is why large unmap operations on big multicore machines are expensive.
Gotcha

A random-access working set larger than TLB-reach murders throughput even if it fits in L1d cache. Reach = entries × page size. 64 × 4 KiB = 256 KiB reach. Going to 2 MiB huge pages turns 64 entries into 128 MiB of reach — often a 2–3× speedup on graph / OLAP workloads.

Segmentation & mmap

Historic segmentation (Intel 80286/386) sliced the address space into length-limited segments with distinct CS/DS/SS/ES bases. x86_64 reduced it to a relic: all segments have base=0, limit=infinity. Only fs and gs bases survive as pointers to per-thread / per-CPU data (TLS).

Instead, Linux exposes virtual memory areas (VMAs) through mmap. Each VMA has a base, length, protection (r/w/x), and backing object (file, dev/zero, or anonymous).

/* memory-map a 64 MiB file, read lazily by page fault */
#include <sys/mman.h>
#include <fcntl.h>

int fd = open("data.bin", O_RDONLY);
void *base = mmap(NULL, 64*1024*1024, PROT_READ,
                  MAP_PRIVATE, fd, 0);
/* touching base[0] triggers a page fault; kernel populates the
   page from the page cache (or reads it in). zero-copy for reads. */
madvise(base, 64*1024*1024, MADV_RANDOM);  /* hint */

Anonymous vs file-backed

  • Anonymous (MAP_ANONYMOUS): backed by swap, zero-filled on first touch. Used by malloc's large allocations and per-thread stacks.
  • File-backed: pages are populated from the file on fault; dirty pages flushed back on writeback or msync. Shared mappings (MAP_SHARED) propagate writes to the file and to other mappers.

Page Fault Handling

A #PF is raised when the MMU sees P=0, a permission violation, or a non-canonical address. The kernel's do_page_fault classifies the fault by looking up the faulting address in the VMA list:

ClassificationWhat happenedKernel action
Minor (lazy-alloc)VMA exists; no physical frame yetAllocate a zeroed frame; install PTE; return — a few µs
Minor (file cache)File page already in page cacheInstall PTE pointing at cached frame — a few µs
Minor (COW)Write to r/o COW page (post-fork)Copy page to new frame, mark PTE writable
MajorPage must be read from disk/swapI/O; context switches; ~ms on HDD, ~10–100 µs on NVMe
Invalid (SIGSEGV)Address not in any VMA, or W to r/o non-COWDeliver SIGSEGV to the thread
Interview Lens

“What is a page fault?” is a trick question if you only answer “page not in RAM.” The interviewer wants: most page faults are minor — the page is either already in the buffer cache or needs to be zero-allocated. Only a tiny fraction are major (swap-in or file read), and those dominate the cost.

Module 03

CPU Scheduling

FCFS, SJF/SRTF, Round Robin, MLFQ, CFS — each running the same five-process workload so you can compare Gantt charts directly.

Goals & Metrics

A scheduler picks which ready thread gets the CPU next. It is pulled in four directions that conflict:

  • Throughput — jobs completed per unit time. Maximized by minimizing context switches.
  • Turnaround timefinish − arrival. Shorter = better responsiveness for batch.
  • Waiting time — time spent in the ready queue. A proxy for perceived latency.
  • Response time — first time the job runs after arrival. Critical for interactive workloads.
  • Fairness — every thread of equal priority makes comparable progress. Violated by starvation.

Two orthogonal knobs:

  • Preemptive vs non-preemptive — can the scheduler forcibly take the CPU back?
  • Priority vs time-slice — who gets to run, or for how long?
Trade-off

Throughput and response time are enemies. Long quanta amortize switch cost (high throughput) but make interactive tasks wait. Short quanta keep latency low but waste cycles on context switches. Every modern scheduler blends both.

Benchmark Workload

To compare schedulers apples-to-apples, we run the same five processes through each algorithm:

ProcessArrival timeCPU burstPriority (low = high)
P1073
P2241
P3414
P4542
P5665

Total work = 22 units. Below we derive waiting time (how long each process sat in the ready queue) and turnaround time (completion − arrival) for each scheduler.

FCFS (First-Come First-Served)

Non-preemptive. Runs jobs in arrival order. Simple to implement, easy to starve: a 10-second burst arriving at time 0 delays every subsequent 1-ms job behind it — the convoy effect.

FCFS Gantt chart (same 5 jobs)

Derivation of averages

ProcessStartFinishWaiting (start − arrival)Turnaround (finish − arrival)
P10707
P271159
P3111278
P41216711
P516221016
avg5.810.2
/* FCFS dispatcher (kernel-style pseudocode in C) */
struct task *fcfs_pick_next(struct runqueue *rq) {
    if (list_empty(&rq->tasks)) return NULL;
    return list_first_entry(&rq->tasks, struct task, run_list);
}
void fcfs_enqueue(struct runqueue *rq, struct task *t) {
    list_add_tail(&t->run_list, &rq->tasks);   /* append in arrival order */
}
Gotcha

FCFS is terrible for interactive workloads but excellent for throughput-only batch: minimal bookkeeping, no switches while a job runs. Real systems use FCFS only as a deep-background idle class.

SJF / SRTF (Shortest Job First)

Non-preemptive SJF picks the ready job with the smallest CPU burst. Preemptive SJF = SRTF (Shortest Remaining Time First): if a shorter job arrives, kick the current one out. Provably optimal for mean waiting time — but requires knowing burst length, which production systems must estimate via exponential averaging.

SRTF Gantt chart (preemptive SJF)

Derivation of averages

ProcessFinishWaitingTurnaround
P116916
P2715
P3501
P41126
P5221016
avg4.48.8
Gotcha: starvation

SJF/SRTF can starve long jobs indefinitely if short jobs keep arriving. Production schedulers (MLFQ, CFS) add aging: a long-waiting job's effective priority rises until it has to run.

Round Robin (RR)

Preemptive. Every ready process gets a fixed time quantum q in FIFO order. When q expires the job goes to the tail of the queue. q too small → high switch overhead; too large → degenerates into FCFS.

Below we simulate RR with q = 2. Arrivals enter the queue tail as they appear.

Round Robin Gantt chart (quantum q=2)

Derivation (q=2)

ProcessFinishWaitingTurnaround
P1201320
P213711
P3501
P417812
P5221016
avg7.612.0
/* Round-robin scheduler skeleton */
struct task *rr_pick_next(struct runqueue *rq) {
    return list_first_entry(&rq->tasks, struct task, run_list);
}
void rr_tick(struct task *t, struct runqueue *rq) {
    if (--t->quantum == 0) {
        t->quantum = RR_QUANTUM;              /* e.g. 4 ms */
        list_move_tail(&t->run_list, &rq->tasks);
        schedule();                           /* forces resched */
    }
}
Numbers

Classic rule of thumb: q should be ~80% of CPU bursts < q. Linux's CFS dynamically sizes its equivalent via sched_latency_ns (default 24 ms) divided across active tasks, bounded below by sched_min_granularity_ns (3 ms).

Multi-Level Feedback Queue (MLFQ)

Several priority queues (Q0 highest .. Qn lowest). A job starts in Q0. If it uses its full quantum it drops to Q(i+1). If it voluntarily yields (I/O) before the quantum ends, it stays — interactive jobs float near the top, CPU-bound ones sink. Periodic priority boost at interval S raises everyone back to Q0 to prevent starvation.

Rules (Arpaci-Dusseau):

  1. If Priority(A) > Priority(B): A runs.
  2. If Priority(A) == Priority(B): round-robin.
  3. New job enters at the highest priority.
  4. If a job uses its entire time allotment at a level, it drops.
  5. After time S, move all jobs to the topmost queue (boost).
MLFQ Gantt (3 queues, q0=2, q1=4, q2=8)
Gotcha: the I/O trick

Without priority boosting, a CPU-hog can game MLFQ by issuing a fake I/O right before its quantum expires — it stays at the top forever. Fix: account total time at a level (allotment), not per-quantum.

CFS — Linux Completely Fair Scheduler

CFS picks the runnable task with the smallest vruntime (virtual runtime). The tasks sit in a red-black tree indexed by vruntime; picking the leftmost node is O(log n). “Fair” = every task makes the same vruntime progress over sched_latency.

Nice values map to weights: nice 0 → 1024, each nice step multiplies by ~1.25. Effective time-slice = sched_latency × (weight / total_weight), so a higher-weighted task runs longer slices and accumulates vruntime slower:

/* Simplified CFS vruntime update (kernel/sched/fair.c) */
static void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec = now - curr->exec_start;
    if ((s64)delta_exec <= 0) return;
    curr->exec_start = now;
    curr->vruntime  += calc_delta_fair(delta_exec, curr);  /* weighted */
    update_min_vruntime(cfs_rq);
}
CFS Gantt chart (simplified, equal nice values)

CFS knobs

TunableDefaultEffect
sched_latency_ns24,000,000Target period in which every runnable task runs once
sched_min_granularity_ns3,000,000Floor on per-task slice (caps nr_tasks / latency)
sched_wakeup_granularity_ns4,000,000How much vruntime gap before a wakee preempts current
sched_migration_cost_ns500,000Penalty to cross-CPU task migration; prevents thrashing

Linux 6.6+ replaces CFS with EEVDF (Earliest Eligible Virtual Deadline First) — still a weighted-fair queue, but with a proper deadline-based ordering that reduces latency outliers.

Interview Lens

You don't need to memorize every line of CFS. The idea to convey: vruntime is a normalized progress clock; CFS always picks the task that looks most “behind.” Nice values are weights, not priorities; the scheduler still runs every task, just at different rates.

Head-to-Head Summary

AlgorithmPreemptPickAvg waitAvg TTStarvationBest for
FCFSNoFIFO5.810.2NoThroughput batch
SJF/SRTFYes (SRTF)min burst4.48.8YesKnown workloads
Round Robin (q=2)YesFIFO cycle7.612.0NoTime-sharing TTY
MLFQYeshi-prio Q5.49.8With boostMixed interactive+batch
CFSYesmin vruntime6.811.2NoLinux desktop+server
Trade-off

SJF has the best average waiting time by construction. Real schedulers (MLFQ, CFS) sacrifice a point or two of wait for fairness and starvation-freedom — losing 2% of average waiting for a 100% guarantee of eventual progress is a bargain.

Module 04

Synchronization

Mutexes, semaphores, monitors, and condition variables — with the bounded-buffer producer-consumer animated step-by-step.

Race Conditions & Atomicity

A race condition arises when two threads access the same memory location concurrently, at least one is a write, and there is no happens-before ordering between them. The canonical example is an unprotected counter:

/* counter.c — the classic race */
int counter = 0;
void *work(void *arg) {
    for (int i = 0; i < 1000000; ++i)
        counter++;           /* load, inc, store — 3 instructions */
    return NULL;
}
/* 4 threads: expected 4,000,000; observed 3,127,841 (lost updates). */

Each counter++ compiles to mov, inc, mov. Interleave two threads and a read can miss an update from another core. The fix is either a mutex or an atomic (hardware lock xadd).

Critical-section requirements (Dijkstra / Lamport)

  • Mutual exclusion — at most one thread inside.
  • Progress — no thread outside the section blocks entry.
  • Bounded waiting — once you request entry you wait at most k entries by others.
Gotcha: word-tearing

On most architectures a long write is atomic if aligned, but a mis-aligned or > 8-byte store can tear. Use _Atomic/std::atomic; do not assume volatile buys you atomicity (it only suppresses compiler reordering of that variable).

Mutexes & Spinlocks

A mutex is a binary, owned lock: only the holder can unlock. Under contention the loser either spins (CPU-busy) or sleeps (kernel-parked).

Lock kindUncontended costContended costWhen to use
Spinlock~20 ns (atomic CAS)Wastes CPU cycles waitingShort critical sections, kernel IRQ paths
Futex-backed mutex (Linux)~30 ns CAS, no syscall~1 µs wake via FUTEX_WAKEGeneral user-space lock
Adaptive (pthread)~30 nsSpins briefly, then parksMix of short and long holds
Reader-writer (RW)~50 ns readWriters starvation riskRead-heavy workloads
/* Test-and-set spinlock (toy, no back-off) */
typedef struct { _Atomic int flag; } spin_t;

static inline void spin_lock(spin_t *s) {
    while (atomic_exchange_explicit(&s->flag, 1,
                                   memory_order_acquire)) {
        while (atomic_load_explicit(&s->flag,
                                   memory_order_relaxed))
            __builtin_ia32_pause(); /* spin-hint */
    }
}
static inline void spin_unlock(spin_t *s) {
    atomic_store_explicit(&s->flag, 0, memory_order_release);
}

Why not just spin forever?

  • Priority inversion — a high-priority spinner burns CPU while the low-priority holder is waiting for CPU; neither makes progress.
  • Cache-line ping-pong — every spinner issues cache-line-exclusive requests; the lock line bounces between cores.
  • Power — spinning defeats C-states; phones and laptops hate it.

Linux userspace uses futexes: the fast path is an atomic CAS entirely in userspace; only on contention does the loser call FUTEX_WAIT into the kernel.

Numbers

An uncontended pthread_mutex_lock + unlock is ~30 ns (one atomic + one atomic). A contended lock with a single waiter costs ~2 µs (syscall round-trip). A ticket lock under heavy contention can cost 10+ µs per cycle due to cache-line bouncing.

Semaphores — Counting & Binary

A semaphore holds an integer. wait/P(s) decrements (blocking if it would go negative); signal/V(s) increments (waking one waiter). Dijkstra's original design is the seed of most sync primitives.

  • Binary semaphore — values in {0,1}. Looks like a mutex but anyone can signal it (no ownership). Good for signalling.
  • Counting semaphore — models a pool of N resources (connection pool, bounded buffer slots).
/* POSIX sem_t — counting semaphore bounding connection pool */
#include <semaphore.h>

static sem_t slots;
sem_init(&slots, 0, 16);          /* 16 available */

void *handler(void *arg) {
    sem_wait(&slots);            /* P: block if 0 */
    conn_t c = pool_take();
    serve(c);
    pool_return(c);
    sem_post(&slots);            /* V: wake one */
    return NULL;
}
Gotcha: missed wakeups

Semaphores solve the “signal-before-wait” race that bare condition variables have — the count is preserved. But using a binary semaphore as a mutex loses ownership tracking: any thread can unlock, error detection (unlock-by-non-owner, recursive unlock) is lost.

Monitors & Condition Variables

A monitor is a higher-level construct: an object whose methods are mutually exclusive (implicit mutex) and whose threads rendezvous via condition variables. Java's synchronized + wait/notify is a direct descendant. POSIX models it as a pthread_mutex_t paired with a pthread_cond_t.

CV operations:

  • wait(mutex) — atomically release the mutex and block. On wake, reacquire the mutex before returning.
  • signal() / notify() — wake one waiter (no-op if empty).
  • broadcast() / notifyAll() — wake all waiters (they contend for the mutex).

Wait semantics: Hoare vs Mesa

HoareMesa (Java, POSIX)
Signal hands off mutex directly to signalled threadYes — signaller is suspendedNo — signaller keeps running
Must re-check predicate on wake?NoYes — always loop on while
Implementation burdenHeavy (complex scheduler)Light
Gotcha: always use a while loop

On Mesa-semantics systems a spurious wakeup is legal, and even without that, another thread may have grabbed the resource between the signal and your wake-up. Always re-test the predicate inside a loop: while (!ready) pthread_cond_wait(&cv, &m);

Bounded-Buffer Producer-Consumer (2P + 2C, size 4)

Two producers, two consumers, a circular buffer of 4 slots. Three semaphores coordinate: empty counts free slots (init 4), full counts produced items (init 0), and mutex protects buffer indices (binary, init 1).

/* bounded_buffer.c — classic Bernstein/Dijkstra pattern */
#define N 4
item_t buf[N];
int   in = 0, out = 0;
sem_t empty, full, mutex;
sem_init(&empty, 0, N);
sem_init(&full,  0, 0);
sem_init(&mutex, 0, 1);

void *producer(void *_) {
    for (;;) {
        item_t it = produce_next();
        sem_wait(&empty);            /* at least one slot */
        sem_wait(&mutex);            /* critical section */
        buf[in] = it; in = (in + 1) % N;
        sem_post(&mutex);
        sem_post(&full);             /* signal consumers */
    }
}
void *consumer(void *_) {
    for (;;) {
        sem_wait(&full);
        sem_wait(&mutex);
        item_t it = buf[out]; out = (out + 1) % N;
        sem_post(&mutex);
        sem_post(&empty);
        consume(it);
    }
}
Bounded-buffer prod/cons: 2 producers, 2 consumers, 4 slots

Order of acquiring semaphores matters

Both producer and consumer acquire their counting semaphore before the mutex. Reverse it and you deadlock: producer holds mutex while blocked on empty, consumer cannot acquire mutex to release a slot.

Interview Lens

The key question isn't “can you code prod-cons?” but “why are there three semaphores?” Two counting semaphores give you flow control in both directions (full and empty). The mutex protects the shared indices. Collapsing any of the three creates either a race or a deadlock.

Module 05

Deadlock

The four Coffman conditions, detection via resource-allocation graphs, prevention and avoidance, and the Banker's algorithm worked through a real allocation table.

The Four Coffman Conditions (1971)

All four must hold simultaneously for deadlock. Break any one and deadlock is impossible.

#ConditionMeaningHow you'd break it
1Mutual exclusionResources cannot be sharedMake the resource virtual/shared (e.g., immutable data, optimistic reads)
2Hold & waitA process holds one resource and waits for anotherRequire all-at-once acquisition, or release before re-request
3No preemptionResources are only released voluntarilySteal resources (e.g., DB abort-on-conflict with MVCC)
4Circular waitCycle in the wait-for graphImpose a global lock order; always acquire in that order
Interview Lens

The cheap prevention answer is always: sort lock IDs and acquire in order. This breaks circular wait. For dining philosophers, number the chopsticks 1–5 and have every philosopher pick up the lower-numbered first. The asymmetry breaks the cycle.

Canonical example — the opposite-direction lock order

/* Thread A */          /* Thread B */
pthread_mutex_lock(&m1);      pthread_mutex_lock(&m2);
pthread_mutex_lock(&m2);      pthread_mutex_lock(&m1);  /* wait — deadlock */
/* work */                    /* work */
pthread_mutex_unlock(&m2);    pthread_mutex_unlock(&m1);
pthread_mutex_unlock(&m1);    pthread_mutex_unlock(&m2);
/* Fix: both sides acquire in m1-then-m2 order. */

Resource Allocation Graph (RAG)

Model processes as circles, resources as squares. An edge from process → resource means “requests”; resource → process means “currently allocated to.” Deadlock in a single-instance system iff there is a cycle. With multiple instances, a cycle is necessary but not sufficient — check actual availability.

Resource allocation graph with a cycle P1 → R2 → P2 → R1 → P1

Detection via Cycle Finding

With a RAG in memory, detect deadlock by running DFS and looking for a back-edge in the directed graph. Kernel deadlock detectors (lockdep in Linux) keep an in-memory chain of lock-acquire-order pairs and flag any new edge that would complete a cycle, before the deadlock actually happens.

# Simple DFS cycle-detection on a wait-for graph (Python)
def has_cycle(adj):
    WHITE, GRAY, BLACK = 0, 1, 2
    color = {v: WHITE for v in adj}
    def dfs(u):
        color[u] = GRAY
        for v in adj[u]:
            if color[v] == GRAY:       # back-edge = cycle
                return True
            if color[v] == WHITE and dfs(v):
                return True
        color[u] = BLACK
        return False
    return any(dfs(v) for v in adj if color[v] == WHITE)

Recovery strategies once deadlock is detected

  • Process termination — kill all deadlocked processes, or pick a victim by cost (total CPU spent, priority, resources held).
  • Resource preemption — force-release a resource; requires rollback to a safe checkpoint (common in DBs via transaction abort).
  • Operator intervention — log, alert, manual cleanup. Used when correctness trumps availability.

Prevention vs Avoidance

PreventionAvoidanceDetection
When it actsStatically: break one Coffman conditionDynamically: refuse grants that could deadlockAfter the fact: find cycles, recover
Needs runtime infoNoMax-resource declarations per processJust the wait-for graph
OverheadLow (design-time)High (every grant runs the safety check)Medium (periodic scan)
DownsideConservative: may forbid safe patternsMust know max claims up frontDeadlock happens; recovery cost non-zero
ExampleGlobal lock orderBanker's algorithmLinux lockdep, MySQL InnoDB wait-for graph
Trade-off

Most production systems use a mix: static prevention where cheap (lock ordering) + post-hoc detection for dynamic allocators (DB row locks). Full avoidance (Banker's) is rare outside of embedded safety-critical code because nobody knows their maxima.

Banker's Algorithm — Avoidance by Safety Check

Dijkstra 1965. Given each process's maximum claim, grant a request only if the post-grant state is safe — i.e. there exists an ordering of processes that can all complete with current available resources.

Inputs (for n processes, m resource types):

  • Max[n][m] — max units process i will ever need of resource j.
  • Alloc[n][m] — units currently held.
  • Need[i][j] = Max[i][j] - Alloc[i][j] — remaining demand.
  • Available[m] — pool.

Worked example: 5 processes, 3 resource types (A, B, C)

Total resources: A=10, B=5, C=7. Initial state:

ProcessAlloc (A,B,C)Max (A,B,C)Need (A,B,C)
P00,1,07,5,37,4,3
P12,0,03,2,21,2,2
P23,0,29,0,26,0,0
P32,1,12,2,20,1,1
P40,0,24,3,34,3,1
Sum alloc7,2,5

Available = Total - Sum alloc = (3, 3, 2).

Banker's safety algorithm walk (Available starts at 3,3,2)

Reference implementation

/* bankers.py — classic safety algorithm */
def is_safe(avail, max_need, alloc):
    n, m = len(alloc), len(avail)
    need = [[max_need[i][j] - alloc[i][j] for j in range(m)]
            for i in range(n)]
    work = avail[:]
    finish = [False] * n
    order = []
    while True:
        progressed = False
        for i in range(n):
            if finish[i]: continue
            if all(need[i][j] <= work[j] for j in range(m)):
                for j in range(m): work[j] += alloc[i][j]
                finish[i] = True
                order.append(i)
                progressed = True
        if not progressed: break
    return (all(finish), order)
Gotcha

Banker's algorithm is O(m·n²) per request. It also requires that every process declare its maximum claim up front — rarely true in real software. Consequence: Banker's is mostly a textbook topic; real kernels use prevention + detection, not avoidance.

Module 06

File Systems

How disks are organized into inodes and directories, what a journal actually logs, and why LFS / COW file systems look so different.

Inodes & On-Disk Layout

UNIX file systems separate metadata (an inode) from the data blocks. An inode carries: mode, uid/gid, timestamps, size, refcount, and pointers to data blocks. Directories are just special files whose data is a table of name → inode.

inode mode -rw-r--r-- uid / gid 1000 / 1000 size 16,384 atime mtime / ctime refcount 1 block pointers direct[0..11] indirect double indirect triple indirect data 0 ... data 11 block with 1024 pointers data 12..1035 pointer → pointer block pointer → ptr → ptr block With 4 KiB blocks and 4-byte pointers: direct 48 KiB, indirect 4 MiB, double 4 GiB, triple 4 TiB.
Figure 6.1 — ext2/3-style inode. Small files use only direct pointers; large files fan out through indirect trees.
Numbers

On ext4 (4 KiB blocks, 4-byte pointers): 12 direct = 48 KiB, single indirect = 4 MiB, double = 4 GiB, triple = 4 TiB. ext4 also supports extents (base+length) so a 1 GiB contiguous file is 1 extent, not 262,144 pointers.

Directories, Hard Links & Path Resolution

A directory is a file whose data blocks contain (name, inode_number) records. ls reads that; stat then resolves each inode. The filesystem tree is a directed graph with hard links (more than one name per inode) and soft links (a file whose content is the path to traverse).

Hard vs soft links

Hard linkSoft / symbolic link
Refers toSame inodePath string
Cross filesystem?NoYes
Survives target rename?Yes (unaffected)Breaks (dangling)
Refcount bump?Yes (inode nlink++)No
Can target a directory?No (only root can; forbidden in practice)Yes
# Observe hard vs soft links
$ echo hello > a
$ ln   a hard      # hard link — same inode, nlink = 2
$ ln -s a soft     # symlink — separate inode
$ stat -c '%i %h %n' a hard soft
12345 2 a
12345 2 hard
12346 1 soft

Journaling — Crash Consistency without fsck

Without a journal, a crash during a multi-block write can leave the filesystem inconsistent (e.g., inode says 1 MiB but only 512 KiB of data blocks allocated). The journal is a circular log of transactions: group metadata edits, write them to the log first, then the main filesystem.

Phases of a journalled write (ext4 in data=ordered mode)

  1. Begin txn — gather all dirty inodes, bitmaps, and directory blocks for this syscall.
  2. Data write — write user data blocks to their final location. Wait for them to be durable.
  3. Journal log — write the metadata edits (inodes, bitmaps) to the journal with a descriptor block. Wait.
  4. Commit record — write a single commit block with checksum. This is the atomic point: either the commit is durable (txn replayable) or it isn't (txn will be discarded).
  5. Checkpoint — later, write the metadata to the main filesystem and mark the journal space reusable.
ext4 journal timeline (data=ordered)

Journal modes

ModeWhat's loggedData consistencyPerf
journalBoth metadata and dataStrongest — data never appears from an uncommitted txnLowest (2x writes)
ordered (default)Metadata only, but data is forced before commitPost-crash: no stale bytesGood balance
writebackMetadata only; data can land any timePost-crash: files may contain garbageFastest
Gotcha

A journal guarantees consistency of the filesystem, not of your application. A database on ext4 still needs fsync + barriers + a WAL of its own, because the page cache may hold your latest writes even after the syscall returns.

LFS vs COW

Traditional (ext4, XFS) update data blocks in-place and use a journal for metadata. Two alternative designs eliminate in-place updates entirely:

Log-Structured FS (LFS / F2F / NILFS)

  • All writes (data + metadata) are appended to a log.
  • An inode map (imap) tracks “where is the latest version of inode i?”
  • Reads follow imap → inode → data blocks (same as regular FS).
  • A segment cleaner compacts partially-dead segments to reclaim space.
  • Ideal for flash: write-heavy, sequential, no RMW amplification.

Copy-on-Write (ZFS, btrfs, APFS)

  • Never overwrite a live block. Every update writes to a new location and replaces pointers up the tree.
  • A superblock atomically swings to the new root — all edits atomic.
  • Cheap snapshots: retain an old root, its subtree is still reachable.
  • End-to-end checksums on every block (optional) — catches silent bit rot.
  • Write amplification: editing a byte may rewrite the path from root to leaf.
Trade-off

Journaled in-place: small writes are cheap, fragmentation low, recovery fast. COW: snapshots cheap, integrity strong, but metadata writes amplify. LFS: flash-friendly write pattern, cleaner overhead eats some of that win.

Crash Consistency & fsync

Even with a journal you must invoke fsync(fd) to tell the kernel “flush my data to durable storage.” Without it the page cache may hold your write for seconds. fsync on a directory ensures directory entries are durable (critical for rename-based atomic replace).

/* Durable atomic file replace (POSIX) */
int fd = open("data.new", O_WRONLY|O_CREAT|O_TRUNC, 0644);
write(fd, payload, len);
fsync(fd);                    /* flush data */
close(fd);
rename("data.new", "data"); /* atomic on same FS */
int dfd = open(".", O_RDONLY|O_DIRECTORY);
fsync(dfd);                   /* flush dir entry, else rename can be lost */
close(dfd);
Module 07

I/O Models

Blocking, non-blocking, multiplexed (select/poll/epoll/kqueue), and async (io_uring) — plus the subtle epoll bugs that trip even senior engineers.

Blocking / Non-blocking / Sync / Async Matrix

Two orthogonal properties govern any read or write:

  • Blocking vs non-blocking — does the call wait when data isn't ready?
  • Synchronous vs asynchronous — does the caller do the I/O itself, or ask the kernel to do it and notify when done?
SyncAsync
BlockingClassic read(): thread sleeps until data ready, then copies itPOSIX AIO (aio_read): rarely used, kernel threads behind it
Non-blockingfcntl(O_NONBLOCK)+select/epoll: app polls for readiness, then copies itselfio_uring: app submits, kernel performs the I/O, app reaps completions
Interview Lens

The C10K problem (one thread per connection blocks around 10K fds) was solved by readiness notificationepoll/kqueue. The C10M problem (10 million connections) is attacked by completion notificationio_uring — which avoids per-I/O syscall overhead entirely.

select / poll / epoll / kqueue

All four let a single thread monitor many FDs. They differ in how the FD set is conveyed and how scalable they are:

APIPortabilityPer-call costMax fdsSemantics
selectPOSIXO(FD_SETSIZE) bit scan1024 by defaultPass bitset; kernel copies it each call
pollPOSIXO(n) array scanulimitPass pollfd[]; kernel iterates all
epollLinuxO(ready) — returns only active fdsTens of millionsRegister once; kernel queues ready events
kqueue*BSD / macOSO(ready)UnboundedEvent-typed filter API (files, signals, timers)
IOCPWindowsO(1) completion reapUnboundedCompletion-based, not readiness-based
/* epoll event loop skeleton */
int ep = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN|EPOLLET, .data.fd = sock };
epoll_ctl(ep, EPOLL_CTL_ADD, sock, &ev);

for (;;) {
    struct epoll_event evs[64];
    int n = epoll_wait(ep, evs, 64, -1);
    for (int i = 0; i < n; ++i) {
        int fd = evs[i].data.fd;
        if (evs[i].events & EPOLLIN) {
            /* ET mode: drain until EAGAIN */
            for (;;) {
                ssize_t r = read(fd, buf, sizeof(buf));
                if (r < 0 && errno == EAGAIN) break;
                if (r <= 0) { close(fd); break; }
                handle(fd, buf, r);
            }
        }
    }
}

Edge vs Level Triggered

Level Triggered (LT, default) "The fd is readable" — kernel keeps notifying as long as data exists. Wake on every epoll_wait while buffer non-empty Edge Triggered (ET, EPOLLET) "The fd just became readable" — one notification per transition. edge! edge! Must drain fully each time or starve.
Figure 7.1 — LT re-wakes every poll while data exists; ET fires once per empty→non-empty transition.
Gotcha: the classic ET bug

You must loop and drain the fd until read returns EAGAIN. If you read exactly what epoll said was there and stop, you'll miss the rest — and epoll will not notify you again until another edge. Same rule for accept loops: pull all pending connections out of the listen queue.

io_uring — True Async I/O on Linux

Introduced in Linux 5.1 (2019). Two lock-free shared-memory ring buffers (SQ: submission, CQ: completion) between userspace and kernel. The app writes SQEs; the kernel consumes them, does the I/O, writes CQEs. Zero syscalls per I/O in the busy case (IORING_SETUP_SQPOLL polls the SQ from a kernel thread).

What it subsumes

  • Every read/write/send/recv/accept/connect/openat/stat/close/fsync — submittable, batchable.
  • Buffered, direct, and zero-copy (IORING_OP_SEND_ZC).
  • Chained / linked SQEs: “open, then read, then close” as a single submission.
  • Registered buffers + files: skip reference counting per I/O.
Numbers

For 4 KiB random reads on NVMe, epoll + pread tops out around ~600 K IOPS per core. io_uring with polled mode reaches ~3 M IOPS per core — ~5× the throughput at ~1/3 the CPU, mostly by eliminating syscall entry/exit and the copy_user cost.

Common epoll / Event-Loop Bugs

Gotcha 1 — Thundering herd on accept

Many worker threads all epoll_wait on the same listen fd. A connection arrives; all wake. Only one gets the socket; the rest find EAGAIN and go back to sleep. Fix: EPOLLEXCLUSIVE (kernel 4.5+) wakes only one waiter. Or use SO_REUSEPORT and one accepter per thread.

Gotcha 2 — fd reuse

You close fd 7, the kernel immediately reuses it for a new socket, then your epoll event for “old fd 7” fires and you process bytes for the wrong connection. Fix: always epoll_ctl(EPOLL_CTL_DEL) before close, or associate a generation counter in epoll_event.data.u64.

Gotcha 3 — ET without non-blocking

Using EPOLLET on a blocking fd: the drain loop blocks forever after the first read fills the buffer. Always pair ET with O_NONBLOCK.

Gotcha 4 — Write buffering for sends

TCP send buffer fills and write returns EAGAIN. You must re-arm EPOLLOUT, queue the unsent bytes, and flush when writable — then disarm EPOLLOUT. Leaving EPOLLOUT armed causes a busy-loop of wakeups.

Module 08

Concurrency Patterns

Futures, actors, CSP channels, and the event loop — the building blocks every modern async runtime picks from.

Futures & Promises

A future is a handle to a value that will arrive later. A promise is the write side of that pair — whoever computes the value completes the promise, which fulfils all futures derived from it. Semantically identical to a continuation-passing callback, but composable.

# Python futures/async composition
import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as r:
        return await r.text()

async def main(urls):
    async with aiohttp.ClientSession() as s:
        # All fetches run concurrently; gather awaits all.
        pages = await asyncio.gather(*(fetch(s, u) for u in urls))
        return [len(p) for p in pages]

asyncio.run(main(["https://a.example/", "https://b.example/"]))

Combinators you must know

  • map — transform the eventual value.
  • flatMap / then — chain another future that depends on this one's result.
  • all / gather — wait for N futures, return a list of N results.
  • race / any — return the first to succeed (or fail).
  • timeout — cancel after a deadline.
Gotcha: unawaited futures

If you call an async function and don't await it, its body never runs (Python coroutines) or runs detached (JS Promises with unhandled rejections). Both languages warn; treat these warnings as errors.

Actor Model — Isolation by Mailbox

Each actor is a lightweight process with private state and a mailbox. Communication is one-way messages: actor A sends to B's mailbox, B processes messages one at a time. Because state is private and access is serial, no locks required. Erlang/OTP, Akka, Elixir popularized it.

# Actor-style counter with a single mailbox coroutine
import asyncio

async def counter_actor(inbox):
    n = 0
    while True:
        msg = await inbox.get()
        if msg == "inc": n += 1
        elif isinstance(msg, tuple) and msg[0] == "get":
            msg[1].set_result(n)
        elif msg == "stop":
            return

Strengths / weaknesses

  • + No shared state, no locks, compositional supervision trees.
  • + “Let it crash” recovery — a supervisor restarts a failing actor.
  • − Ordering only within a mailbox; cross-actor invariants require careful protocol design.
  • − Bounded mailboxes or you risk unbounded memory growth under back-pressure.

CSP Channels (Hoare, 1978 → Go)

Communicating Sequential Processes: goroutines synchronize by sending values through typed channels, not shared memory. Unbuffered channels are rendezvous (sender blocks until receiver arrives); buffered channels act as a bounded queue.

// Go: pipeline of goroutines via channels
package main
import "fmt"

func producer(ch chan<- int) {
    for i := 0; i < 5; i++ { ch <- i }
    close(ch)
}
func consumer(ch <-chan int, done chan<- bool) {
    for v := range ch { fmt.Println(v) }
    done <- true
}
func main() {
    ch := make(chan int, 2)
    done := make(chan bool)
    go producer(ch)
    go consumer(ch, done)
    <-done
}
Trade-off: actors vs CSP

Actor identity is who (mailbox address); CSP identity is what (channel). Actors model independent agents, CSP models typed streams. They are duals: an actor is a goroutine looping on its channel.

Event Loop — Single-threaded Concurrency

One thread running a FIFO of callbacks. Non-CPU work (network, disk, timers) is delegated to the kernel (epoll/kqueue) or a small threadpool (libuv); the loop only dispatches ready tasks. JS engines (Node, browsers) and Python's asyncio both work this way.

Node.js event loop phases for one tick
Gotcha: CPU-bound work kills latency

A single sync JSON.parse on a 50 MB blob blocks the loop for ~500 ms; every other connection stalls. Offload to a worker thread (worker_threads) or a child process. Rule: if any handler takes >10 ms, profile and move it off the loop.

async / await Mechanics

Syntactic sugar over state machines. The compiler translates each async def into a class with methods send, throw, close; each await becomes a suspension point. The runtime drives the state machine from an event loop.

# Roughly what "async def f()" becomes at the bytecode level
class AwaitableF:
    def __init__(self): self.state = 0
    def send(self, val):
        if self.state == 0:
            self.state = 1
            return some_future           # suspend
        elif self.state == 1:
            result = val                   # future's value
            raise StopIteration(result)    # done
Interview Lens

“Why can a single thread handle 100K connections?” Because every await parks the coroutine, freeing the thread to run another. The cost of a context switch in userspace (saving a few pointers) is ~20 ns, vs 1–5 µs for a kernel thread switch.

Module 09

Memory Model & Atomics

Why the hardware reorders your loads and stores, how happens-before rules let you reason about it, and the ABA problem that ruins naive lock-free code.

Reordering & Visibility

Two threads on two cores see a different order of the same stores. This isn't a bug; it's the architecture. Both the compiler and the CPU freely reorder instructions that are independent within a single thread. Cross-thread visibility requires explicit synchronization.

Who reorders what?

LayerCan doDefeat with
CompilerReorder, hoist, combine, reuse registersvolatile / std::atomic / inline-asm "memory" clobber
CPU (out-of-order)Reorder independent loads/stores within the pipelineMemory barriers / fences
Store bufferDelay stores so loads go first (x86 allows store→load reordering)mfence, lock-prefixed ops
Cache coherencePropagation delay between coresAcquire/release ordering and cache line protocol

Hardware model strengths

  • x86/x86_64TSO (Total Store Ordering). Stores don't reorder with earlier stores; loads don't reorder with earlier loads or stores. Only reordering allowed: a store followed by a load to a different address.
  • ARM64Relaxed. Any pair of independent memory ops can reorder. Requires explicit dmb/ldar/stlr.
  • POWER — Even weaker than ARM. Must fence often.
Gotcha

Code that “just works” on x86 often breaks on ARM. Java (JMM), C++11 (std::memory_order), and Go provide portable memory models — if you use them correctly you're safe on any architecture.

Happens-Before — The Reasoning Primitive

Define a partial order “hb” over events in a program. If a hb b, then a's writes are visible when b executes. The rules are few and compose transitively:

  1. Program order — within a thread, earlier statements hb later ones.
  2. Unlock / Lock — an unlock of m hb every subsequent lock of m.
  3. Release / Acquire — an atomic store-release hb every atomic load-acquire that reads it.
  4. Volatile (Java) — a write to a volatile hb every subsequent read of that volatile.
  5. Thread start / joint.start() hb everything in t; everything in t hb t.join() returning.
  6. Transitivity — a hb b and b hb c implies a hb c.
Happens-before: producer writes data then releases; consumer acquires then reads
Interview Lens

The JMM / C++ memory model is intimidating on paper, but “do the writes I did before the flag become visible to someone who reads the flag?” is the only question. Release on write, acquire on read, and you're home.

Acquire / Release / SeqCst

OrderMeaningTypical use
memory_order_relaxedAtomicity only — no orderingPerf counter increments
memory_order_acquireNo later loads/stores move before this loadLock acquire, flag read
memory_order_releaseNo earlier loads/stores move after this storeLock release, flag write
memory_order_acq_relBoth — used on RMW (CAS, fetch_add)Lock-free data structures
memory_order_seq_cstSingle total order across all SC opsDefault — safest but slowest
/* Dekker-style publication with release/acquire (C11) */
_Atomic int  flag = 0;
int          payload;

void producer(void) {
    payload = 42;                                   /* non-atomic write */
    atomic_store_explicit(&flag, 1,
                          memory_order_release);
}
void consumer(void) {
    while (atomic_load_explicit(&flag,
                                 memory_order_acquire) == 0) { }
    printf("%d\n", payload); /* guaranteed to see 42 */
}
Numbers

On x86, memory_order_acquire/release cost zero additional instructions vs relaxed (x86's TSO already provides the ordering). seq_cst on a store requires an mfence, ~15–30 cycles. On ARM the difference is real: relaxed = ldr, acquire = ldar, seq_cst = dmb + op.

Fences & Barriers

A fence is a standalone instruction that restricts reordering around it without itself being a read or write. Use when you need ordering without an atomic op:

  • Full fence (mfence / dmb ish): no load or store moves across.
  • Load fence (lfence): earlier loads complete before later loads.
  • Store fence (sfence): earlier stores complete before later stores.
  • Compiler fence only: asm volatile("" ::: "memory") — prevents the compiler from reordering, but no hardware barrier.
store A store B mfence load C load D No store after mfence can be observed before A and B. No load before it can be delayed past.
Figure 9.1 — A full fence as a reorder barrier between store and load streams.

The ABA Problem on a Lock-Free Stack

The canonical CAS-based pop is: read head, read head->next, CAS head from old to next. If between your read and CAS someone pops A, pops B, then pushes A back — head is still A, CAS succeeds, but head->next now points at freed memory.

ABA on a lock-free stack (Treiber stack)

Fixes

  • Versioned pointers (double-word CAS) — store (ptr, version) in 128 bits; every pop increments version. cmpxchg16b on x86.
  • Hazard pointers (Michael, 2004) — before dereferencing a shared pointer, publish it as “in use”; reclaimers skip nodes announced by any thread.
  • Epoch-based reclamation (crossbeam in Rust) — a node is freed only when no thread is in an older epoch.
  • RCU (Linux kernel) — reclaim after a grace period in which every CPU has context-switched.
/* Treiber stack pop with DWCAS versioned head */
struct Node { int v; Node* next; };
struct Head { Node* p; uint64_t ver; };
_Atomic(struct Head) head;

Node* pop(void) {
    struct Head old, neu;
    do {
        old = atomic_load(&head);
        if (!old.p) return NULL;
        neu.p   = old.p->next;
        neu.ver = old.ver + 1;       /* defeats ABA */
    } while (!atomic_compare_exchange_weak(&head, &old, neu));
    return old.p;
}
Interview Lens

When asked “what's wrong with this lock-free stack,” the expected answer is ABA, followed by one of the four remediations above. Bonus: mention that GC languages (Java, Go) don't see ABA on reference-typed slots because the old node can't be reclaimed while a thread still holds a reference.

Module 10

Cheat Sheet & Quick Reference

One-page summaries, must-memorize latency numbers, rapid-fire Q&A, and a POSIX command matrix.

Module Summaries

Processes & Threads

  • Process = address space + FDs; thread = registers + stack
  • Linux: both are task_struct; threads share mm/files/sighand
  • fork = COW address space; exec replaces image
  • Context switch: ~1 µs direct + 10–100 µs indirect (cache/TLB)

Memory / VM

  • 4-level paging: 9+9+9+9+12 = 48-bit VA, 4 KiB pages
  • PTE bits: P, R/W, U/S, A, D, G, PS, XD, PFN
  • TLB miss = hardware walker reads 4 cache lines
  • Huge pages extend TLB reach from 256 KiB to 128 MiB

Scheduling

  • FCFS: simple, convoy effect
  • SJF/SRTF: optimal mean wait, starvation risk
  • RR: fair, q tuning = latency vs throughput
  • MLFQ: adaptive priority + aging boost
  • CFS: RB-tree of vruntime, weighted fair; EEVDF in 6.6+

Synchronization

  • Mutex: owned lock; pthread = futex-backed
  • Counting semaphore: pool / flow control
  • CV: always loop on predicate (Mesa semantics)
  • Prod-cons needs 3 sems: empty, full, mutex

Deadlock

  • Coffman: mutex, hold&wait, no-preempt, circular-wait
  • Break circular-wait with a global lock order
  • Detect via DFS cycle in wait-for graph
  • Banker's: O(mn²) safety check per request

File Systems

  • Inode = metadata; directory = name → inode table
  • Journal modes: journal > ordered (default) > writeback
  • COW FS (ZFS/btrfs): cheap snapshots, write amplification
  • Atomic file replace: write temp, fsync, rename, fsync dir

I/O Models

  • select O(n), poll O(n), epoll O(ready), kqueue O(ready)
  • ET requires non-blocking + drain-until-EAGAIN
  • io_uring: shared rings, ~0 syscalls per I/O
  • Watch out: fd reuse, thundering herd, EPOLLOUT busy-loop

Concurrency Patterns

  • Futures: map / flatMap / all / race / timeout
  • Actors: private state + mailbox; let-it-crash
  • CSP: channels as typed queues (Go)
  • Event loop: macrotask ↔ microtask drain per tick

Memory Model

  • x86 = TSO (strong); ARM = relaxed
  • Happens-before: PO, unlock/lock, release/acquire, start/join
  • Release on publisher's write, acquire on consumer's read
  • ABA defeated by versioned CAS, hazard ptrs, RCU, epochs

Latency Numbers Every OS Candidate Should Know

OperationTime (2024)Relative
L1 cache reference0.5 ns1x
Branch mispredict3 ns6x
L2 cache reference4 ns8x
Mutex lock/unlock (uncontended)25 ns50x
Main memory reference100 ns200x
Compress 1 KB with Snappy2 µs4,000x
TLB miss (full 4-level walk from RAM)200 ns400x
Context switch1–5 µs2,000–10,000x
System call (simple syscall)100 ns200x
NVMe SSD random 4K read20 µs40,000x
Round-trip in same DC0.5 ms1,000,000x
HDD seek10 ms20,000,000x
Page fault (minor, zero-fill)1 µs2,000x
Page fault (major, NVMe swap-in)20 µs40,000x
Page fault (major, HDD swap-in)10 ms20,000,000x
fork() on 100 MB RSS450 µs900,000x
fsync on NVMe30 µs60,000x

Rapid-Fire Q&A

#QuestionAnswer
1Process vs thread?Process owns resources (AS, FDs); thread is a unit of scheduling (registers, stack).
2What is copy-on-write?Page tables point at shared read-only pages; first write faults and copies that page only.
3Why is fork fast?It copies page tables, not physical pages. COW defers actual copies until writes.
4What's in a PCB?PID, state, register context, memory map, FD table, signals, creds, scheduler state.
5What does a context switch cost?~1 µs direct + 10–100 µs indirect (TLB/cache cold).
6Why 4-level paging?Supports sparse 48-bit virtual space without allocating 2^48 / 4 KiB = 2^36 PTEs upfront.
7What happens on TLB miss?Hardware walks PML4 → PDPT → PD → PT, fetching up to 4 cache lines; loads PTE into TLB.
8How does COW detect the write?PTE marked read-only; write triggers #PF; kernel copies page, updates PTE.
9What is thrashing?Working set exceeds RAM; CPU spends >50% on page faults instead of useful work.
10Difference between a mutex and a semaphore?Mutex owned; unlock only by holder. Counting semaphore: multi-resource; any thread can V.
11Why use a CV and not a busy-wait?Busy-wait burns CPU + cache. CV parks the thread in the kernel until signalled.
12What are the Coffman conditions?Mutual exclusion, hold & wait, no preemption, circular wait.
13Best prevention technique in practice?Global lock ordering — breaks circular wait.
14When does SJF starve?When short jobs keep arriving faster than a long job finishes; aging fixes it.
15What does CFS pick?The runnable task with the smallest weighted vruntime (leftmost RB-tree node).
16Why is RR quantum tuning hard?Too small: all overhead. Too large: degenerates to FCFS.
17What is the convoy effect?Many short jobs queue behind one long job under FCFS.
18What is priority inversion?Low-pri holds a lock needed by high-pri; medium-pri preempts low-pri indefinitely. Fix: priority inheritance.
19Hoare vs Mesa CVs?Hoare: signal hands off mutex directly. Mesa: signalled thread re-contends; always loop on predicate.
20What is a spurious wakeup?A thread wakes from cond_wait without a corresponding signal. Re-test predicate.
21What is livelock?Threads aren't blocked but make no progress; e.g., both backing off and retrying in lockstep.
22Inode vs directory entry?Inode = metadata + block pointers. Dirent = (name, inode) in a directory file.
23Hard link limits?Same filesystem only; inode nlink must fit; root-only for directories.
24What does fsync do?Flushes dirty pages of the fd to durable storage and issues a FLUSH cache command.
25Why fsync a directory?Renames/creates are metadata; the directory file's updated block must be durable too.
26What does a journal guarantee?Filesystem metadata consistency after crash; not application data unless data=journal.
27LFS vs COW?LFS: log-append everything + cleaner. COW: copy on modify and swing root atomically.
28What is select's main limit?O(n) with FD_SETSIZE (1024). poll is O(n) unbounded. epoll is O(ready).
29Why non-blocking for epoll ET?ET fires once per transition; you must read until EAGAIN, which requires non-blocking.
30What is io_uring's advantage?Zero-copy shared rings; zero syscalls per I/O in polled mode; batching.
31What are futures good at?Composing async ops with map/flatMap/all/race without callback pyramids.
32What is back-pressure?Flow control so fast producers don't overwhelm slow consumers (bounded queues, pausing streams).
33Actors vs CSP?Actor identity = mailbox owner. CSP identity = channel. Dual formulations.
34What is a microtask?Promise callback / process.nextTick; drained fully between macrotasks.
35Why is x86 easier to program?TSO: only store→load reorderable. ARM/POWER: any independent pair may reorder.
36Release / acquire pair meaning?Release on publisher's write; acquire on reader's read — synchronizes-with for HB.
37What is a data race?Concurrent accesses to the same memory, at least one write, no HB ordering between them.
38What is volatile NOT?NOT atomic (only suppresses compiler reorder/elide). Use std::atomic for thread safety.
39ABA — one-line summary?CAS sees old value, but it was pop-pop-push-back; your "next" pointer is now dangling.
40RCU in one sentence?Readers lock-free; updater publishes a new version, frees old after a grace period.
41What is a futex?User-space fast-path atomic + kernel FUTEX_WAIT/WAKE syscall for contended slow-path.
42Why do threads share the signal table?POSIX: signals are per-process, delivered to any thread whose mask permits it.
43Zombie vs orphan?Zombie: terminated, not reaped. Orphan: parent died, init adopts and reaps.
44What does EINTR mean?A syscall was interrupted by a signal; must retry (unless SA_RESTART auto-restart).
45mmap vs read for files?mmap: zero-copy, page-fault driven, lazy. read: explicit, always a syscall + copy.
46Why is pthread_mutex_trylock useful?Avoids blocking; lets you fall back (try a different lock order, back off, do other work).
47Time vs priority in SCHED_FIFO?Highest fixed priority runs until it yields; no time-slicing — risk of starving others.
48Huge pages — cost?Require contiguous physical frames; hard to allocate on a fragmented host; transparent HP (THP) mitigates.
49copy_file_range vs sendfile?copy_file_range: file→file zero-copy. sendfile: file→socket zero-copy (older API).
50What does a page cache buy you?In-memory cache of file data; turns the second read into memory-speed; absorbs writes until flush.

POSIX / Linux Command Matrix

TaskSyscallHeaderNotes
Create processfork, vfork, cloneunistd.h / sched.hclone with flags builds threads or containers
Replace imageexecve familyunistd.hOnly returns on failure
Reap childwait, waitpid, waitidsys/wait.hClears zombie; returns exit status
Allocate memorybrk, mmapunistd.h / sys/mman.hAnonymous mapping = malloc backing for big allocs
Map filemmapsys/mman.hShared or private; zero-copy access
Create threadpthread_createpthread.hWrapper around clone(CLONE_THREAD|CLONE_VM|…)
Mutexpthread_mutex_*pthread.hFutex-backed on Linux
Condition variablepthread_cond_*pthread.hAlways used with a mutex, loop on predicate
Semaphoresem_init/wait/postsemaphore.hCounting; POSIX named or unnamed
Event-driven I/Oepoll_create1/ctl/waitsys/epoll.hLinux only; use kqueue on BSD/macOS
Async I/O (modern)io_uring_setup + user lib liburingliburing.hSubmission + completion rings
Open / close fdopen, close, openatfcntl.h / unistd.hO_NONBLOCK, O_CLOEXEC, O_DIRECT
Durabilityfsync, fdatasync, sync_file_rangeunistd.hfsync on dir for rename atomicity
Rename / atomic swaprename, renameat2 RENAME_EXCHANGEstdio.h / fcntl.hAtomic on same filesystem
Signalssigaction, kill, pthread_sigmasksignal.hPrefer sigaction; async-signal-safe funcs only in handlers
Memory protectionmprotect, madvisesys/mman.hChange perms, hint access pattern
IPC: shared memshm_open + mmapsys/mman.hPOSIX SHM; prefer over SysV shmget
IPC: unix socketssocket(AF_UNIX)sys/socket.hSOCK_STREAM or SOCK_SEQPACKET; FD passing via SCM_RIGHTS
Scheduling tuningsched_setscheduler, nice, setprioritysched.hSCHED_OTHER (CFS), SCHED_FIFO, SCHED_RR, SCHED_DEADLINE
CPU pinningsched_setaffinitysched.hReduces migration cost; aligns with NUMA topology