Operating Systems & Concurrency
A FAANG interview consolidation covering processes, virtual memory, scheduling, synchronization, deadlock, file systems, I/O, concurrency patterns, and memory models.
Processes & Threads
What the kernel tracks per process and per thread, how fork/exec work, and exactly what a context switch costs.
Process Anatomy & the PCB
A process is an instance of an executing program with its own virtual address space, open-file table, credentials, and at least one thread of control. The kernel tracks each process via a Process Control Block (PCB) — in Linux this is struct task_struct (~1.7 KB per task on x86_64, 2024 vintage).
Virtual address-space layout (x86_64 Linux, typical)
PCB fields — what the kernel tracks per process
| Field | Linux field name | Purpose |
|---|---|---|
| PID | pid_t pid | Unique process identifier in PID namespace |
| Parent PID | struct task_struct *parent | Backpointer for SIGCHLD / wait() |
| State | long state | RUNNING / INTERRUPTIBLE / UNINTERRUPTIBLE / STOPPED / ZOMBIE |
| Scheduling | struct sched_entity se | vruntime, policy (CFS / FIFO / RR), nice value |
| Address space | struct mm_struct *mm | Page tables, VMA list, brk, stack start |
| File table | struct files_struct *files | FD → file_struct mapping |
| Signals | struct sighand_struct *sighand | Signal handlers + pending masks |
| Credentials | struct cred *cred | UID, GID, capabilities |
| Namespaces | struct nsproxy *nsproxy | pid/net/mount/uts/ipc/cgroup namespaces |
| Saved context | struct thread_struct thread | Register file, FPU state, kernel stack pointer |
| Memory stats | rss, vsize | Resident / virtual size, for top/ps |
Process states — the five-state machine
A zombie is a terminated process whose PCB is still retained so the parent can read its exit code via wait(). If the parent never waits, zombies pile up — each consumes a PID slot. A ghost parent (init, PID 1) adopts orphaned children to reap them, which is why init always loops on wait().
Threads & the TCB
A thread is an independently schedulable flow of execution that shares its address space, file table, and signal handlers with sibling threads. On Linux, threads are just task_structs created with clone(CLONE_VM|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|…) — the PCB is the TCB. Shared resources are stored once in mm_struct / files_struct and pointed to by every thread.
Per-thread (TCB) vs per-process state
| Kind of state | Where it lives | Per-thread? |
|---|---|---|
| Register file / PC / RSP | thread_struct | Yes — the essence of a thread |
| Kernel stack (8–16 KiB) | task_struct::stack | Yes |
| User stack | VMA in shared mm | Yes (distinct ranges) |
TLS block (fs/gs base) | Per-thread arena | Yes |
| Pending signals | task_struct::pending | Yes — thread-directed signals |
| Signal handlers | sighand_struct | No — shared |
| Address space / page tables | mm_struct | No |
| Open file descriptors | files_struct | No |
| Heap / BSS / Text | VMAs in shared mm | No |
| PID (thread group leader) | tgid | No — TID is per-thread, PID is shared |
User vs kernel threads
- 1:1 (kernel threads) — every userspace thread maps to one kernel
task_struct. Linux NPTL, Windows threads, modern macOS all do this. Blocking a thread only blocks that thread; multiple threads can run on multiple cores simultaneously. - N:1 (green threads) — many user threads multiplexed onto one kernel thread. A blocking syscall freezes every sibling. Historic early Java (Solaris green threads, GNU Pth). Simple, but no multicore.
- M:N (hybrid) — user scheduler maps
Muser threads ontoNkernel threads. Go's goroutines, Rust's (former) greenlets, Erlang BEAM. Cheap context switches in userspace; requires a runtime to trap blocking syscalls.
When asked “process vs thread”, the sharp answer distinguishes the unit of resource ownership (process: address space, FDs) from the unit of scheduling (thread: registers + stack). Go's goroutines are not threads — they are a userspace M:N runtime where a go-scheduler multiplexes 100K goroutines onto GOMAXPROCS kernel threads.
fork / exec / wait
POSIX fork() creates a duplicate of the calling process. The child gets a new PID, a copy of the parent's memory (copy-on-write), a duplicate FD table, and picks up execution on the instruction after fork returns.
/* fork_demo.c — classic fork/exec/wait pipeline */ #include <stdio.h> #include <unistd.h> #include <sys/wait.h> int main(void) { pid_t pid = fork(); if (pid < 0) { perror("fork"); return 1; } if (pid == 0) { /* child: replace our image with /bin/ls */ char *argv[] = { "ls", "-la", NULL }; execvp(argv[0], argv); perror("execvp"); /* only reached if exec fails */ _exit(127); } /* parent: wait for the child and inspect its exit status */ int status; waitpid(pid, &status, 0); if (WIFEXITED(status)) printf("child exited with %d\n", WEXITSTATUS(status)); return 0; }
What actually happens on fork
- Kernel copies
task_struct— trivial bytes. - Clones
mm_structby copying page tables, not physical pages. All writable pages are marked read-only. This is copy-on-write (COW). - Duplicates the FD table (refcount++ on each
file*). - Copies signal dispositions (but not pending signals).
- Sets the child's return value to 0 and the parent's to the child PID.
- Schedules the child; the first page written triggers a fault, the kernel copies that page only.
fork cost breakdown (x86_64 Linux, 2024)
| Resident size | Pages | fork() time (approx) | Why |
|---|---|---|---|
| 10 MB | 2,500 | ~90 µs | PT copy dominated |
| 100 MB | 25,000 | ~450 µs | Linear in page-table entries |
| 1 GB | 250,000 | ~3.5 ms | Multi-level PT walk + copy |
| 10 GB | 2.5M | ~35 ms | Why Redis uses fork for BGSAVE and sees pauses |
fork after pthread_create is dangerous. Only the calling thread survives in the child; others vanish mid-lock. If a killed thread held malloc's arena lock, the child's first malloc will deadlock. Use posix_spawn or only call async-signal-safe functions between fork and exec.
Context Switch Cost
A context switch is the kernel replacing the executing thread's saved state with another thread's. The direct cost is saving and restoring registers. The indirect cost — cache and TLB pollution — often dwarfs it.
Direct cost breakdown
| Component | Cycles (approx) | Time @ 3 GHz | Notes |
|---|---|---|---|
| Trap into kernel (syscall/IRQ) | ~100 | 33 ns | swapgs, push rsp, re-target IDT |
| Save GPRs + FPU/AVX | ~200 | 66 ns | AVX-512 alone: 512 bytes/thread |
| scheduler_tick + pick_next_task | ~500–2000 | 0.2–0.7 µs | CFS RB-tree O(log n) |
| switch_mm (different address space) | ~300 | 100 ns | CR3 write + possible TLB flush |
| Restore registers + iret | ~200 | 66 ns | Mirror of save |
| Direct total | ~1,500–5,000 | 1–2 µs |
Indirect cost: what you pay after the switch
- TLB flush: cross-address-space switch on non-PCID CPUs flushes the TLB entirely; the first few thousand memory references miss. PCID (
CR4.PCIDE) tags entries by ASID so the cost is near-zero. - Cache pollution: T2 evicts T1's working set from L1/L2. When T1 resumes, cold misses re-fetch from L3 or RAM. Measured indirect cost for unrelated workloads: 10–100 µs.
- Branch-predictor state: Spectre mitigations (IBPB, STIBP) on cross-trust-domain switches can add ~5 µs on older CPUs.
On modern hardware a same-process thread-to-thread switch costs ~1 µs direct. An epoll-based event loop on one core is cheap because it never switches. A thread-per-request server with 10K threads doing 50K switches/sec burns ~5% of one core just on kernel bookkeeping, plus cache pollution.
IPC Mechanisms
Processes do not share memory by default, so they communicate through kernel-mediated channels:
| Mechanism | Direction | Latency / GB·s | Best for |
|---|---|---|---|
| Anonymous pipe | Parent→child, 1-way | ~1 µs / 2 GB·s | Shell command pipelines |
| Named pipe (FIFO) | Any-any, 1-way | ~1 µs / 2 GB·s | Unrelated process rendezvous |
| UNIX-domain socket | Bidirectional, datagram or stream | ~2 µs / 1–3 GB·s | Local RPC, FD passing |
| Shared memory (mmap / SysV) | Any-any | ~memory speed / 20+ GB·s | Zero-copy bulk data (needs your own sync) |
| Message queue (POSIX) | Any-any | ~2 µs | Structured bounded-priority messages |
| Signal | Any-any, 1 bit | <1 µs | Notify only — never carry data |
| eventfd / pipe2 | Self-wake | <1 µs | Cross-thread wake via epoll |
Shared memory is the fastest IPC, but offers no synchronization. You still need a futex / semaphore / atomic in that shared region to coordinate writes. Pipes and sockets trade throughput for built-in flow control and blocking semantics.
Memory & Virtual Memory
How an MMU translates virtual to physical, what a page-table entry actually contains, and the TLB path that keeps it fast.
Virtual Address Spaces
Each process sees a private, contiguous virtual address space. x86_64 uses 48-bit canonical addresses (256 TiB) split into a user half (0x0000_0000_0000_0000 .. 0x0000_7fff_ffff_ffff) and a kernel half (0xffff_8000_0000_0000 and up). Addresses between are non-canonical and fault on use. 5-level paging raises the limit to 57 bits (128 PiB), gated by CR4.LA57.
Why bother with virtual memory? Three wins:
- Isolation — process A cannot read/write process B's pages because their page tables point to different physical frames.
- Sparseness — a process claims 256 TiB of logical space and only pays for pages it touches (the kernel populates page-table entries lazily on fault).
- Relocation & sharing — shared libraries and COW pages live in physical memory once but are mapped into many virtual spaces.
Candidates love to say “virtual memory lets you use more than RAM.” True, but swap-based paging to disk is the least important win today — servers rarely swap. The isolation and zero-copy (mmap, shared libraries, fork COW) wins matter every millisecond.
Paging & Page-Table Entries
Memory is divided into fixed-size pages (4 KiB default; 2 MiB and 1 GiB “huge pages” available). The MMU translates a virtual page number (VPN) to a physical frame number (PFN) by walking the multi-level page table anchored at CR3.
x86_64 4-level paging (48-bit virtual address)
Page-Table Entry (4 KiB page, x86_64) — the bits that matter
| Bit | Name | Meaning |
|---|---|---|
| 0 | P (Present) | 1 = mapped to a frame. 0 = fault on access (paged out, lazy alloc, COW). |
| 1 | R/W | 0 = read-only (fault on write — used for COW and .text). |
| 2 | U/S | 0 = supervisor-only (kernel pages). |
| 3 | PWT | Page Write-Through caching. |
| 4 | PCD | Page Cache Disable (MMIO / DMA regions). |
| 5 | A (Accessed) | Set by hardware on any reference; cleared by kernel for clock replacement. |
| 6 | D (Dirty) | Set by hardware on write; drives page-cache writeback. |
| 7 | PS (Page Size) | 1 at the PD level = 2 MiB huge page; at PDPT level = 1 GiB. |
| 8 | G (Global) | Survives CR3 reload (kernel mappings). |
| 12-51 | PFN | Physical frame number (40 bits × 4 KiB = 4 PiB). |
| 63 | XD (NX) | 1 = no-execute; kills buffer-overflow shellcode on the stack. |
A 4-level walk reads 4 page-table entries. Each read is a cache line (64 B). Without a TLB a user-space load would cost ~5 memory references. The TLB collapses that to 1 when the PTE is cached.
Virtual → Physical Translation (with TLB miss walk)
The hot path checks the TLB; on miss, the hardware page-table walker (PMH on Intel) walks all four levels. The visualization below steps through a load of address 0x7FFE_1234_5ABC.
Cost of a miss
- Best case — all four PTEs in L1d: ~4 cycles × 4 + transit = ~20 ns.
- Typical case — PTEs in L2/L3: ~100–200 ns for the walk.
- Worst case — PTEs in RAM or not present: a full page fault is tens of microseconds; swapping from disk is milliseconds.
TLB & Miss Handling
The Translation Lookaside Buffer is a small, fully or set-associative cache of recent VPN → PFN mappings. Modern CPUs have a split L1 TLB (dTLB + iTLB), unified L2 TLB, and separate large-page TLBs:
| TLB | Intel Skylake | AMD Zen 4 | Latency |
|---|---|---|---|
| L1 dTLB (4 KiB) | 64 entries | 72 | 1 cycle |
| L1 iTLB (4 KiB) | 128 | 64 | 1 cycle |
| L1 dTLB (2 MiB) | 32 | 64 | 1 cycle |
| L2 STLB (unified) | 1,536 | 3,072 | 7 cycles |
| L1 dTLB (1 GiB) | 4 | 24 | 1 cycle |
Who flushes the TLB?
- Writing CR3 (cross-address-space switch) flushes all non-Global entries — unless PCID (
CR4.PCIDE=1) tags entries by ASID. INVLPG addrflushes one VPN — issued after a PTE edit.INVPCIDinvalidates by PCID (kernel path).- Spectre-style mitigations (
IBPB) may invalidate the branch predictor but not the TLB. - A multi-core TLB flush is a “shootdown” — one core IPIs the others to run their INVLPG; this is why large unmap operations on big multicore machines are expensive.
A random-access working set larger than TLB-reach murders throughput even if it fits in L1d cache. Reach = entries × page size. 64 × 4 KiB = 256 KiB reach. Going to 2 MiB huge pages turns 64 entries into 128 MiB of reach — often a 2–3× speedup on graph / OLAP workloads.
Segmentation & mmap
Historic segmentation (Intel 80286/386) sliced the address space into length-limited segments with distinct CS/DS/SS/ES bases. x86_64 reduced it to a relic: all segments have base=0, limit=infinity. Only fs and gs bases survive as pointers to per-thread / per-CPU data (TLS).
Instead, Linux exposes virtual memory areas (VMAs) through mmap. Each VMA has a base, length, protection (r/w/x), and backing object (file, dev/zero, or anonymous).
/* memory-map a 64 MiB file, read lazily by page fault */ #include <sys/mman.h> #include <fcntl.h> int fd = open("data.bin", O_RDONLY); void *base = mmap(NULL, 64*1024*1024, PROT_READ, MAP_PRIVATE, fd, 0); /* touching base[0] triggers a page fault; kernel populates the page from the page cache (or reads it in). zero-copy for reads. */ madvise(base, 64*1024*1024, MADV_RANDOM); /* hint */
Anonymous vs file-backed
- Anonymous (
MAP_ANONYMOUS): backed by swap, zero-filled on first touch. Used bymalloc's large allocations and per-thread stacks. - File-backed: pages are populated from the file on fault; dirty pages flushed back on writeback or
msync. Shared mappings (MAP_SHARED) propagate writes to the file and to other mappers.
Page Fault Handling
A #PF is raised when the MMU sees P=0, a permission violation, or a non-canonical address. The kernel's do_page_fault classifies the fault by looking up the faulting address in the VMA list:
| Classification | What happened | Kernel action |
|---|---|---|
| Minor (lazy-alloc) | VMA exists; no physical frame yet | Allocate a zeroed frame; install PTE; return — a few µs |
| Minor (file cache) | File page already in page cache | Install PTE pointing at cached frame — a few µs |
| Minor (COW) | Write to r/o COW page (post-fork) | Copy page to new frame, mark PTE writable |
| Major | Page must be read from disk/swap | I/O; context switches; ~ms on HDD, ~10–100 µs on NVMe |
| Invalid (SIGSEGV) | Address not in any VMA, or W to r/o non-COW | Deliver SIGSEGV to the thread |
“What is a page fault?” is a trick question if you only answer “page not in RAM.” The interviewer wants: most page faults are minor — the page is either already in the buffer cache or needs to be zero-allocated. Only a tiny fraction are major (swap-in or file read), and those dominate the cost.
CPU Scheduling
FCFS, SJF/SRTF, Round Robin, MLFQ, CFS — each running the same five-process workload so you can compare Gantt charts directly.
Goals & Metrics
A scheduler picks which ready thread gets the CPU next. It is pulled in four directions that conflict:
- Throughput — jobs completed per unit time. Maximized by minimizing context switches.
- Turnaround time —
finish − arrival. Shorter = better responsiveness for batch. - Waiting time — time spent in the ready queue. A proxy for perceived latency.
- Response time — first time the job runs after arrival. Critical for interactive workloads.
- Fairness — every thread of equal priority makes comparable progress. Violated by starvation.
Two orthogonal knobs:
- Preemptive vs non-preemptive — can the scheduler forcibly take the CPU back?
- Priority vs time-slice — who gets to run, or for how long?
Throughput and response time are enemies. Long quanta amortize switch cost (high throughput) but make interactive tasks wait. Short quanta keep latency low but waste cycles on context switches. Every modern scheduler blends both.
Benchmark Workload
To compare schedulers apples-to-apples, we run the same five processes through each algorithm:
| Process | Arrival time | CPU burst | Priority (low = high) |
|---|---|---|---|
| P1 | 0 | 7 | 3 |
| P2 | 2 | 4 | 1 |
| P3 | 4 | 1 | 4 |
| P4 | 5 | 4 | 2 |
| P5 | 6 | 6 | 5 |
Total work = 22 units. Below we derive waiting time (how long each process sat in the ready queue) and turnaround time (completion − arrival) for each scheduler.
FCFS (First-Come First-Served)
Non-preemptive. Runs jobs in arrival order. Simple to implement, easy to starve: a 10-second burst arriving at time 0 delays every subsequent 1-ms job behind it — the convoy effect.
Derivation of averages
| Process | Start | Finish | Waiting (start − arrival) | Turnaround (finish − arrival) |
|---|---|---|---|---|
| P1 | 0 | 7 | 0 | 7 |
| P2 | 7 | 11 | 5 | 9 |
| P3 | 11 | 12 | 7 | 8 |
| P4 | 12 | 16 | 7 | 11 |
| P5 | 16 | 22 | 10 | 16 |
| avg | 5.8 | 10.2 |
/* FCFS dispatcher (kernel-style pseudocode in C) */ struct task *fcfs_pick_next(struct runqueue *rq) { if (list_empty(&rq->tasks)) return NULL; return list_first_entry(&rq->tasks, struct task, run_list); } void fcfs_enqueue(struct runqueue *rq, struct task *t) { list_add_tail(&t->run_list, &rq->tasks); /* append in arrival order */ }
FCFS is terrible for interactive workloads but excellent for throughput-only batch: minimal bookkeeping, no switches while a job runs. Real systems use FCFS only as a deep-background idle class.
SJF / SRTF (Shortest Job First)
Non-preemptive SJF picks the ready job with the smallest CPU burst. Preemptive SJF = SRTF (Shortest Remaining Time First): if a shorter job arrives, kick the current one out. Provably optimal for mean waiting time — but requires knowing burst length, which production systems must estimate via exponential averaging.
Derivation of averages
| Process | Finish | Waiting | Turnaround |
|---|---|---|---|
| P1 | 16 | 9 | 16 |
| P2 | 7 | 1 | 5 |
| P3 | 5 | 0 | 1 |
| P4 | 11 | 2 | 6 |
| P5 | 22 | 10 | 16 |
| avg | 4.4 | 8.8 |
SJF/SRTF can starve long jobs indefinitely if short jobs keep arriving. Production schedulers (MLFQ, CFS) add aging: a long-waiting job's effective priority rises until it has to run.
Round Robin (RR)
Preemptive. Every ready process gets a fixed time quantum q in FIFO order. When q expires the job goes to the tail of the queue. q too small → high switch overhead; too large → degenerates into FCFS.
Below we simulate RR with q = 2. Arrivals enter the queue tail as they appear.
Derivation (q=2)
| Process | Finish | Waiting | Turnaround |
|---|---|---|---|
| P1 | 20 | 13 | 20 |
| P2 | 13 | 7 | 11 |
| P3 | 5 | 0 | 1 |
| P4 | 17 | 8 | 12 |
| P5 | 22 | 10 | 16 |
| avg | 7.6 | 12.0 |
/* Round-robin scheduler skeleton */ struct task *rr_pick_next(struct runqueue *rq) { return list_first_entry(&rq->tasks, struct task, run_list); } void rr_tick(struct task *t, struct runqueue *rq) { if (--t->quantum == 0) { t->quantum = RR_QUANTUM; /* e.g. 4 ms */ list_move_tail(&t->run_list, &rq->tasks); schedule(); /* forces resched */ } }
Classic rule of thumb: q should be ~80% of CPU bursts < q. Linux's CFS dynamically sizes its equivalent via sched_latency_ns (default 24 ms) divided across active tasks, bounded below by sched_min_granularity_ns (3 ms).
Multi-Level Feedback Queue (MLFQ)
Several priority queues (Q0 highest .. Qn lowest). A job starts in Q0. If it uses its full quantum it drops to Q(i+1). If it voluntarily yields (I/O) before the quantum ends, it stays — interactive jobs float near the top, CPU-bound ones sink. Periodic priority boost at interval S raises everyone back to Q0 to prevent starvation.
Rules (Arpaci-Dusseau):
- If Priority(A) > Priority(B): A runs.
- If Priority(A) == Priority(B): round-robin.
- New job enters at the highest priority.
- If a job uses its entire time allotment at a level, it drops.
- After time
S, move all jobs to the topmost queue (boost).
Without priority boosting, a CPU-hog can game MLFQ by issuing a fake I/O right before its quantum expires — it stays at the top forever. Fix: account total time at a level (allotment), not per-quantum.
CFS — Linux Completely Fair Scheduler
CFS picks the runnable task with the smallest vruntime (virtual runtime). The tasks sit in a red-black tree indexed by vruntime; picking the leftmost node is O(log n). “Fair” = every task makes the same vruntime progress over sched_latency.
Nice values map to weights: nice 0 → 1024, each nice step multiplies by ~1.25. Effective time-slice = sched_latency × (weight / total_weight), so a higher-weighted task runs longer slices and accumulates vruntime slower:
/* Simplified CFS vruntime update (kernel/sched/fair.c) */ static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec = now - curr->exec_start; if ((s64)delta_exec <= 0) return; curr->exec_start = now; curr->vruntime += calc_delta_fair(delta_exec, curr); /* weighted */ update_min_vruntime(cfs_rq); }
CFS knobs
| Tunable | Default | Effect |
|---|---|---|
sched_latency_ns | 24,000,000 | Target period in which every runnable task runs once |
sched_min_granularity_ns | 3,000,000 | Floor on per-task slice (caps nr_tasks / latency) |
sched_wakeup_granularity_ns | 4,000,000 | How much vruntime gap before a wakee preempts current |
sched_migration_cost_ns | 500,000 | Penalty to cross-CPU task migration; prevents thrashing |
Linux 6.6+ replaces CFS with EEVDF (Earliest Eligible Virtual Deadline First) — still a weighted-fair queue, but with a proper deadline-based ordering that reduces latency outliers.
You don't need to memorize every line of CFS. The idea to convey: vruntime is a normalized progress clock; CFS always picks the task that looks most “behind.” Nice values are weights, not priorities; the scheduler still runs every task, just at different rates.
Head-to-Head Summary
| Algorithm | Preempt | Pick | Avg wait | Avg TT | Starvation | Best for |
|---|---|---|---|---|---|---|
| FCFS | No | FIFO | 5.8 | 10.2 | No | Throughput batch |
| SJF/SRTF | Yes (SRTF) | min burst | 4.4 | 8.8 | Yes | Known workloads |
| Round Robin (q=2) | Yes | FIFO cycle | 7.6 | 12.0 | No | Time-sharing TTY |
| MLFQ | Yes | hi-prio Q | 5.4 | 9.8 | With boost | Mixed interactive+batch |
| CFS | Yes | min vruntime | 6.8 | 11.2 | No | Linux desktop+server |
SJF has the best average waiting time by construction. Real schedulers (MLFQ, CFS) sacrifice a point or two of wait for fairness and starvation-freedom — losing 2% of average waiting for a 100% guarantee of eventual progress is a bargain.
Synchronization
Mutexes, semaphores, monitors, and condition variables — with the bounded-buffer producer-consumer animated step-by-step.
Race Conditions & Atomicity
A race condition arises when two threads access the same memory location concurrently, at least one is a write, and there is no happens-before ordering between them. The canonical example is an unprotected counter:
/* counter.c — the classic race */ int counter = 0; void *work(void *arg) { for (int i = 0; i < 1000000; ++i) counter++; /* load, inc, store — 3 instructions */ return NULL; } /* 4 threads: expected 4,000,000; observed 3,127,841 (lost updates). */
Each counter++ compiles to mov, inc, mov. Interleave two threads and a read can miss an update from another core. The fix is either a mutex or an atomic (hardware lock xadd).
Critical-section requirements (Dijkstra / Lamport)
- Mutual exclusion — at most one thread inside.
- Progress — no thread outside the section blocks entry.
- Bounded waiting — once you request entry you wait at most
kentries by others.
On most architectures a long write is atomic if aligned, but a mis-aligned or > 8-byte store can tear. Use _Atomic/std::atomic; do not assume volatile buys you atomicity (it only suppresses compiler reordering of that variable).
Mutexes & Spinlocks
A mutex is a binary, owned lock: only the holder can unlock. Under contention the loser either spins (CPU-busy) or sleeps (kernel-parked).
| Lock kind | Uncontended cost | Contended cost | When to use |
|---|---|---|---|
| Spinlock | ~20 ns (atomic CAS) | Wastes CPU cycles waiting | Short critical sections, kernel IRQ paths |
| Futex-backed mutex (Linux) | ~30 ns CAS, no syscall | ~1 µs wake via FUTEX_WAKE | General user-space lock |
| Adaptive (pthread) | ~30 ns | Spins briefly, then parks | Mix of short and long holds |
| Reader-writer (RW) | ~50 ns read | Writers starvation risk | Read-heavy workloads |
/* Test-and-set spinlock (toy, no back-off) */ typedef struct { _Atomic int flag; } spin_t; static inline void spin_lock(spin_t *s) { while (atomic_exchange_explicit(&s->flag, 1, memory_order_acquire)) { while (atomic_load_explicit(&s->flag, memory_order_relaxed)) __builtin_ia32_pause(); /* spin-hint */ } } static inline void spin_unlock(spin_t *s) { atomic_store_explicit(&s->flag, 0, memory_order_release); }
Why not just spin forever?
- Priority inversion — a high-priority spinner burns CPU while the low-priority holder is waiting for CPU; neither makes progress.
- Cache-line ping-pong — every spinner issues cache-line-exclusive requests; the lock line bounces between cores.
- Power — spinning defeats C-states; phones and laptops hate it.
Linux userspace uses futexes: the fast path is an atomic CAS entirely in userspace; only on contention does the loser call FUTEX_WAIT into the kernel.
An uncontended pthread_mutex_lock + unlock is ~30 ns (one atomic + one atomic). A contended lock with a single waiter costs ~2 µs (syscall round-trip). A ticket lock under heavy contention can cost 10+ µs per cycle due to cache-line bouncing.
Semaphores — Counting & Binary
A semaphore holds an integer. wait/P(s) decrements (blocking if it would go negative); signal/V(s) increments (waking one waiter). Dijkstra's original design is the seed of most sync primitives.
- Binary semaphore — values in {0,1}. Looks like a mutex but anyone can signal it (no ownership). Good for signalling.
- Counting semaphore — models a pool of N resources (connection pool, bounded buffer slots).
/* POSIX sem_t — counting semaphore bounding connection pool */ #include <semaphore.h> static sem_t slots; sem_init(&slots, 0, 16); /* 16 available */ void *handler(void *arg) { sem_wait(&slots); /* P: block if 0 */ conn_t c = pool_take(); serve(c); pool_return(c); sem_post(&slots); /* V: wake one */ return NULL; }
Semaphores solve the “signal-before-wait” race that bare condition variables have — the count is preserved. But using a binary semaphore as a mutex loses ownership tracking: any thread can unlock, error detection (unlock-by-non-owner, recursive unlock) is lost.
Monitors & Condition Variables
A monitor is a higher-level construct: an object whose methods are mutually exclusive (implicit mutex) and whose threads rendezvous via condition variables. Java's synchronized + wait/notify is a direct descendant. POSIX models it as a pthread_mutex_t paired with a pthread_cond_t.
CV operations:
wait(mutex)— atomically release the mutex and block. On wake, reacquire the mutex before returning.signal()/notify()— wake one waiter (no-op if empty).broadcast()/notifyAll()— wake all waiters (they contend for the mutex).
Wait semantics: Hoare vs Mesa
| Hoare | Mesa (Java, POSIX) | |
|---|---|---|
| Signal hands off mutex directly to signalled thread | Yes — signaller is suspended | No — signaller keeps running |
| Must re-check predicate on wake? | No | Yes — always loop on while |
| Implementation burden | Heavy (complex scheduler) | Light |
On Mesa-semantics systems a spurious wakeup is legal, and even without that, another thread may have grabbed the resource between the signal and your wake-up. Always re-test the predicate inside a loop: while (!ready) pthread_cond_wait(&cv, &m);
Bounded-Buffer Producer-Consumer (2P + 2C, size 4)
Two producers, two consumers, a circular buffer of 4 slots. Three semaphores coordinate: empty counts free slots (init 4), full counts produced items (init 0), and mutex protects buffer indices (binary, init 1).
/* bounded_buffer.c — classic Bernstein/Dijkstra pattern */ #define N 4 item_t buf[N]; int in = 0, out = 0; sem_t empty, full, mutex; sem_init(&empty, 0, N); sem_init(&full, 0, 0); sem_init(&mutex, 0, 1); void *producer(void *_) { for (;;) { item_t it = produce_next(); sem_wait(&empty); /* at least one slot */ sem_wait(&mutex); /* critical section */ buf[in] = it; in = (in + 1) % N; sem_post(&mutex); sem_post(&full); /* signal consumers */ } } void *consumer(void *_) { for (;;) { sem_wait(&full); sem_wait(&mutex); item_t it = buf[out]; out = (out + 1) % N; sem_post(&mutex); sem_post(&empty); consume(it); } }
Order of acquiring semaphores matters
Both producer and consumer acquire their counting semaphore before the mutex. Reverse it and you deadlock: producer holds mutex while blocked on empty, consumer cannot acquire mutex to release a slot.
The key question isn't “can you code prod-cons?” but “why are there three semaphores?” Two counting semaphores give you flow control in both directions (full and empty). The mutex protects the shared indices. Collapsing any of the three creates either a race or a deadlock.
Deadlock
The four Coffman conditions, detection via resource-allocation graphs, prevention and avoidance, and the Banker's algorithm worked through a real allocation table.
The Four Coffman Conditions (1971)
All four must hold simultaneously for deadlock. Break any one and deadlock is impossible.
| # | Condition | Meaning | How you'd break it |
|---|---|---|---|
| 1 | Mutual exclusion | Resources cannot be shared | Make the resource virtual/shared (e.g., immutable data, optimistic reads) |
| 2 | Hold & wait | A process holds one resource and waits for another | Require all-at-once acquisition, or release before re-request |
| 3 | No preemption | Resources are only released voluntarily | Steal resources (e.g., DB abort-on-conflict with MVCC) |
| 4 | Circular wait | Cycle in the wait-for graph | Impose a global lock order; always acquire in that order |
The cheap prevention answer is always: sort lock IDs and acquire in order. This breaks circular wait. For dining philosophers, number the chopsticks 1–5 and have every philosopher pick up the lower-numbered first. The asymmetry breaks the cycle.
Canonical example — the opposite-direction lock order
/* Thread A */ /* Thread B */ pthread_mutex_lock(&m1); pthread_mutex_lock(&m2); pthread_mutex_lock(&m2); pthread_mutex_lock(&m1); /* wait — deadlock */ /* work */ /* work */ pthread_mutex_unlock(&m2); pthread_mutex_unlock(&m1); pthread_mutex_unlock(&m1); pthread_mutex_unlock(&m2); /* Fix: both sides acquire in m1-then-m2 order. */
Resource Allocation Graph (RAG)
Model processes as circles, resources as squares. An edge from process → resource means “requests”; resource → process means “currently allocated to.” Deadlock in a single-instance system iff there is a cycle. With multiple instances, a cycle is necessary but not sufficient — check actual availability.
Detection via Cycle Finding
With a RAG in memory, detect deadlock by running DFS and looking for a back-edge in the directed graph. Kernel deadlock detectors (lockdep in Linux) keep an in-memory chain of lock-acquire-order pairs and flag any new edge that would complete a cycle, before the deadlock actually happens.
# Simple DFS cycle-detection on a wait-for graph (Python) def has_cycle(adj): WHITE, GRAY, BLACK = 0, 1, 2 color = {v: WHITE for v in adj} def dfs(u): color[u] = GRAY for v in adj[u]: if color[v] == GRAY: # back-edge = cycle return True if color[v] == WHITE and dfs(v): return True color[u] = BLACK return False return any(dfs(v) for v in adj if color[v] == WHITE)
Recovery strategies once deadlock is detected
- Process termination — kill all deadlocked processes, or pick a victim by cost (total CPU spent, priority, resources held).
- Resource preemption — force-release a resource; requires rollback to a safe checkpoint (common in DBs via transaction abort).
- Operator intervention — log, alert, manual cleanup. Used when correctness trumps availability.
Prevention vs Avoidance
| Prevention | Avoidance | Detection | |
|---|---|---|---|
| When it acts | Statically: break one Coffman condition | Dynamically: refuse grants that could deadlock | After the fact: find cycles, recover |
| Needs runtime info | No | Max-resource declarations per process | Just the wait-for graph |
| Overhead | Low (design-time) | High (every grant runs the safety check) | Medium (periodic scan) |
| Downside | Conservative: may forbid safe patterns | Must know max claims up front | Deadlock happens; recovery cost non-zero |
| Example | Global lock order | Banker's algorithm | Linux lockdep, MySQL InnoDB wait-for graph |
Most production systems use a mix: static prevention where cheap (lock ordering) + post-hoc detection for dynamic allocators (DB row locks). Full avoidance (Banker's) is rare outside of embedded safety-critical code because nobody knows their maxima.
Banker's Algorithm — Avoidance by Safety Check
Dijkstra 1965. Given each process's maximum claim, grant a request only if the post-grant state is safe — i.e. there exists an ordering of processes that can all complete with current available resources.
Inputs (for n processes, m resource types):
Max[n][m]— max units process i will ever need of resource j.Alloc[n][m]— units currently held.Need[i][j] = Max[i][j] - Alloc[i][j]— remaining demand.Available[m]— pool.
Worked example: 5 processes, 3 resource types (A, B, C)
Total resources: A=10, B=5, C=7. Initial state:
| Process | Alloc (A,B,C) | Max (A,B,C) | Need (A,B,C) |
|---|---|---|---|
| P0 | 0,1,0 | 7,5,3 | 7,4,3 |
| P1 | 2,0,0 | 3,2,2 | 1,2,2 |
| P2 | 3,0,2 | 9,0,2 | 6,0,0 |
| P3 | 2,1,1 | 2,2,2 | 0,1,1 |
| P4 | 0,0,2 | 4,3,3 | 4,3,1 |
| Sum alloc | 7,2,5 |
Available = Total - Sum alloc = (3, 3, 2).
Reference implementation
/* bankers.py — classic safety algorithm */ def is_safe(avail, max_need, alloc): n, m = len(alloc), len(avail) need = [[max_need[i][j] - alloc[i][j] for j in range(m)] for i in range(n)] work = avail[:] finish = [False] * n order = [] while True: progressed = False for i in range(n): if finish[i]: continue if all(need[i][j] <= work[j] for j in range(m)): for j in range(m): work[j] += alloc[i][j] finish[i] = True order.append(i) progressed = True if not progressed: break return (all(finish), order)
Banker's algorithm is O(m·n²) per request. It also requires that every process declare its maximum claim up front — rarely true in real software. Consequence: Banker's is mostly a textbook topic; real kernels use prevention + detection, not avoidance.
File Systems
How disks are organized into inodes and directories, what a journal actually logs, and why LFS / COW file systems look so different.
Inodes & On-Disk Layout
UNIX file systems separate metadata (an inode) from the data blocks. An inode carries: mode, uid/gid, timestamps, size, refcount, and pointers to data blocks. Directories are just special files whose data is a table of name → inode.
On ext4 (4 KiB blocks, 4-byte pointers): 12 direct = 48 KiB, single indirect = 4 MiB, double = 4 GiB, triple = 4 TiB. ext4 also supports extents (base+length) so a 1 GiB contiguous file is 1 extent, not 262,144 pointers.
Directories, Hard Links & Path Resolution
A directory is a file whose data blocks contain (name, inode_number) records. ls reads that; stat then resolves each inode. The filesystem tree is a directed graph with hard links (more than one name per inode) and soft links (a file whose content is the path to traverse).
Hard vs soft links
| Hard link | Soft / symbolic link | |
|---|---|---|
| Refers to | Same inode | Path string |
| Cross filesystem? | No | Yes |
| Survives target rename? | Yes (unaffected) | Breaks (dangling) |
| Refcount bump? | Yes (inode nlink++) | No |
| Can target a directory? | No (only root can; forbidden in practice) | Yes |
# Observe hard vs soft links $ echo hello > a $ ln a hard # hard link — same inode, nlink = 2 $ ln -s a soft # symlink — separate inode $ stat -c '%i %h %n' a hard soft 12345 2 a 12345 2 hard 12346 1 soft
Journaling — Crash Consistency without fsck
Without a journal, a crash during a multi-block write can leave the filesystem inconsistent (e.g., inode says 1 MiB but only 512 KiB of data blocks allocated). The journal is a circular log of transactions: group metadata edits, write them to the log first, then the main filesystem.
Phases of a journalled write (ext4 in data=ordered mode)
- Begin txn — gather all dirty inodes, bitmaps, and directory blocks for this syscall.
- Data write — write user data blocks to their final location. Wait for them to be durable.
- Journal log — write the metadata edits (inodes, bitmaps) to the journal with a descriptor block. Wait.
- Commit record — write a single commit block with checksum. This is the atomic point: either the commit is durable (txn replayable) or it isn't (txn will be discarded).
- Checkpoint — later, write the metadata to the main filesystem and mark the journal space reusable.
Journal modes
| Mode | What's logged | Data consistency | Perf |
|---|---|---|---|
journal | Both metadata and data | Strongest — data never appears from an uncommitted txn | Lowest (2x writes) |
ordered (default) | Metadata only, but data is forced before commit | Post-crash: no stale bytes | Good balance |
writeback | Metadata only; data can land any time | Post-crash: files may contain garbage | Fastest |
A journal guarantees consistency of the filesystem, not of your application. A database on ext4 still needs fsync + barriers + a WAL of its own, because the page cache may hold your latest writes even after the syscall returns.
LFS vs COW
Traditional (ext4, XFS) update data blocks in-place and use a journal for metadata. Two alternative designs eliminate in-place updates entirely:
Log-Structured FS (LFS / F2F / NILFS)
- All writes (data + metadata) are appended to a log.
- An inode map (imap) tracks “where is the latest version of inode i?”
- Reads follow imap → inode → data blocks (same as regular FS).
- A segment cleaner compacts partially-dead segments to reclaim space.
- Ideal for flash: write-heavy, sequential, no RMW amplification.
Copy-on-Write (ZFS, btrfs, APFS)
- Never overwrite a live block. Every update writes to a new location and replaces pointers up the tree.
- A superblock atomically swings to the new root — all edits atomic.
- Cheap snapshots: retain an old root, its subtree is still reachable.
- End-to-end checksums on every block (optional) — catches silent bit rot.
- Write amplification: editing a byte may rewrite the path from root to leaf.
Journaled in-place: small writes are cheap, fragmentation low, recovery fast. COW: snapshots cheap, integrity strong, but metadata writes amplify. LFS: flash-friendly write pattern, cleaner overhead eats some of that win.
Crash Consistency & fsync
Even with a journal you must invoke fsync(fd) to tell the kernel “flush my data to durable storage.” Without it the page cache may hold your write for seconds. fsync on a directory ensures directory entries are durable (critical for rename-based atomic replace).
/* Durable atomic file replace (POSIX) */ int fd = open("data.new", O_WRONLY|O_CREAT|O_TRUNC, 0644); write(fd, payload, len); fsync(fd); /* flush data */ close(fd); rename("data.new", "data"); /* atomic on same FS */ int dfd = open(".", O_RDONLY|O_DIRECTORY); fsync(dfd); /* flush dir entry, else rename can be lost */ close(dfd);
I/O Models
Blocking, non-blocking, multiplexed (select/poll/epoll/kqueue), and async (io_uring) — plus the subtle epoll bugs that trip even senior engineers.
Blocking / Non-blocking / Sync / Async Matrix
Two orthogonal properties govern any read or write:
- Blocking vs non-blocking — does the call wait when data isn't ready?
- Synchronous vs asynchronous — does the caller do the I/O itself, or ask the kernel to do it and notify when done?
| Sync | Async | |
|---|---|---|
| Blocking | Classic read(): thread sleeps until data ready, then copies it | POSIX AIO (aio_read): rarely used, kernel threads behind it |
| Non-blocking | fcntl(O_NONBLOCK)+select/epoll: app polls for readiness, then copies itself | io_uring: app submits, kernel performs the I/O, app reaps completions |
The C10K problem (one thread per connection blocks around 10K fds) was solved by readiness notification — epoll/kqueue. The C10M problem (10 million connections) is attacked by completion notification — io_uring — which avoids per-I/O syscall overhead entirely.
select / poll / epoll / kqueue
All four let a single thread monitor many FDs. They differ in how the FD set is conveyed and how scalable they are:
| API | Portability | Per-call cost | Max fds | Semantics |
|---|---|---|---|---|
select | POSIX | O(FD_SETSIZE) bit scan | 1024 by default | Pass bitset; kernel copies it each call |
poll | POSIX | O(n) array scan | ulimit | Pass pollfd[]; kernel iterates all |
epoll | Linux | O(ready) — returns only active fds | Tens of millions | Register once; kernel queues ready events |
kqueue | *BSD / macOS | O(ready) | Unbounded | Event-typed filter API (files, signals, timers) |
IOCP | Windows | O(1) completion reap | Unbounded | Completion-based, not readiness-based |
/* epoll event loop skeleton */ int ep = epoll_create1(0); struct epoll_event ev = { .events = EPOLLIN|EPOLLET, .data.fd = sock }; epoll_ctl(ep, EPOLL_CTL_ADD, sock, &ev); for (;;) { struct epoll_event evs[64]; int n = epoll_wait(ep, evs, 64, -1); for (int i = 0; i < n; ++i) { int fd = evs[i].data.fd; if (evs[i].events & EPOLLIN) { /* ET mode: drain until EAGAIN */ for (;;) { ssize_t r = read(fd, buf, sizeof(buf)); if (r < 0 && errno == EAGAIN) break; if (r <= 0) { close(fd); break; } handle(fd, buf, r); } } } }
Edge vs Level Triggered
You must loop and drain the fd until read returns EAGAIN. If you read exactly what epoll said was there and stop, you'll miss the rest — and epoll will not notify you again until another edge. Same rule for accept loops: pull all pending connections out of the listen queue.
io_uring — True Async I/O on Linux
Introduced in Linux 5.1 (2019). Two lock-free shared-memory ring buffers (SQ: submission, CQ: completion) between userspace and kernel. The app writes SQEs; the kernel consumes them, does the I/O, writes CQEs. Zero syscalls per I/O in the busy case (IORING_SETUP_SQPOLL polls the SQ from a kernel thread).
What it subsumes
- Every
read/write/send/recv/accept/connect/openat/stat/close/fsync— submittable, batchable. - Buffered, direct, and zero-copy (
IORING_OP_SEND_ZC). - Chained / linked SQEs: “open, then read, then close” as a single submission.
- Registered buffers + files: skip reference counting per I/O.
For 4 KiB random reads on NVMe, epoll + pread tops out around ~600 K IOPS per core. io_uring with polled mode reaches ~3 M IOPS per core — ~5× the throughput at ~1/3 the CPU, mostly by eliminating syscall entry/exit and the copy_user cost.
Common epoll / Event-Loop Bugs
Many worker threads all epoll_wait on the same listen fd. A connection arrives; all wake. Only one gets the socket; the rest find EAGAIN and go back to sleep. Fix: EPOLLEXCLUSIVE (kernel 4.5+) wakes only one waiter. Or use SO_REUSEPORT and one accepter per thread.
You close fd 7, the kernel immediately reuses it for a new socket, then your epoll event for “old fd 7” fires and you process bytes for the wrong connection. Fix: always epoll_ctl(EPOLL_CTL_DEL) before close, or associate a generation counter in epoll_event.data.u64.
Using EPOLLET on a blocking fd: the drain loop blocks forever after the first read fills the buffer. Always pair ET with O_NONBLOCK.
TCP send buffer fills and write returns EAGAIN. You must re-arm EPOLLOUT, queue the unsent bytes, and flush when writable — then disarm EPOLLOUT. Leaving EPOLLOUT armed causes a busy-loop of wakeups.
Concurrency Patterns
Futures, actors, CSP channels, and the event loop — the building blocks every modern async runtime picks from.
Futures & Promises
A future is a handle to a value that will arrive later. A promise is the write side of that pair — whoever computes the value completes the promise, which fulfils all futures derived from it. Semantically identical to a continuation-passing callback, but composable.
# Python futures/async composition import asyncio, aiohttp async def fetch(session, url): async with session.get(url) as r: return await r.text() async def main(urls): async with aiohttp.ClientSession() as s: # All fetches run concurrently; gather awaits all. pages = await asyncio.gather(*(fetch(s, u) for u in urls)) return [len(p) for p in pages] asyncio.run(main(["https://a.example/", "https://b.example/"]))
Combinators you must know
- map — transform the eventual value.
- flatMap / then — chain another future that depends on this one's result.
- all / gather — wait for N futures, return a list of N results.
- race / any — return the first to succeed (or fail).
- timeout — cancel after a deadline.
If you call an async function and don't await it, its body never runs (Python coroutines) or runs detached (JS Promises with unhandled rejections). Both languages warn; treat these warnings as errors.
Actor Model — Isolation by Mailbox
Each actor is a lightweight process with private state and a mailbox. Communication is one-way messages: actor A sends to B's mailbox, B processes messages one at a time. Because state is private and access is serial, no locks required. Erlang/OTP, Akka, Elixir popularized it.
# Actor-style counter with a single mailbox coroutine import asyncio async def counter_actor(inbox): n = 0 while True: msg = await inbox.get() if msg == "inc": n += 1 elif isinstance(msg, tuple) and msg[0] == "get": msg[1].set_result(n) elif msg == "stop": return
Strengths / weaknesses
- + No shared state, no locks, compositional supervision trees.
- + “Let it crash” recovery — a supervisor restarts a failing actor.
- − Ordering only within a mailbox; cross-actor invariants require careful protocol design.
- − Bounded mailboxes or you risk unbounded memory growth under back-pressure.
CSP Channels (Hoare, 1978 → Go)
Communicating Sequential Processes: goroutines synchronize by sending values through typed channels, not shared memory. Unbuffered channels are rendezvous (sender blocks until receiver arrives); buffered channels act as a bounded queue.
// Go: pipeline of goroutines via channels package main import "fmt" func producer(ch chan<- int) { for i := 0; i < 5; i++ { ch <- i } close(ch) } func consumer(ch <-chan int, done chan<- bool) { for v := range ch { fmt.Println(v) } done <- true } func main() { ch := make(chan int, 2) done := make(chan bool) go producer(ch) go consumer(ch, done) <-done }
Actor identity is who (mailbox address); CSP identity is what (channel). Actors model independent agents, CSP models typed streams. They are duals: an actor is a goroutine looping on its channel.
Event Loop — Single-threaded Concurrency
One thread running a FIFO of callbacks. Non-CPU work (network, disk, timers) is delegated to the kernel (epoll/kqueue) or a small threadpool (libuv); the loop only dispatches ready tasks. JS engines (Node, browsers) and Python's asyncio both work this way.
A single sync JSON.parse on a 50 MB blob blocks the loop for ~500 ms; every other connection stalls. Offload to a worker thread (worker_threads) or a child process. Rule: if any handler takes >10 ms, profile and move it off the loop.
async / await Mechanics
Syntactic sugar over state machines. The compiler translates each async def into a class with methods send, throw, close; each await becomes a suspension point. The runtime drives the state machine from an event loop.
# Roughly what "async def f()" becomes at the bytecode level class AwaitableF: def __init__(self): self.state = 0 def send(self, val): if self.state == 0: self.state = 1 return some_future # suspend elif self.state == 1: result = val # future's value raise StopIteration(result) # done
“Why can a single thread handle 100K connections?” Because every await parks the coroutine, freeing the thread to run another. The cost of a context switch in userspace (saving a few pointers) is ~20 ns, vs 1–5 µs for a kernel thread switch.
Memory Model & Atomics
Why the hardware reorders your loads and stores, how happens-before rules let you reason about it, and the ABA problem that ruins naive lock-free code.
Reordering & Visibility
Two threads on two cores see a different order of the same stores. This isn't a bug; it's the architecture. Both the compiler and the CPU freely reorder instructions that are independent within a single thread. Cross-thread visibility requires explicit synchronization.
Who reorders what?
| Layer | Can do | Defeat with |
|---|---|---|
| Compiler | Reorder, hoist, combine, reuse registers | volatile / std::atomic / inline-asm "memory" clobber |
| CPU (out-of-order) | Reorder independent loads/stores within the pipeline | Memory barriers / fences |
| Store buffer | Delay stores so loads go first (x86 allows store→load reordering) | mfence, lock-prefixed ops |
| Cache coherence | Propagation delay between cores | Acquire/release ordering and cache line protocol |
Hardware model strengths
- x86/x86_64 — TSO (Total Store Ordering). Stores don't reorder with earlier stores; loads don't reorder with earlier loads or stores. Only reordering allowed: a store followed by a load to a different address.
- ARM64 — Relaxed. Any pair of independent memory ops can reorder. Requires explicit
dmb/ldar/stlr. - POWER — Even weaker than ARM. Must fence often.
Code that “just works” on x86 often breaks on ARM. Java (JMM), C++11 (std::memory_order), and Go provide portable memory models — if you use them correctly you're safe on any architecture.
Happens-Before — The Reasoning Primitive
Define a partial order “hb” over events in a program. If a hb b, then a's writes are visible when b executes. The rules are few and compose transitively:
- Program order — within a thread, earlier statements hb later ones.
- Unlock / Lock — an unlock of m hb every subsequent lock of m.
- Release / Acquire — an atomic store-release hb every atomic load-acquire that reads it.
- Volatile (Java) — a write to a volatile hb every subsequent read of that volatile.
- Thread start / join —
t.start()hb everything in t; everything in t hbt.join()returning. - Transitivity — a hb b and b hb c implies a hb c.
The JMM / C++ memory model is intimidating on paper, but “do the writes I did before the flag become visible to someone who reads the flag?” is the only question. Release on write, acquire on read, and you're home.
Acquire / Release / SeqCst
| Order | Meaning | Typical use |
|---|---|---|
memory_order_relaxed | Atomicity only — no ordering | Perf counter increments |
memory_order_acquire | No later loads/stores move before this load | Lock acquire, flag read |
memory_order_release | No earlier loads/stores move after this store | Lock release, flag write |
memory_order_acq_rel | Both — used on RMW (CAS, fetch_add) | Lock-free data structures |
memory_order_seq_cst | Single total order across all SC ops | Default — safest but slowest |
/* Dekker-style publication with release/acquire (C11) */ _Atomic int flag = 0; int payload; void producer(void) { payload = 42; /* non-atomic write */ atomic_store_explicit(&flag, 1, memory_order_release); } void consumer(void) { while (atomic_load_explicit(&flag, memory_order_acquire) == 0) { } printf("%d\n", payload); /* guaranteed to see 42 */ }
On x86, memory_order_acquire/release cost zero additional instructions vs relaxed (x86's TSO already provides the ordering). seq_cst on a store requires an mfence, ~15–30 cycles. On ARM the difference is real: relaxed = ldr, acquire = ldar, seq_cst = dmb + op.
Fences & Barriers
A fence is a standalone instruction that restricts reordering around it without itself being a read or write. Use when you need ordering without an atomic op:
- Full fence (
mfence/dmb ish): no load or store moves across. - Load fence (
lfence): earlier loads complete before later loads. - Store fence (
sfence): earlier stores complete before later stores. - Compiler fence only:
asm volatile("" ::: "memory")— prevents the compiler from reordering, but no hardware barrier.
The ABA Problem on a Lock-Free Stack
The canonical CAS-based pop is: read head, read head->next, CAS head from old to next. If between your read and CAS someone pops A, pops B, then pushes A back — head is still A, CAS succeeds, but head->next now points at freed memory.
Fixes
- Versioned pointers (double-word CAS) — store
(ptr, version)in 128 bits; every pop increments version.cmpxchg16bon x86. - Hazard pointers (Michael, 2004) — before dereferencing a shared pointer, publish it as “in use”; reclaimers skip nodes announced by any thread.
- Epoch-based reclamation (crossbeam in Rust) — a node is freed only when no thread is in an older epoch.
- RCU (Linux kernel) — reclaim after a grace period in which every CPU has context-switched.
/* Treiber stack pop with DWCAS versioned head */ struct Node { int v; Node* next; }; struct Head { Node* p; uint64_t ver; }; _Atomic(struct Head) head; Node* pop(void) { struct Head old, neu; do { old = atomic_load(&head); if (!old.p) return NULL; neu.p = old.p->next; neu.ver = old.ver + 1; /* defeats ABA */ } while (!atomic_compare_exchange_weak(&head, &old, neu)); return old.p; }
When asked “what's wrong with this lock-free stack,” the expected answer is ABA, followed by one of the four remediations above. Bonus: mention that GC languages (Java, Go) don't see ABA on reference-typed slots because the old node can't be reclaimed while a thread still holds a reference.
Cheat Sheet & Quick Reference
One-page summaries, must-memorize latency numbers, rapid-fire Q&A, and a POSIX command matrix.
Module Summaries
Processes & Threads
- Process = address space + FDs; thread = registers + stack
- Linux: both are
task_struct; threads sharemm/files/sighand - fork = COW address space; exec replaces image
- Context switch: ~1 µs direct + 10–100 µs indirect (cache/TLB)
Memory / VM
- 4-level paging: 9+9+9+9+12 = 48-bit VA, 4 KiB pages
- PTE bits: P, R/W, U/S, A, D, G, PS, XD, PFN
- TLB miss = hardware walker reads 4 cache lines
- Huge pages extend TLB reach from 256 KiB to 128 MiB
Scheduling
- FCFS: simple, convoy effect
- SJF/SRTF: optimal mean wait, starvation risk
- RR: fair, q tuning = latency vs throughput
- MLFQ: adaptive priority + aging boost
- CFS: RB-tree of vruntime, weighted fair; EEVDF in 6.6+
Synchronization
- Mutex: owned lock; pthread = futex-backed
- Counting semaphore: pool / flow control
- CV: always loop on predicate (Mesa semantics)
- Prod-cons needs 3 sems: empty, full, mutex
Deadlock
- Coffman: mutex, hold&wait, no-preempt, circular-wait
- Break circular-wait with a global lock order
- Detect via DFS cycle in wait-for graph
- Banker's: O(mn²) safety check per request
File Systems
- Inode = metadata; directory = name → inode table
- Journal modes: journal > ordered (default) > writeback
- COW FS (ZFS/btrfs): cheap snapshots, write amplification
- Atomic file replace: write temp, fsync, rename, fsync dir
I/O Models
- select O(n), poll O(n), epoll O(ready), kqueue O(ready)
- ET requires non-blocking + drain-until-EAGAIN
- io_uring: shared rings, ~0 syscalls per I/O
- Watch out: fd reuse, thundering herd, EPOLLOUT busy-loop
Concurrency Patterns
- Futures: map / flatMap / all / race / timeout
- Actors: private state + mailbox; let-it-crash
- CSP: channels as typed queues (Go)
- Event loop: macrotask ↔ microtask drain per tick
Memory Model
- x86 = TSO (strong); ARM = relaxed
- Happens-before: PO, unlock/lock, release/acquire, start/join
- Release on publisher's write, acquire on consumer's read
- ABA defeated by versioned CAS, hazard ptrs, RCU, epochs
Latency Numbers Every OS Candidate Should Know
| Operation | Time (2024) | Relative |
|---|---|---|
| L1 cache reference | 0.5 ns | 1x |
| Branch mispredict | 3 ns | 6x |
| L2 cache reference | 4 ns | 8x |
| Mutex lock/unlock (uncontended) | 25 ns | 50x |
| Main memory reference | 100 ns | 200x |
| Compress 1 KB with Snappy | 2 µs | 4,000x |
| TLB miss (full 4-level walk from RAM) | 200 ns | 400x |
| Context switch | 1–5 µs | 2,000–10,000x |
| System call (simple syscall) | 100 ns | 200x |
| NVMe SSD random 4K read | 20 µs | 40,000x |
| Round-trip in same DC | 0.5 ms | 1,000,000x |
| HDD seek | 10 ms | 20,000,000x |
| Page fault (minor, zero-fill) | 1 µs | 2,000x |
| Page fault (major, NVMe swap-in) | 20 µs | 40,000x |
| Page fault (major, HDD swap-in) | 10 ms | 20,000,000x |
| fork() on 100 MB RSS | 450 µs | 900,000x |
| fsync on NVMe | 30 µs | 60,000x |
Rapid-Fire Q&A
| # | Question | Answer |
|---|---|---|
| 1 | Process vs thread? | Process owns resources (AS, FDs); thread is a unit of scheduling (registers, stack). |
| 2 | What is copy-on-write? | Page tables point at shared read-only pages; first write faults and copies that page only. |
| 3 | Why is fork fast? | It copies page tables, not physical pages. COW defers actual copies until writes. |
| 4 | What's in a PCB? | PID, state, register context, memory map, FD table, signals, creds, scheduler state. |
| 5 | What does a context switch cost? | ~1 µs direct + 10–100 µs indirect (TLB/cache cold). |
| 6 | Why 4-level paging? | Supports sparse 48-bit virtual space without allocating 2^48 / 4 KiB = 2^36 PTEs upfront. |
| 7 | What happens on TLB miss? | Hardware walks PML4 → PDPT → PD → PT, fetching up to 4 cache lines; loads PTE into TLB. |
| 8 | How does COW detect the write? | PTE marked read-only; write triggers #PF; kernel copies page, updates PTE. |
| 9 | What is thrashing? | Working set exceeds RAM; CPU spends >50% on page faults instead of useful work. |
| 10 | Difference between a mutex and a semaphore? | Mutex owned; unlock only by holder. Counting semaphore: multi-resource; any thread can V. |
| 11 | Why use a CV and not a busy-wait? | Busy-wait burns CPU + cache. CV parks the thread in the kernel until signalled. |
| 12 | What are the Coffman conditions? | Mutual exclusion, hold & wait, no preemption, circular wait. |
| 13 | Best prevention technique in practice? | Global lock ordering — breaks circular wait. |
| 14 | When does SJF starve? | When short jobs keep arriving faster than a long job finishes; aging fixes it. |
| 15 | What does CFS pick? | The runnable task with the smallest weighted vruntime (leftmost RB-tree node). |
| 16 | Why is RR quantum tuning hard? | Too small: all overhead. Too large: degenerates to FCFS. |
| 17 | What is the convoy effect? | Many short jobs queue behind one long job under FCFS. |
| 18 | What is priority inversion? | Low-pri holds a lock needed by high-pri; medium-pri preempts low-pri indefinitely. Fix: priority inheritance. |
| 19 | Hoare vs Mesa CVs? | Hoare: signal hands off mutex directly. Mesa: signalled thread re-contends; always loop on predicate. |
| 20 | What is a spurious wakeup? | A thread wakes from cond_wait without a corresponding signal. Re-test predicate. |
| 21 | What is livelock? | Threads aren't blocked but make no progress; e.g., both backing off and retrying in lockstep. |
| 22 | Inode vs directory entry? | Inode = metadata + block pointers. Dirent = (name, inode) in a directory file. |
| 23 | Hard link limits? | Same filesystem only; inode nlink must fit; root-only for directories. |
| 24 | What does fsync do? | Flushes dirty pages of the fd to durable storage and issues a FLUSH cache command. |
| 25 | Why fsync a directory? | Renames/creates are metadata; the directory file's updated block must be durable too. |
| 26 | What does a journal guarantee? | Filesystem metadata consistency after crash; not application data unless data=journal. |
| 27 | LFS vs COW? | LFS: log-append everything + cleaner. COW: copy on modify and swing root atomically. |
| 28 | What is select's main limit? | O(n) with FD_SETSIZE (1024). poll is O(n) unbounded. epoll is O(ready). |
| 29 | Why non-blocking for epoll ET? | ET fires once per transition; you must read until EAGAIN, which requires non-blocking. |
| 30 | What is io_uring's advantage? | Zero-copy shared rings; zero syscalls per I/O in polled mode; batching. |
| 31 | What are futures good at? | Composing async ops with map/flatMap/all/race without callback pyramids. |
| 32 | What is back-pressure? | Flow control so fast producers don't overwhelm slow consumers (bounded queues, pausing streams). |
| 33 | Actors vs CSP? | Actor identity = mailbox owner. CSP identity = channel. Dual formulations. |
| 34 | What is a microtask? | Promise callback / process.nextTick; drained fully between macrotasks. |
| 35 | Why is x86 easier to program? | TSO: only store→load reorderable. ARM/POWER: any independent pair may reorder. |
| 36 | Release / acquire pair meaning? | Release on publisher's write; acquire on reader's read — synchronizes-with for HB. |
| 37 | What is a data race? | Concurrent accesses to the same memory, at least one write, no HB ordering between them. |
| 38 | What is volatile NOT? | NOT atomic (only suppresses compiler reorder/elide). Use std::atomic for thread safety. |
| 39 | ABA — one-line summary? | CAS sees old value, but it was pop-pop-push-back; your "next" pointer is now dangling. |
| 40 | RCU in one sentence? | Readers lock-free; updater publishes a new version, frees old after a grace period. |
| 41 | What is a futex? | User-space fast-path atomic + kernel FUTEX_WAIT/WAKE syscall for contended slow-path. |
| 42 | Why do threads share the signal table? | POSIX: signals are per-process, delivered to any thread whose mask permits it. |
| 43 | Zombie vs orphan? | Zombie: terminated, not reaped. Orphan: parent died, init adopts and reaps. |
| 44 | What does EINTR mean? | A syscall was interrupted by a signal; must retry (unless SA_RESTART auto-restart). |
| 45 | mmap vs read for files? | mmap: zero-copy, page-fault driven, lazy. read: explicit, always a syscall + copy. |
| 46 | Why is pthread_mutex_trylock useful? | Avoids blocking; lets you fall back (try a different lock order, back off, do other work). |
| 47 | Time vs priority in SCHED_FIFO? | Highest fixed priority runs until it yields; no time-slicing — risk of starving others. |
| 48 | Huge pages — cost? | Require contiguous physical frames; hard to allocate on a fragmented host; transparent HP (THP) mitigates. |
| 49 | copy_file_range vs sendfile? | copy_file_range: file→file zero-copy. sendfile: file→socket zero-copy (older API). |
| 50 | What does a page cache buy you? | In-memory cache of file data; turns the second read into memory-speed; absorbs writes until flush. |
POSIX / Linux Command Matrix
| Task | Syscall | Header | Notes |
|---|---|---|---|
| Create process | fork, vfork, clone | unistd.h / sched.h | clone with flags builds threads or containers |
| Replace image | execve family | unistd.h | Only returns on failure |
| Reap child | wait, waitpid, waitid | sys/wait.h | Clears zombie; returns exit status |
| Allocate memory | brk, mmap | unistd.h / sys/mman.h | Anonymous mapping = malloc backing for big allocs |
| Map file | mmap | sys/mman.h | Shared or private; zero-copy access |
| Create thread | pthread_create | pthread.h | Wrapper around clone(CLONE_THREAD|CLONE_VM|…) |
| Mutex | pthread_mutex_* | pthread.h | Futex-backed on Linux |
| Condition variable | pthread_cond_* | pthread.h | Always used with a mutex, loop on predicate |
| Semaphore | sem_init/wait/post | semaphore.h | Counting; POSIX named or unnamed |
| Event-driven I/O | epoll_create1/ctl/wait | sys/epoll.h | Linux only; use kqueue on BSD/macOS |
| Async I/O (modern) | io_uring_setup + user lib liburing | liburing.h | Submission + completion rings |
| Open / close fd | open, close, openat | fcntl.h / unistd.h | O_NONBLOCK, O_CLOEXEC, O_DIRECT |
| Durability | fsync, fdatasync, sync_file_range | unistd.h | fsync on dir for rename atomicity |
| Rename / atomic swap | rename, renameat2 RENAME_EXCHANGE | stdio.h / fcntl.h | Atomic on same filesystem |
| Signals | sigaction, kill, pthread_sigmask | signal.h | Prefer sigaction; async-signal-safe funcs only in handlers |
| Memory protection | mprotect, madvise | sys/mman.h | Change perms, hint access pattern |
| IPC: shared mem | shm_open + mmap | sys/mman.h | POSIX SHM; prefer over SysV shmget |
| IPC: unix sockets | socket(AF_UNIX) | sys/socket.h | SOCK_STREAM or SOCK_SEQPACKET; FD passing via SCM_RIGHTS |
| Scheduling tuning | sched_setscheduler, nice, setpriority | sched.h | SCHED_OTHER (CFS), SCHED_FIFO, SCHED_RR, SCHED_DEADLINE |
| CPU pinning | sched_setaffinity | sched.h | Reduces migration cost; aligns with NUMA topology |