LLM Foundations
Decoder transformer mechanics. Tokens, context, sampling, KV cache.
Decoder-Only Transformer
- Tokens → embedding + positional encoding (RoPE, ALiBi) → N decoder blocks → unembed → softmax over vocab.
- Block = causal multi-head self-attention + FFN + residual + RMSNorm/LayerNorm.
- Causal mask: token t attends only to ≤ t. Enables autoregressive next-token prediction.
- FFN often gated (SwiGLU) + 4× hidden dim. MoE = sparse FFN, k of N experts active per token.
Tokens & Tokenizers
- BPE / Unigram / SentencePiece. Vocab 32k–256k.
- Cost rule of thumb: 1 token ≈ 4 chars English, ≈ 1 short word. Code denser; CJK sparser.
- Tokenizer mismatch = silent corruption. Always tokenize with model's exact tokenizer.
- Special tokens: BOS, EOS, system, tool-use markers. Don't include in user content.
Sampling Parameters
| Param | Effect | Default zone |
|---|---|---|
| temperature | Logit scaling pre-softmax. 0 = greedy, 2 = chaotic | 0.0 deterministic, 0.7 creative |
| top_p (nucleus) | Sample from smallest set whose prob ≥ p | 0.9–0.95 |
| top_k | Sample from k highest-prob tokens | 40–100 |
| frequency_penalty | Penalize tokens already used | 0–0.5 |
| presence_penalty | Penalize any token reused | 0–0.5 |
| max_tokens | Output cap | set always; default unbounded burns $ |
| stop sequences | Halt generation on substring | structured output boundary |
For deterministic eval / extraction use temperature=0. For brainstorming / synthesis use 0.7–1.0. Don't combine high temp + low top_p — redundant + unstable.
KV Cache
- During generation, cache K + V tensors per layer for past tokens. Avoids O(n²) recompute.
- Memory dominant cost. Per token ≈
2 × layers × heads × head_dim × bytes. Llama-70B: ~1 MB / token. - Optimizations: PagedAttention (vLLM), GQA / MQA (fewer KV heads), quantized KV (FP8/INT8).
- Prefix caching = reuse KV across requests sharing prefix. Huge win for system prompts, RAG.
Context Window
- Max tokens model attends to. 8k → 200k → 1M+ ranges.
- "Lost in the middle" — retrieval recall drops at middle of long context. Place key info at start or end.
- Long context cost ≈ quadratic compute (without sparse / linear attention variants).
- Prefer retrieval + smaller window over stuffing 1M tokens for cost + accuracy.