Research Archive
Technical Report

TR164 V4: The Amortization-Breakdown Surface — A Predictive Bandwidth Model for Continuous Batching

A 672-cell offline static-batch grid (3 models × A100-80GB + H100 × vLLM 0.10.2, 2,016 timed decode generations, ok_rate 1.0) showing the continuous-batching free lunch is bounded by KV-cache read bandwidth. A one-parameter model η(B) = (1+r)/(1+Br) fits the 96 efficiency curves at median R² 0.93, and its zero-fitted-parameter architectural form predicts the breakdown knee at Spearman ρ = 0.84 — off only by a GPU-specific compute-overlap factor (0.52 A100, 0.33 H100). Complements V3's closed-loop boundary — grounds it, does not revise it.

Table of Contents

Technical Report 164 V4: The Amortization-Breakdown Surface — A Predictive Bandwidth Model for When Continuous Batching Stops Paying (672-Cell Offline Grid, A100-80GB + H100, vLLM 0.10.2)

Status: Complete and hand-narrated at full TR depth. A single offline static-batch decode-throughput grid of 672 cells (3 models × 2 GPUs × 2 decode lengths × 2 precisions × 4 context lengths × 7 batch sizes), 3 timed repetitions per cell, ok_rate = 1.0 (672/672 succeeded, zero error cells). Engine: vLLM 0.10.2 V1. GPUs: A100-80GB and H100 (Modal serverless). Run artifact: research/tr164/results/modal_amortization/full_xmodel_672/results.json. Analysis artifacts: research/tr164/amortization_stats.json (main effects) and research/tr164/amortization_full_stats.json (the nine-layer maximalist analysis — every empirical number below traces to a key there or to the run JSON). Model architectures are verified from each model's Hugging Face config.json and safetensors index (Appendix D). Lineage facts about V1/V2/V3 are cited, not recomputed.

The V4 thesis in one sentence. Continuous batching's free lunch — the regime where adding requests raises aggregate throughput without degrading per-request rate — is bounded by KV-cache read bandwidth, and that bound is predictable from model architecture alone: a one-parameter bandwidth model η(B) = (1+r)/(1+Br) with r = C·k/W (context × KV-bytes-per-token over weight bytes) fits the measured efficiency curves at median R² 0.93 and predicts the observed breakdown knee at Spearman ρ = 0.84 with zero fitted parameters, off only by a GPU-specific compute-overlap factor (α ≈ 0.52 on A100, 0.33 on H100) that is itself the quantitative explanation for why faster HBM holds the knee out longer.

Protocol distinction, stated up front and threaded through. V3 measured a closed-loop concurrency boundary (N agents against a server, parallel efficiency + p95 latency). V4 measures an offline static-batch amortization surface (one process submits a fixed batch of B equal-length prompts to an in-process vllm.LLM, decode throughput is timed). These are complementary: the closed-loop number is the served operating point under traffic; the static-batch number is the best-case amortization ceiling the scheduler reaches toward when every slot is full of equal work. A knee here is a ceiling on where a real scheduler can keep amortizing (§24). No V4 number is presented as a served latency, and no V3 number is spliced into a V4 curve.


Overview

V4 turns the TR164 serving-stack-physics arc from a map of where continuous batching breaks (V3, under concurrent traffic) into a predictive mechanism for where it breaks and why. The substrate is a full cross-product offline grid: three instruction-tuned 7–8B checkpoints (qwen2.5-7b, llama3.1-8b, mistral-7b), two datacenter GPUs (A100-80GB, HBM2e ≈ 2.0 TB/s; H100, HBM3 ≈ 3.35 TB/s), two decode lengths (64 and 512 generated tokens), two precisions (fp16 and fp8-weight quantization), four context lengths (512, 2,048, 8,192, 32,000 exact prompt tokens), and seven batch sizes (1, 2, 4, 8, 16, 32, 64) — 3 × 2 × 2 × 2 × 4 × 7 = 672 cells, each measured over three timed repetitions after a warm-up. Completion was perfect (672/672, no error cell, no zero-throughput cell, every prompt exact to the token), and the three reps are extremely tight (median per-cell coefficient of variation 0.4%, §19), so the surface is measured cleanly. The aggregate measurement count is 672 cells × 3 reps = 2,016 timed decode generations plus 672 warm-ups, all on real GPUs with real weights.

The analysis is deliberately maximalist — nine layers that wring the cross-product for everything it supports rather than reporting marginal means. Layer 1 is the four axis main effects. Layer 2 is the interaction structure: difference-in-differences paired tests and a factorial Type-II ANOVA that ranks the levers by partial η². Layer 3, the centerpiece, derives a first-principles bandwidth-amortization model, fits it to all 96 efficiency curves, and validates a zero-free-parameter architectural prediction of the knee. Layer 4 measures achieved memory-bandwidth utilization to confirm decode is bandwidth-bound. Layers 5–9 cover the aggregate-throughput scaling ceiling, measurement reliability, the single-stream baseline, per-model architecture-grounded robustness, and a regression of the knee against context. The result is not a description of four correlations but a single mechanism — bandwidth-bound decode — that predicts the surface.

Two findings carry the report. First, the mechanism is predictive, not merely descriptive. The bandwidth model η(B) = (1+r)/(1+Br) fits 85% of the 96 curves at R² ≥ 0.7 (median 0.93), and its zero-parameter architectural form predicts the observed knee at Spearman ρ = 0.84 across two orders of magnitude of context; it fails in exactly one interpretable place — the H100 long-decode long-context corner where the curve goes non-monotone because compute overlap re-enters, which is itself the signature of the α < 1 effect the pure-bandwidth form omits. Second, every axis sign falls out of that one mechanism, and the architecture predicts the ordering. Context dominates (mean η 0.865 → 0.599 from ctx512 → ctx32000, ANOVA partial η² 0.27, an order of magnitude above any other design factor); the H100 holds η 7.7 points higher and its advantage grows with context (a significant interaction); fp8 weight quantization is 5.4 points less batch-robust with a penalty that is context-invariant (a null interaction that corrects a plausible-but-wrong intuition); and qwen2.5-7b's 57,344-byte-per-token KV footprint — less than half the 131,072 of Llama and Mistral, by virtue of its 28 layers × 4 KV heads — directly predicts and matches its 2× later knee. This is the empirical core reserved, in the serving-stack-physics paper contract, for the held title "When Continuous Batching Stops Amortizing."


1. Abstract

Continuous batching amortizes the per-token memory traffic of LLM decode across concurrent requests; where that amortization stops paying — and what governs the boundary — is an open operational question. Technical Report 164 V4 answers it with a predictive bandwidth model validated on a 672-cell offline static-batch grid (vLLM 0.10.2 V1, ok_rate 1.0): three 7–8B instruction-tuned models, two datacenter GPUs (A100-80GB, H100), two decode lengths (64, 512), two precisions (fp16, fp8-weight), four context lengths (512–32,000 exact tokens), seven batch sizes (1–64), three reps each. We define amortization efficiency η(B) = t(B)/t(1) — the fraction of the batch-1 per-request decode rate retained at batch B, equivalently the aggregate speedup over B, the parallel efficiency — and derive from first principles that if decode is HBM-bandwidth-bound then η(B) = (1+r)/(1+Br) with r = C·k/W, the ratio of per-request KV read bytes (context C × KV-bytes-per-token k) to model weight bytes W. The model fits the 96 measured curves at median R² 0.93, and its zero-fitted-parameter architectural form predicts the observed breakdown knee at Spearman ρ = 0.84, deviating only by a GPU-specific compute-overlap factor α (0.52 on A100, 0.33 on H100) that quantifies why faster HBM holds the knee out longer. Consistent with one mechanism: context dominates (mean η 0.865 → 0.599 across ctx512 → ctx32000, paired Δ = −0.266, bootstrap 95% CI [−0.294, −0.240], Holm p ≈ 0; factorial-ANOVA partial η² 0.27, vs 0.038 GPU / 0.018 precision / 0.009 model); the H100 sustains η +0.077 (CI [0.067, 0.088]) with an advantage that grows with context (difference-in-differences +0.088, Holm p < 1e-5); fp8 is −0.054 less batch-robust (CI [−0.061, −0.047]) with a context-invariant penalty (interaction null, Holm p = 0.88); a longer decode is +0.036. Across models qwen2.5-7b is most robust (mean η 0.789, knee at batch 16 at ctx32000) ahead of a llama3.1-8bmistral-7b tie (0.750/0.748, knee at batch 8) — an ordering the KV-byte architecture predicts (57,344 vs 131,072 bytes/token). Achieved memory-bandwidth utilization at batch 1 is ≈ 0.63 on the H100, confirming decode is bandwidth-bound; the aggregate-throughput speedup ceiling at batch 64 falls from 42× to 25× across the context range; measurement reliability is high (median CV 0.4%). All comparisons are paired Wilcoxon signed-rank with Holm-Bonferroni and reproducible bootstrap CIs (seed 164, 10,000 resamples). The claims are sized to the substrate — a best-case static-batch amortization ceiling for 7–8B models on two GPUs under one engine.


2. Table of Contents

  1. Abstract
  2. Table of Contents
  3. Executive Summary
  4. Introduction and Research Motivation
  5. Research Questions
  6. Methodology
  7. SS1. The Cross-Product Grid Design and Its Arithmetic
  8. SS2. The Modal Serverless Harness and the Persistence / Orphan-Billing Rigor Arc
  9. SS3. The Amortization Metric and the First-Principles Bandwidth Model
  10. SS4. Context Length Is the Dominant Lever
  11. SS5. The Bandwidth-Amortization Model: Fit, Architectural Prediction, and Compute Overlap
  12. SS6. The HBM Bandwidth Axis and the GPU×Context Interaction
  13. SS7. The Precision Tradeoff and the Context-Invariant fp8 Penalty
  14. SS8. The Decode-Length Axis
  15. SS9. The Factorial ANOVA Lever Ranking
  16. SS10. Cross-Model Robustness and the Architecture That Explains It
  17. SS11. The Aggregate-Throughput Scaling Ceiling
  18. SS12. Memory-Bandwidth Utilization: Confirming Decode Is Bandwidth-Bound
  19. SS13. Measurement Reliability
  20. SS14. The Single-Stream Baseline Surface
  21. SS15. The Continuous-Knee Regression
  22. SS16. The Knee Table — Where Batching Stops Paying
  23. SS17. Data Verification
  24. SS18. Cross-Run Lineage — V3 Closed-Loop versus V4 Offline
  25. Limitations
  26. Forbidden Claims
  27. Future Work
  28. Conclusion
  29. References
  30. Appendix A — Reproducibility Pins
  31. Appendix B — Statistical Methodology
  32. Appendix C — The Full 24-Configuration Knee Table
  33. Appendix D — Architecture and Bandwidth Constants
  34. Appendix E — Per-Curve Roofline-Fit Sample
  35. Appendix F — Master Named-Value Reference (paper-substrate table)
  36. Appendix G — Figure Manifest (caption-ready)
  37. Appendix H — Paper Claim Ladder (supported / licensed / forbidden)
  38. Appendix I — TR→Paper Section Map and Usage Note

Reference-substrate note. This TR is built to be mined when drafting the Tier-2 paper "When Continuous Batching Stops Amortizing." Appendices F–I are the agent-facing surface: F is the master number table (every value → JSON key, never re-derive), G the figure manifest (caption-ready), H the claim ladder (supported / licensed-with-caveat / forbidden), and I the TR→paper section map. Lead the paper with the predictive bandwidth model (§11, Figure 3).


3. Executive Summary

3.1 The headline: the amortization knee is predictable from architecture

The central result of V4 is that the breakdown of continuous batching is not an empirical curiosity to be tabulated but a consequence of a single mechanism that can be written down and used to predict. Model the per-request decode throughput as bandwidth-bound: each decode step reads the model weights once (W bytes, shared across the batch) and each request's KV cache once (C·k bytes, where C is context and k is KV-bytes-per-token), producing one token per request. Then per-request throughput is t(B) = BW / (W + B·C·k) and the amortization efficiency is

η(B) = t(B)/t(1) = (W + C·k)/(W + B·C·k) = (1 + r)/(1 + B·r), with r = C·k / W.

This one-parameter rational function, derived in §9, fits the 96 measured (model, GPU, decode, precision, context) curves at median R² 0.93 (57 of 96 above 0.90; 82 of 96 above 0.70), and the architectural value of r — computed from the verified config (params → W, layers × KV-heads × head_dim → k) with no fitting whatsoever — predicts the observed breakdown knee at Spearman ρ = 0.84 (Figure 3). The single place the form fails is the four H100 | d512 | ctx32000 curves, which go non-monotone (η dips then partly recovers) because the H100's compute headroom re-absorbs the batch at the most bandwidth-stressed corner; a monotone rational function cannot fit a non-monotone curve, and that failure is itself the fingerprint of the compute-overlap term the pure model omits.

Observations. The architectural prediction sits systematically below the observed knee (points above the y = x line in Figure 3) by a roughly constant factor — the compute-overlap factor α = r_fit / r_arch, median 0.52 on the A100 and 0.33 on the H100. Lower α means the real curve is gentler than pure bandwidth predicts, i.e., compute overlaps more of the KV read; the H100's larger α-discount is the quantitative reason its knee falls later. The mechanism is therefore not just descriptive: from three numbers in a model's config and the GPU's HBM generation, one can predict to within a factor of ~2 the batch at which continuous batching stops paying at a given context.

The amortization knee is predictable. A one-parameter bandwidth model fits the curves at R² 0.93 and its zero-parameter architectural form predicts the knee at ρ = 0.84; the only residual is a GPU-specific compute-overlap discount that itself explains the hardware effect. This is the result that turns "long context is slower" into a usable engineering prediction.

3.2 Context is the dominant lever, and the whole effect lives in the curve shape

Averaged across all twelve (model, GPU, decode, precision) configurations, mean η at batch 16 falls 0.897 → 0.838 → 0.686 → 0.442 across contexts 512 → 2,048 → 8,192 → 32,000, and at batch 64 it falls 0.659 → 0.619 → 0.533 → 0.389 (headline.mean_eta_by_context_batch). The paired extreme contrast (ctx32000 vs ctx512, 144 matched cells) is Δ = −0.266 (bootstrap 95% CI [−0.294, −0.240], Holm-adjusted Wilcoxon p ≈ 0), and the factorial ANOVA assigns context a partial η² of 0.266 — an order of magnitude above the next design factor (GPU 0.038). Critically, almost none of this lives in the single-stream rate: the batch-1 per-request throughput is only ≈ 6% lower at 32,000 tokens than at 512 (prefill-amortization penalty 0.94, §20). Observations. Context barely changes how fast one request decodes; it changes how fast batching stops helping. That is precisely why context is an amortization lever (it reshapes r, the curve's one parameter) rather than a raw-speed lever, and it is why the knee, not the throughput, is the quantity that moves.

Of every operational knob, context length has by far the largest effect on whether continuous batching still amortizes — and the effect is entirely in the curve shape, not the single-stream speed. Deciding how much history or retrieved context to include is also deciding the batch-throughput regime the server will live in.

3.3 The knee marches left with context; at 32K every configuration breaks by batch 8–16

The breakdown knee compresses sharply with context. Counting the 24 (model, GPU, decode, precision) configurations: at ctx512, 14 never cross 0.65 through batch 64; at ctx2048, still 14; at ctx8192, only 3 remain free and the rest cluster at batch 16–32; at ctx32000, every configuration has broken down, thirteen by batch 8 and the rest by batch 32 (headline.knee_distribution_by_context, full table §22). The continuous-knee regression (§21) puts the compression rate at a median slope of −0.64 in log₂(knee) per log₂(context): the knee roughly halves for every ~3× of context. Observations. There is no single safe batch size. The maximum batch at which batching still pays runs from ≥64 (or unbounded) at 512 tokens to ~8 at 32,000 — a swing an autoscaler that ignores prompt length cannot absorb without leaving either throughput or latency on the table.

The knee is a steep function of context: ≥64 at short context, ~8 at 32K. The operational corollary is that the right batch ceiling is prompt-length-conditional, not a constant.

3.4 Faster HBM holds the knee out, and its advantage grows exactly where bandwidth binds

The GPU axis is the cleanest causal probe — only the memory subsystem changes. Over 288 matched pairs the H100 (HBM3) sustains mean η 0.801 against the A100's (HBM2e) 0.724, Δ = +0.077 (CI [0.067, 0.088], Holm p ≈ 0), and delivers the higher absolute ceiling (peak aggregate 8,879 vs 4,684 tok/s at batch 64). The difference-in-differences test shows the advantage is not flat: the H100 edge grows from +0.036 at ctx512 to +0.125 at ctx32000 (DiD +0.088, Holm p < 1e-5, §12), exactly where KV traffic is largest and bandwidth binds hardest. Observations. A faster memory subsystem helps most precisely when the bottleneck is memory — a sign-and-magnitude prediction of the bandwidth mechanism that the data confirms, and the mechanistic complement to the α = 0.33-vs-0.52 compute-overlap finding.

The H100 advantage is bandwidth, demonstrated two ways: it holds η +7.7 points higher overall, and that margin widens to +12.5 points at 32K context where bandwidth is the binding constraint. Same models, same engine, only the HBM changes.

3.5 fp8 is faster but less batch-robust — and the penalty is context-invariant

fp8 weight quantization halves the weight read and so raises absolute throughput (single-stream fp8 is 1.40× fp16, §20), but it leaves the KV cache at fp16, so it reaches the KV-dominated regime sooner and is −0.054 less batch-robust (CI [−0.061, −0.047], Holm p ≈ 0). The maximalist analysis corrects a plausible intuition here: one expects the fp8 penalty to be largest where the KV cache is largest (long context), but the difference-in-differences test is null — the fp8 η-penalty is −0.053 at ctx512 and −0.051 at ctx32000, DiD +0.002, Holm p = 0.88 (§13). The penalty is significantly smaller on the H100 (precision×GPU DiD +0.029, Holm p < 1e-5), because the H100's compute headroom absorbs part of it. Observations. fp8 relocates the bottleneck from weights to the unquantized KV cache by a roughly fixed efficiency cost across context, modulated by how much compute headroom the GPU has — not, as one might guess, by a cost that escalates with context.

fp8 buys absolute speed and memory headroom at a fixed ~5-point batch-robustness cost that does not escalate with context but does shrink on a compute-richer GPU. The obvious follow-up the sign motivates is to quantize the KV cache too (§27).

3.6 The architecture predicts the model ordering

qwen2.5-7b is the most batch-robust (mean η 0.789, median knee at batch 16 at ctx32000); llama3.1-8b and mistral-7b are a statistical tie (0.750 / 0.748, median knee at batch 8; pairwise Holm p = 0.13). This ordering is not a black-box observation: Qwen's KV cache is 57,344 bytes/token (28 layers × 4 KV heads × 128 head_dim × 2 for K/V × 2 bytes) against 131,072 bytes/token for Llama and Mistral (32 layers × 8 KV heads × …), so Qwen's r = C·k/W is roughly half at any context, its curve is gentler, and its knee falls ~2× later — exactly the 16-vs-8 ratio observed. Observations. The model that decodes with fewer KV bytes per token amortizes batching further, and the architecture says how much in advance. The Llama≈Mistral tie is reported as a tie (CI straddles zero, Holm rejects), consistent with their identical KV geometry.

The cross-model robustness order is an architectural prediction, not a leaderboard: KV-bytes-per-token sets r, and Qwen's half-sized KV footprint predicts and matches its 2× later knee. Llama and Mistral, with identical KV geometry, tie.

3.7 What V4 licenses and what it does not

V4 licenses, on the substrate measured: (i) a predictive bandwidth-amortization model for 7–8B offline static-batch decode on A100-80GB and H100, validated at median R² 0.93 and a zero-parameter knee prediction at ρ = 0.84; (ii) the causal claim that the knee is governed by KV read bandwidth, supported by the concordant signs of the context, GPU, and precision effects, the gpu×context interaction, and the ≈ 0.63 measured MBU; (iii) a model-robustness ordering predicted by KV-byte architecture. V4 does not license any closed-loop serving-latency claim (the protocol is static-batch offline throughput, §24, §26), generalization beyond the three families or the two GPUs, any fp8-KV-cache claim (only fp8 weight quant was run), or an engine comparison (one engine). These are enumerated in §26.


4. Introduction and Research Motivation

4.1 The arc from V1 to V4

TR164 is a four-leg characterization of where the concurrency behavior of LLM serving breaks. V1 measured a single in-process backend (pytorch_direct) and found a catastrophic near-universal breakdown by N = 16. V2 introduced server-process continuous-batching backends that eliminate the V1 collapse but left a hardware confound in the cross-backend head-to-head. V3 closed that confound with a matched-SKU A100 grid across five models and five workloads under a closed-loop concurrency ladder, and forecast the next measurement explicitly: "a denser concurrency sweep that traces each backend's amortization regime to its own breakdown point." V4 is that sweep, with a protocol change (offline static batching, to isolate the amortization ceiling from scheduler and queueing dynamics) and two extensions (a second GPU for a clean bandwidth contrast, and context and precision swept densely so the physical drivers of memory traffic become explicit axes).

4.2 From description to prediction

V1–V3 described where the boundary sits. V4's ambition is to predict it. The closed-loop ladder cannot do this cleanly because it folds the amortization physics together with scheduler decisions and arrival noise; the static-batch grid strips those away, leaving a quantity — η(B) — that a first-principles bandwidth argument predicts in closed form. The test of the report is whether that argument, fed only verified architecture and HBM specs, predicts the measured knee. It does, to within the compute-overlap factor, which is the report's main contribution: a serving engineer can estimate the amortization knee for an untested (model, GPU, context) from the model card and the datasheet.

4.3 Why offline static batching is the right instrument

To ask where batching stops amortizing bandwidth, the cleanest instrument fills exactly B slots with equal-length prompts and times steady-state decode, removing ragged batches, preemption, and queueing. The resulting η(B) is the best case — so a knee observed here is a ceiling on where a real scheduler can keep amortizing (a closed-loop operator will generally see the knee at or before this batch). That conservative direction is exactly what an operator needs from a planning bound.

4.4 The exact-token-context construction removes a V3 confound

V3 carried a documented Mistral tokenizer confound: char-matched text prompts produced different token counts across tokenizers, so "same prompt" did not mean "same sequence length." V4 constructs each prompt as an exact token-id sequence of the target length and verifies context_actual == context for all 672 cells (§23), so sequence length is an exact identical variable across models and GPUs and the tokenizer-density confound does not arise. This is a methodological strength of the static-batch design and the reason the cross-model context comparison is clean here where V3 required a caveat.


5. Research Questions

  • RQ1 (predictive mechanism). Can a first-principles bandwidth model η(B) = (1+r)/(1+Br) describe the measured efficiency curves, and does its zero-parameter architectural form predict the observed knee? Hypothesis: yes, up to a GPU-specific compute-overlap factor.
  • RQ2 (the dominant lever). Which axis dominates η, and is its effect monotone? Hypothesis: context, monotonically, because KV read traffic scales with it.
  • RQ3 (the bandwidth causal probe and its interaction). Does changing only the GPU memory subsystem move η as the mechanism predicts, and does the effect concentrate where bandwidth binds? Hypothesis: the H100 sustains higher η, with an advantage that grows with context.
  • RQ4 (the precision tradeoff and its shape). Is fp8 weight quantization less batch-robust, and does its penalty depend on context? Hypothesis: less robust (KV becomes binding sooner); the context-dependence is tested, not assumed.
  • RQ5 (model consistency). Is the surface consistent across families, and does KV-byte architecture predict any ordering? Hypothesis: a shared surface with an ordering set by KV-bytes-per-token.
  • RQ6 (bandwidth-bound confirmation). Is decode actually bandwidth-bound at these operating points? Hypothesis: measured MBU is high (≳ 0.5).

6. Methodology

6.1 The cell-shape contract and its arithmetic

A cell is one point in (model, GPU, decode length, precision, context, batch): models {qwen2.5-7b, llama3.1-8b, mistral-7b}; GPUs {A100-80GB, H100}; decodes {64, 512}; precisions {fp16, fp8-weight}; contexts {512, 2,048, 8,192, 32,000}; batches {1, 2, 4, 8, 16, 32, 64}. The product is 672 cells. Each runs one warm-up generation (moving autotuning and CUDA-graph capture out of the timed window) then three timed repetitions, reporting the mean and population std of aggregate decode throughput and the mean per-request decode throughput. The grid is balanced and verified balanced: 224 cells per model, 336 per GPU/decode/precision, all contexts and batches present (§23).

6.2 Precision, context ceiling, and the engine path

Both precisions share the official checkpoint: fp16 is dtype=float16; fp8 is fp8 weight quantization (quantization="fp8"), the only fp8 path the V1 engine supports at 0.10.2 (fp8 KV-cache dtype is a V0-only path, out of scope; a V0 validation pass showed long-context decode an order of magnitude slower and non-representative, so all cells use the production V1 engine). The KV cache is fp16 in every cell, including fp8 cells — the reason the fp8 robustness penalty has the sign it does (§13). The largest context is 32,000 (not 32,768) so max_model_len = context + decode ≤ 32,768, the shared position-embedding ceiling of Qwen and Mistral (Llama's is 131,072; the shared 32,000 keeps the context axis identical across models). All cells run V1 with gpu_memory_utilization = 0.90, CUDA graphs on, greedy decode pinned to the exact token budget (min_tokens = max_tokens, temperature = 0).

6.3 The amortization metric and the 0.65 threshold

η(B) = t(B)/t(1) is the fraction of the batch-1 per-request decode rate retained at batch B; since aggregate A(B) = B·t(B), it equals the parallel efficiency A(B)/A(1)/B used throughout the TR164 line. η ≈ 1 is the free-lunch regime; η < 0.65 is breakdown (EFF_BREAKDOWN, the shared threshold). The knee is the smallest B > 1 with η(B) < 0.65 (or "never"); §21 additionally interpolates a continuous knee (the η = 0.65 crossing in log₂ B). The 0.65 line is a consistent reading point shared with V3, not a claim that 0.649 and 0.651 differ operationally.

6.4 The first-principles bandwidth model and its parameters

The model η(B) = (1+r)/(1+Br) with r = C·k/W is derived in §9 from a bandwidth-bound decode step. Its parameters are computed from verified architecture (Appendix D): weight bytes W = params × 2 (fp16) or × 1 (fp8, approximate); KV-bytes-per-token k = 2 × layers × KV-heads × head_dim × 2 (the leading 2 for K and V, the trailing 2 for fp16). For each curve we fit r by least squares (the architectural value is not used in the fit), report R², compare the fitted r to the architectural r_arch, and define the compute-overlap factor α = r_fit / r_arch. The architectural knee B* = (0.35 + r_arch)/(0.65·r_arch) is the zero-parameter prediction validated against the observed knee.

6.5 The statistical battery

Axis main effects are paired Wilcoxon signed-rank over matched cells (batch 1 excluded; η ≡ 1), Holm-Bonferroni-corrected within the four-axis family, each with a reproducible percentile-bootstrap 95% CI on the mean paired delta (seed 164, 10,000 resamples). Interactions are difference-in-differences paired Wilcoxon (the effect at one moderator level vs another), Holm-corrected within the interaction family, and a factorial Type-II ANOVA of η on all main effects plus 2-way interactions reporting partial η² per term. Model comparisons are paired Wilcoxon, Holm-corrected within the three-pair family. All estimators and the 0.65 threshold are shared with v3_paper_stats.py.

6.6 The per-cell measurement protocol

Each cell instantiates one vllm.LLM with max_model_len = context + decode, builds the batch of B identical exact-length token-id prompts, and runs the measurement loop. A single warm-up generate over one prompt is issued first and discarded: vLLM's V1 engine autotunes and captures CUDA graphs on the first generation of a new shape, and folding that one-time cost into the timed window would inflate the batch-1 step and corrupt the η ratio. Three timed repetitions then follow, each a full generate over the B-prompt batch with max_tokens = min_tokens = decode (so every request emits exactly the decode budget, no early stop), wrapped in a wall-clock timer. Per repetition the aggregate decode throughput is sum(output_tokens)/wall and the per-request throughput is that divided by B; the cell records the mean and population standard deviation of the aggregate across the three reps, and the mean per-request rate (the quantity η is built from). The greedy, fixed-length decode (temperature = 0) removes sampling variance, so the only residual noise is scheduling and memory-contention jitter, which the three-rep CV quantifies at a median 0.4% (§19). Observations. The protocol is deliberately the best case for amortization — equal-length prompts, full batch, steady-state decode, autotuning excluded — because the question is where batching stops paying even when nothing else is working against it; a knee measured under these conditions is a ceiling on the served knee (§24).

6.7 Execution substrate and rigor controls

The grid ran on Modal serverless GPUs via two GPU-typed functions sharing one measurement body, both pools dispatched concurrently. Three rigor controls (narrated in full in §8): validate-before-pay (a ten-cell diagnostic proved every new configuration loads and runs before the paid grid), dual-path persistence (each cell writes to a run-tagged modal.Dict as it completes, with a recover mode, decoupling data from the local process), and --detach (the job survives a local disconnect). The happy-path save and the insurance Dict agreed exactly at 672 cells.


7. SS1. The Cross-Product Grid Design and Its Arithmetic

7.1 Why a full cross-product

The amortization question is interaction-shaped: the knee is not a property of context or batch alone but of their joint position relative to the bandwidth ceiling, modulated by precision and hardware. A sliced design would hide, for example, that the H100 advantage grows with context (§12) or that the fp8 penalty does not (§13). Measuring all 672 cells lets every two-way effect be read off the same surface and tested directly (§9, §12, §13, §15), and supplies the 96 complete batch-sweeps the bandwidth model is fit against. The axes were chosen as the physical drivers of decode memory traffic: context and batch set KV size and therefore KV read bandwidth per step; precision sets weight read bytes; the GPU sets available HBM bandwidth; decode length sets how many memory-bound steps amortize the one compute-leaning prefill.

7.2 What the design buys statistically

Because every (model, decode, precision, context, batch) combination exists on both GPUs, the GPU cross-cut is a fully matched paired test with no missing cells; the same holds for precision and decode. The only extreme-versus-extreme contrast is context (512 vs 32,000), chosen for the cleanest single effect size, with the intermediate contexts reported in full (§3.2, §22) and used in the continuous-knee regression (§21). Observations. The symmetry is what makes the difference-in-differences interaction tests clean: each is a paired contrast of paired contrasts over fully matched cells, so the DiD estimates are unconfounded by the other axes.

A full cross-product is the design that does not assume away the interactions an amortization-breakdown story is made of. The 672 cells buy the right to read and test every two-way effect on one surface, and the 96 complete curves the predictive model is validated against.


8. SS2. The Modal Serverless Harness as Methodology-Honest Substrate: The Full Execution Arc

The 672-cell grid is a paid-compute artifact, and the TR164 convention treats execution rigor as a result rather than a footnote, because on paid serverless compute the harness's failure modes are part of the measurement: a silent-fallback bug, an orphaned container fleet, or a non-representative engine path does not announce itself, it just bills and returns a number. This section narrates the full arc that produced a clean 672/672 grid — the bring-up failures, the engine decision, the validation discipline, the orphan-billing catch, and the persistence hardening — in the same spirit V3 narrated its four-failure environment arc. Every event below was observed in the run logs and is reported as it happened, not reconstructed.

8.1 The mandate: "we are paying for runs"

The harness did not start clean. An early pass of the sweep fired a partial grid that returned roughly half its cells failed and then crashed the result save with a FileNotFoundError on a /results/... path: the local entrypoint had tried to write to the Modal volume path, which exists only inside the remote containers, not in the local process that runs the entrypoint and collects the .map() returns. The fix was to save locally from the entrypoint (the .map() returns the result dicts to the local process; the volume is the wrong place to write the aggregate). The failure set the bar for everything after it: a paid run is not an enthusiastic best-effort, it is code that must work end-to-end before money is spent, with each failure mode diagnosed from the real traceback rather than guessed. The three controls that follow — validate-before-pay, dual-path persistence, and orphan-billing vigilance — are the operationalization of that bar.

8.2 The vLLM 0.10.2 offline API surface: three bring-up failures

Bringing the offline measurement up against vLLM 0.10.2's V1 engine surfaced three concrete failures, each fixed against the real error:

  • The exact-context prompt construction. The experiment requires prompts of an exact token length (so context is an exact variable, §4.4), which means submitting token-id sequences rather than text. The natural call LLM.generate(prompt_token_ids=...) raised TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids': at 0.10.2 token-id prompts must be passed as dict prompts, generate([{"prompt_token_ids": seq}, ...]). The harness builds each prompt by encoding a stable non-special token pattern, repeating and truncating it to exactly ctx tokens, and wrapping each in such a dict; the post-hoc check context_actual == context on all 672 cells (§23) confirms the construction.
  • The 32,768-token position ceiling. A context of 32,768 raised a Pydantic ValidationError: max_model_len (32832) > max_position_embeddings (32768)max_model_len = context + decode must fit under the model's position ceiling. The largest context was therefore set to 32,000, so that even the long decode fits (32,000 + 512 = 32,512 ≤ 32,768), and the context axis stays identical across Qwen and Mistral (ceiling 32,768) and Llama (131,072).
  • The Windows console codec. The Modal CLI prints progress with a glyph, which the Windows console's default charmap codec cannot encode ('charmap' codec can't encode character '✓'), crashing the local driver before the run even dispatched. Every modal run invocation is prefixed PYTHONUTF8=1 PYTHONIOENCODING=utf-8. A trivial failure, but a fatal one if unfixed — and exactly the kind of environment friction that silently aborts a paid run.

8.3 The V0-versus-V1 engine decision and the fp8 redefinition

The most consequential bring-up decision was which engine path defines a cell, and it turned on a measured rejection. The original precision axis included an fp8 KV-cache dtype (kv_cache_dtype="fp8"), which on the V1 engine raised NotImplementedError: VLLM_USE_V1=1 is not supported with --kv-cache-dtype. The path that does support an fp8 KV cache is the legacy V0 engine (VLLM_USE_V1=0), so a V0 validation pass was run — and its numbers disqualified it. On V0 the fp16 throughput at ctx2048, batch 4 was 191 tok/s against the V1 engine's 333, and at the stressed corner (ctx32000, batch 64) V0 returned 19 tok/s against V1's 1,182 — a ~62× gap. V0 is not a slower-but-representative path; it is a legacy code path whose long-context decode behavior is not what any production deployment sees, so reporting V0 numbers would have measured an obsolete engine, not the serving physics. V0 was rejected for all cells.

The pivot was to redefine the fp8 axis as fp8 weight quantization (quantization="fp8"), the only fp8 path V1 supports, which a validation cell confirmed works and is fast: fp8-weight at ctx2048, batch 4 returned 415 tok/s, faster than the fp16 baseline's 333, exactly as halving the weight read predicts. This is why the V4 fp8 axis is weight-quant with an fp16 KV cache (§6.2), and it is the direct cause of the fp8 robustness penalty's sign (§13): the weights shrink, the KV cache does not, so the binding constraint moves to KV sooner. The decision is recorded in full because it shapes a headline axis — the fp8 result is a weight-quant result, not a KV-quant result, and conflating the two would misreport it.

8.4 Validate-before-pay: the ten-cell cross-model diagnostic

With the API surface and engine path settled, the paid 672-cell grid was still not the first thing to run. A ten-cell diagnostic fired first, chosen to exercise every new configuration the full grid would depend on: each of the three models on each of the two GPUs at a mid context (six cells), plus the decode-512 axis on two models and the 32,000-token corner on two (four cells). All ten succeeded. This proved, before any large spend, that gated-model access worked (§8.5), that the H100 GPU type was available and loaded all three models, that the decode-512 path ran, and that the 32,000-token configuration loaded under the position ceiling — and it surfaced the realistic load times (a first-time Mistral load on the H100 took 411 seconds, Llama loads 148–182 seconds, cold-cache effects that the diagnostic absorbed). It also pre-downloaded the two gated checkpoints into the shared cache volume, so the paid grid paid no download cost. Observations. The diagnostic is the cheapest insurance in the arc: ten cells of validation against a 672-cell paid run is a ~1.5% overhead that retires every "does this axis even run" risk before the expensive fan-out, and it is the direct operationalization of the §8.1 bar.

8.5 Gated-model access and the HF-token Modal secret

Two of the three models — meta-llama/Llama-3.1-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.3 — are gated on Hugging Face and require an access token; only Qwen is ungated. The token is supplied to the remote containers through a Modal secret (hf-token), attached to both GPU functions, so it reaches the container environment without being committed to code. The secret's existence was verified before the diagnostic (created on the run date), and the diagnostic's success on the two gated models is the end-to-end proof that the secret resolved inside the containers — a gated-access failure would have surfaced as a 401 on model load in the ten-cell pass, not silently in the paid grid.

8.6 The orphan-billing catch

The single most instructive failure in the arc cost real money and was caught by inspection. A first version of the full grid was fired with the data saved only by the local entrypoint at the very end. It was stopped after roughly ninety seconds to harden the persistence path first — and stopping the local modal run client did not stop the remote app. An inspection of the Modal app list showed the ephemeral app still in the ephemeral state with eleven GPU containers running and billing: killing the client had detached from, not terminated, the fan-out, and eleven A100/H100 containers were burning compute with no client attached. Terminating it was itself a small lesson in non-interactive tooling: modal app stop <id> prompted [y/N] and aborted under the non-interactive shell; piping y into it failed with Aborted: no interactive terminal detected. Rerun with --yes (-y); only the explicit modal app stop -y <id> worked, after which the app drained from eleven containers to nine to zero. Observations. The hazard is generic to every serverless fan-out: the containers outlive the client that launched them, so "I stopped my terminal" is not "I stopped paying," and the only reliable check is to inspect the remote app state directly. This is exactly the silent-cost failure mode that does not appear until an invoice, and it was caught here by looking.

8.7 Dual-path persistence: modal.Dict, recover, and --detach

The orphan catch and the §8.1 save crash together motivated a persistence design that does not depend on the local process surviving. In the hardened harness, each cell writes its own result to a run-tagged modal.Dict (tr164-amort-<runtag>) the instant it completes, so a finished cell is durable independent of what happens to the local entrypoint, the network, or the client; a recover entrypoint mode reconstructs results.json from the Dict if the local process ever dies mid-run; and the grid runs with --detach so the remote app survives a client disconnect. The modal.Dict API required one correction during bring-up: it does not implement Python's len() (TypeError: object of type 'Dict' has no len()) but exposes .len() as a method, caught against the live Dict and confirmed alongside .keys(), .items(), .values(), and .clear() before the recover path was relied upon. In the event the local happy path completed normally and saved all 672 cells, and the insurance Dict independently held all 672, and the two agreed exactly (§23) — so the insurance was never needed, but it was verified correct on both channels rather than assumed.

8.8 Concurrent dispatch, the run, and the background-task discovery

A final efficiency fix: the first dispatch ran the A100 pool to completion and then the H100 pool (each .map() blocks until its pool drains), serializing two independent GPU fleets for no reason. Rewrapping the two maps in a two-worker thread pool runs both GPU pools concurrently — identical GPU-seconds billed, roughly half the wall-clock. The hardened, concurrent, detached grid then ran cleanly: the insurance Dict climbed steadily from 35 cells at the first poll through 171, 281, 394, 496, 613, to 672, at ~10–13 cells per minute with no plateau and no run-crashing failure, completing in ≈ 63 minutes of wall-clock with both pools busy. One incidental observation from monitoring: the local driver and the progress watcher both ran well past ten minutes without being killed, so the background execution survived for the full run. The only benign warning in the logs was Mistral's recommendation to use --tokenizer-mode "mistral", which is irrelevant here precisely because the prompts are exact token-id sequences (§4.4): sequence length is fixed regardless of tokenizer mode, so the tokenizer-density issue that was a confound in V3 does not arise in V4.

8.9 Why the execution arc is a result

The grid finished 672/672 at ok_rate 1.0 with zero error cells, and that number is the headline. But the trust-relevant facts are the arc that produced it: a non-representative engine path was measured and rejected on the evidence (V0's 62× long-context gap), a headline axis was correctly scoped to what the engine actually supports (fp8 weight, not KV), every new configuration was validated for ~1.5% overhead before the paid fan-out, an eleven-container orphan-billing hazard was caught by inspection rather than on an invoice, and the data was made durable on two independent channels that were verified to agree. None of these is a footnote: each one is a place the measurement could have silently returned a wrong or expensive number, and the report's confidence in the 672 cells rests on them.

On paid serverless compute the harness's failure modes are part of the result. V4's clean 672/672 is backed by an evidence-based engine rejection, a correctly-scoped fp8 axis, a validate-before-pay diagnostic, a caught orphan-billing fleet, and dual-channel persistence proven to agree — the methodology-honest substrate beneath the numbers.


9. SS3. The Amortization Metric and the First-Principles Bandwidth Model

9.1 The metric and what it isolates

η(B) = t(B)/t(1) is the parallel efficiency in static-batch form. At η ≈ 1 the engine reads the weights once and amortizes that read across B concurrent decodes for free; aggregate throughput scales linearly. What erodes η as B (or context) grows is the KV cache, which scales as B × C and must be read every step, until KV traffic rivals and then dominates weight traffic. The metric isolates this from raw speed: §3.2 and §20 show the batch-1 rate barely moves with context (penalty 0.94), so the entire context effect is in the shape of η, i.e. in one parameter.

9.2 Deriving η(B) = (1+r)/(1+Br)

Treat the decode step as bandwidth-bound. Bytes read per step at batch B: the weights once (W) plus each request's KV cache (B · C · k, with k = KV-bytes-per-token). At sustained bandwidth BW, step time is τ(B) = (W + B·C·k)/BW, and each step emits B tokens (one per request), so per-request throughput is t(B) = 1/τ(B) = BW/(W + B·C·k). Then

η(B) = t(B)/t(1) = (W + C·k)/(W + B·C·k) = (1 + r)/(1 + B·r), r = C·k/W.

r is the ratio of one request's KV read bytes (at the full context) to the weight bytes — the single quantity that sets the curve. The knee follows in closed form: η(B) < 0.65 ⟺ B > (0.35 + r)/(0.65·r) ≡ B*. As r → 0 (short context, small KV), B* → ∞ (batching always pays); as r grows with context, B* shrinks. The prediction is sharp: the knee scales inversely with context, and the model with the smaller k (fewer KV bytes/token) has the later knee.

9.3 Why the pure model is a floor, not the whole story

The derivation assumes pure bandwidth-bound decode with no compute overlap. Real engines overlap some computation (attention, FFN) with memory movement, and the overlap grows with the compute headroom of the GPU and with batch (more arithmetic to hide behind the reads). So observed η is gentler than the pure model — the real r behaves as α · r_arch with α < 1 — and the discount α is larger (further below 1) on the compute-richer H100. The fit (§11) recovers α directly, and the pure-architectural prediction (α = 1) is therefore a conservative floor on the knee. Observations. This is the right structure for an engineering bound: the architecture gives a predictable, conservative knee; the GPU's compute headroom buys a known multiplicative relaxation.

The bandwidth model is derived, not assumed: a bandwidth-bound decode step yields η(B) = (1+r)/(1+Br) with r = C·k/W, predicting an inverse-context knee and a model ordering set by KV-bytes-per-token. Compute overlap makes the pure form a conservative floor, relaxed by a GPU-specific factor the fit recovers.

9.4 A worked prediction, end to end

To make the model concrete and falsifiable, here is the full arithmetic for two configurations at the stressed corner (context 32,000), using only the verified architecture (Appendix D) and the formulas above — no fitting. The KV-bytes-per-token is k = 2 × layers × KV-heads × head_dim × 2 and the weight bytes are W = params × 2 (fp16).

For qwen2.5-7b: k = 2 × 28 × 4 × 128 × 2 = 57,344 bytes/token, W = 7,615,616,512 × 2 = 15.23 GB, so at C = 32,000 the per-request KV read is C·k = 1.835 GB and r = C·k/W = 1.835/15.23 = 0.1205. The closed-form knee is B* = (0.35 + r)/(0.65·r) = (0.35 + 0.1205)/(0.65 × 0.1205) = 0.4705/0.0783 = 6.0. For llama3.1-8b: k = 2 × 32 × 8 × 128 × 2 = 131,072 bytes/token (2.3× Qwen's), W = 16.06 GB, C·k = 4.194 GB, r = 0.2612, and B* = (0.35 + 0.2612)/(0.65 × 0.2612) = 0.6112/0.1698 = 3.6.

These are the zero-parameter predictions: Qwen breaks at batch 6, Llama at batch 3.6 — the right ordering (Qwen 1.7× later, tracking its 0.46× smaller k), each about 2× below the observed knee because the pure form omits compute overlap. Applying the empirical H100 compute-overlap factor α = 0.33 (the median r_fit/r_arch, §11.3), the effective ratio is α·r and the corrected knee is (0.35 + α·r)/(0.65·α·r): for Qwen α·r = 0.0398 → B* = 15.1, for Llama α·r = 0.0862 → B* = 7.8. The observed discrete knees at this corner are 16 for Qwen and 8 for Llama (Appendix C) — the α-corrected prediction lands on both. On the A100, α = 0.52 gives the more conservative B* = 10.1 (Qwen) and 5.5 (Llama), consistent with the A100's earlier knees. Observations. The worked example shows the two-stage structure of the prediction: the architecture alone fixes the ordering and the scale of the knee from three config numbers and a context length, and a single empirical per-GPU constant (the compute-overlap α) closes the ~2× gap to the observed value. An engineer with a model card and a GPU's HBM generation can run this calculation for an untested (model, context) in a spreadsheet.

A worked prediction: from Qwen's and Llama's config alone, the pure model predicts knees at batch 6.0 and 3.6 (right ordering, 2× conservative); the empirical H100 α = 0.33 corrects them to 15.1 and 7.8, landing on the observed 16 and 8. The amortization knee is computable in advance from a model card and a datasheet.


10. SS4. Context Length Is the Dominant Lever

The context cross-cut is the largest effect in V4 by a factor of three to five over any other axis. Pairing η at matched (model, GPU, decode, precision, batch > 1) cells between ctx512 and ctx32000 (144 informative pairs), mean η falls 0.865 → 0.599, paired Δ −0.266 (bootstrap 95% CI [−0.294, −0.240], Holm-adjusted Wilcoxon p ≈ 0; L1_main_effects.context_32000_vs_512), and the factorial ANOVA (§15) assigns context partial η² 0.266 against 0.038 for the GPU, the next design factor. Figure 1 (amortization_eta_vs_batch.png) plots the η(B) curves by context for each GPU, averaged over the twelve (model, decode, precision) configurations: the four curves fan from the near-flat ctx512 line to the steeply-collapsing ctx32000 line on both panels, with the 0.65 breakdown line crossed progressively earlier as context grows — the dominant lever made visual. The per-batch executive table (§3.2) shows the collapse is monotone at both batch 16 and batch 64 and steepens with batch.

Observations. Context is the only axis whose effect crosses the entire free-to-broken span on its own (mean batch-16 η 0.897 at ctx512 vs 0.442 at ctx32000). The mechanism is direct and quantitative: r = C·k/W is linear in C, so the 62.5× context increase multiplies the KV/weight ratio by the same factor and walks the curve from r ≈ 0 (free) to r large (broken). Because the batch-1 rate is nearly context-independent (§20), the context effect is purely a reshaping of r — the cleanest possible demonstration that context acts through the amortization parameter, not through raw speed.

Context dominates the amortization surface — an order of magnitude above any other design factor in the ANOVA — and it acts entirely by scaling the bandwidth model's one parameter r. This is both the largest empirical effect and the most direct confirmation of the mechanism.


11. SS5. The Bandwidth-Amortization Model: Fit, Architectural Prediction, and Compute Overlap

This section is the report's centerpiece: the test of whether the derived model (§9) actually predicts the measured surface. Three results — the functional-form fit, the zero-parameter architectural prediction, and the compute-overlap factor — are reported in turn, with Figure 2 (fit overlays) and Figure 3 (architectural prediction) as the visual anchors.

11.1 The functional form fits

Fitting η(B) = (1+r)/(1+Br) to each of the 96 (model, GPU, decode, precision, context) curves by least squares (one free parameter r, the curve anchored at η(1) = 1 by construction) gives median R² 0.928, with 57 of 96 curves above R² 0.90 and 82 of 96 above 0.70 (L3_roofline.fit_r2_median, distribution verified). Figure 2 overlays the fitted curves on the observed points for four representative configurations; the rational form tracks the data across the full batch sweep at every context. Observations. A one-parameter rational function explaining 93% of the variance in a seven-point throughput curve, across 96 independent curves spanning two GPUs, two precisions, two decode lengths, and a 62.5× context range, is strong evidence that the bandwidth-bound decode step is the right physical picture. The single free parameter is not a curve-fitting degree of freedom chasing shape — it is a physical ratio whose architectural value is tested next.

11.2 The zero-parameter architectural prediction

Computing r_arch = C·k/W from verified architecture (no fitting) and the closed-form knee B* = (0.35 + r_arch)/(0.65·r_arch), the predicted knee correlates with the observed interpolated knee at Spearman ρ = 0.84 (Pearson on log₂ 0.79, n = 65 breaking curves; L3_roofline.arch_knee_vs_obs_knee_spearman, Figure 3). The prediction holds across two orders of magnitude: for qwen2.5-7b | H100 | d64 | fp16 the architectural knee runs 281 → 71 → 19 → 6 across contexts 512 → 2,048 → 8,192 → 32,000, against observed never → never → 41 → 11 — same ordering, same scale, no fit. Observations. The points in Figure 3 sit systematically above the y = x line: the observed knee is later than the pure-architectural prediction, consistently, because compute overlap (α < 1) relaxes the bandwidth pressure. The offset is the signal, not noise — it is the compute-overlap factor quantified next. That a knee computed from three config numbers and a context length predicts the measured breakdown at ρ = 0.84 is the report's central claim.

11.3 The compute-overlap factor α and where the model breaks

The ratio α = r_fit / r_arch measures how much gentler the real curve is than pure bandwidth predicts. Its median is 0.52 on the A100 and 0.33 on the H100 (L3_roofline.alpha_compute_overlap_by_gpu): the H100's larger compute headroom overlaps more of the KV read, discounting the effective bandwidth pressure to a third of architectural, against half on the A100. This is the mechanistic complement to the GPU main effect (§12) — the H100 holds η higher because α is lower. The pure model fails in exactly one place: the four H100 | d512 | ctx32000 curves return R² < 0 because they are non-monotone (η dips then partly recovers toward batch 64), which a monotone rational function cannot fit. Observations. The non-monotone tails are not a defect to be smoothed away; they are the fingerprint of compute re-entering at the most bandwidth-stressed corner on the compute-richest GPU at the longest decode — precisely where the α < 1 relaxation is largest and the pure-bandwidth assumption is weakest. The model's failure is located exactly where its derivation says it should fail.

The bandwidth model fits 85% of curves at R² ≥ 0.7, predicts the knee from architecture alone at ρ = 0.84, and the residual is a single GPU-specific compute-overlap factor (α 0.52 on A100, 0.33 on H100) that both explains the hardware effect and predicts where the pure form breaks. The mechanism is not asserted; it is fit, validated out-of-fit against architecture, and falsified only where it predicts its own failure.


12. SS6. The HBM Bandwidth Axis and the GPU×Context Interaction

12.1 The main effect

The GPU cross-cut is the cleanest causal probe in V4: only the memory subsystem changes. Over 288 matched (model, decode, precision, context, batch > 1) pairs the H100 (HBM3, ≈ 3.35 TB/s) sustains mean η 0.801 against the A100-80GB's (HBM2e, ≈ 2.0 TB/s) 0.724, paired Δ +0.077 (bootstrap 95% CI [0.067, 0.088], Holm-adjusted Wilcoxon p ≈ 0; L1_main_effects.gpu_H100_vs_A100). The absolute ceiling moves further: peak batch-64 aggregate decode throughput is 8,879 tok/s on the H100 versus 4,684 tok/s on the A100, means 6,210 vs 2,884 (§17). The compute and bandwidth advantages separate cleanly: the ≈ 1.9× absolute-aggregate gain reflects compute lifting the whole curve and cancels out of the normalized η, while the +0.077 η gain is specifically the bandwidth-delayed knee.

12.2 The interaction: the H100 edge grows where bandwidth binds

The bandwidth mechanism predicts that a faster memory subsystem should help most where memory traffic is largest. The difference-in-differences test confirms it: the H100-minus-A100 η gap grows from +0.036 at ctx512 to +0.125 at ctx32000 (DiD +0.088, bootstrap 95% CI [0.059, 0.121], Holm p < 1e-5; L2_interactions.gpu_gap_x_context). It is the single largest interaction in the grid after context×batch. Observations. This is a sign-and-magnitude prediction the data confirms: at short context, where decode is weight-read-bound and both GPUs amortize easily, the HBM generation barely matters (+3.6 points); at 32,000 tokens, where each request drags a multi-gigabyte KV cache through memory every step, the faster HBM is worth +12.5 points of retained efficiency. The interaction is the mechanistic complement to the α factor (§11.3): the H100's lower compute-overlap-adjusted bandwidth pressure (α 0.33 vs 0.52) is exactly what holds its knee out, and it does so most at long context.

The H100 advantage is bandwidth, shown two ways: a +7.7-point main effect and a +8.8-point difference-in-differences that concentrates the edge at long context where bandwidth binds. Same models, same engine, only the HBM changes — and it changes η most where the mechanism says it should.


13. SS7. The Precision Tradeoff and the Context-Invariant fp8 Penalty

13.1 The main effect

fp8 weight quantization halves the bytes read for the weights each decode step, raising absolute throughput (single-stream fp8 is 1.40× fp16, §20), but it leaves the KV cache at fp16, so it reaches the KV-dominated regime sooner and is −0.054 less batch-robust: mean η 0.735 (fp8) vs 0.790 (fp16) over 288 matched pairs (bootstrap 95% CI [−0.061, −0.047], Holm p ≈ 0; L1_main_effects.precision_fp8_vs_fp16). In the bandwidth model this is immediate: fp8 halves W, so r = C·k/W doubles at every context, shifting the binding constraint toward KV and pulling the knee in.

13.2 The correction: the fp8 penalty is context-invariant, not context-escalating

The intuitive expectation is that the fp8 penalty should be largest where the KV cache is largest — long context — because that is where the unquantized KV dominates most. The maximalist difference-in-differences test falsifies this intuition: the fp8 η-penalty is −0.053 at ctx512 and −0.051 at ctx32000, DiD +0.002 (bootstrap 95% CI [−0.016, +0.021], Holm p = 0.88, not significant; L2_interactions.precision_gap_x_context). The penalty is context-invariant at ≈ 5 points. It is significantly smaller on the H100 (precision×GPU DiD +0.029, Holm p < 1e-5; L2_interactions.precision_gap_x_gpu): −0.069 on the A100 versus −0.040 on the H100, because the H100's compute headroom absorbs part of the fp8 cost. Observations. Why context-invariant? In the model, fp8 doubles r at every context uniformly, and the η-difference between an r and a 2r curve is roughly scale-stable across the breakdown region — so the penalty in η-points is approximately constant in context even though both curves shift. This is a genuine correction the maximalist analysis caught; the earlier main-effects-only writeup asserted a context-escalating penalty that the data does not support.

fp8 relocates the bottleneck onto the unquantized KV cache at a fixed ≈ 5-point batch-robustness cost that does not escalate with context — correcting a plausible intuition — but does shrink on a compute-richer GPU. The follow-up the sign motivates is fp8 KV-cache quantization (§27).


14. SS8. The Decode-Length Axis

The decode-length axis is the smallest physical effect and the only one whose practical reading is second-order. Over 288 matched (model, GPU, precision, context, batch > 1) cells, the longer decode (512) sustains mean η 0.780 against the shorter decode's (64) 0.745, Δ +0.036 (bootstrap 95% CI [0.028, 0.043], Holm p ≈ 0; L1_main_effects.decode_512_vs_64). The effect grows with context (decode×context DiD +0.042, Holm p = 0.001; L2_interactions.decode_gap_x_context). Observations. The sign is sensible: a longer decode runs more memory-bound steady-state steps against the one compute-leaning prefill, so a larger fraction of the cell's time is clean steady-state decode and a smaller fraction is the prefill-to-decode transition and warm-up, marginally raising retained per-request rate; and the effect is larger at long context because the prefill transition is proportionally costlier there. The magnitude is about half the precision effect and a sixth of the context effect (ANOVA partial η² 0.008, §15), so it shifts the knee by at most one batch step in the table (§22). It is reported in full because its sign is one more consistency check on the bandwidth picture (more decode steps, more amortization).

Decode length is a real but second-order modifier of the knee, larger at long context, consistent with more steady-state decode steps amortizing the fixed prefill. It is a consistency check, not a lever an operator would turn.


15. SS9. The Factorial ANOVA Lever Ranking

A Type-II ANOVA of η on all main effects plus 2-way interactions (576 batch > 1 cells, full-model R² = 0.949) ranks the levers quantitatively by partial η²:

Term partial η² reading
batch 0.509 the η argument itself (η varies along batch by construction)
context 0.266 the dominant design factor
context × batch 0.060 context reshapes the batch curve — the knee, formally
gpu 0.038 the HBM axis
precision 0.018 the fp8 axis
gpu × batch 0.013 the H100 holds the curve up — the α effect
model 0.009 the KV-architecture ordering
decode 0.008 second-order
model × batch 0.007 per-model curve shape
gpu × context 0.007 the H100-edge-grows-with-context interaction

These are L2_factorial_anova.top_terms. Observations. Three things. First, among the design factors (excluding batch, which is η's argument), context's partial η² of 0.266 is an order of magnitude above the next (GPU 0.038), precisely matching the paired-effect-size ordering and putting a number on "context dominates." Second, the largest interaction is context×batch (0.060) — the formal statement that context reshapes the amortization curve, i.e. moves the knee, which is the entire point of the report. Third, the GPU and model effects appear both as main effects and as ×batch interactions (gpu×batch 0.013, model×batch 0.007), the ANOVA's independent confirmation that these factors act on the curve shape (the α and KV-architecture mechanisms), not merely on the level.

The ANOVA ranks the levers and confirms the story numerically: context partial η² 0.27 ≫ gpu 0.038 > precision 0.018 > model 0.009, with the largest interaction (context×batch) being the knee itself. The variance decomposition agrees with the paired effect sizes and the bandwidth mechanism.


16. SS10. Cross-Model Robustness and the Architecture That Explains It

The three families share the amortization surface qualitatively — all collapse with context, all respond to GPU and precision in the same direction — with a small ordering the architecture predicts. Pairing η over every (GPU, decode, precision, context, batch > 1) cell, qwen2.5-7b sustains mean η 0.789, ahead of llama3.1-8b 0.750 and mistral-7b 0.748 (L8_per_model, model_comparison). The per-model reads below trace each family's profile back to its KV-byte architecture.

16.1 qwen2.5-7b — the most batch-robust, and why

Qwen's KV cache is 57,344 bytes/token2 × 28 layers × 4 KV heads × 128 head_dim × 2 bytes — less than half the 131,072 of the other two, because Qwen pairs a shallower stack (28 vs 32 layers) with an aggressive 7:1 grouped-query ratio (4 KV heads against 28 query heads). At any context its r = C·k/W is therefore roughly half, its η curve is the gentlest of the three, and its knee falls latest: across its eight (GPU, decode, precision) configurations the median knee at ctx32000 is batch 16, and its single most robust cell (H100 | d512 | fp16) does not break until batch 32 — the latest knee anywhere in the grid at that context. The Qwen-vs-Llama and Qwen-vs-Mistral contrasts are both Holm-significant (Δ +0.039, CI [0.027, 0.050]; Δ +0.041, CI [0.029, 0.052]; both Holm p ≈ 0). Observations. Qwen is not robust by accident or by tuning; it is robust because its architecture reads the fewest KV bytes per token, and the worked example (§9.4) predicts its batch-16 knee from that fact alone. Of the three, Qwen is the model an operator should reach for when long-context batched throughput matters.

16.2 llama3.1-8b — the baseline 4:1 GQA profile

Llama-3.1-8B has the standard 32-layer, 8-KV-head, 128-head-dim geometry (4:1 grouped-query), giving the 131,072-byte-per-token KV footprint and the largest weight matrix of the three (8.03 B params, W = 16.06 GB fp16). Its median knee at ctx32000 is batch 8, half of Qwen's, exactly as its 2.3× larger k predicts (the larger W partly offsets the larger k, but k dominates the ratio). Llama holds the free regime through ctx2048 on most configurations and only its H100 | d512 | fp16 cell reaches ctx32000 before breaking (knee 16 there, the H100 bandwidth advantage at the most-amortizing decode). Observations. Llama is the reference profile against which the other two are read: Qwen beats it by halving k, and Mistral matches it by sharing its KV geometry exactly.

16.3 mistral-7b — identical KV geometry, and the earliest knee in the grid

Mistral-7B-v0.3 shares Llama's KV geometry exactly (32 layers, 8 KV heads, 128 head_dim → 131,072 bytes/token) but has the smallest weight matrix of the three (7.25 B params, W = 14.50 GB fp16, the lowest). Same k, smaller W, so its r = C·k/W is slightly higher than Llama's at matched context, and its median ctx32000 knee is likewise batch 8. The grid's single earliest knee — batch 4, at mistral-7b | A100-80GB | d64 | fp8 | ctx32000 — is a clean mechanism consequence: the lowest-weight model, under fp8 (which halves W again to ≈ 7.25 GB), at the longest context, maximizes r (≈ 0.58 architectural), pushing the predicted knee to the lowest value anywhere in the grid (§9.4 arithmetic gives B* ≈ 2.5 pure, ≈ 3–4 α-corrected). Observations. Mistral demonstrates the weight side of r: with k fixed by geometry, the model carrying the fewest weight bytes breaks earliest, and fp8 — which cuts weight bytes — sharpens the effect. The earliest knee in the grid is exactly where the mechanism says it should be.

16.4 The Llama≈Mistral tie, reported as a tie

Llama and Mistral are a statistical tie in batch-robustness: Δ +0.002, bootstrap 95% CI [−0.004, 0.007], Holm p = 0.13 (not significant). This is the honest reading — the CI straddles zero and Holm rejects significance — and it is exactly what their identical KV geometry predicts: same layers, KV heads, and head_dim give the same k, and the small weight difference (16.06 vs 14.50 GB) is too minor to separate their mean η across the grid. Observations. The tie is not a null finding to be glossed; it is a prediction confirmed — two models with identical KV architecture tie in the metric the KV architecture governs, while the model with different KV architecture (Qwen) separates significantly from both. Forcing a Llama-vs-Mistral rank would invent a distinction the mechanism says should not exist.

Each family's robustness is an architectural readout: Qwen's half-sized KV footprint earns its 2× later knee, Mistral's smallest weight matrix earns the grid's earliest knee under fp8, and the architecturally identical Llama and Mistral tie. Model choice is the fourth axis of the same bandwidth mechanism, and the architecture predicts the ordering in advance.


17. SS11. The Aggregate-Throughput Scaling Ceiling

η measures retained per-request rate; the operator also cares about the absolute aggregate throughput batching buys. The aggregate speedup A(B)/A(1) at batch 64, averaged across the twelve configs per context, falls 42.2× (ctx512) → 39.6× → 34.1× → 24.9× (ctx32000) (L5_aggregate_scaling.mean_aggregate_speedup_at_b64_by_context), against the ideal 64×. Peak absolute aggregate decode throughput at batch 64 is 8,879 tok/s (H100) and 4,684 tok/s (A100); the means are 6,210 and 2,884 (Figure 4 traces the speedup curves). Observations. Even at the worst corner (32,000 tokens), batch 64 still buys 25× the single-stream aggregate throughput — batching is never worthless, it just stops being free. The distinction is the report's operational core: at short context the speedup is near-ideal (42× of 64×, η ≈ 0.66 at batch 64), while at long context the 25× speedup comes at η ≈ 0.39, i.e. the operator pays a 2.6× per-request latency inflation for it. The aggregate ceiling and the efficiency knee are two readings of one surface: aggregate speedup = B · η(B), so the speedup plateaus exactly as η falls, and the plateau is steeper at long context.

Batching always raises aggregate throughput (25–42× at batch 64), but at long context that throughput is bought with a 2.6× per-request latency inflation. The aggregate ceiling and the efficiency knee are the same surface read two ways: speedup is B·η, so it plateaus precisely as amortization breaks down.


18. SS12. Memory-Bandwidth Utilization: Confirming Decode Is Bandwidth-Bound

The bandwidth model assumes decode is HBM-bandwidth-bound; this section measures whether it is. At batch 1 each decode step reads W + C·k bytes and emits one token in 1/t(1) seconds, so achieved bandwidth is t(1)·(W + C·k), and the memory-bandwidth utilization (MBU) is that over the nominal peak HBM. For qwen2.5-7b | H100 | fp16 the achieved bandwidth is ≈ 2,100–2,200 GB/s across contexts and MBU is 0.63–0.66 (L4_bandwidth_mbu.detail); averaged over all models and precisions at batch 1, ctx512, MBU is 0.535 (A100) and 0.539 (H100). Observations. An MBU of 0.5–0.66 at batch 1 places decode firmly in the memory-bound regime — the GPU spends most of each step moving bytes, not computing — which is the premise the entire bandwidth model rests on. The MBU is roughly context-stable (the achieved bandwidth is a near-constant sustained ≈ 2.1 TB/s on the H100), confirming that the context effect on η is not an MBU artifact but a change in the bytes-per-token composition (more KV, same sustained bandwidth, fewer tokens per byte). The nominal peaks are vendor spec, not measured (Appendix D), so MBU is reported as achieved/peak with that caveat; the cross-GPU achieved-bandwidth ratio is a measured cross-check that does not depend on the exact peak.

Measured MBU of 0.5–0.66 at batch 1 confirms decode is bandwidth-bound — the premise of the model. The achieved bandwidth is near-constant across context, so the context effect is a change in bytes-per-token composition (KV growth), not a utilization artifact.


19. SS13. Measurement Reliability

The three timed reps per cell let measurement noise be quantified directly. The per-cell coefficient of variation of aggregate decode throughput (std/mean over 3 reps) has median 0.4%, mean 0.6%, 95th percentile 2.0%, and maximum 12.1% (L6_variance). By batch it rises gently from 0.6% at batch 1 to 1.05% at batch 64; by context it is flat at 0.6–0.7%. Observations. The surface is measured cleanly: a median 0.4% CV means the η curves and the knee crossings are not noise-limited, and the effect sizes (context 0.27, GPU 0.077, precision 0.054) are one to two orders of magnitude above the measurement noise floor. The mild rise at batch 64 (1.05%) reflects the larger scheduling and memory-contention variance at maximum batch; the 12.1% maximum is a single long-context, large-batch cell where the few-second timing window is most sensitive to contention. None of this threatens the headline: even the worst single cell's noise is small against the axis effects, and the bootstrap CIs already fold the cell-to-cell variance into the reported intervals.

Median per-cell CV of 0.4% (worst 12%) places the measurement noise one to two orders of magnitude below every reported effect. The surface is not noise-limited; the knees and curves are real.


20. SS14. The Single-Stream Baseline Surface

The batch-1 per-request rate t(1) is the amortization-free baseline, and its structure is a result in its own right. Three readings: the single-stream GPU speedup (H100/A100 at batch 1) is 1.68× averaged over the grid; the single-stream fp8 speedup (fp8/fp16) is 1.40×; and the prefill-amortization penalty t(1, ctx32000)/t(1, ctx512) is ≈ 0.94 (e.g. 0.939 for qwen2.5-7b | H100 | fp16; L7_single_stream). Observations. The third number is the keystone of the whole report: at batch 1 the per-request decode rate is only ~6% slower at 32,000 tokens than at 512. Context barely touches the single-stream rate — so the entire ≈ 27-point context effect on η is a reshaping of the batch curve, not a slowdown of decode. This is the quantitative justification for calling context an amortization lever (§3.2, §9.1). The H100's 1.68× single-stream edge (larger than its +7.7-point η edge) and fp8's 1.40× edge (against its −5.4-point η cost) make explicit that absolute speed and batch-robustness are distinct axes: the H100 is both faster and more robust; fp8 is faster but less robust.

Context barely changes the single-stream rate (penalty 0.94), so the whole context effect on amortization is curve-shape, not slowdown — the keystone of the report. Absolute speed (H100 1.68×, fp8 1.40×) and batch-robustness (H100 +7.7pts, fp8 −5.4pts) are distinct axes.


21. SS15. The Continuous-Knee Regression

Beyond the discrete knee, interpolating the exact batch at which η crosses 0.65 (linear in log₂ B) gives a continuous knee per curve, and regressing log₂(knee) on log₂(context) per configuration gives the compression rate. The median slope is −0.64 (mean −0.50; L9_knee_regression), i.e. the knee batch falls as roughly context^(−0.64). Observations. The pure-bandwidth model predicts the knee B* = (0.35 + r)/(0.65·r) with r ∝ C, which for r not-too-small gives B* ∝ 1/C — slope −1. The observed −0.64 is shallower than −1, and consistently so: this is the same compute-overlap relaxation (α < 1) seen in §11, which softens the context dependence because compute increasingly overlaps the growing KV read. fp8 configs show shallower slopes still (e.g. −0.33 vs −0.71 for the matched fp16 on qwen | A100 | d512), consistent with fp8 already sitting deeper in the KV-bound regime where the knee is less context-elastic. Observations. The regression turns the qualitative "knee marches left" into a rate: roughly a halving of the useful batch per ~3× of context, slower than the −1 a pure-bandwidth model would give, with the gap again being compute overlap.

The knee compresses as context^(−0.64), a measured rate between the pure-bandwidth prediction (−1) and context-independence (0); the gap from −1 is the compute-overlap relaxation, the same α effect in a third guise.


22. SS16. The Knee Table — Where Batching Stops Paying

The knee table is the operational deliverable: for each of the 24 (model, GPU, decode, precision) configurations, the smallest batch with η < 0.65 at each context (full table, Appendix C). Reading down the context columns the knee compresses monotonically — modal entry "never" at ctx512 (14/24), still 14 "never" at ctx2048, clustering at batch 16–32 at ctx8192, and uniformly finite (≤ 32, mostly 8) at ctx32000. Reading across, two sub-structures recur: fp8 rows knee at equal-or-smaller batches than their fp16 partners (the §13 penalty), and H100 rows knee at equal-or-larger batches than their A100 partners (the §12 advantage), with the longest-held configs being H100 / decode-512 / fp16, free through ctx8192 for all three models. Observations. The table answers the operator's question directly — "at my model, GPU, precision, and prompt length, how deep can I batch before per-request latency degrades past a third" — and its most actionable row-family is ctx32000, where the answer is uniformly small (4–32, mostly 8). It is the per-configuration lookup that the bandwidth model (§11) generates and the cross-cut statistics (§10, §12, §13, §16) summarize.

The knee table is the deployment lookup the whole report generates: a per-configuration maximum-useful-batch as a function of prompt length, predicted by the bandwidth model and confirmed cell by cell.


23. SS17. Data Verification

Integrity rests on four independent main-thread checks. Completion and balance: exactly 672 rows, all status = "ok", zero error cells; 224 per model, 336 per GPU/decode/precision, all contexts and batches present. Physical sanity: no zero/negative throughput cell, and every cell's actual constructed context equals its target exactly (zero context_actual ≠ context mismatches across all 672 — the exact-token-id construction worked on every tokenizer). Dual-path agreement: the local results.json and the insurance modal.Dict each held all 672 cells and agreed. Analyzer faithfulness: one config's η curve and knee were recomputed from the raw per-request throughputs and matched the analyzer exactly (qwen2.5-7b|H100|d64|fp16|ctx32000: η = {1.00, 0.954, 0.864, 0.720, 0.562, 0.402, 0.388}, knee 16, identical), and the maximalist roofline fit was sanity-checked (median R² 0.93, the four negative-R² curves confirmed to be exactly the non-monotone H100/d512/ctx32000 tails, §11.3). Observations. Each check targets a distinct failure mode — a dropped axis, a truncating tokenizer, a partial save, a metric bug — and all passed, which is why the headline numbers are stated without hedging.

The data is verified four independent ways with persistence proven on two channels; the maximalist analysis additionally cross-checked the roofline fit and located its only failures exactly where the mechanism predicts. The 672/672 ok_rate is the headline; the verification is what licenses it.


24. SS18. Cross-Run Lineage — V3 Closed-Loop versus V4 Offline

V3 and V4 are two measurements of one physics from opposite ends. V3's closed-loop ladder asks what an operator sees under arrival-driven concurrency, folding amortization physics with scheduler behavior and queueing; V4's static-batch grid asks where the amortization ceiling sits when every slot is full of equal work, exposing the physics directly. They are consistent: V3 found a gradual, workload-shaped decline holding past the in-process baseline's N = 2 collapse and fanning out at N = 16–32; V4 finds the static-batch ceiling fans out exactly where context stresses KV bandwidth, with the knee in the same batch-16-to-32 region at moderate context and compressing to batch 8 only at the extreme 32,000-token corner V3's workload set did not isolate. Observations. The protocol difference is why both are needed and why neither substitutes for the other: a closed-loop operator will see the knee at or before V4's static-batch knee, because ragged batches, preemption, and queueing can only erode the idealized ceiling. So V4's knee is an upper bound on a real scheduler's knee — the operationally conservative direction. V3 measured where two real schedulers sit; V4 measures how high they could reach and supplies the predictive model for the ceiling.

V4 does not revise V3; it grounds it. The closed-loop boundary sits beneath the static-batch ceiling, the same KV-bandwidth mechanism explains both, and V4 adds the predictive model. The lineage is a bracket, not a contradiction.


25. Limitations

The substrate bounds the claims. Three models, one scale band (7–8B; not a size sweep — V3 carries the size gradient; nothing about sub-7B or 13B+). Two GPUs, one engine version (A100-80GB, H100; vLLM 0.10.2 V1; the bandwidth axis is a two-point contrast, and the α factor is GPU-pair-specific, though the mechanism is engine-independent). fp8 weight, not fp8 KV (the KV cache is fp16 in every cell — the cause of the fp8 penalty's sign and the motivation for the fp8-KV follow-up; no fp8-KV claim is made). Static batch, not closed-loop (every cell is one fixed equal-length batch; the amortization ceiling, not a served latency; no p95 or throughput-under-traffic figure — §24). Equal-length prompts (real batches are ragged; the equal-length static batch is the best case, so the measured knee is an upper bound on a real scheduler's). Greedy fixed-length decode (no sampling or early stopping). The 32,000-token ceiling (shared across models with different position-embedding limits; the 131,072-token Llama regime is untested). The roofline model's scope (the pure form is bandwidth-only; it fits 85% of curves but breaks on the non-monotone H100/d512/ctx32000 corner where compute overlap dominates, and α is an empirical per-GPU factor, not derived from first principles). Nominal peak HBM (MBU uses vendor-spec peak bandwidth, not a measured bandwidth ceiling). Observations. None of these threatens the headline, which is a predictive mechanism and a ranking on a defined substrate, not a universal constant. The limitations bound generalization, enumerated next.

Every limitation is a scope boundary, not a measurement flaw. The grid is internally complete and verified; the bandwidth model is validated where it applies and honest about where it breaks; what is not licensed is extrapolation past the three models, two GPUs, one engine, and static-batch protocol that ran.


26. Forbidden Claims

Explicitly not supported by V4:

  1. Any closed-loop serving latency or p95 claim — V4 is offline static-batch decode throughput (§6.3, §24).
  2. Generalization beyond 7–8B — three 7–8B families only (§25).
  3. Generalization beyond the A100-80GB / H100 pair — a two-point bandwidth contrast (§12, §25).
  4. Any fp8 KV-cache claim — only fp8 weight quantization was run; KV is fp16 everywhere (§6.2, §13).
  5. An engine ranking or any cross-engine claim — one engine (vLLM 0.10.2); engine comparison is V3's territory under a different protocol.
  6. A ragged-batch or production-scheduler knee — the measured knee is the equal-length static-batch ceiling, an upper bound on a real scheduler's knee, never the served knee (§24).
  7. A first-principles α — the compute-overlap factor is an empirical per-GPU fit, not derived; the bandwidth model is licensed as predictive-up-to-α, not exact.
  8. Universal verbs — "always", "proves", "eliminates", "guarantees" are banned; the licensed reading is "on the 672 cells measured, the knee is governed by KV read bandwidth, predicted from architecture at ρ = 0.84 up to a GPU-specific compute-overlap factor."
  9. A Llama-versus-Mistral robustness ordering — a statistical tie (Holm p = 0.13); reported as a tie (§16).

The result is strong inside its box and silent outside it. The box is three 7–8B models, two GPUs, one engine, static-batch offline throughput, and a bandwidth model predictive up to an empirical compute-overlap factor; every headline claim is bounded to that box.


27. Future Work

The sign of the fp8 penalty (§13) and its context-invariance motivate an fp8 KV-cache sweep (V0-only at this engine version) to test whether quantizing the KV cache moves the knee back out and whether the penalty's context-invariance survives. The two-point GPU contrast (§12) invites a third bandwidth point (L40S, H200) to turn the bandwidth axis into a slope and test whether the knee — and α — track HBM bandwidth quantitatively. The 32,000-token ceiling (§25) leaves the 128K-context regime on Llama unmeasured, where r is far larger and the knee should compress toward batch 2–4. The static-batch ceiling (§24) should be tied to the closed-loop served knee by re-running the same points under V3's concurrency ladder, measuring the gap between ceiling and operating point directly. The α factor (§11.3) should be derived, not just fit, from a compute-and-bandwidth roofline that predicts the overlap from arithmetic intensity — turning the predictive-up-to-α model into a fully first-principles one. Finally, the model surface should be extended across scale to test whether the KV-byte architectural prediction (§16) holds from 1B to 70B.


28. Conclusion

V4 turns the question "where does continuous batching stop amortizing" from a tabulation into a prediction. Across 672 verified cells — three 7–8B models, two datacenter GPUs, two decode lengths, two precisions, four contexts, seven batches, three reps, ok_rate 1.0, median per-cell CV 0.4% — a first-principles bandwidth model η(B) = (1+r)/(1+Br) with r = C·k/W fits the measured efficiency curves at median R² 0.93 and predicts the observed breakdown knee from architecture alone at Spearman ρ = 0.84, deviating only by a GPU-specific compute-overlap factor α (0.52 on A100, 0.33 on H100) that itself explains the hardware effect and locates the model's only failure (the non-monotone H100/d512/ctx32000 tails). Every axis sign falls out of that one mechanism: context dominates (mean η 0.865 → 0.599, paired Δ −0.266, ANOVA partial η² 0.27); the H100 sustains η +0.077 with an advantage that grows to +0.125 at long context where bandwidth binds; fp8 is −0.054 less batch-robust at a context-invariant cost; a longer decode is +0.036; and qwen2.5-7b's half-sized 57,344-byte KV footprint predicts and matches its 2× later knee against a Llama ≈ Mistral tie. Decode is confirmed bandwidth-bound (MBU ≈ 0.63), the aggregate ceiling falls 42× → 25× across context, and the knee compresses as context^(−0.64). The claims are sized to the substrate — a best-case static-batch amortization ceiling for 7–8B models on two GPUs under one engine — and the contribution is a usable one: from a model's config and a GPU's datasheet, the batch at which continuous batching stops paying at a given context can be predicted in advance. The next measurements extend it toward fp8-KV, a third bandwidth point, the 128K-context regime, a derived α, and the closed-loop served knee beneath the ceiling.


29. References

  • TR164 V1/V2/V3 — research/tr164/DRAFT_Technical_Report_164{,_V2,_V3}.md (lineage, cited not recomputed; V3 forecasts this measurement).
  • TR165 — interpreter-axis GIL ablation companion (mechanism family).
  • Run artifact — research/tr164/results/modal_amortization/full_xmodel_672/results.json (672 cells).
  • Analysis artifacts — research/tr164/amortization_stats.json (main effects), research/tr164/amortization_full_stats.json (nine-layer maximalist analysis, all V4 numbers).
  • Analyzers — research/tr164/analyze_amortization.py, research/tr164/analyze_amortization_full.py.
  • Figure generators — papers/serving_stack_physics/v2_boundary/make_amortization_figure.py, …/make_amortization_figures_full.py.
  • Architectures — verified from HF config.json / safetensors index for Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (unsloth mirror), mistralai/Mistral-7B-Instruct-v0.3 (Appendix D).

30. Appendix A — Reproducibility Pins

  • Engine: vLLM 0.10.2, V1 engine, enforce_eager = False, gpu_memory_utilization = 0.90, disable_log_stats = True.
  • Dependencies: transformers < 5, huggingface_hub; Python 3.12 Modal debian_slim image.
  • Models: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (gated), mistralai/Mistral-7B-Instruct-v0.3 (gated); HF token via Modal secret.
  • Precisions: fp16 = dtype=float16; fp8 = dtype=float16, quantization="fp8" (weight quant; KV fp16).
  • Sampling: temperature = 0, min_tokens = max_tokens ∈ {64, 512} (exact-length greedy).
  • Prompts: exact prompt_token_ids of length ctx ∈ {512, 2048, 8192, 32000}; context_actual == context verified for all 672.
  • max_model_len: ctx + decode, capped under each model's max_position_embeddings.
  • Hardware: Modal gpu="A100-80GB" and gpu="H100", both pools concurrent.
  • Reps: 1 warm-up + 3 timed; aggregate-throughput mean and population std recorded.
  • Seeds: bootstrap seed 164, 10,000 resamples.

31. Appendix B — Statistical Methodology

η(B) = t(B)/t(1); knee = smallest B > 1 with η < 0.65 (Appendix C), continuous knee = interpolated η = 0.65 crossing in log₂ B (§21). Bandwidth model η = (1+r)/(1+Br) fit by least squares per curve (scipy curve_fit, bounded r > 0); R² against the curve mean; architectural r = C·k/W; α = r_fit/r_arch; architectural knee B* = (0.35+r)/(0.65r); arch-vs-observed knee correlation by Spearman ρ (log₂ Pearson reported too). Axis main effects and model comparisons are paired Wilcoxon signed-rank (batch 1 excluded) with Holm-Bonferroni within family and reproducible percentile-bootstrap 95% CIs (seed 164, 10,000 resamples). Interactions are difference-in-differences paired Wilcoxon, Holm-corrected, plus a Type-II factorial ANOVA (statsmodels) of η on all main effects + 2-way interactions with partial η² per term. Implementation: analyze_amortization_full.pyamortization_full_stats.json.

32. Appendix C — The Full 24-Configuration Knee Table

Knee = smallest batch with η < 0.65; "—" = never reached through batch 64. Source: amortization_stats.json knee_table (24/24 rows verified against the JSON, 0 mismatches).

Config (model | GPU | decode | precision) ctx 512 ctx 2,048 ctx 8,192 ctx 32,000
qwen2.5-7b | A100-80GB | d64 | fp16 64 32 16
qwen2.5-7b | A100-80GB | d64 | fp8 64 32 32 16
qwen2.5-7b | A100-80GB | d512 | fp16 32 16
qwen2.5-7b | A100-80GB | d512 | fp8 64 64 32 16
qwen2.5-7b | H100 | d64 | fp16 64 16
qwen2.5-7b | H100 | d64 | fp8 64 32 16
qwen2.5-7b | H100 | d512 | fp16 32
qwen2.5-7b | H100 | d512 | fp8 32 16
llama3.1-8b | A100-80GB | d64 | fp16 16 8
llama3.1-8b | A100-80GB | d64 | fp8 64 32 16 8
llama3.1-8b | A100-80GB | d512 | fp16 32 8
llama3.1-8b | A100-80GB | d512 | fp8 64 64 16 8
llama3.1-8b | H100 | d64 | fp16 32 64 8
llama3.1-8b | H100 | d64 | fp8 64 64 64 8
llama3.1-8b | H100 | d512 | fp16 16
llama3.1-8b | H100 | d512 | fp8 64 8
mistral-7b | A100-80GB | d64 | fp16 64 32 8
mistral-7b | A100-80GB | d64 | fp8 64 32 16 4
mistral-7b | A100-80GB | d512 | fp16 16 8
mistral-7b | A100-80GB | d512 | fp8 32 64 16 8
mistral-7b | H100 | d64 | fp16 64 8
mistral-7b | H100 | d64 | fp8 64 64 16 8
mistral-7b | H100 | d512 | fp16 16
mistral-7b | H100 | d512 | fp8 64 8

Observations. The ctx-32,000 column is uniformly small (4–32, mostly 8 for Llama/Mistral and 16 for Qwen — the KV-architecture gradient, §16); within each (model, GPU, decode) triple the fp8 row knees at an equal-or-smaller batch than fp16 (the §13 penalty); and the H100/d512/fp16 rows hold "—" through ctx 8,192 for all three models (the §12 bandwidth advantage at the most-amortizing decode length).

33. Appendix D — Architecture and Bandwidth Constants

Verified from HF config.json and safetensors index (param count = total_size / 2 for bf16 checkpoints). KV cache is fp16 (2 bytes) in every cell.

Model layers KV heads head_dim params KV bytes/token k weight bytes W (fp16)
qwen2.5-7b 28 4 128 7,615,616,512 57,344 15.23 GB
llama3.1-8b 32 8 128 8,030,261,248 131,072 16.06 GB
mistral-7b 32 8 128 7,248,023,552 131,072 14.50 GB

k = 2 (K,V) × layers × KV-heads × head_dim × 2 bytes; W_fp8 ≈ W_fp16 / 2 (≈ 1 byte/param, approximate). Nominal peak HBM (vendor spec, not measured): A100-80GB 2,039 GB/s (HBM2e), H100 3,350 GB/s (HBM3). Sources: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (unsloth mirror, architecture fields unmodified), mistralai/Mistral-7B-Instruct-v0.3 config.json + model.safetensors.index.json.

34. Appendix E — Per-Curve Roofline-Fit Sample

Representative rows from L3_roofline.per_curve (fitted r, architectural r, compute-overlap α, fit R², architectural-predicted knee, observed interpolated knee). Full 96-curve table in the JSON.

Curve r_fit r_arch α knee_pred (arch) knee_obs
qwen2.5-7b | H100 | d64 | fp16 | ctx512 0.0019 high 281
qwen2.5-7b | H100 | d64 | fp16 | ctx8192 0.031 high 19 41
qwen2.5-7b | H100 | d64 | fp16 | ctx32000 0.046 0.120 0.38 high 6 11
llama3.1-8b | H100 | d512 | fp16 | ctx32000 < 0 (non-monotone) 16

Observations. The architectural knee tracks the observed knee in order and scale across the context range (281 → 19 → 6 vs — → 41 → 11), under-predicting by the compute-overlap factor (α ≈ 0.38 here, near the H100 median 0.33); the negative-R² row is one of the four non-monotone H100/d512/ctx32000 tails where the monotone form cannot fit (§11.3).


35. Appendix F — Master Named-Value Reference (paper-substrate table)

This appendix is the mining surface for drafting the Tier-2 paper: every quantity the paper would cite, with its exact value, source key, and gloss. Pull values from here; do not re-derive. Keys resolve in research/tr164/amortization_stats.json (prefix m.) or research/tr164/amortization_full_stats.json (prefix f.).

Grid / provenance.

Name Value Source Gloss
gridCells 672 run JSON length total cells, ok_rate = 1.0
timedGenerations 2,016 672 × 3 reps + 672 warm-ups
engine vLLM 0.10.2 V1 f.provenance.engine production engine, CUDA graphs on
gpus A100-80GB, H100 HBM2e ≈ 2.0 TB/s, HBM3 ≈ 3.35 TB/s
models qwen2.5-7b, llama3.1-8b, mistral-7b 7.62 / 8.03 / 7.25 B params
axes 3 model × 2 gpu × 2 decode × 2 prec × 4 ctx × 7 batch contexts {512, 2048, 8192, 32000}; batches {1…64}
wallClockMin ≈ 63 run log both GPU pools concurrent, ~10–13 cells/min
runDir research/tr164/results/modal_amortization/full_xmodel_672/ results.json (294 KB)

Main effects on η (paired Wilcoxon, Holm, bootstrap 95% CI, seed 164).

Name Δη 95% CI η lo→hi n Holm p Source
effContext (ctx32000 vs 512) −0.266 [−0.294, −0.240] 0.865→0.599 144 <0.001 f.L1_main_effects.context_32000_vs_512
effGpu (H100 vs A100) +0.077 [0.067, 0.088] 0.724→0.801 288 <0.001 …gpu_H100_vs_A100
effDecode (512 vs 64) +0.036 [0.028, 0.043] 0.745→0.780 288 <0.001 …decode_512_vs_64
effPrecision (fp8 vs fp16) −0.054 [−0.061, −0.047] 0.790→0.735 288 <0.001 …precision_fp8_vs_fp16

Interactions (difference-in-differences, Holm).

Name DiD 95% CI gap lo→hi Holm p Source
intGpuByContext +0.088 [0.059, 0.121] +0.036→+0.125 4e-6 f.L2_interactions.gpu_gap_x_context
intDecodeByContext +0.042 [0.021, 0.064] +0.020→+0.062 0.001 …decode_gap_x_context
intPrecisionByGpu +0.029 [0.017, 0.042] −0.069→−0.040 4e-6 …precision_gap_x_gpu
intPrecisionByContext (NULL) +0.002 [−0.016, 0.021] −0.053→−0.051 0.88 …precision_gap_x_context

Factorial ANOVA partial η² (full-model R² = 0.949). batch 0.509 · context 0.266 · context×batch 0.060 · gpu 0.038 · precision 0.018 · gpu×batch 0.013 · model 0.009 · decode 0.008. Source f.L2_factorial_anova.top_terms.

Model comparison (paired η, Holm).

Name Δη 95% CI η a→b Holm p Source
mdlQwenVsLlama +0.039 [0.027, 0.050] 0.789 vs 0.750 <0.001 m.model_comparison_eta
mdlQwenVsMistral +0.041 [0.029, 0.052] 0.789 vs 0.748 <0.001 m.model_comparison_eta
mdlLlamaVsMistral (TIE) +0.002 [−0.004, 0.007] 0.750 vs 0.748 0.13 m.model_comparison_eta

Bandwidth model (L3, the centerpiece).

Name Value Source Gloss
modelForm η(B) = (1+r)/(1+Br), r = C·k/W f.L3_roofline.model_form derived §9
fitR2Median 0.928 f.L3_roofline.fit_r2_median 57/96 > 0.9, 82/96 > 0.7
fitR2P10 0.655 f.L3_roofline.fit_r2_p10 10th percentile
archKneeSpearman 0.84 f.L3_roofline.arch_knee_vs_obs_knee_spearman zero-param prediction, n=65
archKneePearsonLog2 0.79 f.L3_roofline.arch_knee_vs_obs_knee_pearson_log2
alphaA100 / alphaH100 0.524 / 0.330 f.L3_roofline.alpha_compute_overlap_by_gpu compute-overlap factor r_fit/r_arch
kvBytesQwen 57,344 f.L3_roofline.kv_bytes_per_token 2×28×4×128×2
kvBytesLlamaMistral 131,072 f.L3_roofline.kv_bytes_per_token 2×32×8×128×2
wBytesFp16 15.23 / 16.06 / 14.50 GB f.L3_roofline.weight_bytes_fp16 qwen / llama / mistral

Bandwidth-bound confirmation, scaling, reliability, single-stream, knee rate.

Name Value Source Gloss
mbuBatch1Ctx512 0.535 (A100), 0.539 (H100) f.L4_bandwidth_mbu.mean_mbu_batch1_ctx512_by_gpu decode is bandwidth-bound
mbuQwenH100Fp16 0.63–0.66 f.L4_bandwidth_mbu.detail per-context, near-constant
aggSpeedupAt64 42.2 / 39.6 / 34.1 / 24.9 f.L5_aggregate_scaling.mean_aggregate_speedup_at_b64_by_context by ctx512/2048/8192/32000
peakAggTpsA100 / H100 4,684 / 8,879 f.L5_aggregate_scaling.peak_aggregate_tps_by_gpu max, batch 64
cvMedian / cvP95 / cvMax 0.4% / 2.0% / 12.1% f.L6_variance per-cell, 3 reps
singleStreamGpuSpeedup 1.68× f.L7_single_stream H100/A100 at batch 1
singleStreamFp8Speedup 1.40× f.L7_single_stream fp8/fp16 at batch 1
prefillPenalty ≈ 0.94 f.L7_single_stream t1(ctx32000)/t1(ctx512) — keystone
kneeSlope −0.64 (median) f.L9_knee_regression log₂(knee) per log₂(context)
meanEtaByCtxB16 0.897 / 0.838 / 0.686 / 0.442 m.headline.mean_eta_by_context_batch ctx512→32000 at batch 16
kneeDistCtx32000 all 24 break; 13 by b8 m.headline.knee_distribution_by_context {b8:13, b16:9, b32:1, b4:1}

36. Appendix G — Figure Manifest (caption-ready)

Four figures in papers/serving_stack_physics/figures/, each traceable to the stats JSON and reproducible from its generator. Caption text is paper-ready; adapt as needed.

Fig File Generator Caption / what it supports
1 amortization_eta_vs_batch.png make_amortization_figure.py "Amortization efficiency η(B)=t(B)/t(1) vs batch, one curve per context, A100 (left) vs H100 (right), averaged over the 12 (model, decode, precision) configs; 0.65 breakdown line dashed." Supports §3.2, §10 (context lever + GPU effect, visual).
2 fig_roofline_fit.png make_amortization_figures_full.py "Observed η (points) vs the fitted bandwidth model η=(1+r)/(1+Br) (lines) for four representative configs; per-curve R² annotated." Supports §11.1 (the form fits, median R²=0.93).
3 fig_arch_knee_pred.png make_amortization_figures_full.py "Zero-free-parameter architectural knee B*=(0.35+r)/(0.65r), r=C·k/W, vs observed interpolated knee (log-log, n=65 breaking curves, Spearman ρ=0.84); points sit above y=x by the compute-overlap factor α." Supports §11.2 (the predictive validation — the paper's headline figure).
4 fig_agg_speedup.png make_amortization_figures_full.py "Aggregate throughput speedup A(B)/A(1) vs batch by context, A100 vs H100, with the linear-ideal reference; the plateau is the throughput ceiling." Supports §17 (scaling ceiling 42×→25×).

Suggested paper figure order: Fig 1 (the phenomenon) → Fig 3 (the predictive result, the money figure) → Fig 2 (the fit underpinning it) → Fig 4 (the throughput consequence). Fig 3 is the one to lead the results with.


37. Appendix H — Paper Claim Ladder (supported / licensed / forbidden)

Mirrors the bridge-paper claim-ladder convention so the paper-writing agent can lift bounded sentences directly. Tier 1 is verbatim-adaptable; Tier 2 must carry its caveat; Tier 3 must never be stated.

Tier 1 — Supported (state plainly, bounded to the substrate):

  1. A one-parameter bandwidth model η(B)=(1+r)/(1+Br) with r=C·k/W fits the 96 measured efficiency curves at median R²=0.93 and predicts the observed breakdown knee from model architecture alone (zero fitted parameters) at Spearman ρ=0.84, on a 672-cell offline static-batch grid (three 7–8B models, A100-80GB + H100, vLLM 0.10.2).
  2. Context length is the dominant lever: mean η falls 0.865→0.599 from 512 to 32,000 tokens (paired Δ=−0.266, 95% CI [−0.294,−0.240], Holm p<0.001), an order of magnitude above any other design factor (ANOVA partial η²: context 0.27 vs GPU 0.038).
  3. The H100 sustains η +0.077 above the A100 (CI [0.067,0.088]), and the advantage grows with context (difference-in-differences +0.088, Holm p<1e-5) — the bandwidth mechanism's interaction prediction confirmed.
  4. fp8 weight quantization is 1.40× faster single-stream but 0.054 less batch-robust (CI [−0.061,−0.047]); the penalty is context-invariant (DiD +0.002, Holm p=0.88), correcting the intuition that it should grow with context.
  5. KV-bytes-per-token predicts the cross-model robustness order: Qwen (57,344 B/tok) knees at batch 16 at 32K context vs Llama/Mistral (131,072 B/tok) at batch 8; the architecturally identical Llama and Mistral tie (Holm p=0.13).
  6. Decode is memory-bandwidth-bound: measured MBU ≈ 0.54–0.66 at batch 1.
  7. The aggregate-throughput speedup ceiling at batch 64 falls from 42× to 25× across the context range; batching never stops raising throughput, only stops being free.

Tier 2 — Licensed with caveat (state only with the bracketed qualifier):

  • The knee prediction is predictive up to a GPU-specific compute-overlap factor α (0.52 A100, 0.33 H100), which is an empirical per-GPU fit, not derived — the zero-parameter architectural knee is conservative (~2× low) and α closes the gap.
  • The bandwidth model fits 85% of curves (82/96 at R²≥0.7); it breaks on the four H100/d512/ctx32000 non-monotone curves where compute overlap dominates — state "fits the monotone bandwidth regime," not "fits all curves."
  • The measured knee is a static-batch, equal-length-prompt ceiling = an upper bound on a real (ragged-batch, closed-loop) scheduler's knee — never present it as the served knee.
  • Peak HBM bandwidth in MBU is vendor spec, not measured.

Tier 3 — Forbidden (never state; from §26): closed-loop serving latency / p95; generalization beyond 7–8B; generalization beyond the A100-80GB/H100 pair; any fp8-KV-cache claim (only fp8 weight quant ran); any engine ranking / cross-engine claim (one engine); a ragged-batch/production-scheduler knee; universal verbs ("always", "proves", "eliminates", "guarantees"); a Llama-vs-Mistral robustness ordering (it is a tie).


38. Appendix I — TR→Paper Section Map and Usage Note

Usage. This TR is the verified substrate for the Tier-2 paper "When Continuous Batching Stops Amortizing." An agent drafting the paper should: (i) pull every number from Appendix F (each traces to a JSON key — never re-derive); (ii) use the Tier-1 sentences in Appendix H as the claim skeleton, attach Tier-2 caveats where flagged, and never state a Tier-3 claim; (iii) place figures per Appendix G (lead results with Figure 3); (iv) draw the reproducibility appendix from §30, Appendix D, and the execution arc §8; (v) keep the venue-neutral exposure discipline the project's pre-push leak-grep rule defines (that rule enumerates the forbidden strings; this TR passes it clean and the paper must too). The paper's contribution is the predictive bandwidth model — lead with the prediction (§11, Fig 3), not the descriptive surface.

Section map.

Paper section Draw from
Abstract §1, §3 (exec summary), Appendix F headline rows
Introduction / motivation §4 (V1→V4 arc, description→prediction), §5 (RQs)
Related work §4.2–4.3 positioning; the closed-loop-vs-offline distinction (§24); the V3 forecast this answers
Method §6 (cell contract, metric, model, stats), §6.6 (measurement protocol), Appendix D (architecture constants)
Method — reproducibility/rigor §8 (full Modal execution arc — the V0-rejection, fp8-scoping, validate-before-pay, orphan-billing, dual-path persistence), §30 (pins)
Results — the model (lead) §9 (derivation), §11 (fit + architectural prediction + α), §9.4 (worked example), Figs 2–3
Results — context §10, §3.2–3.3, Fig 1, §21 (knee rate), §22 (knee table)
Results — hardware §12 (GPU + interaction), §18 (MBU), §20 (single-stream)
Results — precision §13 (fp8, the corrected context-invariance)
Results — variance decomposition §15 (ANOVA lever ranking)
Results — cross-model §16 (per-model deep reads + architecture)
Results — throughput §17 (scaling ceiling), Fig 4
Reliability §19 (CV), §23 (verification)
Discussion / limitations §24 (lineage/protocol), §25 (limitations), §27 (future work)
Forbidden / claim discipline §26, Appendix H

This TR is built to be mined. Appendix F is the number source, Appendix G the figure source, Appendix H the claim source, and this map the routing table — an agent can assemble the Tier-2 paper from these four plus the section prose without touching the raw JSON or re-running analysis.