Technical Report

TR164 V4: The Amortization-Breakdown Surface — A Predictive Bandwidth Model for Continuous Batching

A 672-cell offline static-batch grid (3 models × A100-80GB + H100 × vLLM 0.10.2, 2,016 timed decode generations, ok_rate 1.0) showing the continuous-batching free lunch is bounded by KV-cache read bandwidth. A one-parameter model η(B) = (1+r)/(1+Br) fits the 96 efficiency curves at median R² 0.93, and its zero-fitted-parameter architectural form predicts the breakdown knee at Spearman ρ = 0.84 — off only by a GPU-specific batch-saturation factor α (0.52 A100, 0.33 H100; a roofline check rules out compute-overlap and VRAM-capacity). Complements V3's closed-loop boundary — grounds it, does not revise it.

Table of Contents

Overview
1. Abstract
2. Table of Contents
3. Executive Summary
3.1 The headline: the amortization knee is predictable from architecture
3.2 Context is the dominant lever, and the whole effect lives in the curve shape
3.3 The knee marches left with context; at 32K every configuration breaks by batch 8–16
3.4 Faster HBM holds the knee out, and its advantage grows exactly where bandwidth binds
3.5 fp8 is faster but less batch-robust — and the penalty is context-invariant
3.6 The architecture predicts the model ordering
3.7 What V4 licenses and what it does not
3.8 The offline knee is a measured upper bound on the served knee
4. Introduction and Research Motivation
4.1 The arc from V1 to V4
4.2 From description to prediction
4.3 Why offline static batching is the right instrument
4.4 The exact-token-context construction removes a V3 confound
5. Research Questions
6. Methodology
6.1 The cell-shape contract and its arithmetic
6.2 Precision, context ceiling, and the engine path
6.3 The amortization metric and the 0.65 threshold
6.4 The first-principles bandwidth model and its parameters
6.5 The statistical battery
6.6 The per-cell measurement protocol
6.7 Execution substrate and rigor controls
7. SS1. The Cross-Product Grid Design and Its Arithmetic
7.1 Why a full cross-product
7.2 What the design buys statistically
8. SS2. The Modal Serverless Harness as Methodology-Honest Substrate: The Full Execution Arc
8.1 The mandate: "we are paying for runs"
8.2 The vLLM 0.10.2 offline API surface: three bring-up failures
8.3 The V0-versus-V1 engine decision and the fp8 redefinition
8.4 Validate-before-pay: the ten-cell cross-model diagnostic
8.5 Gated-model access and the HF-token Modal secret
8.6 The orphan-billing catch
8.7 Dual-path persistence: modal.Dict, recover, and --detach
8.8 Concurrent dispatch, the run, and the background-task discovery
8.9 Why the execution arc is a result
9. SS3. The Amortization Metric and the First-Principles Bandwidth Model
9.1 The metric and what it isolates
9.2 Deriving η(B) = (1+r)/(1+Br)
9.3 Why the pure model is a floor, not the whole story
9.4 A worked prediction, end to end
10. SS4. Context Length Is the Dominant Lever
10.1 The 128K extension: the knee compresses to batch 4 at the max-r corner
11. SS5. The Bandwidth-Amortization Model: Fit, Architectural Prediction, and the VRAM-Ceiling Source of α
11.1 The functional form fits
11.2 The zero-parameter architectural prediction
11.3 The batch-saturation factor α: what it is, what it is not, and where the model breaks
12. SS6. The HBM Bandwidth Axis and the GPU×Context Interaction
12.1 The main effect
12.2 The interaction: the H100 edge grows where bandwidth binds
12.3 The third bandwidth point: α does *not* track HBM bandwidth (a tested, refuted extrapolation)
12.4 Why: the H200 underutilizes its bandwidth (the MBU mechanism, measured)
13. SS7. The Precision Tradeoff and the Context-Invariant fp8 Penalty
13.1 The main effect
13.2 The correction: the fp8 penalty is context-invariant, not context-escalating
13.3 fp8 KV-cache: halving the KV footprint does *not* push the knee out
14. SS8. The Decode-Length Axis
15. SS9. The Factorial ANOVA Lever Ranking
16. SS10. Cross-Model Robustness and the Architecture That Explains It
16.1 qwen2.5-7b — the most batch-robust, and why
16.2 llama3.1-8b — the baseline 4:1 GQA profile
16.3 mistral-7b — identical KV geometry, and the earliest knee in the grid
16.4 The Llama≈Mistral tie, reported as a tie
17. SS11. The Aggregate-Throughput Scaling Ceiling
18. SS12. Memory-Bandwidth Utilization: Confirming Decode Is Bandwidth-Bound
19. SS13. Measurement Reliability
20. SS14. The Single-Stream Baseline Surface
21. SS15. The Continuous-Knee Regression
22. SS16. The Knee Table — Where Batching Stops Paying
23. SS17. Data Verification
24. SS18. Cross-Run Lineage and the Measured Served Knee — Closed-Loop versus Offline
24.1 The closed-loop validation harness
24.2 Distinct-prompt traffic: the static knee upper-bounds the served knee
24.3 Shared-prefix traffic: the decode knee transfers within a ladder step
24.4 What the served grid measures, and what it does not
25. Limitations
26. Forbidden Claims
27. Future Work
28. Conclusion
29. References
30. Appendix A — Reproducibility Pins
31. Appendix B — Statistical Methodology
32. Appendix C — The Full 24-Configuration Knee Table
32.1 Appendix C.2 — Served Knee versus Offline Knee (closed-loop validation, §24)
33. Appendix D — Architecture and Bandwidth Constants
34. Appendix E — Per-Curve Roofline-Fit Sample
35. Appendix F — Master Named-Value Reference (paper-substrate table)
36. Appendix G — Figure Manifest (caption-ready)
37. Appendix H — Paper Claim Ladder (supported / licensed / forbidden)
38. Appendix I — TR→Paper Section Map and Usage Note
Data Reconciliation & Cross-Version Lineage (2026-06-24)
Addendum to §29 References — post-V4 substrates
Addendum to Appendix F — post-V4 substrate number table

Technical Report 164 V4: The Amortization-Breakdown Surface — A Predictive Bandwidth Model for When Continuous Batching Stops Paying (672-Cell Offline Grid, A100-80GB + H100, vLLM 0.10.2)

Status: Complete and hand-narrated at full TR depth. A single offline static-batch decode-throughput grid of 672 cells (3 models × 2 GPUs × 2 decode lengths × 2 precisions × 4 context lengths × 7 batch sizes), 3 timed repetitions per cell, ok_rate = 1.0 (672/672 succeeded, zero error cells). Engine: vLLM 0.10.2 V1. GPUs: A100-80GB and H100 (Modal serverless). Run artifact: research/tr164/results/modal_amortization/full_xmodel_672/results.json. Analysis artifacts: research/tr164/amortization_stats.json (main effects) and research/tr164/amortization_full_stats.json (the nine-layer maximalist analysis — every empirical number below traces to a key there or to the run JSON). Model architectures are verified from each model's Hugging Face config.json and safetensors index (Appendix D). Lineage facts about V1/V2/V3 are cited, not recomputed.

The V4 thesis in one sentence. Continuous batching's free lunch — the regime where adding requests raises aggregate throughput without degrading per-request rate — is bounded by the KV-cache footprint per token, and that bound is predictable from model architecture alone: a one-parameter bandwidth model η(B) = (1+r)/(1+Br) with r = C·k/W (context × KV-bytes-per-token over weight bytes) fits the measured efficiency curves at median R² 0.93 and predicts the observed breakdown knee at Spearman ρ = 0.84 with zero fitted parameters, off only by a GPU-specific batch-saturation factor (α ≈ 0.52 on A100, 0.33 on H100) that a first-principles roofline check shows is not compute overlap (decode is memory-bound throughout) but an empirical, hardware-specific effect of achieved memory-bandwidth utilization rising with batch.

Protocol distinction, stated up front and threaded through. V3 measured a closed-loop concurrency boundary (N agents against a server, parallel efficiency + p95 latency). V4 measures an offline static-batch amortization surface (one process submits a fixed batch of B equal-length prompts to an in-process vllm.LLM, decode throughput is timed). These are complementary: the closed-loop number is the served operating point under traffic; the static-batch number is the best-case amortization ceiling the scheduler reaches toward when every slot is full of equal work. A knee here is a ceiling on where a real scheduler can keep amortizing (§24). No V4 number is presented as a served latency, and no V3 number is spliced into a V4 curve.

Overview

V4 turns the TR164 serving-stack-physics arc from a map of where continuous batching breaks (V3, under concurrent traffic) into a predictive mechanism for where it breaks and why. The substrate is a full cross-product offline grid: three instruction-tuned 7–8B checkpoints (qwen2.5-7b, llama3.1-8b, mistral-7b), two datacenter GPUs (A100-80GB, HBM2e ≈ 2.0 TB/s; H100, HBM3 ≈ 3.35 TB/s), two decode lengths (64 and 512 generated tokens), two precisions (fp16 and fp8-weight quantization), four context lengths (512, 2,048, 8,192, 32,000 exact prompt tokens), and seven batch sizes (1, 2, 4, 8, 16, 32, 64) — 3 × 2 × 2 × 2 × 4 × 7 = 672 cells, each measured over three timed repetitions after a warm-up. Completion was perfect (672/672, no error cell, no zero-throughput cell, every prompt exact to the token), and the three reps are extremely tight (median per-cell coefficient of variation 0.4%, §19), so the surface is measured cleanly. The aggregate measurement count is 672 cells × 3 reps = 2,016 timed decode generations plus 672 warm-ups, all on real GPUs with real weights.

The analysis is deliberately maximalist — nine layers that wring the cross-product for everything it supports rather than reporting marginal means. Layer 1 is the four axis main effects. Layer 2 is the interaction structure: difference-in-differences paired tests and a factorial Type-II ANOVA that ranks the levers by partial η². Layer 3, the centerpiece, derives a first-principles bandwidth-amortization model, fits it to all 96 efficiency curves, and validates a zero-free-parameter architectural prediction of the knee. Layer 4 measures achieved memory-bandwidth utilization to confirm decode is bandwidth-bound. Layers 5–9 cover the aggregate-throughput scaling ceiling, measurement reliability, the single-stream baseline, per-model architecture-grounded robustness, and a regression of the knee against context. The result is not a description of four correlations but a single mechanism — bandwidth-bound decode — that predicts the surface.

Two findings carry the report. First, the mechanism is predictive, not merely descriptive. The bandwidth model η(B) = (1+r)/(1+Br) fits 85% of the 96 curves at R² ≥ 0.7 (median 0.93), and its zero-parameter architectural form predicts the observed knee at Spearman ρ = 0.84 across two orders of magnitude of context; a first-principles roofline check (§9.3, §11.3) shows the α < 1 residual is not compute overlap (decode is 4.6%-compute, memory-bound throughout) and not a VRAM-capacity effect at the knee (95% of knees sit at ~¼ of the VRAM concurrency ceiling, where the KV still fits) but an empirical, hardware-specific effect of achieved memory-bandwidth utilization rising with batch. The model fails in exactly one interpretable place: the H100 long-decode long-context corner where the curve goes non-monotone because the scheduler wave-batches once the KV cache exceeds VRAM — beyond the operational knee. Second, every axis sign falls out of that one KV-footprint mechanism, and the architecture predicts the ordering. Context dominates (mean η 0.865 → 0.599 from ctx512 → ctx32000, ANOVA partial η² 0.27, an order of magnitude above any other design factor); the H100 holds η 7.7 points higher and its advantage grows with context (a significant interaction); fp8 weight quantization is 5.4 points less batch-robust with a penalty that is context-invariant (a null interaction that corrects a plausible-but-wrong intuition); and qwen2.5-7b's 57,344-byte-per-token KV footprint — less than half the 131,072 of Llama and Mistral, by virtue of its 28 layers × 4 KV heads — directly predicts and matches its 2× later knee. This is the empirical core reserved, in the serving-stack-physics paper contract, for the held title "When Continuous Batching Stops Amortizing."

1. Abstract

Continuous batching amortizes the per-token memory traffic of LLM decode across concurrent requests; where that amortization stops paying — and what governs the boundary — is an open operational question. Technical Report 164 V4 answers it with a predictive bandwidth model validated on a 672-cell offline static-batch grid (vLLM 0.10.2 V1, ok_rate 1.0): three 7–8B instruction-tuned models, two datacenter GPUs (A100-80GB, H100), two decode lengths (64, 512), two precisions (fp16, fp8-weight), four context lengths (512–32,000 exact tokens), seven batch sizes (1–64), three reps each. We define amortization efficiency η(B) = t(B)/t(1) — the fraction of the batch-1 per-request decode rate retained at batch B, equivalently the aggregate speedup over B, the parallel efficiency — and derive from first principles that if decode is HBM-bandwidth-bound then η(B) = (1+r)/(1+Br) with r = C·k/W, the ratio of per-request KV read bytes (context C × KV-bytes-per-token k) to model weight bytes W. The model fits the 96 measured curves at median R² 0.93, and its zero-fitted-parameter architectural form predicts the observed breakdown knee at Spearman ρ = 0.84, deviating only by a GPU-specific batch-saturation factor α (0.52 on A100, 0.33 on H100); a first-principles roofline check shows α is not compute overlap (decode is 4.6%-compute, memory-bound throughout) and not a VRAM-capacity effect at the knee (95% of knees sit at ~¼ of the VRAM concurrency ceiling, where the KV still fits) but an empirical, hardware-specific effect of achieved memory-bandwidth utilization rising with batch (MBU 0.55 → saturation). Consistent with one KV-footprint scaling: context dominates (mean η 0.865 → 0.599 across ctx512 → ctx32000, paired Δ = −0.266, bootstrap 95% CI [−0.294, −0.240], Holm p ≈ 0; factorial-ANOVA partial η² 0.27, vs 0.038 GPU / 0.018 precision / 0.009 model); the H100 sustains η +0.077 (CI [0.067, 0.088]) with an advantage that grows with context (difference-in-differences +0.088, Holm p < 1e-5); fp8 is −0.054 less batch-robust (CI [−0.061, −0.047]) with a context-invariant penalty (interaction null, Holm p = 0.88); a longer decode is +0.036. Across models qwen2.5-7b is most robust (mean η 0.789, knee at batch 16 at ctx32000) ahead of a llama3.1-8b ≈ mistral-7b tie (0.750/0.748, knee at batch 8) — an ordering the KV-byte architecture predicts (57,344 vs 131,072 bytes/token). Achieved memory-bandwidth utilization at batch 1 is ≈ 0.63 on the H100, confirming decode is bandwidth-bound; the aggregate-throughput speedup ceiling at batch 64 falls from 42× to 25× across the context range; measurement reliability is high (median CV 0.4%). All comparisons are paired Wilcoxon signed-rank with Holm-Bonferroni and reproducible bootstrap CIs (seed 164, 10,000 resamples). The claims are sized to the substrate — a best-case static-batch amortization ceiling for 7–8B models on two GPUs under one engine.

2. Table of Contents

Abstract
Table of Contents
Executive Summary
Introduction and Research Motivation
Research Questions
Methodology
SS1. The Cross-Product Grid Design and Its Arithmetic
SS2. The Modal Serverless Harness and the Persistence / Orphan-Billing Rigor Arc
SS3. The Amortization Metric and the First-Principles Bandwidth Model
SS4. Context Length Is the Dominant Lever
SS5. The Bandwidth-Amortization Model: Fit, Architectural Prediction, and the VRAM-Ceiling Source of α
SS6. The HBM Bandwidth Axis and the GPU×Context Interaction
SS7. The Precision Tradeoff and the Context-Invariant fp8 Penalty
SS8. The Decode-Length Axis
SS9. The Factorial ANOVA Lever Ranking
SS10. Cross-Model Robustness and the Architecture That Explains It
SS11. The Aggregate-Throughput Scaling Ceiling
SS12. Memory-Bandwidth Utilization: Confirming Decode Is Bandwidth-Bound
SS13. Measurement Reliability
SS14. The Single-Stream Baseline Surface
SS15. The Continuous-Knee Regression
SS16. The Knee Table — Where Batching Stops Paying
SS17. Data Verification
SS18. Cross-Run Lineage and the Measured Served Knee — Closed-Loop versus Offline
Limitations
Forbidden Claims
Future Work
Conclusion
References
Appendix A — Reproducibility Pins
Appendix B — Statistical Methodology
Appendix C — The Full 24-Configuration Knee Table
Appendix D — Architecture and Bandwidth Constants
Appendix E — Per-Curve Roofline-Fit Sample
Appendix F — Master Named-Value Reference (paper-substrate table)
Appendix G — Figure Manifest (caption-ready)
Appendix H — Paper Claim Ladder (supported / licensed / forbidden)
Appendix I — TR→Paper Section Map and Usage Note

Reference-substrate note. This TR is built to be mined when drafting the Tier-2 paper "When Continuous Batching Stops Amortizing." Appendices F–I are the agent-facing surface: F is the master number table (every value → JSON key, never re-derive), G the figure manifest (caption-ready), H the claim ladder (supported / licensed-with-caveat / forbidden), and I the TR→paper section map. Lead the paper with the predictive bandwidth model (§11, Figure 3).

3. Executive Summary

3.1 The headline: the amortization knee is predictable from architecture

The central result of V4 is that the breakdown of continuous batching is not an empirical curiosity to be tabulated but a consequence of a single mechanism that can be written down and used to predict. Model the per-request decode throughput as bandwidth-bound: each decode step reads the model weights once (W bytes, shared across the batch) and each request's KV cache once (C·k bytes, where C is context and k is KV-bytes-per-token), producing one token per request. Then per-request throughput is t(B) = BW / (W + B·C·k) and the amortization efficiency is

η(B) = t(B)/t(1) = (W + C·k)/(W + B·C·k) = (1 + r)/(1 + B·r), with r = C·k / W.

This one-parameter rational function, derived in §9, fits the 96 measured (model, GPU, decode, precision, context) curves at median R² 0.93 (57 of 96 above 0.90; 82 of 96 above 0.70), and the architectural value of r — computed from the verified config (params → W, layers × KV-heads × head_dim → k) with no fitting whatsoever — predicts the observed breakdown knee at Spearman ρ = 0.84 (Figure 3). The single place the form fails is the four H100 | d512 | ctx32000 curves, which go non-monotone (η dips then partly recovers) at the most VRAM-stressed corner where the scheduler wave-batches; a monotone rational function cannot fit a non-monotone curve.

Observations. The architectural prediction sits systematically below the observed knee (points above the y = x line in Figure 3) by a roughly constant factor — the batch-saturation factor α = r_fit / r_arch, median 0.52 on the A100 and 0.33 on the H100. A first-principles roofline check (§9.3, §11.3) shows this is not compute overlap (decode is 4.6%-compute, memory-bound throughout) and not a VRAM-capacity effect at the knee (95% of knees sit at ~¼ of the VRAM concurrency ceiling, where the KV still fits) but an empirical, hardware-specific effect of achieved memory-bandwidth utilization rising with batch. The operational payoff stands regardless: the knee scales cleanly as 1/(C·k), so from three numbers in a model's config and the GPU's HBM one can predict to within a factor of ~2 the batch at which continuous batching stops paying at a given context.

The amortization knee is predictable. A one-parameter bandwidth model fits the curves at R² 0.93 and its zero-parameter architectural form predicts the knee — which scales cleanly as 1/(C·k) — at ρ = 0.84; the residual α is an empirical, hardware-specific correction the derivation attributes to achieved bandwidth rising with batch, not compute overlap and not VRAM capacity at the knee. This turns "long context is slower" into a usable engineering prediction.

3.2 Context is the dominant lever, and the whole effect lives in the curve shape

Averaged across all twelve (model, GPU, decode, precision) configurations, mean η at batch 16 falls 0.897 → 0.838 → 0.686 → 0.442 across contexts 512 → 2,048 → 8,192 → 32,000, and at batch 64 it falls 0.659 → 0.619 → 0.533 → 0.389 (headline.mean_eta_by_context_batch). The paired extreme contrast (ctx32000 vs ctx512, 144 matched cells) is Δ = −0.266 (bootstrap 95% CI [−0.294, −0.240], Holm-adjusted Wilcoxon p ≈ 0), and the factorial ANOVA assigns context a partial η² of 0.266 — an order of magnitude above the next design factor (GPU 0.038). Critically, almost none of this lives in the single-stream rate: the batch-1 per-request throughput is only ≈ 6% lower at 32,000 tokens than at 512 (prefill-amortization penalty 0.94, §20). Observations. Context barely changes how fast one request decodes; it changes how fast batching stops helping. That is precisely why context is an amortization lever (it reshapes r, the curve's one parameter) rather than a raw-speed lever, and it is why the knee, not the throughput, is the quantity that moves.

Of every operational knob, context length has by far the largest effect on whether continuous batching still amortizes — and the effect is entirely in the curve shape, not the single-stream speed. Deciding how much history or retrieved context to include is also deciding the batch-throughput regime the server will live in.

3.3 The knee marches left with context; at 32K every configuration breaks by batch 8–16

The breakdown knee compresses sharply with context. Counting the 24 (model, GPU, decode, precision) configurations: at ctx512, 14 never cross 0.65 through batch 64; at ctx2048, still 14; at ctx8192, only 3 remain free and the rest cluster at batch 16–32; at ctx32000, every configuration has broken down, thirteen by batch 8 and the rest by batch 32 (headline.knee_distribution_by_context, full table §22). The continuous-knee regression (§21) puts the compression rate at a median slope of −0.64 in log₂(knee) per log₂(context): the knee roughly halves for every ~3× of context. Observations. There is no single safe batch size. The maximum batch at which batching still pays runs from ≥64 (or unbounded) at 512 tokens to ~8 at 32,000 — a swing an autoscaler that ignores prompt length cannot absorb without leaving either throughput or latency on the table.

The knee is a steep function of context: ≥64 at short context, ~8 at 32K. The operational corollary is that the right batch ceiling is prompt-length-conditional, not a constant.

3.4 Faster HBM holds the knee out, and its advantage grows exactly where bandwidth binds

The GPU axis is the cleanest causal probe — only the memory subsystem changes. Over 288 matched pairs the H100 (HBM3) sustains mean η 0.801 against the A100's (HBM2e) 0.724, Δ = +0.077 (CI [0.067, 0.088], Holm p ≈ 0), and delivers the higher absolute ceiling (peak aggregate 8,879 vs 4,684 tok/s at batch 64). The difference-in-differences test shows the advantage is not flat: the H100 edge grows from +0.036 at ctx512 to +0.125 at ctx32000 (DiD +0.088, Holm p < 1e-5, §12), exactly where KV traffic is largest and bandwidth binds hardest. Observations. A faster memory subsystem helps most precisely when the bottleneck is memory — a sign-and-magnitude prediction of the bandwidth mechanism that the data confirms, and the mechanistic complement to the α = 0.33-vs-0.52 batch-saturation finding (§11.3).

The H100 advantage is bandwidth, demonstrated two ways: it holds η +7.7 points higher overall, and that margin widens to +12.5 points at 32K context where bandwidth is the binding constraint. Same models, same engine, only the HBM changes.

3.5 fp8 is faster but less batch-robust — and the penalty is context-invariant

fp8 weight quantization halves the weight read and so raises absolute throughput (single-stream fp8 is 1.40× fp16, §20), but it leaves the KV cache at fp16, so it reaches the KV-dominated regime sooner and is −0.054 less batch-robust (CI [−0.061, −0.047], Holm p ≈ 0). The maximalist analysis corrects a plausible intuition here: one expects the fp8 penalty to be largest where the KV cache is largest (long context), but the difference-in-differences test is null — the fp8 η-penalty is −0.053 at ctx512 and −0.051 at ctx32000, DiD +0.002, Holm p = 0.88 (§13). The penalty is significantly smaller on the H100 (precision×GPU DiD +0.029, Holm p < 1e-5), consistent with the H100's larger batch-saturation relaxation (lower α, §11.3) softening every bandwidth-side effect including the fp8 one. Observations. fp8 relocates the bottleneck from weights to the unquantized KV cache by a roughly fixed efficiency cost across context, modulated by the GPU's effective-traffic relaxation — not, as one might guess, by a cost that escalates with context.

fp8 buys absolute speed and memory headroom at a fixed ~5-point batch-robustness cost that does not escalate with context but does shrink on a compute-richer GPU. The obvious follow-up the sign motivates is to quantize the KV cache too (§27).

3.6 The architecture predicts the model ordering

qwen2.5-7b is the most batch-robust (mean η 0.789, median knee at batch 16 at ctx32000); llama3.1-8b and mistral-7b are a statistical tie (0.750 / 0.748, median knee at batch 8; pairwise Holm p = 0.13). This ordering is not a black-box observation: Qwen's KV cache is 57,344 bytes/token (28 layers × 4 KV heads × 128 head_dim × 2 for K/V × 2 bytes) against 131,072 bytes/token for Llama and Mistral (32 layers × 8 KV heads × …), so Qwen's r = C·k/W is roughly half at any context, its curve is gentler, and its knee falls ~2× later — exactly the 16-vs-8 ratio observed. Observations. The model that decodes with fewer KV bytes per token amortizes batching further, and the architecture says how much in advance. The Llama≈Mistral tie is reported as a tie (CI straddles zero, Holm rejects), consistent with their identical KV geometry.

The cross-model robustness order is an architectural prediction, not a leaderboard: KV-bytes-per-token sets r, and Qwen's half-sized KV footprint predicts and matches its 2× later knee. Llama and Mistral, with identical KV geometry, tie.

3.7 What V4 licenses and what it does not

V4 licenses, on the substrate measured: (i) a predictive bandwidth-amortization model for 7–8B offline static-batch decode on A100-80GB and H100, validated at median R² 0.93 and a zero-parameter knee prediction at ρ = 0.84; (ii) the causal claim that the knee is governed by KV read bandwidth, supported by the concordant signs of the context, GPU, and precision effects, the gpu×context interaction, and the ≈ 0.63 measured MBU; (iii) a model-robustness ordering predicted by KV-byte architecture. V4 does not license any closed-loop serving-latency claim (the protocol is static-batch offline throughput, §24, §26), generalization beyond the three families or the two GPUs, any fp8-KV-cache claim (only fp8 weight quant was run), or an engine comparison (one engine). These are enumerated in §26.

3.8 The offline knee is a measured upper bound on the served knee

The one place V4's offline protocol most invites a "but does it hold under real serving?" objection — its static-batch knee versus a live scheduler's served knee — is now answered by measurement rather than argument (§24). A dedicated closed-loop grid (vllm serve + N concurrent closed-loop agents, same parallel_efficiency < 0.65 knee definition) re-runs the V4 cells under two traffic regimes. Under distinct prompts (realistic, no KV reuse) the served knee is at or below the offline static knee on 26 of 26 configurations, the gap widening with context because distinct long prompts force prefill serialization the decode-only offline metric excludes (served parallel_efficiency collapses as ≈ 1/N from N = 2 at ctx8192–32000). Under shared-prefix prompts (prefill amortized by the cache) the served knee lands on the offline decode knee to within one ladder step on 32 of 33 configurations. Observations. The two regimes bracket the served knee from both sides: the decode-amortization ceiling V4 predicts is what governs serving when prefill is removed as a confound, and it strictly upper-bounds the served knee when prefill is present. An operator who provisions to V4's offline knee will not over-provision — the conservative direction is confirmed, not asserted (§24.2–§24.4).

4. Introduction and Research Motivation

4.1 The arc from V1 to V4

TR164 is a four-leg characterization of where the concurrency behavior of LLM serving breaks. V1 measured a single in-process backend (pytorch_direct) and found a catastrophic near-universal breakdown by N = 16. V2 introduced server-process continuous-batching backends that eliminate the V1 collapse but left a hardware confound in the cross-backend head-to-head. V3 closed that confound with a matched-SKU A100 grid across five models and five workloads under a closed-loop concurrency ladder, and forecast the next measurement explicitly: "a denser concurrency sweep that traces each backend's amortization regime to its own breakdown point." V4 is that sweep, with a protocol change (offline static batching, to isolate the amortization ceiling from scheduler and queueing dynamics) and two extensions (a second GPU for a clean bandwidth contrast, and context and precision swept densely so the physical drivers of memory traffic become explicit axes).

4.2 From description to prediction

V1–V3 described where the boundary sits. V4's ambition is to predict it. The closed-loop ladder cannot do this cleanly because it folds the amortization physics together with scheduler decisions and arrival noise; the static-batch grid strips those away, leaving a quantity — η(B) — that a first-principles bandwidth argument predicts in closed form. The test of the report is whether that argument, fed only verified architecture and hardware specs, predicts the measured knee. It does, to within an empirical batch-saturation factor (§11.3 rules out compute overlap and VRAM-capacity as its cause, attributing it to achieved bandwidth rising with batch), which is the report's main contribution: a serving engineer can estimate the amortization knee for an untested (model, GPU, context) from the model card and the datasheet.

4.3 Why offline static batching is the right instrument

To ask where batching stops amortizing bandwidth, the cleanest instrument fills exactly B slots with equal-length prompts and times steady-state decode, removing ragged batches, preemption, and queueing. The resulting η(B) is the best case — so a knee observed here is a ceiling on where a real scheduler can keep amortizing (a closed-loop operator will generally see the knee at or before this batch). That conservative direction is exactly what an operator needs from a planning bound.

4.4 The exact-token-context construction removes a V3 confound

V3 carried a documented Mistral tokenizer confound: char-matched text prompts produced different token counts across tokenizers, so "same prompt" did not mean "same sequence length." V4 constructs each prompt as an exact token-id sequence of the target length and verifies context_actual == context for all 672 cells (§23), so sequence length is an exact identical variable across models and GPUs and the tokenizer-density confound does not arise. This is a methodological strength of the static-batch design and the reason the cross-model context comparison is clean here where V3 required a caveat.

5. Research Questions

RQ1 (predictive mechanism). Can a first-principles bandwidth model η(B) = (1+r)/(1+Br) describe the measured efficiency curves, and does its zero-parameter architectural form predict the observed knee? Hypothesis: yes, up to a GPU-specific batch-saturation factor — which the derivation (§9.3, §11.3) finds is neither compute overlap nor a VRAM-capacity effect at the knee, but achieved bandwidth rising with batch.
RQ2 (the dominant lever). Which axis dominates η, and is its effect monotone? Hypothesis: context, monotonically, because KV read traffic scales with it.
RQ3 (the bandwidth causal probe and its interaction). Does changing only the GPU memory subsystem move η as the mechanism predicts, and does the effect concentrate where bandwidth binds? Hypothesis: the H100 sustains higher η, with an advantage that grows with context.
RQ4 (the precision tradeoff and its shape). Is fp8 weight quantization less batch-robust, and does its penalty depend on context? Hypothesis: less robust (KV becomes binding sooner); the context-dependence is tested, not assumed.
RQ5 (model consistency). Is the surface consistent across families, and does KV-byte architecture predict any ordering? Hypothesis: a shared surface with an ordering set by KV-bytes-per-token.
RQ6 (bandwidth-bound confirmation). Is decode actually bandwidth-bound at these operating points? Hypothesis: measured MBU is high (≳ 0.5).

6. Methodology

6.1 The cell-shape contract and its arithmetic

A cell is one point in (model, GPU, decode length, precision, context, batch): models {qwen2.5-7b, llama3.1-8b, mistral-7b}; GPUs {A100-80GB, H100}; decodes {64, 512}; precisions {fp16, fp8-weight}; contexts {512, 2,048, 8,192, 32,000}; batches {1, 2, 4, 8, 16, 32, 64}. The product is 672 cells. Each runs one warm-up generation (moving autotuning and CUDA-graph capture out of the timed window) then three timed repetitions, reporting the mean and population std of aggregate decode throughput and the mean per-request decode throughput. The grid is balanced and verified balanced: 224 cells per model, 336 per GPU/decode/precision, all contexts and batches present (§23).

6.2 Precision, context ceiling, and the engine path

Both precisions share the official checkpoint: fp16 is dtype=float16; fp8 is fp8 weight quantization (quantization="fp8"), the only fp8 path the V1 engine supports at 0.10.2 (fp8 KV-cache dtype is a V0-only path, out of scope; a V0 validation pass showed long-context decode an order of magnitude slower and non-representative, so all cells use the production V1 engine). The KV cache is fp16 in every cell, including fp8 cells — the reason the fp8 robustness penalty has the sign it does (§13). The largest context is 32,000 (not 32,768) so max_model_len = context + decode ≤ 32,768, the shared position-embedding ceiling of Qwen and Mistral (Llama's is 131,072; the shared 32,000 keeps the context axis identical across models). All cells run V1 with gpu_memory_utilization = 0.90, CUDA graphs on, greedy decode pinned to the exact token budget (min_tokens = max_tokens, temperature = 0).

6.3 The amortization metric and the 0.65 threshold

η(B) = t(B)/t(1) is the fraction of the batch-1 per-request decode rate retained at batch B; since aggregate A(B) = B·t(B), it equals the parallel efficiency A(B)/A(1)/B used throughout the TR164 line. η ≈ 1 is the free-lunch regime; η < 0.65 is breakdown (EFF_BREAKDOWN, the shared threshold). The knee is the smallest B > 1 with η(B) < 0.65 (or "never"); §21 additionally interpolates a continuous knee (the η = 0.65 crossing in log₂ B). The 0.65 line is a consistent reading point shared with V3, not a claim that 0.649 and 0.651 differ operationally.

6.4 The first-principles bandwidth model and its parameters

The model η(B) = (1+r)/(1+Br) with r = C·k/W is derived in §9 from a bandwidth-bound decode step. Its parameters are computed from verified architecture (Appendix D): weight bytes W = params × 2 (fp16) or × 1 (fp8, approximate); KV-bytes-per-token k = 2 × layers × KV-heads × head_dim × 2 (the leading 2 for K and V, the trailing 2 for fp16). For each curve we fit r by least squares (the architectural value is not used in the fit), report R², compare the fitted r to the architectural r_arch, and define the batch-saturation factor α = r_fit / r_arch (§11.3 shows it is neither compute overlap nor VRAM-capacity at the knee, but achieved bandwidth rising with batch). The architectural knee B* = (0.35 + r_arch)/(0.65·r_arch) is the zero-parameter prediction validated against the observed knee; the VRAM concurrency ceiling n_max = (0.9·VRAM − W − overhead)/(C·k) is a correlated co-predictor of the same knee sharing the 1/(C·k) scaling (§11.3).

6.5 The statistical battery

Axis main effects are paired Wilcoxon signed-rank over matched cells (batch 1 excluded; η ≡ 1), Holm-Bonferroni-corrected within the four-axis family, each with a reproducible percentile-bootstrap 95% CI on the mean paired delta (seed 164, 10,000 resamples). Interactions are difference-in-differences paired Wilcoxon (the effect at one moderator level vs another), Holm-corrected within the interaction family, and a factorial Type-II ANOVA of η on all main effects plus 2-way interactions reporting partial η² per term. Model comparisons are paired Wilcoxon, Holm-corrected within the three-pair family. All estimators and the 0.65 threshold are shared with v3_paper_stats.py.

6.6 The per-cell measurement protocol

Each cell instantiates one vllm.LLM with max_model_len = context + decode, builds the batch of B identical exact-length token-id prompts, and runs the measurement loop. A single warm-up generate over one prompt is issued first and discarded: vLLM's V1 engine autotunes and captures CUDA graphs on the first generation of a new shape, and folding that one-time cost into the timed window would inflate the batch-1 step and corrupt the η ratio. Three timed repetitions then follow, each a full generate over the B-prompt batch with max_tokens = min_tokens = decode (so every request emits exactly the decode budget, no early stop), wrapped in a wall-clock timer. Per repetition the aggregate decode throughput is sum(output_tokens)/wall and the per-request throughput is that divided by B; the cell records the mean and population standard deviation of the aggregate across the three reps, and the mean per-request rate (the quantity η is built from). The greedy, fixed-length decode (temperature = 0) removes sampling variance, so the only residual noise is scheduling and memory-contention jitter, which the three-rep CV quantifies at a median 0.4% (§19). Observations. The protocol is deliberately the best case for amortization — equal-length prompts, full batch, steady-state decode, autotuning excluded — because the question is where batching stops paying even when nothing else is working against it; a knee measured under these conditions is a ceiling on the served knee (§24).

6.7 Execution substrate and rigor controls

The grid ran on Modal serverless GPUs via two GPU-typed functions sharing one measurement body, both pools dispatched concurrently. Three rigor controls (narrated in full in §8): validate-before-pay (a ten-cell diagnostic proved every new configuration loads and runs before the paid grid), dual-path persistence (each cell writes to a run-tagged modal.Dict as it completes, with a recover mode, decoupling data from the local process), and --detach (the job survives a local disconnect). The happy-path save and the insurance Dict agreed exactly at 672 cells.

7. SS1. The Cross-Product Grid Design and Its Arithmetic

7.1 Why a full cross-product

The amortization question is interaction-shaped: the knee is not a property of context or batch alone but of their joint position relative to the bandwidth ceiling, modulated by precision and hardware. A sliced design would hide, for example, that the H100 advantage grows with context (§12) or that the fp8 penalty does not (§13). Measuring all 672 cells lets every two-way effect be read off the same surface and tested directly (§9, §12, §13, §15), and supplies the 96 complete batch-sweeps the bandwidth model is fit against. The axes were chosen as the physical drivers of decode memory traffic: context and batch set KV size and therefore KV read bandwidth per step; precision sets weight read bytes; the GPU sets available HBM bandwidth; decode length sets how many memory-bound steps amortize the one compute-leaning prefill.

7.2 What the design buys statistically

Because every (model, decode, precision, context, batch) combination exists on both GPUs, the GPU cross-cut is a fully matched paired test with no missing cells; the same holds for precision and decode. The only extreme-versus-extreme contrast is context (512 vs 32,000), chosen for the cleanest single effect size, with the intermediate contexts reported in full (§3.2, §22) and used in the continuous-knee regression (§21). Observations. The symmetry is what makes the difference-in-differences interaction tests clean: each is a paired contrast of paired contrasts over fully matched cells, so the DiD estimates are unconfounded by the other axes.

A full cross-product is the design that does not assume away the interactions an amortization-breakdown story is made of. The 672 cells buy the right to read and test every two-way effect on one surface, and the 96 complete curves the predictive model is validated against.

The 672-cell grid is a paid-compute artifact, and the TR164 convention treats execution rigor as a result rather than a footnote, because on paid serverless compute the harness's failure modes are part of the measurement: a silent-fallback bug, an orphaned container fleet, or a non-representative engine path does not announce itself, it just bills and returns a number. This section narrates the full arc that produced a clean 672/672 grid — the bring-up failures, the engine decision, the validation discipline, the orphan-billing catch, and the persistence hardening — in the same spirit V3 narrated its four-failure environment arc. Every event below was observed in the run logs and is reported as it happened, not reconstructed.

8.1 The mandate: "we are paying for runs"

The harness did not start clean. An early pass of the sweep fired a partial grid that returned roughly half its cells failed and then crashed the result save with a FileNotFoundError on a /results/... path: the local entrypoint had tried to write to the Modal volume path, which exists only inside the remote containers, not in the local process that runs the entrypoint and collects the .map() returns. The fix was to save locally from the entrypoint (the .map() returns the result dicts to the local process; the volume is the wrong place to write the aggregate). The failure set the bar for everything after it: a paid run is not an enthusiastic best-effort, it is code that must work end-to-end before money is spent, with each failure mode diagnosed from the real traceback rather than guessed. The three controls that follow — validate-before-pay, dual-path persistence, and orphan-billing vigilance — are the operationalization of that bar.

8.2 The vLLM 0.10.2 offline API surface: three bring-up failures

Bringing the offline measurement up against vLLM 0.10.2's V1 engine surfaced three concrete failures, each fixed against the real error:

The exact-context prompt construction. The experiment requires prompts of an exact token length (so context is an exact variable, §4.4), which means submitting token-id sequences rather than text. The natural call LLM.generate(prompt_token_ids=...) raised TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids': at 0.10.2 token-id prompts must be passed as dict prompts, generate([{"prompt_token_ids": seq}, ...]). The harness builds each prompt by encoding a stable non-special token pattern, repeating and truncating it to exactly ctx tokens, and wrapping each in such a dict; the post-hoc check context_actual == context on all 672 cells (§23) confirms the construction.
The 32,768-token position ceiling. A context of 32,768 raised a Pydantic ValidationError: max_model_len (32832) > max_position_embeddings (32768) — max_model_len = context + decode must fit under the model's position ceiling. The largest context was therefore set to 32,000, so that even the long decode fits (32,000 + 512 = 32,512 ≤ 32,768), and the context axis stays identical across Qwen and Mistral (ceiling 32,768) and Llama (131,072).
The Windows console codec. The Modal CLI prints progress with a ✓ glyph, which the Windows console's default charmap codec cannot encode ('charmap' codec can't encode character '✓'), crashing the local driver before the run even dispatched. Every modal run invocation is prefixed PYTHONUTF8=1 PYTHONIOENCODING=utf-8. A trivial failure, but a fatal one if unfixed — and exactly the kind of environment friction that silently aborts a paid run.

8.3 The V0-versus-V1 engine decision and the fp8 redefinition

The most consequential bring-up decision was which engine path defines a cell, and it turned on a measured rejection. The original precision axis included an fp8 KV-cache dtype (kv_cache_dtype="fp8"), which on the V1 engine raised NotImplementedError: VLLM_USE_V1=1 is not supported with --kv-cache-dtype. The path that does support an fp8 KV cache is the legacy V0 engine (VLLM_USE_V1=0), so a V0 validation pass was run — and its numbers disqualified it. On V0 the fp16 throughput at ctx2048, batch 4 was 191 tok/s against the V1 engine's 333, and at the stressed corner (ctx32000, batch 64) V0 returned 19 tok/s against V1's 1,182 — a ~62× gap. V0 is not a slower-but-representative path; it is a legacy code path whose long-context decode behavior is not what any production deployment sees, so reporting V0 numbers would have measured an obsolete engine, not the serving physics. V0 was rejected for all cells.

The pivot was to redefine the fp8 axis as fp8 weight quantization (quantization="fp8"), the only fp8 path V1 supports, which a validation cell confirmed works and is fast: fp8-weight at ctx2048, batch 4 returned 415 tok/s, faster than the fp16 baseline's 333, exactly as halving the weight read predicts. This is why the V4 fp8 axis is weight-quant with an fp16 KV cache (§6.2), and it is the direct cause of the fp8 robustness penalty's sign (§13): the weights shrink, the KV cache does not, so the binding constraint moves to KV sooner. The decision is recorded in full because it shapes a headline axis — the fp8 result is a weight-quant result, not a KV-quant result, and conflating the two would misreport it.

8.4 Validate-before-pay: the ten-cell cross-model diagnostic

With the API surface and engine path settled, the paid 672-cell grid was still not the first thing to run. A ten-cell diagnostic fired first, chosen to exercise every new configuration the full grid would depend on: each of the three models on each of the two GPUs at a mid context (six cells), plus the decode-512 axis on two models and the 32,000-token corner on two (four cells). All ten succeeded. This proved, before any large spend, that gated-model access worked (§8.5), that the H100 GPU type was available and loaded all three models, that the decode-512 path ran, and that the 32,000-token configuration loaded under the position ceiling — and it surfaced the realistic load times (a first-time Mistral load on the H100 took 411 seconds, Llama loads 148–182 seconds, cold-cache effects that the diagnostic absorbed). It also pre-downloaded the two gated checkpoints into the shared cache volume, so the paid grid paid no download cost. Observations. The diagnostic is the cheapest insurance in the arc: ten cells of validation against a 672-cell paid run is a ~1.5% overhead that retires every "does this axis even run" risk before the expensive fan-out, and it is the direct operationalization of the §8.1 bar.

Two of the three models — meta-llama/Llama-3.1-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.3 — are gated on Hugging Face and require an access token; only Qwen is ungated. The token is supplied to the remote containers through a Modal secret (hf-token), attached to both GPU functions, so it reaches the container environment without being committed to code. The secret's existence was verified before the diagnostic (created on the run date), and the diagnostic's success on the two gated models is the end-to-end proof that the secret resolved inside the containers — a gated-access failure would have surfaced as a 401 on model load in the ten-cell pass, not silently in the paid grid.

8.6 The orphan-billing catch

The single most instructive failure in the arc cost real money and was caught by inspection. A first version of the full grid was fired with the data saved only by the local entrypoint at the very end. It was stopped after roughly ninety seconds to harden the persistence path first — and stopping the local modal run client did not stop the remote app. An inspection of the Modal app list showed the ephemeral app still in the ephemeral state with eleven GPU containers running and billing: killing the client had detached from, not terminated, the fan-out, and eleven A100/H100 containers were burning compute with no client attached. Terminating it was itself a small lesson in non-interactive tooling: modal app stop <id> prompted [y/N] and aborted under the non-interactive shell; piping y into it failed with Aborted: no interactive terminal detected. Rerun with --yes (-y); only the explicit modal app stop -y <id> worked, after which the app drained from eleven containers to nine to zero. Observations. The hazard is generic to every serverless fan-out: the containers outlive the client that launched them, so "I stopped my terminal" is not "I stopped paying," and the only reliable check is to inspect the remote app state directly. This is exactly the silent-cost failure mode that does not appear until an invoice, and it was caught here by looking.

8.7 Dual-path persistence: modal.Dict, recover, and --detach

The orphan catch and the §8.1 save crash together motivated a persistence design that does not depend on the local process surviving. In the hardened harness, each cell writes its own result to a run-tagged modal.Dict (tr164-amort-<runtag>) the instant it completes, so a finished cell is durable independent of what happens to the local entrypoint, the network, or the client; a recover entrypoint mode reconstructs results.json from the Dict if the local process ever dies mid-run; and the grid runs with --detach so the remote app survives a client disconnect. The modal.Dict API required one correction during bring-up: it does not implement Python's len() (TypeError: object of type 'Dict' has no len()) but exposes .len() as a method, caught against the live Dict and confirmed alongside .keys(), .items(), .values(), and .clear() before the recover path was relied upon. In the event the local happy path completed normally and saved all 672 cells, and the insurance Dict independently held all 672, and the two agreed exactly (§23) — so the insurance was never needed, but it was verified correct on both channels rather than assumed.

8.8 Concurrent dispatch, the run, and the background-task discovery

A final efficiency fix: the first dispatch ran the A100 pool to completion and then the H100 pool (each .map() blocks until its pool drains), serializing two independent GPU fleets for no reason. Rewrapping the two maps in a two-worker thread pool runs both GPU pools concurrently — identical GPU-seconds billed, roughly half the wall-clock. The hardened, concurrent, detached grid then ran cleanly: the insurance Dict climbed steadily from 35 cells at the first poll through 171, 281, 394, 496, 613, to 672, at ~10–13 cells per minute with no plateau and no run-crashing failure, completing in ≈ 63 minutes of wall-clock with both pools busy. One incidental observation from monitoring: the local driver and the progress watcher both ran well past ten minutes without being killed, so the background execution survived for the full run. The only benign warning in the logs was Mistral's recommendation to use --tokenizer-mode "mistral", which is irrelevant here precisely because the prompts are exact token-id sequences (§4.4): sequence length is fixed regardless of tokenizer mode, so the tokenizer-density issue that was a confound in V3 does not arise in V4.

8.9 Why the execution arc is a result

The grid finished 672/672 at ok_rate 1.0 with zero error cells, and that number is the headline. But the trust-relevant facts are the arc that produced it: a non-representative engine path was measured and rejected on the evidence (V0's 62× long-context gap), a headline axis was correctly scoped to what the engine actually supports (fp8 weight, not KV), every new configuration was validated for ~1.5% overhead before the paid fan-out, an eleven-container orphan-billing hazard was caught by inspection rather than on an invoice, and the data was made durable on two independent channels that were verified to agree. None of these is a footnote: each one is a place the measurement could have silently returned a wrong or expensive number, and the report's confidence in the 672 cells rests on them.

On paid serverless compute the harness's failure modes are part of the result. V4's clean 672/672 is backed by an evidence-based engine rejection, a correctly-scoped fp8 axis, a validate-before-pay diagnostic, a caught orphan-billing fleet, and dual-channel persistence proven to agree — the methodology-honest substrate beneath the numbers.

9. SS3. The Amortization Metric and the First-Principles Bandwidth Model

9.1 The metric and what it isolates

η(B) = t(B)/t(1) is the parallel efficiency in static-batch form. At η ≈ 1 the engine reads the weights once and amortizes that read across B concurrent decodes for free; aggregate throughput scales linearly. What erodes η as B (or context) grows is the KV cache, which scales as B × C and must be read every step, until KV traffic rivals and then dominates weight traffic. The metric isolates this from raw speed: §3.2 and §20 show the batch-1 rate barely moves with context (penalty 0.94), so the entire context effect is in the shape of η, i.e. in one parameter.

9.2 Deriving η(B) = (1+r)/(1+Br)

Treat the decode step as bandwidth-bound. Bytes read per step at batch B: the weights once (W) plus each request's KV cache (B · C · k, with k = KV-bytes-per-token). At sustained bandwidth BW, step time is τ(B) = (W + B·C·k)/BW, and each step emits B tokens (one per request), so per-request throughput is t(B) = 1/τ(B) = BW/(W + B·C·k). Then

η(B) = t(B)/t(1) = (W + C·k)/(W + B·C·k) = (1 + r)/(1 + B·r), r = C·k/W.

r is the ratio of one request's KV read bytes (at the full context) to the weight bytes — the single quantity that sets the curve. The knee follows in closed form: η(B) < 0.65 ⟺ B > (0.35 + r)/(0.65·r) ≡ B*. As r → 0 (short context, small KV), B* → ∞ (batching always pays); as r grows with context, B* shrinks. The prediction is sharp: the knee scales inversely with context, and the model with the smaller k (fewer KV bytes/token) has the later knee.

9.3 Why the pure model is a floor, not the whole story

The derivation assumes the per-step memory traffic moves at a constant effective bandwidth, so per-request rate falls as 1/(W + B·C·k). Observed η is consistently gentler than this — the real curve behaves as if r were α · r_arch with α < 1 (median 0.52 on A100, 0.33 on H100, §11.3) — so the pure prediction (α = 1) is a conservative floor on the knee. A first-principles roofline check on all 672 cells (derive_alpha.py) rules out two tempting explanations and leaves one. Not compute overlap: decode is memory-bound essentially everywhere — the compute-to-memory time ratio is a median of 0.046 (4.6%), exceeding 0.5 in only 6 of 576 cells and reaching 0.36 only at the short-context, high-batch corner. There is no compute to "hide behind the reads" in the long-context regime that sets the knee. Not VRAM capacity either, at the knee: 95% of the breakdown knees fall at a median 0.27 × the VRAM concurrency ceiling n_max = (0.9·VRAM−W)/(C·k) (§11.3, Q5), i.e. at the knee the KV cache still fits VRAM and the scheduler is not yet wave-batching — so capacity is not the trigger. What remains is the mechanism the data does support: achieved memory-bandwidth utilization rises with batch. Measured MBU climbs from ≈ 0.55 at batch 1 toward and past saturation (the model-implied MBU even crosses 1.0 at the largest batches — 1.2 on A100, 1.95 on H100 — which is physically the signature of the constant-bandwidth assumption failing, dominated there by the KV cache exceeding VRAM and vLLM wave-batching). Because effective bandwidth improves as the batch grows (GEMV→GEMM weight reads, better coalescing), per-request rate decays slower than the constant-bandwidth model predicts — and that is the gentling α < 1. Observations. α is therefore an empirical, hardware-specific property (the H100's MBU rises further/faster, so its α is lower), not a quantity derivable from architecture; the VRAM ceiling is a useful correlate (it shares the 1/(C·k) scaling and co-predicts the knee at ρ ≈ 0.83) but not the cause, since the knee sits well below it.

The bandwidth model is derived, not assumed: a bandwidth-bound decode step yields η(B) = (1+r)/(1+Br) with r = C·k/W, predicting a knee that scales inversely with the KV footprint C·k and a model ordering set by KV-bytes-per-token. The ~2× gentling α < 1 is not compute overlap (decode is 4.6%-compute) and not VRAM capacity at the knee (95% of knees sit at ~¼ of the VRAM ceiling, where KV still fits) — it is achieved bandwidth rising with batch (MBU 0.55 → saturation), which makes the per-request rate decay slower than a constant-bandwidth model assumes. α is an empirical, hardware-specific correction; the pure form is a conservative floor.

9.4 A worked prediction, end to end

To make the model concrete and falsifiable, here is the full arithmetic for two configurations at the stressed corner (context 32,000), using only the verified architecture (Appendix D) and the formulas above — no fitting. The KV-bytes-per-token is k = 2 × layers × KV-heads × head_dim × 2 and the weight bytes are W = params × 2 (fp16).

For qwen2.5-7b: k = 2 × 28 × 4 × 128 × 2 = 57,344 bytes/token, W = 7,615,616,512 × 2 = 15.23 GB, so at C = 32,000 the per-request KV read is C·k = 1.835 GB and r = C·k/W = 1.835/15.23 = 0.1205. The closed-form knee is B* = (0.35 + r)/(0.65·r) = (0.35 + 0.1205)/(0.65 × 0.1205) = 0.4705/0.0783 = 6.0. For llama3.1-8b: k = 2 × 32 × 8 × 128 × 2 = 131,072 bytes/token (2.3× Qwen's), W = 16.06 GB, C·k = 4.194 GB, r = 0.2612, and B* = (0.35 + 0.2612)/(0.65 × 0.2612) = 0.6112/0.1698 = 3.6.

These are the zero-parameter predictions: Qwen breaks at batch 6, Llama at batch 3.6 — the right ordering (Qwen 1.7× later, tracking its 0.46× smaller k), each about 2× below the observed knee because the pure form omits the high-batch saturation correction. Applying the empirical H100 batch-saturation factor α = 0.33 (the median r_fit/r_arch, §11.3), the effective ratio is α·r and the corrected knee is (0.35 + α·r)/(0.65·α·r): for Qwen α·r = 0.0398 → B* = 15.1, for Llama α·r = 0.0862 → B* = 7.8. The observed discrete knees at this corner are 16 for Qwen and 8 for Llama (Appendix C) — the α-corrected prediction lands on both. On the A100, α = 0.52 gives the more conservative B* = 10.1 (Qwen) and 5.5 (Llama), consistent with the A100's earlier knees. Observations. The worked example shows the two-stage structure of the prediction: the architecture alone fixes the ordering and the scale of the knee from three config numbers and a context length, and a single empirical per-GPU constant (the batch-saturation α, whose source is the VRAM ceiling, §11.3) closes the ~2× gap to the observed value. An engineer with a model card and a GPU's HBM/VRAM can run this calculation for an untested (model, context) in a spreadsheet.

A worked prediction: from Qwen's and Llama's config alone, the pure model predicts knees at batch 6.0 and 3.6 (right ordering, 2× conservative); the empirical H100 α = 0.33 corrects them to 15.1 and 7.8, landing on the observed 16 and 8. The amortization knee is computable in advance from a model card and a datasheet.

10. SS4. Context Length Is the Dominant Lever

The context cross-cut is the largest effect in V4 by a factor of three to five over any other axis. Pairing η at matched (model, GPU, decode, precision, batch > 1) cells between ctx512 and ctx32000 (144 informative pairs), mean η falls 0.865 → 0.599, paired Δ −0.266 (bootstrap 95% CI [−0.294, −0.240], Holm-adjusted Wilcoxon p ≈ 0; L1_main_effects.context_32000_vs_512), and the factorial ANOVA (§15) assigns context partial η² 0.266 against 0.038 for the GPU, the next design factor. Figure 1 (amortization_eta_vs_batch.png) plots the η(B) curves by context for each GPU, averaged over the twelve (model, decode, precision) configurations: the four curves fan from the near-flat ctx512 line to the steeply-collapsing ctx32000 line on both panels, with the 0.65 breakdown line crossed progressively earlier as context grows — the dominant lever made visual. The per-batch executive table (§3.2) shows the collapse is monotone at both batch 16 and batch 64 and steepens with batch.

Observations. Context is the only axis whose effect crosses the entire free-to-broken span on its own (mean batch-16 η 0.897 at ctx512 vs 0.442 at ctx32000). The mechanism is direct and quantitative: r = C·k/W is linear in C, so the 62.5× context increase multiplies the KV/weight ratio by the same factor and walks the curve from r ≈ 0 (free) to r large (broken). Because the batch-1 rate is nearly context-independent (§20), the context effect is purely a reshaping of r — the cleanest possible demonstration that context acts through the amortization parameter, not through raw speed.

Context dominates the amortization surface — an order of magnitude above any other design factor in the ANOVA — and it acts entirely by scaling the bandwidth model's one parameter r. This is both the largest empirical effect and the most direct confirmation of the mechanism.

10.1 The 128K extension: the knee compresses to batch 4 at the max-r corner

The grid stops at 32,000 tokens; the bandwidth model predicts what happens beyond it, and a 10-cell extension on llama3.1-8b at 128,000 tokens (4× the grid ceiling, fp16/d64, the only one of the three models whose 131,072-token position limit reaches it) tests the prediction at the largest r the substrate can produce. Linear-in-C, r = C·k/W rises to 1.04 (from 0.26 at 32K), and the architectural knee B* = (0.35+r)/(0.65r) is ≈ 2, ≈ 3 after the H100 α correction. Observed: the knee lands at batch 4 on both the A100 and the H200, and just past batch 4 on the H100 (η = 0.682 at its largest fitting batch) — the η curves run 1.00 → 0.78 → 0.62 → 0.36 (H200, b1–b8) and 1.00 → 0.76 → 0.53 (A100, b1–b4). At 128K the KV cache is 16.8 GB per sequence, so the 80 GB A100/H100 hold only ~4 concurrent sequences and the 141 GB H200 only ~8 — but the amortization knee arrives at or before that VRAM ceiling, so the breakdown is a bandwidth event, not a capacity one. Observations. The knee marches exactly as 1/(C·k) says it must: from never-by-64 at short context, to batch 8–16 at 32K, to batch 4 at 128K — the architectural prediction (≈ 2–3) landing within the model's usual ~2× factor at a context 4× beyond anything in the fitted grid. This is the headline's validity range extended by extrapolation-then-measurement: the one-parameter bandwidth model predicts the knee at a context it was never fit on.

Push context to 128K and the knee compresses to batch 4, the architectural prediction (≈ 2–3) holding 4× past the grid ceiling. The free lunch is essentially gone at long context on a single replica: with a 16.8 GB/sequence KV cache, you reach the amortization knee at the same handful of sequences the VRAM will hold — the two limits converge at the extreme.

11. SS5. The Bandwidth-Amortization Model: Fit, Architectural Prediction, and the VRAM-Ceiling Source of α

This section is the report's centerpiece: the test of whether the derived model (§9) actually predicts the measured surface. Three results — the functional-form fit, the zero-parameter architectural prediction, and the batch-saturation factor with its VRAM-ceiling source — are reported in turn, with Figure 2 (fit overlays) and Figure 3 (architectural prediction) as the visual anchors.

11.1 The functional form fits

Fitting η(B) = (1+r)/(1+Br) to each of the 96 (model, GPU, decode, precision, context) curves by least squares (one free parameter r, the curve anchored at η(1) = 1 by construction) gives median R² 0.928, with 57 of 96 curves above R² 0.90 and 82 of 96 above 0.70 (L3_roofline.fit_r2_median, distribution verified). Figure 2 overlays the fitted curves on the observed points for four representative configurations; the rational form tracks the data across the full batch sweep at every context. Observations. A one-parameter rational function explaining 93% of the variance in a seven-point throughput curve, across 96 independent curves spanning two GPUs, two precisions, two decode lengths, and a 62.5× context range, is strong evidence that the bandwidth-bound decode step is the right physical picture. The single free parameter is not a curve-fitting degree of freedom chasing shape — it is a physical ratio whose architectural value is tested next.

11.2 The zero-parameter architectural prediction

Computing r_arch = C·k/W from verified architecture (no fitting) and the closed-form knee B* = (0.35 + r_arch)/(0.65·r_arch), the predicted knee correlates with the observed interpolated knee at Spearman ρ = 0.84 (Pearson on log₂ 0.79, n = 65 breaking curves; L3_roofline.arch_knee_vs_obs_knee_spearman, Figure 3). The prediction holds across two orders of magnitude: for qwen2.5-7b | H100 | d64 | fp16 the architectural knee runs 281 → 71 → 19 → 6 across contexts 512 → 2,048 → 8,192 → 32,000, against observed never → never → 41 → 11 — same ordering, same scale, no fit. Observations. The points in Figure 3 sit systematically above the y = x line: the observed knee is later than the pure-architectural prediction, consistently, because achieved bandwidth rises with batch so the per-request rate decays slower than constant-bandwidth predicts (α < 1) — the empirical effect §11.3 pins down (not compute overlap, not VRAM-capacity at the knee). The offset is the signal, not noise — it is the batch-saturation factor quantified next. That a knee computed from three config numbers and a context length predicts the measured breakdown at ρ = 0.84 is the report's central claim.

11.3 The batch-saturation factor α: what it is, what it is not, and where the model breaks

The ratio α = r_fit / r_arch measures how much gentler the real curve is than the constant-bandwidth pure model predicts; its median is 0.52 on the A100 and 0.33 on the H100 (L3_roofline.alpha_compute_overlap_by_gpu). The first-principles derivation (§9.3, derive_alpha.py) eliminates two candidate mechanisms before settling the third. (i) Not compute overlap. Decode is memory-bound throughout (compute/memory time median 0.046; > 0.2 in only 46 of 576 cells, all at short-context/high-batch), so there is no compute to hide the KV reads behind in the long-context regime that sets the knee. (ii) Not VRAM capacity, at the knee. vLLM holds at most n_max = (0.9·VRAM − W − overhead)/(C·k) sequences of length C, and the observed knee does correlate with n_max at Spearman ρ = 0.83 (n = 65, Q4_vram_ceiling) — but only because both are functions of the same KV footprint C·k (r = C·k/W and n_max = VRAM/(C·k) share the 1/(C·k) scaling that the knee, ρ = 0.84, also follows). The knee is not the capacity boundary: 95% of knees fall at a median 0.27 × n_max (Q5_knee_regime), i.e. well inside the regime where the KV still fits VRAM and the scheduler is not wave-batching, so capacity cannot be the trigger. (iii) The supported mechanism — achieved bandwidth rises with batch. Measured MBU(B) = t(B)·(W+B·C·k)/BW_peak climbs from ≈ 0.55 at batch 1 (a batch-1 GEMV under-utilizing HBM) toward saturation as the batch grows (weight reads become GEMMs, memory access coalesces), so the per-request rate decays slower than a constant-bandwidth model assumes — which is exactly the gentling α < 1. The model-implied MBU even crosses the physical 1.0 ceiling at the largest batches (1.2 on A100, 1.95 on H100), the signature of the constant-bandwidth assumption failing entirely where the KV cache exceeds VRAM and vLLM wave-batches — and that wave-scheduling regime (B > n_max, beyond the operational knee) is exactly where the model fails outright: the four H100 | d512 | ctx32000 curves return R² < 0 because they go non-monotone there. Observations. The honest status of the model: an excellent empirical knee-predictor (the knee scales as 1/(C·k), ρ = 0.84), with α an empirical, hardware-specific correction (the H100's MBU rises faster/further, so its α is lower) that the data attributes to bandwidth-utilization rising with batch — not compute overlap, and not VRAM capacity, which is a correlate of the scaling rather than its cause. A closed-form α would require the per-GPU MBU-vs-batch curve, which is a hardware property to be microbenchmarked, not read off the architecture (§27). That α is genuinely per-GPU and not a bandwidth law is now demonstrated, not just asserted: a third bandwidth point (the H200, §12.3) shows α is not even monotone in HBM bandwidth — the H100, not the higher-bandwidth H200, has the lowest α (11/12 matched fp16/d64 cells), so the A100→H100 decrease was a two-point coincidence, not a slope.

α is an empirical, hardware-specific correction, and the derivation pins down what it is not: not compute overlap (decode is 4.6%-compute) and not VRAM capacity at the knee (95% of knees sit at ~¼ of the VRAM ceiling, where KV still fits). What the data supports is achieved memory-bandwidth utilization rising with batch (MBU 0.55 → saturation), which makes per-request rate decay slower than a constant-bandwidth model assumes. The knee itself scales cleanly as 1/(C·k) (ρ = 0.84) — that is the predictive headline; α is the empirical second-order correction, and only beyond the VRAM ceiling (past the knee) does the model break outright.

12. SS6. The HBM Bandwidth Axis and the GPU×Context Interaction

12.1 The main effect

The GPU cross-cut is the cleanest causal probe in V4: only the memory subsystem changes. Over 288 matched (model, decode, precision, context, batch > 1) pairs the H100 (HBM3, ≈ 3.35 TB/s) sustains mean η 0.801 against the A100-80GB's (HBM2e, ≈ 2.0 TB/s) 0.724, paired Δ +0.077 (bootstrap 95% CI [0.067, 0.088], Holm-adjusted Wilcoxon p ≈ 0; L1_main_effects.gpu_H100_vs_A100). The absolute ceiling moves further: peak batch-64 aggregate decode throughput is 8,879 tok/s on the H100 versus 4,684 tok/s on the A100, means 6,210 vs 2,884 (§17). The compute and bandwidth advantages separate cleanly: the ≈ 1.9× absolute-aggregate gain reflects compute lifting the whole curve and cancels out of the normalized η, while the +0.077 η gain is specifically the bandwidth-delayed knee.

12.2 The interaction: the H100 edge grows where bandwidth binds

The bandwidth mechanism predicts that a faster memory subsystem should help most where memory traffic is largest. The difference-in-differences test confirms it: the H100-minus-A100 η gap grows from +0.036 at ctx512 to +0.125 at ctx32000 (DiD +0.088, bootstrap 95% CI [0.059, 0.121], Holm p < 1e-5; L2_interactions.gpu_gap_x_context). It is the single largest interaction in the grid after context×batch. Observations. This is a sign-and-magnitude prediction the data confirms: at short context, where decode is weight-read-bound and both GPUs amortize easily, the HBM generation barely matters (+3.6 points); at 32,000 tokens, where each request drags a multi-gigabyte KV cache through memory every step, the faster HBM is worth +12.5 points of retained efficiency. The interaction is the mechanistic complement to the α factor (§11.3): the H100's faster HBM (and its lower batch-saturation α 0.33 vs 0.52) is exactly what holds its knee out, and it does so most at long context where bandwidth binds.

The H100 advantage is bandwidth, shown two ways: a +7.7-point main effect and a +8.8-point difference-in-differences that concentrates the edge at long context where bandwidth binds. Same models, same engine, only the HBM changes — and it changes η most where the mechanism says it should.

12.3 The third bandwidth point: α does not track HBM bandwidth (a tested, refuted extrapolation)

The two-GPU contrast above invites an obvious extrapolation: if the H100's faster HBM gives it a lower batch-saturation factor (α 0.33 vs the A100's 0.52, §11.3) and holds its knee out, then an even-faster-memory GPU should push α lower still and the knee later still — α as a monotone function of HBM bandwidth. We tested it directly by adding a third bandwidth point, the H200 (HBM3e, ~4.8 TB/s vendor-spec peak vs the H100's 3.35 and the A100's 2.0), as an 84-cell fp16/d64 grid (3 models × 4 contexts × 7 batches, vLLM 0.10.2 V1 with CUDA graphs, ok_rate 1.0, 84/84). To compare like with like, each H200 curve's α is matched against the V4 fp16/d64 A100 and H100 per-curve α (not the all-curve per-GPU aggregate, which folds in fp8 and d512). The extrapolation fails.

α (= r_fit/r_arch), matched fp16/d64	A100 (2.0 TB/s)	H100 (3.35 TB/s)	H200 (4.8 TB/s)	monotone↓?
median, all contexts (n=12)	0.628	0.456	0.679	no
median, long context ctx≥8192 (n=6)	0.443	0.198	0.298	no

Observations. In both medians the H100 is the minimum, and the higher-bandwidth H200 sits above it — the trend reverses rather than continues. The reversal is not a median artifact: at the matched-cell level the H200's α exceeds the H100's in 11 of 12 (model, context) cells (e.g. llama3.1-8b | ctx32000: 0.43 A100 → 0.16 H100 → 0.31 H200; mistral-7b | ctx8192: 0.32 → 0.20 → 0.27). It is also not a precision artifact: an 84-cell H200 fp8-weight leg replicates the pattern independently — long-context α median A100 0.333 / H100 0.163 / H200 0.412, H100 again the minimum and H200 α > H100 α in 6 of 6 matched cells. The operator-facing knee tells the same mixed story: llama and mistral knees mostly shift later-or-equal with bandwidth (H200 ≥ H100 ≥ A100), but qwen reverses (ctx8192 knee 32 on H200 vs 64 on H100; ctx32000 knee 8 on H200 vs 16 — the latter a threshold-sensitive crossing, η = 0.642 just under the 0.65 line, with a non-monotone tail). So neither α nor the knee is a clean function of HBM bandwidth past the H100. The honest reading: α is an empirical per-GPU property (a fact §11.3 already stated and the forbidden-claims ledger already protected), and the third point now demonstrates it — the two-point A100→H100 line was a coincidence of two points, not a bandwidth law. Why the H100 is the gentlest is answered directly in §12.4 by measuring the per-GPU memory-bandwidth-utilization curve — and it is not explained by peak bandwidth, which would predict the opposite ordering. This is exactly the value of a third point: it converts "α decreases with bandwidth" from a plausible two-point trend into a refuted one.

The third bandwidth point is a negative result that strengthens the claim. The A100→H100 drop in α looked like a bandwidth law; the H200 — 43% more bandwidth than the H100 — has a higher α (11/12 matched cells), so α is empirical and per-GPU, not a monotone function of HBM bandwidth. The predictive headline (knee ∝ 1/(C·k), ρ = 0.84) is untouched; what the H200 refutes is a tempting story about the second-order α correction, exactly the kind of overclaim a third point exists to kill.

12.4 Why: the H200 underutilizes its bandwidth (the MBU mechanism, measured)

§12.3 raised the question; the per-GPU memory-bandwidth-utilization (MBU) curve answers it. Against a measured peak bandwidth (a best-of-N device-to-device copy probe, alpha_microbench.py ported into the Modal harness: A100 1,763 / H100 3,020 / H200 4,240 GB/s, 86–90% of vendor spec and in the same bandwidth order as nominal — so the §12.3 reversal is not a spec artifact), MBU(B) = achieved_bw(B)/measured_peak with achieved_bw(B) = (W + B·C·k)·agg_tps(B)/B. The gentling that gives a low α is MBU rising with batch — and the H200 rises least.

median MBU, fp16/d64	A100 (1,763 GB/s)	H100 (3,020 GB/s)	H200 (4,240 GB/s)
at batch 1	0.699	0.711	0.584
max over batch	1.00	1.11	0.744
rise (max ÷ b1)	×1.43	×1.56	×1.27

Observations. The MBU rise — how far achieved bandwidth climbs as the batch fills — orders exactly inversely to α: H100 rises most (×1.56, lowest α 0.46), A100 middle (×1.43, α 0.63), H200 rises least (×1.27, highest α 0.68; H200 MBU(1) is below the H100's in 12/12 cells). The mechanism is now legible: the H200's HBM3e bandwidth outruns the decode kernels. It starts at the lowest utilization (0.584 — the most "headroom") yet saturates lowest (0.744, never approaching the ~1.0+ the A100/H100 reach), because the GEMV→GEMM batched decode cannot generate enough memory pressure to use 4.2 TB/s — so its per-request rate decays closer to the constant-bandwidth prediction (less gentling, higher α). The A100/H100 model-implied MBU crossing 1.0 at the largest batches is the constant-bandwidth assumption breaking where the KV cache exceeds VRAM and vLLM wave-batches (§11.3); the H200's 141 GB VRAM avoids that regime, so its honest ≤1.0 ceiling at 0.744 is the robust signal — and that lower achievable ceiling is precisely why more peak bandwidth buys less batch-amortization, not more. α is therefore not "compute overlap," not "VRAM capacity at the knee," and not "HBM bandwidth" — it is the achieved MBU-vs-batch trajectory, which is a per-GPU kernel-and-memory-subsystem property the architecture does not predict.

The negative result has a measured cause. Plot achieved bandwidth against batch and the H200's curve is the flattest of the three — its 4.2 TB/s is under-driven by the decode kernels, so it gentles least and lands the highest α. "More bandwidth" only lowers α if the kernels can use it; on the H200 they cannot, which is why the two-point bandwidth law breaks. This is the first-principles content §11.3 promised the microbench would supply, now supplied.

13. SS7. The Precision Tradeoff and the Context-Invariant fp8 Penalty

13.1 The main effect

fp8 weight quantization halves the bytes read for the weights each decode step, raising absolute throughput (single-stream fp8 is 1.40× fp16, §20), but it leaves the KV cache at fp16, so it reaches the KV-dominated regime sooner and is −0.054 less batch-robust: mean η 0.735 (fp8) vs 0.790 (fp16) over 288 matched pairs (bootstrap 95% CI [−0.061, −0.047], Holm p ≈ 0; L1_main_effects.precision_fp8_vs_fp16). In the bandwidth model this is immediate: fp8 halves W, so r = C·k/W doubles at every context, shifting the binding constraint toward KV and pulling the knee in.

13.2 The correction: the fp8 penalty is context-invariant, not context-escalating

The intuitive expectation is that the fp8 penalty should be largest where the KV cache is largest — long context — because that is where the unquantized KV dominates most. The maximalist difference-in-differences test falsifies this intuition: the fp8 η-penalty is −0.053 at ctx512 and −0.051 at ctx32000, DiD +0.002 (bootstrap 95% CI [−0.016, +0.021], Holm p = 0.88, not significant; L2_interactions.precision_gap_x_context). The penalty is context-invariant at ≈ 5 points. It is significantly smaller on the H100 (precision×GPU DiD +0.029, Holm p < 1e-5; L2_interactions.precision_gap_x_gpu): −0.069 on the A100 versus −0.040 on the H100, consistent with the H100's larger batch-saturation relaxation (lower α, §11.3) softening every bandwidth-side effect including the fp8 one — not, the derivation shows, a compute effect. Observations. Why context-invariant? In the model, fp8 doubles r at every context uniformly, and the η-difference between an r and a 2r curve is roughly scale-stable across the breakdown region — so the penalty in η-points is approximately constant in context even though both curves shift. This is a genuine correction the maximalist analysis caught; the earlier main-effects-only writeup asserted a context-escalating penalty that the data does not support.

fp8 relocates the bottleneck onto the unquantized KV cache at a fixed ≈ 5-point batch-robustness cost that does not escalate with context — correcting a plausible intuition — but does shrink on a compute-richer GPU. The follow-up the sign motivates is fp8 KV-cache quantization (§13.3).

13.3 fp8 KV-cache: halving the KV footprint does not push the knee out

The §13.2 sign — fp8 weight quant relocates the bottleneck onto the unquantized KV cache — motivates the obvious next lever: quantize the KV cache itself. The bandwidth model makes a sharp prediction. fp8 (e4m3) KV-cache halves the KV-bytes-per-token k (qwen 57,344 → 28,672; Llama/Mistral 131,072 → 65,536), so it halves r = C·k/W at every context, and the knee — scaling as ~1/r — should move out by roughly 2×. We ran it: an 84-cell fp8-KV d64 grid on the H100 (where vLLM 0.10.2's V1 engine supports fp8 KV-cache, via the FlashAttention-3 backend on SM90; the kernel requires bf16 weights, which leaves W — and the knee comparison — unchanged since bf16 and fp16 are both 2-byte), ok_rate 1.0, 84/84. The prediction fails: across the six (model, context) cells where both the fp16 and fp8-KV knees are defined, the fp8-KV knee is equal in 4 and earlier in 2, later in 0 — median knee ratio 1.0. Halving the KV cache's memory footprint left the amortization knee where it was.

Observations. Two readings, and the honest one carries a caveat. The clean reading is that this is consistent with §12.4: the binding constraint on the knee is the achieved-MBU-vs-batch trajectory, not the nominal KV-read volume — so halving the bytes the model would read does not move the knee, because the decode was not saturating bandwidth on those reads to begin with (the same reason the H200's spare bandwidth went unused). The caveat is a confound: enabling fp8 KV-cache forces the FA3 attention backend, whereas the fp16 grid used vLLM's default backend, so part of the "no shift" (and especially the two earlier-knee cells, llama3.1-8b | ctx8192 64 → 16 and mistral-7b | ctx8192 64 → 16) may be an FA3-versus-default batch-scaling difference rather than a pure KV-dtype effect. The robust, defensible statement is the negative one — fp8 KV-cache halves KV memory but did not extend the batching free-lunch in this comparison — with a matched-backend (FA3-fp16) control flagged as the clean isolation (§27). The operational takeaway is already useful: fp8 KV-cache is a capacity lever (it lets more or longer sequences fit in VRAM), not a throughput-amortization lever — it should not be expected to move the batch at which continuous batching stops paying.

Halving the KV cache's bytes-per-token did not move the knee (median ratio 1.0, 0/6 later). The bandwidth model predicted ~2× later; the data refused. Read cleanly, it reinforces §12.4 — the knee tracks achieved bandwidth utilization, not nominal KV-read volume; read cautiously, the FA3-backend confound means the pure KV-dtype effect needs a matched-backend control. Either way, fp8 KV-cache is a capacity win, not an amortization win.

14. SS8. The Decode-Length Axis

The decode-length axis is the smallest physical effect and the only one whose practical reading is second-order. Over 288 matched (model, GPU, precision, context, batch > 1) cells, the longer decode (512) sustains mean η 0.780 against the shorter decode's (64) 0.745, Δ +0.036 (bootstrap 95% CI [0.028, 0.043], Holm p ≈ 0; L1_main_effects.decode_512_vs_64). The effect grows with context (decode×context DiD +0.042, Holm p = 0.001; L2_interactions.decode_gap_x_context). Observations. The sign is sensible: a longer decode runs more memory-bound steady-state steps against the one compute-leaning prefill, so a larger fraction of the cell's time is clean steady-state decode and a smaller fraction is the prefill-to-decode transition and warm-up, marginally raising retained per-request rate; and the effect is larger at long context because the prefill transition is proportionally costlier there. The magnitude is about half the precision effect and a sixth of the context effect (ANOVA partial η² 0.008, §15), so it shifts the knee by at most one batch step in the table (§22). It is reported in full because its sign is one more consistency check on the bandwidth picture (more decode steps, more amortization).

Decode length is a real but second-order modifier of the knee, larger at long context, consistent with more steady-state decode steps amortizing the fixed prefill. It is a consistency check, not a lever an operator would turn.

15. SS9. The Factorial ANOVA Lever Ranking

A Type-II ANOVA of η on all main effects plus 2-way interactions (576 batch > 1 cells, full-model R² = 0.949) ranks the levers quantitatively by partial η²:

Term	partial η²	reading
batch	0.509	the η argument itself (η varies along batch by construction)
context	0.266	the dominant design factor
context × batch	0.060	context reshapes the batch curve — the knee, formally
gpu	0.038	the HBM axis
precision	0.018	the fp8 axis
gpu × batch	0.013	the H100 holds the curve up — the α effect
model	0.009	the KV-architecture ordering
decode	0.008	second-order
model × batch	0.007	per-model curve shape
gpu × context	0.007	the H100-edge-grows-with-context interaction

These are L2_factorial_anova.top_terms. Observations. Three things. First, among the design factors (excluding batch, which is η's argument), context's partial η² of 0.266 is an order of magnitude above the next (GPU 0.038), precisely matching the paired-effect-size ordering and putting a number on "context dominates." Second, the largest interaction is context×batch (0.060) — the formal statement that context reshapes the amortization curve, i.e. moves the knee, which is the entire point of the report. Third, the GPU and model effects appear both as main effects and as ×batch interactions (gpu×batch 0.013, model×batch 0.007), the ANOVA's independent confirmation that these factors act on the curve shape (the α and KV-architecture mechanisms), not merely on the level.

The ANOVA ranks the levers and confirms the story numerically: context partial η² 0.27 ≫ gpu 0.038 > precision 0.018 > model 0.009, with the largest interaction (context×batch) being the knee itself. The variance decomposition agrees with the paired effect sizes and the bandwidth mechanism.

16. SS10. Cross-Model Robustness and the Architecture That Explains It

The three families share the amortization surface qualitatively — all collapse with context, all respond to GPU and precision in the same direction — with a small ordering the architecture predicts. Pairing η over every (GPU, decode, precision, context, batch > 1) cell, qwen2.5-7b sustains mean η 0.789, ahead of llama3.1-8b 0.750 and mistral-7b 0.748 (L8_per_model, model_comparison). The per-model reads below trace each family's profile back to its KV-byte architecture.

16.1 qwen2.5-7b — the most batch-robust, and why

Qwen's KV cache is 57,344 bytes/token — 2 × 28 layers × 4 KV heads × 128 head_dim × 2 bytes — less than half the 131,072 of the other two, because Qwen pairs a shallower stack (28 vs 32 layers) with an aggressive 7:1 grouped-query ratio (4 KV heads against 28 query heads). At any context its r = C·k/W is therefore roughly half, its η curve is the gentlest of the three, and its knee falls latest: across its eight (GPU, decode, precision) configurations the median knee at ctx32000 is batch 16, and its single most robust cell (H100 | d512 | fp16) does not break until batch 32 — the latest knee anywhere in the grid at that context. The Qwen-vs-Llama and Qwen-vs-Mistral contrasts are both Holm-significant (Δ +0.039, CI [0.027, 0.050]; Δ +0.041, CI [0.029, 0.052]; both Holm p ≈ 0). Observations. Qwen is not robust by accident or by tuning; it is robust because its architecture reads the fewest KV bytes per token, and the worked example (§9.4) predicts its batch-16 knee from that fact alone. Of the three, Qwen is the model an operator should reach for when long-context batched throughput matters.

16.2 llama3.1-8b — the baseline 4:1 GQA profile

Llama-3.1-8B has the standard 32-layer, 8-KV-head, 128-head-dim geometry (4:1 grouped-query), giving the 131,072-byte-per-token KV footprint and the largest weight matrix of the three (8.03 B params, W = 16.06 GB fp16). Its median knee at ctx32000 is batch 8, half of Qwen's, exactly as its 2.3× larger k predicts (the larger W partly offsets the larger k, but k dominates the ratio). Llama holds the free regime through ctx2048 on most configurations and only its H100 | d512 | fp16 cell reaches ctx32000 before breaking (knee 16 there, the H100 bandwidth advantage at the most-amortizing decode). Observations. Llama is the reference profile against which the other two are read: Qwen beats it by halving k, and Mistral matches it by sharing its KV geometry exactly.

16.3 mistral-7b — identical KV geometry, and the earliest knee in the grid

Mistral-7B-v0.3 shares Llama's KV geometry exactly (32 layers, 8 KV heads, 128 head_dim → 131,072 bytes/token) but has the smallest weight matrix of the three (7.25 B params, W = 14.50 GB fp16, the lowest). Same k, smaller W, so its r = C·k/W is slightly higher than Llama's at matched context, and its median ctx32000 knee is likewise batch 8. The grid's single earliest knee — batch 4, at mistral-7b | A100-80GB | d64 | fp8 | ctx32000 — is a clean mechanism consequence: the lowest-weight model, under fp8 (which halves W again to ≈ 7.25 GB), at the longest context, maximizes r (≈ 0.58 architectural), pushing the predicted knee to the lowest value anywhere in the grid (§9.4 arithmetic gives B* ≈ 2.5 pure, ≈ 3–4 α-corrected). Observations. Mistral demonstrates the weight side of r: with k fixed by geometry, the model carrying the fewest weight bytes breaks earliest, and fp8 — which cuts weight bytes — sharpens the effect. The earliest knee in the grid is exactly where the mechanism says it should be.

16.4 The Llama≈Mistral tie, reported as a tie

Llama and Mistral are a statistical tie in batch-robustness: Δ +0.002, bootstrap 95% CI [−0.004, 0.007], Holm p = 0.13 (not significant). This is the honest reading — the CI straddles zero and Holm rejects significance — and it is exactly what their identical KV geometry predicts: same layers, KV heads, and head_dim give the same k, and the small weight difference (16.06 vs 14.50 GB) is too minor to separate their mean η across the grid. Observations. The tie is not a null finding to be glossed; it is a prediction confirmed — two models with identical KV architecture tie in the metric the KV architecture governs, while the model with different KV architecture (Qwen) separates significantly from both. Forcing a Llama-vs-Mistral rank would invent a distinction the mechanism says should not exist.

Each family's robustness is an architectural readout: Qwen's half-sized KV footprint earns its 2× later knee, Mistral's smallest weight matrix earns the grid's earliest knee under fp8, and the architecturally identical Llama and Mistral tie. Model choice is the fourth axis of the same bandwidth mechanism, and the architecture predicts the ordering in advance.

17. SS11. The Aggregate-Throughput Scaling Ceiling

η measures retained per-request rate; the operator also cares about the absolute aggregate throughput batching buys. The aggregate speedup A(B)/A(1) at batch 64, averaged across the twelve configs per context, falls 42.2× (ctx512) → 39.6× → 34.1× → 24.9× (ctx32000) (L5_aggregate_scaling.mean_aggregate_speedup_at_b64_by_context), against the ideal 64×. Peak absolute aggregate decode throughput at batch 64 is 8,879 tok/s (H100) and 4,684 tok/s (A100); the means are 6,210 and 2,884 (Figure 4 traces the speedup curves). Observations. Even at the worst corner (32,000 tokens), batch 64 still buys 25× the single-stream aggregate throughput — batching is never worthless, it just stops being free. The distinction is the report's operational core: at short context the speedup is near-ideal (42× of 64×, η ≈ 0.66 at batch 64), while at long context the 25× speedup comes at η ≈ 0.39, i.e. the operator pays a 2.6× per-request latency inflation for it. The aggregate ceiling and the efficiency knee are two readings of one surface: aggregate speedup = B · η(B), so the speedup plateaus exactly as η falls, and the plateau is steeper at long context.

Batching always raises aggregate throughput (25–42× at batch 64), but at long context that throughput is bought with a 2.6× per-request latency inflation. The aggregate ceiling and the efficiency knee are the same surface read two ways: speedup is B·η, so it plateaus precisely as amortization breaks down.

18. SS12. Memory-Bandwidth Utilization: Confirming Decode Is Bandwidth-Bound

The bandwidth model assumes decode is HBM-bandwidth-bound; this section measures whether it is. At batch 1 each decode step reads W + C·k bytes and emits one token in 1/t(1) seconds, so achieved bandwidth is t(1)·(W + C·k), and the memory-bandwidth utilization (MBU) is that over the nominal peak HBM. For qwen2.5-7b | H100 | fp16 the achieved bandwidth is ≈ 2,100–2,200 GB/s across contexts and MBU is 0.63–0.66 (L4_bandwidth_mbu.detail); averaged over all models and precisions at batch 1, ctx512, MBU is 0.535 (A100) and 0.539 (H100). Observations. An MBU of 0.5–0.66 at batch 1 places decode firmly in the memory-bound regime — the GPU spends most of each step moving bytes, not computing — which is the premise the entire bandwidth model rests on. The MBU is roughly context-stable (the achieved bandwidth is a near-constant sustained ≈ 2.1 TB/s on the H100), confirming that the context effect on η is not an MBU artifact but a change in the bytes-per-token composition (more KV, same sustained bandwidth, fewer tokens per byte). The nominal peaks are vendor spec, not measured (Appendix D), so MBU is reported as achieved/peak with that caveat; the cross-GPU achieved-bandwidth ratio is a measured cross-check that does not depend on the exact peak.

Measured MBU of 0.5–0.66 at batch 1 confirms decode is bandwidth-bound — the premise of the model. The achieved bandwidth is near-constant across context, so the context effect is a change in bytes-per-token composition (KV growth), not a utilization artifact.

19. SS13. Measurement Reliability

The three timed reps per cell let measurement noise be quantified directly. The per-cell coefficient of variation of aggregate decode throughput (std/mean over 3 reps) has median 0.4%, mean 0.6%, 95th percentile 2.0%, and maximum 12.1% (L6_variance). By batch it rises gently from 0.6% at batch 1 to 1.05% at batch 64; by context it is flat at 0.6–0.7%. Observations. The surface is measured cleanly: a median 0.4% CV means the η curves and the knee crossings are not noise-limited, and the effect sizes (context 0.27, GPU 0.077, precision 0.054) are one to two orders of magnitude above the measurement noise floor. The mild rise at batch 64 (1.05%) reflects the larger scheduling and memory-contention variance at maximum batch; the 12.1% maximum is a single long-context, large-batch cell where the few-second timing window is most sensitive to contention. None of this threatens the headline: even the worst single cell's noise is small against the axis effects, and the bootstrap CIs already fold the cell-to-cell variance into the reported intervals.

Median per-cell CV of 0.4% (worst 12%) places the measurement noise one to two orders of magnitude below every reported effect. The surface is not noise-limited; the knees and curves are real.

20. SS14. The Single-Stream Baseline Surface

The batch-1 per-request rate t(1) is the amortization-free baseline, and its structure is a result in its own right. Three readings: the single-stream GPU speedup (H100/A100 at batch 1) is 1.68× averaged over the grid; the single-stream fp8 speedup (fp8/fp16) is 1.40×; and the prefill-amortization penalty t(1, ctx32000)/t(1, ctx512) is ≈ 0.94 (e.g. 0.939 for qwen2.5-7b | H100 | fp16; L7_single_stream). Observations. The third number is the keystone of the whole report: at batch 1 the per-request decode rate is only ~6% slower at 32,000 tokens than at 512. Context barely touches the single-stream rate — so the entire ≈ 27-point context effect on η is a reshaping of the batch curve, not a slowdown of decode. This is the quantitative justification for calling context an amortization lever (§3.2, §9.1). The H100's 1.68× single-stream edge (larger than its +7.7-point η edge) and fp8's 1.40× edge (against its −5.4-point η cost) make explicit that absolute speed and batch-robustness are distinct axes: the H100 is both faster and more robust; fp8 is faster but less robust.

Context barely changes the single-stream rate (penalty 0.94), so the whole context effect on amortization is curve-shape, not slowdown — the keystone of the report. Absolute speed (H100 1.68×, fp8 1.40×) and batch-robustness (H100 +7.7pts, fp8 −5.4pts) are distinct axes.

21. SS15. The Continuous-Knee Regression

Beyond the discrete knee, interpolating the exact batch at which η crosses 0.65 (linear in log₂ B) gives a continuous knee per curve, and regressing log₂(knee) on log₂(context) per configuration gives the compression rate. The median slope is −0.64 (mean −0.50; L9_knee_regression), i.e. the knee batch falls as roughly context^(−0.64). Observations. The pure-bandwidth model predicts the knee B* = (0.35 + r)/(0.65·r) with r ∝ C, which for r not-too-small gives B* ∝ 1/C — slope −1. The observed −0.64 is shallower than −1, and consistently so: this is the same batch-saturation relaxation (α < 1) seen in §11.3, which softens the context dependence because effective KV traffic is sublinear once the cache approaches and exceeds VRAM. fp8 configs show shallower slopes still (e.g. −0.33 vs −0.71 for the matched fp16 on qwen | A100 | d512), consistent with fp8 already sitting deeper in the KV-bound regime where the knee is less context-elastic. Observations. The regression turns the qualitative "knee marches left" into a rate: roughly a halving of the useful batch per ~3× of context, slower than the −1 a pure-bandwidth model would give, with the gap again being the VRAM-capacity batch-saturation effect, not compute.

The knee compresses as context^(−0.64), a measured rate between the pure-bandwidth prediction (−1) and context-independence (0); the gap from −1 is the batch-saturation relaxation (achieved bandwidth rising with batch, §11.3), the same α in a third guise.

22. SS16. The Knee Table — Where Batching Stops Paying

The knee table is the operational deliverable: for each of the 24 (model, GPU, decode, precision) configurations, the smallest batch with η < 0.65 at each context (full table, Appendix C). Reading down the context columns the knee compresses monotonically — modal entry "never" at ctx512 (14/24), still 14 "never" at ctx2048, clustering at batch 16–32 at ctx8192, and uniformly finite (≤ 32, mostly 8) at ctx32000. Reading across, two sub-structures recur: fp8 rows knee at equal-or-smaller batches than their fp16 partners (the §13 penalty), and H100 rows knee at equal-or-larger batches than their A100 partners (the §12 advantage), with the longest-held configs being H100 / decode-512 / fp16, free through ctx8192 for all three models. Observations. The table answers the operator's question directly — "at my model, GPU, precision, and prompt length, how deep can I batch before per-request latency degrades past a third" — and its most actionable row-family is ctx32000, where the answer is uniformly small (4–32, mostly 8). It is the per-configuration lookup that the bandwidth model (§11) generates and the cross-cut statistics (§10, §12, §13, §16) summarize.

The knee table is the deployment lookup the whole report generates: a per-configuration maximum-useful-batch as a function of prompt length, predicted by the bandwidth model and confirmed cell by cell.

23. SS17. Data Verification

Integrity rests on four independent main-thread checks. Completion and balance: exactly 672 rows, all status = "ok", zero error cells; 224 per model, 336 per GPU/decode/precision, all contexts and batches present. Physical sanity: no zero/negative throughput cell, and every cell's actual constructed context equals its target exactly (zero context_actual ≠ context mismatches across all 672 — the exact-token-id construction worked on every tokenizer). Dual-path agreement: the local results.json and the insurance modal.Dict each held all 672 cells and agreed. Analyzer faithfulness: one config's η curve and knee were recomputed from the raw per-request throughputs and matched the analyzer exactly (qwen2.5-7b|H100|d64|fp16|ctx32000: η = {1.00, 0.954, 0.864, 0.720, 0.562, 0.402, 0.388}, knee 16, identical), and the maximalist roofline fit was sanity-checked (median R² 0.93, the four negative-R² curves confirmed to be exactly the non-monotone H100/d512/ctx32000 tails, §11.3). Observations. Each check targets a distinct failure mode — a dropped axis, a truncating tokenizer, a partial save, a metric bug — and all passed, which is why the headline numbers are stated without hedging.

The data is verified four independent ways with persistence proven on two channels; the maximalist analysis additionally cross-checked the roofline fit and located its only failures exactly where the mechanism predicts. The 672/672 ok_rate is the headline; the verification is what licenses it.

24. SS18. Cross-Run Lineage and the Measured Served Knee — Closed-Loop versus Offline

V3 and V4 are two measurements of one physics from opposite ends. V3's closed-loop ladder asks what an operator sees under arrival-driven concurrency, folding amortization physics with scheduler behavior and queueing; V4's static-batch grid asks where the amortization ceiling sits when every slot is full of equal work, exposing the physics directly. They are consistent: V3 found a gradual, workload-shaped decline holding past the in-process baseline's N = 2 collapse and fanning out at N = 16–32; V4 finds the static-batch ceiling fans out exactly where context stresses KV bandwidth, with the knee in the same batch-16-to-32 region at moderate context and compressing to batch 8 only at the extreme 32,000-token corner V3's workload set did not isolate. Observations. The protocol difference is why both are needed and why neither substitutes for the other: a closed-loop operator will see the knee at or before V4's static-batch knee, because ragged batches, preemption, and queueing can only erode the idealized ceiling. So V4's knee is an upper bound on a real scheduler's knee — the operationally conservative direction. The original V4 stated this as an argument from protocol asymmetry; §24.1–§24.4 replace the argument with a direct measurement.

24.1 The closed-loop validation harness

To turn "the static knee upper-bounds the served knee" from a claim into a measurement, a dedicated closed-loop grid (modal_closed_loop_sweep.py) re-runs the V4 cells against a live server rather than an in-process batch. Each cell launches one vllm serve <model> --dtype float16 --max-model-len (ctx+dec) --gpu-memory-utilization 0.90 (with --quantization fp8 on the fp8 leg), health-polls /v1/models then /health (server log captured to disk and surfaced on failure — no silent bring-up), then drives it with a ladder of N ∈ {1, 2, 4, 6, 8, 12, 16, 24, 32, 48, 64} concurrent closed-loop agents (asyncio.gather over httpx, each agent send→await→send for four requests, two reps). The served metric is identical to the offline one — parallel_efficiency(N) = agg_decode_tps(N) / agg_decode_tps(1) / N, the same quantity as V4's offline η and V3's closed-loop efficiency — and the served knee is the smallest N > 1 with parallel_efficiency(N) < 0.65, the same threshold V4 uses offline. Decode is fixed at 64 tokens to match the offline d64 leg, ok_rate is 1.0 on every reported cell (no failed requests inflating or deflating the curve), and two prompt regimes are run to separate the two physical effects that a server folds together: distinct prompts (each agent sends a unique prompt — realistic traffic, no KV reuse) and shared-prefix prompts (all agents send the identical prompt — the prefix KV is cached once and prefill is amortized away). The static comparison point in every row is V4's own offline knee_table d64 leg (served_knee_analysis.py → served_knee_stats.json).

24.2 Distinct-prompt traffic: the static knee upper-bounds the served knee

Under distinct prompts — the regime a production endpoint actually sees — the served knee is at or below the offline static knee on 24 of 24 cells at contexts 512 and 8,192 (23 strictly below, 1 equal: qwen2.5-7b | H100 | fp8 | ctx512, served = static = 64), and on 2 of 2 at the 32,000-token diagnostic corner (qwen2.5-7b, both GPUs: served knee 2 against a static knee of 16) — 26 of 26 distinct-traffic configs satisfy served ≤ static, with zero violations. The bound is not merely satisfied; it loosens with context, and the loosening localizes the mechanism. At ctx512 the served knee sits in the 6–64 band (median ≈ 16–24) against static knees that are mostly "never breaks within batch 64," so decode amortization is still doing most of the work; at ctx8192 the served knee collapses to 2–4 (median 4) against static knees of 16–64; at ctx32000 it collapses to 2.

Traffic	Context	Median served knee	Static (offline d64) knee	served ≤ static
distinct	512	≈ 16–24	mostly never-by-64	12 / 12
distinct	8,192	4	16–64	12 / 12
distinct	32,000	2	16	2 / 2

Observations. The reason the served knee falls so far below the decode-amortization ceiling at long context is a different constraint than the one V4 measures: with distinct prompts there is no shared prefix, so every newly admitted request must run a full prefill whose FLOPs scale with context, and that prefill contends with the ongoing decode for the same HBM and SMs. At ctx512 prefill is cheap and decode physics dominates, so the served knee approaches the offline decode knee; at ctx8192 and beyond, prefill serialization becomes the binding constraint and the served parallel_efficiency collapses as ≈ 1/N from N = 2 (verified directly: qwen2.5-7b | A100 | fp8 | ctx8192 runs η = 1.00, 0.625, 0.36, 0.25, … at N = 1, 2, 4, 6 — pure 1/N, with ok_rate 1.0, so it is real throughput collapse, not request failure). This is the operationally important refinement of SS18: V4's offline knee is a necessary ceiling on decode amortization, but the served operating point under distinct long-context traffic is gated earlier by prefill contention, a physics V4's static-batch instrument deliberately excludes. The upper-bound direction is exactly right; the gap is the prefill term V4 does not claim to model.

24.3 Shared-prefix traffic: the decode knee transfers within a ladder step

When the prefill confound is removed — all agents share one cached prefix, so the server runs near-pure shared-prefix decode — the served knee tracks the offline static decode knee to within one concurrency-ladder step on 32 of 33 cells (19 exactly equal, 7 one step below, 6 one step above, 1 outlier two steps below: mistral-7b | A100 | fp8 | ctx512, static 64 → served 16). Neither cleanly above nor below: the served and offline knees coincide at the median. Observations. This is the controlled half of the experiment. Once prefill is amortized away by the cache, the only physics left in the served loop is the decode-amortization V4 measures offline — and the served knee lands on the offline knee, step-for-step, across both GPUs, both precisions, and the 512–32,000 context range. The seven one-step-above cases are the closed-loop scheduler overlapping the residual prefill of one request behind another's decode (continuous batching squeezing slightly past the equal-work static batch); they are within ladder granularity, not a violation of the mechanism. Taken with §24.2, the two regimes bracket the served knee from both sides: the decode-amortization ceiling V4 predicts is what governs serving when prefill is removed (§24.3), and it strictly upper-bounds the served knee when prefill is present (§24.2).

24.4 What the served grid measures, and what it does not

The closed-loop grid is N saturating closed-loop agents against a real vllm serve, not an open-loop arrival process: it measures the served decode-throughput knee under concurrency — the operating point an autoscaler targets when it asks "how many concurrent streams before per-request throughput halves" — not a tail-latency SLO or a Poisson-arrival queueing curve. Each cell is still equal-length within itself (ragged real batches will erode the served knee further, consistent with the upper-bound direction). No p95 or time-to-first-token number is produced here, and none is claimed. What the grid does establish, on its own substrate, is the lineage that SS18 previously argued: the offline static knee is a measured, conservative upper bound on the served knee under realistic distinct traffic (26/26), and the decode-amortization mechanism V4 predicts transfers to serving within one ladder step once prefill is held constant (32/33).

V4 does not revise V3; it grounds it, and the closed-loop grid now grounds V4's own upper-bound claim in measurement rather than argument. The served knee sits beneath the static-batch ceiling on every distinct-traffic config measured (served ≤ static, 26/26), the gap is the prefill-contention term V4's decode-physics instrument excludes, and once that confound is removed the served knee lands on the predicted decode knee step-for-step (32/33). The lineage is a measured bracket, not a contradiction — and the operationally conservative direction (plan to the offline knee, you will not over-provision) is now confirmed, not just asserted.

25. Limitations

The substrate bounds the claims. Three models, one scale band (7–8B; not a size sweep — V3 carries the size gradient; nothing about sub-7B or 13B+). Two GPUs, one engine version (A100-80GB, H100; vLLM 0.10.2 V1; the bandwidth axis is a two-point contrast, and the α factor is GPU-pair-specific, though the mechanism is engine-independent). fp8 weight, not fp8 KV (the KV cache is fp16 in every cell — the cause of the fp8 penalty's sign and the motivation for the fp8-KV follow-up; no fp8-KV claim is made). Static batch, with a saturating closed-loop validation but no open-loop curve (the main grid is one fixed equal-length batch per cell — the amortization ceiling, not a served latency; the companion closed-loop grid (§24) measures the served throughput knee under saturating concurrency and confirms it sits at or below the offline knee on 26/26 distinct-traffic configs, but neither grid produces a p95/TTFT or an open-loop arrival-rate curve). Equal-length prompts (real batches are ragged; the equal-length static batch is the best case, so the measured knee is an upper bound on a real scheduler's). Greedy fixed-length decode (no sampling or early stopping). The 32,000-token ceiling (shared across models with different position-embedding limits; the 131,072-token Llama regime is untested). The roofline model's scope (the pure form assumes constant effective bandwidth; it fits 85% of curves but breaks on the non-monotone H100/d512/ctx32000 corner where the scheduler wave-batches once KV exceeds VRAM, and α — the achieved-bandwidth-rising-with-batch correction (§11.3) — is an empirical per-GPU property, not a closed form derivable from architecture). The VRAM-ceiling co-predictor's assumptions (n_max uses an assumed 0.9 memory-utilization and a ~3 GB overhead; it co-predicts the knee via the shared 1/(C·k) scaling but is not its cause — the knee sits at ~¼ of it). Peak HBM now measured (the earlier nominal-peak limitation is resolved for the datacenter GPUs: a device-to-device-copy probe measured A100 1,763 / H100 3,020 / H200 4,240 GB/s, 86–90% of vendor spec and in the same order as nominal, and §12.4 uses these measured peaks for MBU; the §18 batch-1 MBU figures predate the probe and remain on vendor-spec peak). Observations. None of these threatens the headline, which is a predictive mechanism and a ranking on a defined substrate, not a universal constant. The limitations bound generalization, enumerated next.

Every limitation is a scope boundary, not a measurement flaw. The grid is internally complete and verified; the bandwidth model is validated where it applies and honest about where it breaks; what is not licensed is extrapolation past the three models, two GPUs, one engine, and static-batch protocol that ran.

26. Forbidden Claims

Explicitly not supported by V4:

Any closed-loop serving latency or p95 claim — V4 is offline static-batch decode throughput (§6.3, §24).
Generalization beyond 7–8B — three 7–8B families only (§25).
Generalization beyond the A100-80GB / H100 pair — a two-point bandwidth contrast (§12, §25).
Any fp8 KV-cache claim — only fp8 weight quantization was run; KV is fp16 everywhere (§6.2, §13).
An engine ranking or any cross-engine claim — one engine (vLLM 0.10.2); engine comparison is V3's territory under a different protocol.
The offline knee reported as a served knee or p95 — the offline knee is the equal-length static-batch decode ceiling, a measured upper bound on the served knee (§24: served ≤ static on 26/26 distinct-traffic configs), not the served value itself and not a tail-latency figure; the served decode-throughput knee is measured separately in the closed-loop grid (§24), and no p95/TTFT or open-loop arrival-rate claim is made by either grid.
An exact closed-form α — α's value is an empirical per-GPU property (achieved bandwidth rising with batch, §11.3), not a closed form derivable from architecture; the bandwidth model is licensed as predictive-up-to-α. Do not attribute α to compute overlap (refuted), to VRAM capacity at the knee (the knee sits at ~¼ of the VRAM ceiling, where KV still fits), or to HBM bandwidth as a monotone law — the H200 third bandwidth point refutes it (H100, not the higher-bandwidth H200, has the lowest α; §12.3).
Universal verbs — "always", "proves", "eliminates", "guarantees" are banned; the licensed reading is "on the 672 cells measured, the knee is governed by the KV footprint C·k, predicted from architecture at ρ = 0.84, up to a GPU-specific empirical batch-saturation factor."
A Llama-versus-Mistral robustness ordering — a statistical tie (Holm p = 0.13); reported as a tie (§16).

The result is strong inside its box and silent outside it. The box is three 7–8B models, two GPUs, one engine, static-batch offline throughput, and a bandwidth model whose knee scales as 1/(C·k) (ρ = 0.84) and is predictive up to an empirical, hardware-specific batch-saturation factor; every headline claim is bounded to that box.

27. Future Work

The sign of the fp8 penalty (§13) and its context-invariance motivate an fp8 KV-cache sweep to test whether quantizing the KV cache moves the knee back out and whether the penalty's context-invariance survives. Contrary to V1's earlier characterization, vLLM's V1 engine does support fp8 KV-cache via the FlashAttention-3 backend, but the vLLM source gates it to Hopper-class GPUs (SM90: H100/H200) — so the sweep is runnable on the H100 leg of this grid without the V0-engine fallback, but not on the A100 leg (SM80); the SM90/FA3 requirement and any prefix-caching/chunked-prefill interaction should be confirmed against the running vLLM version before the run. The two-point GPU contrast (§12) has now been extended with a third bandwidth point, the H200 (fp16/d64, 84 cells, §12.3), and the result is a refutation: α does not track HBM bandwidth — the H100, not the higher-bandwidth H200, has the lowest α (11/12 matched cells), so the A100→H100 trend does not extrapolate. What remains is to (i) explain why the H100 is the gentlest, via the per-GPU MBU-vs-batch microbench (alpha_microbench.py, whose device-to-device peak-bandwidth probe was validated on a consumer RTX 4080 Laptop at ~84% of vendor spec — confirming measured < nominal — and which should now run on the A100/H100/H200 to recover their achieved-bandwidth-vs-batch curves directly); (ii) add the H200 fp8 and d512 legs to complete the matched grid; and (iii) add a much-lower-bandwidth consumer point (the RTX 4080 Laptop) as an architecture-generality datapoint rather than a slope point. The 32,000-token ceiling (§25) leaves the 128K-context regime on Llama unmeasured (cloud-only: KV ≈ 16 GB/sequence exceeds a consumer GPU's VRAM), where r is far larger and the knee should compress toward batch 2–4. The static-batch ceiling has now been tied to the closed-loop served knee directly (§24: served ≤ static on 26/26 distinct-traffic configs, and the decode knee transfers within a ladder step under shared prefix on 32/33); the residual is to push past the saturating closed-loop grid to an open-loop arrival-rate sweep with p95/TTFT under ragged batches, which will quantify the prefill-contention term §24.2 isolates and turn the served knee into a tail-latency SLO bound rather than a throughput-knee bound. The α factor (§11.3) should be derived, not just fit, from the per-GPU memory-bandwidth-utilization-versus-batch curve; a microbench (alpha_microbench.py) that measures that MBU curve directly — and the achieved peak DRAM bandwidth, replacing the nominal-peak figure MBU currently assumes (§25) — is built and validated on a consumer GPU, and running it on the A100/H100 in the faithful CUDA-graph regime closes the predictive-up-to-α gap to a first-principles model.

Finally, the model surface should be extended across scale to test whether the KV-byte architectural prediction (§16) holds beyond 7–8B. The most consequential boundary is 70B (e.g. Llama-3.1-70B): at fp16 the 140 GB weight footprint requires tensor-parallel sharding across two 80 GB datacenter GPUs (fp8-weight ≈ 70 GB fits 2× with KV headroom), making this strictly a cloud, multi-GPU run. The instrument carries over unchanged — η(B) at fixed context, the architectural r = C·k/W from the 70B config — and a minimal scale point (one model, one 2-GPU config, decode 64, the four contexts × seven batches ≈ 28 cells) suffices to place 70B on the knee-vs-architecture plot; a full per-model slice (both precisions × both GPU pairs ≈ 224 cells, matching one model's share of the V4 grid) is the richer option. The dominant cost driver is long-context prefill at 70B (a 32K × high-batch cell prefills millions of tokens), so a 2–3-cell validate-before-pay calibration should fix the real per-cell wall-clock before committing the full spend, exactly as the V4 grid was scoped. This scale extension is the deliberately-deferred, paid-compute boundary; the cheaper cloud items above (the third bandwidth point, the fp8-KV H100 sweep, the α/peak-bandwidth microbench, and the 128K-context cell) are the lower-cost queue.

28. Conclusion

V4 turns the question "where does continuous batching stop amortizing" from a tabulation into a prediction. Across 672 verified cells — three 7–8B models, two datacenter GPUs, two decode lengths, two precisions, four contexts, seven batches, three reps, ok_rate 1.0, median per-cell CV 0.4% — a first-principles bandwidth model η(B) = (1+r)/(1+Br) with r = C·k/W fits the measured efficiency curves at median R² 0.93 and predicts the observed breakdown knee from architecture alone at Spearman ρ = 0.84, deviating only by a GPU-specific batch-saturation factor α (0.52 on A100, 0.33 on H100) that a first-principles roofline check shows is neither compute overlap nor a VRAM-capacity effect at the knee (95% of knees sit at ~¼ of the VRAM ceiling, where KV still fits) but an empirical, hardware-specific effect of achieved memory-bandwidth utilization rising with batch; the model's only outright failure (the non-monotone H100/d512/ctx32000 tails) is beyond the knee, where the KV cache exceeds VRAM and the scheduler wave-batches. Every axis sign falls out of that one KV-footprint scaling: context dominates (mean η 0.865 → 0.599, paired Δ −0.266, ANOVA partial η² 0.27); the H100 sustains η +0.077 with an advantage that grows to +0.125 at long context where bandwidth binds; fp8 is −0.054 less batch-robust at a context-invariant cost; a longer decode is +0.036; and qwen2.5-7b's half-sized 57,344-byte KV footprint predicts and matches its 2× later knee against a Llama ≈ Mistral tie. Decode is confirmed memory-bound (MBU ≈ 0.63 at batch 1; compute 4.6% of step time), the aggregate ceiling falls 42× → 25× across context, and the knee compresses as context^(−0.64). The claims are sized to the substrate — a best-case static-batch amortization ceiling for 7–8B models on two GPUs under one engine — and the contribution is a usable one: from a model's config and a GPU's datasheet, the batch at which continuous batching stops paying at a given context can be predicted in advance to within the empirical α. That ceiling is now confirmed to be the operationally conservative one: a closed-loop grid measures the served knee at or below the offline static knee on all 26 distinct-traffic configurations, and the decode-amortization mechanism transfers to serving within one ladder step once prefill is held constant (§24). The next measurements extend it toward fp8-KV, a third bandwidth point, the 128K-context regime, a closed-form α from the per-GPU bandwidth-utilization curve, and an open-loop arrival-rate sweep that turns the served throughput knee into a p95/TTFT bound.

29. References

TR164 V1/V2/V3 — research/tr164/DRAFT_Technical_Report_164{,_V2,_V3}.md (lineage, cited not recomputed; V3 forecasts this measurement).
TR165 — interpreter-axis GIL ablation companion (mechanism family).
Run artifact — research/tr164/results/modal_amortization/full_xmodel_672/results.json (672 cells).
Analysis artifacts — research/tr164/amortization_stats.json (main effects), research/tr164/amortization_full_stats.json (nine-layer maximalist analysis, all V4 numbers).
Analyzers — research/tr164/analyze_amortization.py, research/tr164/analyze_amortization_full.py, research/tr164/derive_alpha.py (the first-principles mechanism derivation → alpha_derivation.json: compute-boundedness, MBU rising with batch, the knee-below-VRAM-ceiling regime test, and the 1/(C·k) knee scaling of §11.3).
Closed-loop validation (§24) — harness research/tr164/modal_closed_loop_sweep.py, consolidation research/tr164/served_knee_analysis.py → research/tr164/served_knee_stats.json; results research/tr164/results/modal_closed_loop/{full_clfull1,full_clbfull1,diag_clbv1}/results.json (served knee vs offline static knee, distinct and shared-prefix traffic).
Bandwidth-utilization microbench (§27, future-work tool) — research/tr164/alpha_microbench.py (pytorch_direct steady-state decode timing + device-to-device-copy peak-bandwidth probe; measures the per-GPU MBU-vs-batch curve α derives from, and the achieved peak that would replace the nominal-peak MBU). Validated on a consumer GPU; queued to run on the A100/H100 in the CUDA-graph regime.
Third bandwidth point (§12.3) — grid research/tr164/modal_amortization_sweep.py --mode bw3 (H200, 84 fp16/d64 cells) + --mode h200fp8 (H200, 84 fp8-weight/d64 cells, the precision-robustness leg), analysis research/tr164/analyze_bw3_slope.py → research/tr164/bw3_slope_stats.json; results research/tr164/results/modal_amortization/{bw3_bw3a,h200fp8_h2f8a}/results.json. Figure generator papers/serving_stack_physics/v2_boundary/make_bw3_slope_figure.py.
MBU mechanism + measured peak (§12.4) — --mode peakbw (device-to-device-copy probe, A100/H100/H200) → results/modal_amortization/peakbw_pk1/results.json; analysis research/tr164/analyze_mbu.py → research/tr164/mbu_stats.json; figure papers/serving_stack_physics/v2_boundary/make_mbu_figure.py → fig_mbu_vs_batch.png.
fp8 KV-cache (§13.3) — grid --mode kv (H100, 84 fp8kv/d64 cells; bf16 weight via FA3/SM90), analysis research/tr164/analyze_kv.py → research/tr164/kv_stats.json; results results/modal_amortization/kv_kv3/results.json.
128K-context extension (§10.1) — grid --mode ctx128k (llama3.1-8b, 10 cells) → results/modal_amortization/ctx128k_c128a/results.json.
Figure generators — papers/serving_stack_physics/v2_boundary/make_amortization_figure.py, …/make_amortization_figures_full.py.
Architectures — verified from HF config.json / safetensors index for Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (unsloth mirror), mistralai/Mistral-7B-Instruct-v0.3 (Appendix D).

30. Appendix A — Reproducibility Pins

Engine: vLLM 0.10.2, V1 engine, enforce_eager = False, gpu_memory_utilization = 0.90, disable_log_stats = True.
Dependencies: transformers < 5, huggingface_hub; Python 3.12 Modal debian_slim image.
Models: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (gated), mistralai/Mistral-7B-Instruct-v0.3 (gated); HF token via Modal secret.
Precisions: fp16 = dtype=float16; fp8 = dtype=float16, quantization="fp8" (weight quant; KV fp16).
Sampling: temperature = 0, min_tokens = max_tokens ∈ {64, 512} (exact-length greedy).
Prompts: exact prompt_token_ids of length ctx ∈ {512, 2048, 8192, 32000}; context_actual == context verified for all 672.
max_model_len: ctx + decode, capped under each model's max_position_embeddings.
Hardware: Modal gpu="A100-80GB" and gpu="H100", both pools concurrent.
Reps: 1 warm-up + 3 timed; aggregate-throughput mean and population std recorded.
Seeds: bootstrap seed 164, 10,000 resamples.

Closed-loop validation grid (§24). Harness modal_closed_loop_sweep.py; consolidation served_knee_analysis.py → served_knee_stats.json; results research/tr164/results/modal_closed_loop/{full_clfull1,full_clbfull1,diag_clbv1}/.

Server: vllm serve <model> --dtype float16 [--quantization fp8] --max-model-len (ctx+64) --gpu-memory-utilization 0.90 --disable-log-requests; health-poll /v1/models then /health (360 s budget, server log captured); fastapi==0.116.1, starlette==0.47.2 pinned (vLLM 0.10.2 lower-bounds fastapi>=0.115.0, and the unpinned June-2026 resolution tripped a Starlette router incompatibility — pinned below the break).
Load: N ∈ {1, 2, 4, 6, 8, 12, 16, 24, 32, 48, 64} concurrent closed-loop agents (asyncio.gather over httpx, each send→await→send for 4 requests, 2 reps); decode 64; ok_rate 1.0 on every reported cell.
Traffic regimes: distinct prompts (make_prompt(uid), 64 unique head tokens + shared body to exact ctx) vs shared-prefix (identical prompt); contexts {512, 8192} full + {32000} diagnostic.
Served knee: smallest N > 1 with parallel_efficiency(N) = agg_decode_tps(N)/agg_decode_tps(1)/N < 0.65 (same metric/threshold as offline η).

31. Appendix B — Statistical Methodology

η(B) = t(B)/t(1); knee = smallest B > 1 with η < 0.65 (Appendix C), continuous knee = interpolated η = 0.65 crossing in log₂ B (§21). Bandwidth model η = (1+r)/(1+Br) fit by least squares per curve (scipy curve_fit, bounded r > 0); R² against the curve mean; architectural r = C·k/W; α = r_fit/r_arch; architectural knee B* = (0.35+r)/(0.65r); arch-vs-observed knee correlation by Spearman ρ (log₂ Pearson reported too). Axis main effects and model comparisons are paired Wilcoxon signed-rank (batch 1 excluded) with Holm-Bonferroni within family and reproducible percentile-bootstrap 95% CIs (seed 164, 10,000 resamples). Interactions are difference-in-differences paired Wilcoxon, Holm-corrected, plus a Type-II factorial ANOVA (statsmodels) of η on all main effects + 2-way interactions with partial η² per term. Implementation: analyze_amortization_full.py → amortization_full_stats.json.

32. Appendix C — The Full 24-Configuration Knee Table

Knee = smallest batch with η < 0.65; "—" = never reached through batch 64. Source: amortization_stats.json knee_table (24/24 rows verified against the JSON, 0 mismatches).

Config (model \| GPU \| decode \| precision)	ctx 512	ctx 2,048	ctx 8,192	ctx 32,000
qwen2.5-7b \| A100-80GB \| d64 \| fp16	—	64	32	16
qwen2.5-7b \| A100-80GB \| d64 \| fp8	64	32	32	16
qwen2.5-7b \| A100-80GB \| d512 \| fp16	—	—	32	16
qwen2.5-7b \| A100-80GB \| d512 \| fp8	64	64	32	16
qwen2.5-7b \| H100 \| d64 \| fp16	—	—	64	16
qwen2.5-7b \| H100 \| d64 \| fp8	64	—	32	16
qwen2.5-7b \| H100 \| d512 \| fp16	—	—	—	32
qwen2.5-7b \| H100 \| d512 \| fp8	—	—	32	16
llama3.1-8b \| A100-80GB \| d64 \| fp16	—	—	16	8
llama3.1-8b \| A100-80GB \| d64 \| fp8	64	32	16	8
llama3.1-8b \| A100-80GB \| d512 \| fp16	—	—	32	8
llama3.1-8b \| A100-80GB \| d512 \| fp8	64	64	16	8
llama3.1-8b \| H100 \| d64 \| fp16	32	—	64	8
llama3.1-8b \| H100 \| d64 \| fp8	64	64	64	8
llama3.1-8b \| H100 \| d512 \| fp16	—	—	—	16
llama3.1-8b \| H100 \| d512 \| fp8	—	—	64	8
mistral-7b \| A100-80GB \| d64 \| fp16	—	64	32	8
mistral-7b \| A100-80GB \| d64 \| fp8	64	32	16	4
mistral-7b \| A100-80GB \| d512 \| fp16	—	—	16	8
mistral-7b \| A100-80GB \| d512 \| fp8	32	64	16	8
mistral-7b \| H100 \| d64 \| fp16	—	—	64	8
mistral-7b \| H100 \| d64 \| fp8	64	64	16	8
mistral-7b \| H100 \| d512 \| fp16	—	—	—	16
mistral-7b \| H100 \| d512 \| fp8	—	—	64	8

Observations. The ctx-32,000 column is uniformly small (4–32, mostly 8 for Llama/Mistral and 16 for Qwen — the KV-architecture gradient, §16); within each (model, GPU, decode) triple the fp8 row knees at an equal-or-smaller batch than fp16 (the §13 penalty); and the H100/d512/fp16 rows hold "—" through ctx 8,192 for all three models (the §12 bandwidth advantage at the most-amortizing decode length).

32.1 Appendix C.2 — Served Knee versus Offline Knee (closed-loop validation, §24)

Served knee = smallest concurrency N > 1 with closed-loop parallel_efficiency(N) < 0.65; static knee = the offline d64 leg of the table above (matching the closed-loop's max_tokens = 64). Source: served_knee_stats.json (served_knee_analysis.py over results/modal_closed_loop/). "never" = no offline knee through batch 64. The 24 distinct-prompt cells (ctx 512, 8,192) plus the 2 distinct 32,000-token diagnostic cells are the realistic-traffic regime; all 26 satisfy served ≤ static.

Config (model \| GPU \| precision)	ctx	static (offline d64)	served (distinct)	relation
qwen2.5-7b \| A100-80GB \| fp16	512	never	16	served < static
qwen2.5-7b \| A100-80GB \| fp16	8,192	32	4	served < static
qwen2.5-7b \| A100-80GB \| fp8	512	64	6	served < static
qwen2.5-7b \| A100-80GB \| fp8	8,192	32	2	served < static
qwen2.5-7b \| A100-80GB \| fp16	32,000	16	2	served < static
qwen2.5-7b \| H100 \| fp16	512	never	24	served < static
qwen2.5-7b \| H100 \| fp16	8,192	64	4	served < static
qwen2.5-7b \| H100 \| fp8	512	64	64	served = static
qwen2.5-7b \| H100 \| fp8	8,192	32	4	served < static
qwen2.5-7b \| H100 \| fp16	32,000	16	2	served < static
llama3.1-8b \| A100-80GB \| fp16	512	never	12	served < static
llama3.1-8b \| A100-80GB \| fp16	8,192	16	4	served < static
llama3.1-8b \| A100-80GB \| fp8	512	64	8	served < static
llama3.1-8b \| A100-80GB \| fp8	8,192	16	2	served < static
llama3.1-8b \| H100 \| fp16	512	32	24	served < static
llama3.1-8b \| H100 \| fp16	8,192	64	4	served < static
llama3.1-8b \| H100 \| fp8	512	64	24	served < static
llama3.1-8b \| H100 \| fp8	8,192	64	4	served < static
mistral-7b \| A100-80GB \| fp16	512	never	12	served < static
mistral-7b \| A100-80GB \| fp16	8,192	32	4	served < static
mistral-7b \| A100-80GB \| fp8	512	64	8	served < static
mistral-7b \| A100-80GB \| fp8	8,192	16	2	served < static
mistral-7b \| H100 \| fp16	512	never	24	served < static
mistral-7b \| H100 \| fp16	8,192	64	4	served < static
mistral-7b \| H100 \| fp8	512	64	24	served < static
mistral-7b \| H100 \| fp8	8,192	16	4	served < static

Observations. Distinct-traffic served ≤ static on all 26 cells (23 strictly below at ctx 512/8,192, 1 equal, plus 2 strictly below at ctx 32,000): 23 + 1 + 2 = 26/26. The served knee at ctx 512 (6–64) stays in the decode-amortization band, while at ctx 8,192 it collapses to 2–4 and at ctx 32,000 to 2 — the prefill-serialization signature (§24.2), since distinct long prompts cannot reuse the prefix KV. The single equality (qwen2.5-7b | H100 | fp8 | ctx512 at 64) is the lowest-r cell on the fastest GPU, where prefill is cheapest and the served knee can reach the offline ceiling. The shared-prefix companion regime (prefill amortized by the cache, 33 cells) is not retabulated here — its served knee equals the offline decode knee within one ladder step on 32/33 cells (19 exact, full per-cell breakdown in served_knee_stats.json), the controlled half of §24.3.

33. Appendix D — Architecture and Bandwidth Constants

Verified from HF config.json and safetensors index (param count = total_size / 2 for bf16 checkpoints). KV cache is fp16 (2 bytes) in every cell.

Model	layers	KV heads	head_dim	params	KV bytes/token `k`	weight bytes `W` (fp16)
qwen2.5-7b	28	4	128	7,615,616,512	57,344	15.23 GB
llama3.1-8b	32	8	128	8,030,261,248	131,072	16.06 GB
mistral-7b	32	8	128	7,248,023,552	131,072	14.50 GB

k = 2 (K,V) × layers × KV-heads × head_dim × 2 bytes; W_fp8 ≈ W_fp16 / 2 (≈ 1 byte/param, approximate). Nominal peak HBM (vendor spec, not measured): A100-80GB 2,039 GB/s (HBM2e), H100 3,350 GB/s (HBM3). Sources: Qwen/Qwen2.5-7B-Instruct, meta-llama/Llama-3.1-8B-Instruct (unsloth mirror, architecture fields unmodified), mistralai/Mistral-7B-Instruct-v0.3 config.json + model.safetensors.index.json.

34. Appendix E — Per-Curve Roofline-Fit Sample

Representative rows from L3_roofline.per_curve (fitted r, architectural r, batch-saturation α, fit R², architectural-predicted knee, observed interpolated knee). Full 96-curve table in the JSON; the derivation that pins down α (memory-bound, not VRAM-capacity at the knee, achieved bandwidth rising with batch) is in alpha_derivation.json (§11.3).

Curve	r_fit	r_arch	α	R²	knee_pred (arch)	knee_obs
qwen2.5-7b \| H100 \| d64 \| fp16 \| ctx512	—	0.0019	—	high	281	—
qwen2.5-7b \| H100 \| d64 \| fp16 \| ctx8192	—	0.031	—	high	19	41
qwen2.5-7b \| H100 \| d64 \| fp16 \| ctx32000	0.046	0.120	0.38	high	6	11
llama3.1-8b \| H100 \| d512 \| fp16 \| ctx32000	—	—	—	< 0 (non-monotone)	—	16

Observations. The architectural knee tracks the observed knee in order and scale across the context range (281 → 19 → 6 vs — → 41 → 11), under-predicting by the batch-saturation factor (α ≈ 0.38 here, near the H100 median 0.33; an empirical bandwidth-utilization effect, not compute overlap and not VRAM-capacity at the knee — §11.3); the negative-R² row is one of the four non-monotone H100/d512/ctx32000 tails where the monotone form cannot fit (§11.3).

35. Appendix F — Master Named-Value Reference (paper-substrate table)

This appendix is the mining surface for drafting the Tier-2 paper: every quantity the paper would cite, with its exact value, source key, and gloss. Pull values from here; do not re-derive. Keys resolve in research/tr164/amortization_stats.json (prefix m.) or research/tr164/amortization_full_stats.json (prefix f.).

Grid / provenance.

Name	Value	Source	Gloss
`gridCells`	672	run JSON length	total cells, `ok_rate` = 1.0
`timedGenerations`	2,016	672 × 3 reps	+ 672 warm-ups
`engine`	vLLM 0.10.2 V1	`f.provenance.engine`	production engine, CUDA graphs on
`gpus`	A100-80GB, H100	—	HBM2e ≈ 2.0 TB/s, HBM3 ≈ 3.35 TB/s
`models`	qwen2.5-7b, llama3.1-8b, mistral-7b	—	7.62 / 8.03 / 7.25 B params
`axes`	3 model × 2 gpu × 2 decode × 2 prec × 4 ctx × 7 batch	—	contexts {512, 2048, 8192, 32000}; batches {1…64}
`wallClockMin`	≈ 63	run log	both GPU pools concurrent, ~10–13 cells/min
`runDir`	`research/tr164/results/modal_amortization/full_xmodel_672/`	—	`results.json` (294 KB)

Main effects on η (paired Wilcoxon, Holm, bootstrap 95% CI, seed 164).

Name	Δη	95% CI	η lo→hi	n	Holm p	Source
`effContext` (ctx32000 vs 512)	−0.266	[−0.294, −0.240]	0.865→0.599	144	<0.001	`f.L1_main_effects.context_32000_vs_512`
`effGpu` (H100 vs A100)	+0.077	[0.067, 0.088]	0.724→0.801	288	<0.001	`…gpu_H100_vs_A100`
`effDecode` (512 vs 64)	+0.036	[0.028, 0.043]	0.745→0.780	288	<0.001	`…decode_512_vs_64`
`effPrecision` (fp8 vs fp16)	−0.054	[−0.061, −0.047]	0.790→0.735	288	<0.001	`…precision_fp8_vs_fp16`

Interactions (difference-in-differences, Holm).

Name	DiD	95% CI	gap lo→hi	Holm p	Source
`intGpuByContext`	+0.088	[0.059, 0.121]	+0.036→+0.125	4e-6	`f.L2_interactions.gpu_gap_x_context`
`intDecodeByContext`	+0.042	[0.021, 0.064]	+0.020→+0.062	0.001	`…decode_gap_x_context`
`intPrecisionByGpu`	+0.029	[0.017, 0.042]	−0.069→−0.040	4e-6	`…precision_gap_x_gpu`
`intPrecisionByContext` (NULL)	+0.002	[−0.016, 0.021]	−0.053→−0.051	0.88	`…precision_gap_x_context`

Factorial ANOVA partial η² (full-model R² = 0.949). batch 0.509 · context 0.266 · context×batch 0.060 · gpu 0.038 · precision 0.018 · gpu×batch 0.013 · model 0.009 · decode 0.008. Source f.L2_factorial_anova.top_terms.

Model comparison (paired η, Holm).

Name	Δη	95% CI	η a→b	Holm p	Source
`mdlQwenVsLlama`	+0.039	[0.027, 0.050]	0.789 vs 0.750	<0.001	`m.model_comparison_eta`
`mdlQwenVsMistral`	+0.041	[0.029, 0.052]	0.789 vs 0.748	<0.001	`m.model_comparison_eta`
`mdlLlamaVsMistral` (TIE)	+0.002	[−0.004, 0.007]	0.750 vs 0.748	0.13	`m.model_comparison_eta`

Bandwidth model (L3, the centerpiece).

Name	Value	Source	Gloss
`modelForm`	η(B) = (1+r)/(1+Br), r = C·k/W	`f.L3_roofline.model_form`	derived §9
`fitR2Median`	0.928	`f.L3_roofline.fit_r2_median`	57/96 > 0.9, 82/96 > 0.7
`fitR2P10`	0.655	`f.L3_roofline.fit_r2_p10`	10th percentile
`archKneeSpearman`	0.84	`f.L3_roofline.arch_knee_vs_obs_knee_spearman`	zero-param prediction, n=65
`archKneePearsonLog2`	0.79	`f.L3_roofline.arch_knee_vs_obs_knee_pearson_log2`	—
`alphaA100` / `alphaH100`	0.524 / 0.330	`f.L3_roofline.alpha_compute_overlap_by_gpu` (JSON key name retained)	batch-saturation factor r_fit/r_arch — empirical, NOT compute overlap and NOT VRAM-capacity at the knee (§9.3, §11.3)
`kneeVsVramSpearman`	0.83	`derive_alpha.py` → `alpha_derivation.json` `Q4_vram_ceiling`	knee vs VRAM ceiling n_max — a correlated co-predictor (shares 1/(C·k) scaling), n=65
`kneeOverNmaxRatio`	0.27	`…Q4_vram_ceiling.median_knee_over_nmax_ratio`	knee sits at ~¼ of the VRAM ceiling → capacity is NOT the trigger (KV still fits at knee)
`mbuAtKneeMedian`	0.75	`…Q5_knee_regime.mbu_at_knee_median`	MBU at knee (p10–p90 0.41–1.13) — bandwidth rising toward saturation, not a clean constant
`compFracMedian`	0.046	`…Q1_compute_boundedness.comp_frac_median`	decode compute/memory time — memory-bound, refutes compute-overlap
`kvBytesQwen`	57,344	`f.L3_roofline.kv_bytes_per_token`	2×28×4×128×2
`kvBytesLlamaMistral`	131,072	`f.L3_roofline.kv_bytes_per_token`	2×32×8×128×2
`wBytesFp16`	15.23 / 16.06 / 14.50 GB	`f.L3_roofline.weight_bytes_fp16`	qwen / llama / mistral

Bandwidth-bound confirmation, scaling, reliability, single-stream, knee rate.

Name	Value	Source	Gloss
`mbuBatch1Ctx512`	0.535 (A100), 0.539 (H100)	`f.L4_bandwidth_mbu.mean_mbu_batch1_ctx512_by_gpu`	decode is bandwidth-bound
`mbuQwenH100Fp16`	0.63–0.66	`f.L4_bandwidth_mbu.detail`	per-context, near-constant
`aggSpeedupAt64`	42.2 / 39.6 / 34.1 / 24.9	`f.L5_aggregate_scaling.mean_aggregate_speedup_at_b64_by_context`	by ctx512/2048/8192/32000
`peakAggTpsA100` / `H100`	4,684 / 8,879	`f.L5_aggregate_scaling.peak_aggregate_tps_by_gpu`	max, batch 64
`cvMedian` / `cvP95` / `cvMax`	0.4% / 2.0% / 12.1%	`f.L6_variance`	per-cell, 3 reps
`singleStreamGpuSpeedup`	1.68×	`f.L7_single_stream`	H100/A100 at batch 1
`singleStreamFp8Speedup`	1.40×	`f.L7_single_stream`	fp8/fp16 at batch 1
`prefillPenalty`	≈ 0.94	`f.L7_single_stream`	t1(ctx32000)/t1(ctx512) — keystone
`kneeSlope`	−0.64 (median)	`f.L9_knee_regression`	log₂(knee) per log₂(context)
`meanEtaByCtxB16`	0.897 / 0.838 / 0.686 / 0.442	`m.headline.mean_eta_by_context_batch`	ctx512→32000 at batch 16
`kneeDistCtx32000`	all 24 break; 13 by b8	`m.headline.knee_distribution_by_context`	{b8:13, b16:9, b32:1, b4:1}

Served-knee validation (closed-loop grid, §24). Source served_knee_stats.json (served_knee_analysis.py over results/modal_closed_loop/); harness modal_closed_loop_sweep.py.

Name	Value	Source	Gloss
`servedMetric`	parallel_efficiency(N)=agg_decode_tps(N)/agg_decode_tps(1)/N	—	same quantity as offline η; served knee at <0.65
`servedLeStaticDistinct`	26 / 26	`served_knee_stats.distinct_regime_*`	distinct-traffic served ≤ static (24 at ctx512/8192 + 2 at ctx32000); 0 violations
`servedLtStaticDistinct`	25 / 26	`…distinct_regime_*`	strictly below (1 equal: qwen·H100·fp8·ctx512=64)
`servedKneeCtx8192Distinct`	2–4 (median 4)	`…distinct_regime_ctx512_8192.cells`	prefill-serialization collapse, PE≈1/N from N=2
`servedKneeCtx32000Distinct`	2	`…distinct_regime_ctx32000_diag.cells`	qwen both GPUs, static=16
`servedTracksStaticSharedPrefix`	32 / 33 within ±1 ladder step	`…shared_prefix_regime`	19 exact, 7 one-below, 6 one-above, 1 two-below; decode mechanism transfers when prefill is cached
`servedGridLadder`	N ∈ {1,2,4,6,8,12,16,24,32,48,64}	`modal_closed_loop_sweep.py`	N concurrent closed-loop agents, decode 64, ok_rate 1.0

Third bandwidth point (H200, §12.3). Source bw3_slope_stats.json (analyze_bw3_slope.py); grid modal_amortization_sweep.py --mode bw3 (84 fp16/d64 H200 cells, ok_rate 1.0). α matched against V4 fp16/d64 per-curve α (not the all-curve aggregate).

Name	Value	Source	Gloss
`bwTbpsByGpu`	A100 2.0 / H100 3.35 / H200 4.8	vendor-spec peak HBM	HBM2e / HBM3 / HBM3e
`alphaMedAllCtx`	A100 0.628 / H100 0.456 / H200 0.679	`bw3_slope_stats.alpha_median_all_context`	matched fp16/d64, n=12 — H100 is the minimum (non-monotone)
`alphaMedLongCtx`	A100 0.443 / H100 0.198 / H200 0.298	`…alpha_median_long_context`	ctx≥8192, n=6 — H100 minimum again
`alphaH200gtH100`	11/12	`…h200_alpha_gt_h100_per_cell`	per-cell: H200 α > H100 α (trend reversal)
`alphaMonotoneDecreasing`	False (both medians)	`…alpha_monotonic_decreasing_*`	α is NOT a monotone function of HBM bandwidth

Measured peak bandwidth + MBU mechanism (§12.4). Source mbu_stats.json (analyze_mbu.py) + peakbw_pk1 probe.

Name	Value	Source	Gloss
`measuredPeakBwGbps`	A100 1,763 / H100 3,020 / H200 4,240	`peakbw_pk1` device-to-device copy	86–90% of vendor spec; same order as nominal
`mbuAtBatch1`	A100 0.699 / H100 0.711 / H200 0.584	`mbu_stats.median_mbu_at_batch1`	H200 starts lowest (most headroom), 12/12 below H100
`mbuMax`	A100 1.00 / H100 1.11 / H200 0.744	`mbu_stats.median_mbu_max`	H200 saturates lowest; A100/H100 >1 = const-bw model breaking at VRAM ceiling
`mbuRise`	A100 ×1.43 / H100 ×1.56 / H200 ×1.27	derived (max÷b1)	inversely orders α — H200 rises least → highest α; the §12.3 mechanism

fp8 KV-cache (§13.3). Source kv_stats.json (analyze_kv.py); grid --mode kv (84 H100 fp8kv d64 cells, ok_rate 1.0; fp8 e4m3 KV + bf16 weight via FA3/SM90).

Name	Value	Source	Gloss
`fp8kvKneeRatio`	1.0 (median)	`kv_stats.median_ratio`	halving k did NOT push the knee out (model predicted ~2×)
`fp8kvKneeShift`	0 later / 4 equal / 2 earlier (of 6)	`kv_stats.{fp8kv_knee_later,equal,earlier}`	negative result; FA3-backend confound flagged (matched-backend control = §27)

128K-context extension (§10.1). Source ctx128k_c128a (10 cells, llama3.1-8b, ok_rate 1.0).

Name	Value	Source	Gloss
`ctx128kRArch`	1.04	C·k/W at 128,000	4× the 32K value (linear in C)
`ctx128kKneeArch`	≈ 2 (≈ 3 with α)	(0.35+r)/(0.65r)	architectural prediction at the max-r corner
`ctx128kKneeObs`	4 (A100, H200); >4 (H100)	`ctx128k_c128a` η curves	knee compresses to batch 4 — prediction holds 4× past the grid
`ctx128kKvPerSeq`	16.8 GB/sequence	131,072 B/tok × 128,000	VRAM caps batch ~4 (80 GB) / ~8 (141 GB); knee arrives at/before the ceiling

36. Appendix G — Figure Manifest (caption-ready)

Seven figures in papers/serving_stack_physics/figures/, each traceable to the stats JSON and reproducible from its generator. Caption text is paper-ready; adapt as needed.

Fig	File	Generator	Caption / what it supports
1	`amortization_eta_vs_batch.png`	`make_amortization_figure.py`	"Amortization efficiency η(B)=t(B)/t(1) vs batch, one curve per context, A100 (left) vs H100 (right), averaged over the 12 (model, decode, precision) configs; 0.65 breakdown line dashed." Supports §3.2, §10 (context lever + GPU effect, visual).
2	`fig_roofline_fit.png`	`make_amortization_figures_full.py`	"Observed η (points) vs the fitted bandwidth model η=(1+r)/(1+Br) (lines) for four representative configs; per-curve R² annotated." Supports §11.1 (the form fits, median R²=0.93).
3	`fig_arch_knee_pred.png`	`make_amortization_figures_full.py`	"Zero-free-parameter architectural knee B*=(0.35+r)/(0.65r), r=C·k/W, vs observed interpolated knee (log-log, n=65 breaking curves, Spearman ρ=0.84); points sit above y=x by the empirical batch-saturation factor α (§11.3)." Supports §11.2 (the predictive validation — the paper's headline figure).
4	`fig_agg_speedup.png`	`make_amortization_figures_full.py`	"Aggregate throughput speedup A(B)/A(1) vs batch by context, A100 vs H100, with the linear-ideal reference; the plateau is the throughput ceiling." Supports §17 (scaling ceiling 42×→25×).
5	`fig_served_vs_static_knee.png`	`make_served_knee_figure.py`	"(left) Closed-loop served knee (distinct prompts) vs offline static knee, log₂-log₂, colored by context, with the y=x diagonal; all 26 distinct configs land on or below it — the measured upper bound. (right) Closed-loop parallel efficiency vs concurrency for representative cells: short context decays gently (served knee 24) while long context collapses as 1/N from N=2 (prefill serialization), against the 1/N reference and the 0.65 knee line." Supports §24.2–§24.3 (the served-knee validation).
6	`fig_bw3_bandwidth_slope.png`	`make_bw3_slope_figure.py`	"Batch-saturation factor α = r_fit/r_arch vs HBM bandwidth for A100 (2.0), H100 (3.35), H200 (4.8 TB/s), matched fp16/d64; per-(model,context) faint lines plus the all-context and ctx≥8192 medians. Both median lines dip at the H100 (the minimum) and rise at the H200, against the dashed (refuted) two-point extrapolation — α is not a monotone function of HBM bandwidth." Supports §12.3 (the third-bandwidth-point negative result).
7	`fig_mbu_vs_batch.png`	`make_mbu_figure.py`	"Median MBU(B) = achieved_bw/measured_peak vs batch, one curve per GPU (measured peaks A100 1.8, H100 3.0, H200 4.2 TB/s). The H100 rises highest (crosses 1.0 at batch 64 — the const-bandwidth model breaking at the VRAM ceiling), the H200 is the flattest and saturates at ~0.74. The MBU rise inversely orders α (H100 lowest α, H200 highest) — the measured mechanism for why the higher-bandwidth H200 gentles least." Supports §12.4 (the MBU explanation of the §12.3 reversal).

Suggested paper figure order: Fig 1 (the phenomenon) → Fig 3 (the predictive result, the money figure) → Fig 2 (the fit underpinning it) → Fig 4 (the throughput consequence) → Fig 5 (the served-knee validation, for the "does it hold under real serving?" reviewer). Fig 3 is the one to lead the results with.

37. Appendix H — Paper Claim Ladder (supported / licensed / forbidden)

Mirrors the bridge-paper claim-ladder convention so the paper-writing agent can lift bounded sentences directly. Tier 1 is verbatim-adaptable; Tier 2 must carry its caveat; Tier 3 must never be stated.

Tier 1 — Supported (state plainly, bounded to the substrate):

A one-parameter bandwidth model η(B)=(1+r)/(1+Br) with r=C·k/W fits the 96 measured efficiency curves at median R²=0.93 and predicts the observed breakdown knee from model architecture alone (zero fitted parameters) at Spearman ρ=0.84, on a 672-cell offline static-batch grid (three 7–8B models, A100-80GB + H100, vLLM 0.10.2).
Context length is the dominant lever: mean η falls 0.865→0.599 from 512 to 32,000 tokens (paired Δ=−0.266, 95% CI [−0.294,−0.240], Holm p<0.001), an order of magnitude above any other design factor (ANOVA partial η²: context 0.27 vs GPU 0.038).
The H100 sustains η +0.077 above the A100 (CI [0.067,0.088]), and the advantage grows with context (difference-in-differences +0.088, Holm p<1e-5) — the bandwidth mechanism's interaction prediction confirmed.
fp8 weight quantization is 1.40× faster single-stream but 0.054 less batch-robust (CI [−0.061,−0.047]); the penalty is context-invariant (DiD +0.002, Holm p=0.88), correcting the intuition that it should grow with context.
KV-bytes-per-token predicts the cross-model robustness order: Qwen (57,344 B/tok) knees at batch 16 at 32K context vs Llama/Mistral (131,072 B/tok) at batch 8; the architecturally identical Llama and Mistral tie (Holm p=0.13).
Decode is memory-bandwidth-bound (measured MBU ≈ 0.54–0.66 at batch 1; compute is 4.6% of step time across the grid).
The aggregate-throughput speedup ceiling at batch 64 falls from 42× to 25× across the context range; batching never stops raising throughput, only stops being free.
The breakdown knee is governed by — and scales inversely with — the KV footprint C·k, predicted from architecture at ρ=0.84 via the bandwidth ratio r=C·k/W. (The VRAM concurrency ceiling n_max=VRAM/(C·k) co-predicts it at ρ=0.83 because it shares the same 1/(C·k) scaling, but the knee sits at ~0.27×n_max, inside the regime where KV still fits VRAM — capacity is a correlate, not the cause.)
The offline static-batch knee is a measured upper bound on the served knee: a closed-loop grid (vLLM serve + N concurrent closed-loop agents, same parallel_efficiency<0.65 knee) finds the served knee at or below the offline knee on all 26 distinct-traffic configurations, and within one concurrency-ladder step of the offline decode knee on 32/33 shared-prefix configurations. Under distinct long-context traffic the served knee is gated earlier, by prefill serialization (PE≈1/N from N=2 at ctx≥8192) — a constraint the decode-only offline metric deliberately excludes (§24).
The batch-saturation factor α is an empirical per-GPU property, not a monotone function of HBM bandwidth: a third bandwidth point (H200, HBM3e 4.8 TB/s, matched fp16/d64) shows the H100 — not the higher-bandwidth H200 — has the lowest α (median 0.198 vs 0.298 at ctx≥8192; H200 α > H100 α in 11/12 matched cells), refuting the two-point A100→H100 trend. The predictive headline (knee ∝ 1/(C·k), ρ=0.84) is unaffected; only the second-order α correction is shown to lack a bandwidth law (§12.3).
The α ordering is explained by the measured MBU-vs-batch curve: against measured peak bandwidth (A100 1,763 / H100 3,020 / H200 4,240 GB/s, device-to-device copy), median MBU rises ×1.56 on the H100 (lowest α), ×1.43 on the A100, and only ×1.27 on the H200 (highest α) — the H200 starts at the lowest MBU (0.584) yet saturates lowest (0.744) because its HBM3e bandwidth outruns the decode kernels, so more peak bandwidth buys less batch-amortization (§12.4).
fp8 KV-cache is a capacity lever, not an amortization lever: halving the KV-bytes-per-token (fp8 e4m3 KV, H100/V1/FA3) did not push the breakdown knee out (median knee ratio 1.0 across 6 matched cells; 0 later), against the model's ~2× prediction — caveated by an FA3-vs-default backend confound that a matched-backend control would resolve (§13.3).
The 1/(C·k) knee prediction holds 4× beyond the fitted grid: at 128,000-token context on Llama-3.1-8B the knee compresses to batch 4 (A100, H200; >4 on H100), matching the architectural prediction of ≈2–3, with the amortization knee arriving at or before the 16.8 GB/sequence VRAM ceiling (§10.1).

Tier 2 — Licensed with caveat (state only with the bracketed qualifier):

The knee prediction is predictive up to a GPU-specific batch-saturation factor α (0.52 A100, 0.33 H100) which is an empirical, hardware-specific property the derivation attributes to achieved memory-bandwidth utilization rising with batch (§9.3, §11.3) — the zero-parameter architectural knee is conservative (~2× low) and α closes the gap; α's value is not derivable from architecture. Do NOT call α "compute overlap" (decode is 4.6%-compute, refuted) and do NOT call it a VRAM-capacity effect at the knee (95% of knees sit at ~¼ of the VRAM ceiling, where KV still fits).
The bandwidth model fits 85% of curves (82/96 at R²≥0.7); it breaks on the four H100/d512/ctx32000 non-monotone curves where the scheduler wave-batches once the KV cache exceeds VRAM — beyond the operational knee. State "fits up to the wave-scheduling regime," not "fits all curves."
The offline knee is a static-batch, equal-length-prompt ceiling, now measured to upper-bound the served knee (§24, Tier 1 #9: served ≤ static 26/26 distinct-traffic) — but it is the ceiling, not the served value, and the served grid is saturating-closed-loop, not open-loop; do not present the offline knee as a served latency or p95, and do not present the served grid as an arrival-rate/SLO curve.
Peak HBM bandwidth in MBU is vendor spec, not measured.

Tier 3 — Forbidden (never state; from §26): closed-loop serving latency / p95; generalization beyond 7–8B; generalization beyond the A100-80GB/H100 pair; any fp8-KV-cache claim (only fp8 weight quant ran); any engine ranking / cross-engine claim (one engine); a ragged-batch/production-scheduler knee; universal verbs ("always", "proves", "eliminates", "guarantees"); a Llama-vs-Mistral robustness ordering (it is a tie).

38. Appendix I — TR→Paper Section Map and Usage Note

Usage. This TR is the verified substrate for the Tier-2 paper "When Continuous Batching Stops Amortizing." An agent drafting the paper should: (i) pull every number from Appendix F (each traces to a JSON key — never re-derive); (ii) use the Tier-1 sentences in Appendix H as the claim skeleton, attach Tier-2 caveats where flagged, and never state a Tier-3 claim; (iii) place figures per Appendix G (lead results with Figure 3); (iv) draw the reproducibility appendix from §30, Appendix D, and the execution arc §8; (v) keep the venue-neutral exposure discipline the project's pre-push leak-grep rule defines (that rule enumerates the forbidden strings; this TR passes it clean and the paper must too). The paper's contribution is the predictive bandwidth model — lead with the prediction (§11, Fig 3), not the descriptive surface.

Section map.

Paper section	Draw from
Abstract	§1, §3 (exec summary), Appendix F headline rows
Introduction / motivation	§4 (V1→V4 arc, description→prediction), §5 (RQs)
Related work	§4.2–4.3 positioning; the closed-loop-vs-offline distinction (§24); the V3 forecast this answers
Method	§6 (cell contract, metric, model, stats), §6.6 (measurement protocol), Appendix D (architecture constants)
Method — reproducibility/rigor	§8 (full Modal execution arc — the V0-rejection, fp8-scoping, validate-before-pay, orphan-billing, dual-path persistence), §30 (pins)
Results — the model (lead)	§9 (derivation), §11 (fit + architectural prediction + α), §9.4 (worked example), Figs 2–3
Results — context	§10, §3.2–3.3, Fig 1, §21 (knee rate), §22 (knee table)
Results — hardware	§12 (GPU + interaction), §18 (MBU), §20 (single-stream)
Results — precision	§13 (fp8, the corrected context-invariance)
Results — variance decomposition	§15 (ANOVA lever ranking)
Results — cross-model	§16 (per-model deep reads + architecture)
Results — throughput	§17 (scaling ceiling), Fig 4
Reliability	§19 (CV), §23 (verification)
Discussion / limitations	§24 (lineage/protocol), §25 (limitations), §27 (future work)
Forbidden / claim discipline	§26, Appendix H

This TR is built to be mined. Appendix F is the number source, Appendix G the figure source, Appendix H the claim source, and this map the routing table — an agent can assemble the Tier-2 paper from these four plus the section prose without touching the raw JSON or re-running analysis.

Data Reconciliation & Cross-Version Lineage (2026-06-24)

This section ties V4's substrate to the canonical TR164 data-lineage map (research/tr164/TR164_DATA_LINEAGE.md) and the program measurement count (BANTERHEARTS_MEASUREMENT_COUNT.md, 2026-06-24 supplement), and folds in the post-V4 work that extends or re-interprets V4. It is the audit surface a reviewer opens to trace any V4 number to disk and to see what later substrate hardens it.

(a) V4's exact provenance. V4's static-batch amortization grid is research/tr164/results/modal_amortization (the 672-cell full_xmodel_672/results.json plus the diag/bw3/kv/h200fp8/ctx128k probes). At the per-request results.csv granularity the measurement doc counts the whole modal_amortization aggregate at 1,143 primary rows across 15 results.csv files (wc -l − 1 header per file, verified on disk 2026-06-24). V4's "672 cells / 2,016 timed generations" framing (§Overview, Appendix F) is a cell/curve presentation count at a coarser granularity over the same data — it does not double-count against the 1,143 primary rows (different granularity, same artifact). No data-accuracy conflict.

(b) V4 in the TR164 arc. V1 (results/20260531_120428_552237, 21,159 rows) found a uniform closed-loop parallel-efficiency breakdown at N=2; V2 (TGI/vLLM/SGLang cross-backend, 26,784 + 15,120 + 15,120 rows) showed a real server backend moves the knee 2–4 concurrency rungs right; V3 (matched A100 5-model vLLM/SGLang, 94,500 rows each + 3,780-row refill) retired V2's bandwidth confound and gave a workload-conditional sign flip. V4 changes the question from closed-loop concurrency to the offline static-batch best-case amortization ceiling, and turns description into prediction: η(B)=(1+r)/(1+Br), r=C·k/W, zero-parameter knee at ρ≈0.84. V5 (built on V4) and a post-hoc closure/served/policy layer follow; details below.

(c) Post-V4 work that extends or re-interprets V4 (each grounded in the lineage map; run-dirs verified on disk 2026-06-24):

70B confirmation — promoted from Future Work (§27) to a result. results/modal_amortization_70b_confirm/confirm_ctx32k_20260619_185314/replicates.csv (40 rows, per-rep wall_s/decode_tps) plus the broad pre-confirm sweep results/modal_amortization_70b (31 rows). Finding: the 70B knee lands ~1.8× later than the 8B knee, supporting the W-ordering direction of r=C·k/W beyond the 7–8B grid — caveated by a TP=2 aggregate-bandwidth confound (tensor-parallel sharding changes effective W and HBM, not just W). This converts §27's "extend across scale to test whether the KV-byte prediction holds beyond 7–8B" and partially discharges Forbidden Claim #2 (generalization beyond 7–8B): the architectural-ordering prediction now has one cloud multi-GPU confirmation point, so the scale claim moves from forbidden to licensed-with-caveat (the caveat is the TP=2 confound; no full-precision single-GPU 70B point exists).
Closure audits — hardening η against a "your throughput includes prefill" attack. mlsys_closure/marginal_decode_cells.csv (336 = D512−D64 marginal-decode subtraction, derived, 0 new rows) plus the streaming true-decode grid vllm_decode_timed_closure/decode_timed_batch_measurements.csv (252 rows). Finding: the C·k/W rank survives isolating true steady-state decode — streaming finite ρ=0.886, and the slope magnitude tightens from 2.022 to 1.322 on the clean-decode surface. This defends the §9/§18 decode interpretation: η's full-generation-throughput definition (prefill in the timing window) is not what produces the C·k/W ordering.
Cross-family weight-axis matrix — an independent isolation of W in r=C·k/W. weight_axis_decode/weight_axis_decode_replicates.csv (708 rows, Qwen/Mistral/Gemma3). V4's 7–8B grid barely moves W; this matrix sweeps it across families: pooled cross-family ρ=0.8142, with the Gemma3 4B→12B knee inversion mechanistically explained and explicitly blocking a "larger model always knees later" universal.
Served-ceiling engine robustness — extending §24's single-engine validation. §24 validates static-as-upper-bound on a vLLM closed-loop grid (results/modal_closed_loop, 68 rows). The compact SGLang served-ceiling check v5/results/sglang_served (12 rows, 8 ok cells) extends this toward an engine-robustness statement: SGLang served ≤ static on 8/8, so the upper-bound direction holds across both engines, not just vLLM.
Policy-cap operationalization. policy_replay/modal_policy_validation/{diag_pcdiag2,rescue32k_pc32k1,<ssp-policy-cap-run>}/results.csv (8 live Modal rows) turns the §22 knee table into a deployable batch-cap policy and validates the predicted ceiling on live cells. The offline per-policy expansions (policy_replay_cells.csv 1,407; final_policy_validation/combined_policy_rows.csv 24) are derived re-analysis over already-counted served cells and are excluded from the headline.
TR170 kernel-reproducibility pilot (separate TR, lineage cross-reference). research/tr170/results (555 rows: 20260622_214004/outcomes.jsonl 3 noise-floor + tritonbench_sweep/outcomes_t{331,340,360}.jsonl 184×3 across Triton 3.3.1/3.4.0/3.6.0). It is the kernel-level reproducibility companion to TR164's serving-level physics and the landing spot for the dram__bytes byte-moved instrumentation the V4 extension probes request. Pilot-grade, not paper-grade (clears the higher public-artifact bar only when expanded).

(d) Methodology cross-references (research/tr164/methodology/): V4's core model and mechanism are documented in TR_static_ckw_model.md and TR_roofline_mechanism.md; the extension probes (bw3/fp8-KV/128K/70B) in TR_extension_probes.md; the closure audits in TR_mlsys_closure.md; the weight-axis matrix in TR_weight_axis_matrix.md; the served validation (vLLM + SGLang) in TR_served_validation.md; the V1/V2/V3 closed-loop boundary in TR_closed_loop_boundary.md; the TR165 no-GIL companion in TR_nogil_ablation.md. The live-Modal policy-cap validation and TR170 have no dedicated methodology TR_*.md (marked NOT-verified for that column in the lineage map; closest is TR_served_validation.md).

Provenance note. Every row count above is re-verified on disk 2026-06-24 in the lineage map and reconciles to the row with the measurement doc's 2026-06-24 supplement. The SSP post-hoc increment that includes these substrates is 6,042 primary rows (3,780 refill + 1,143 static grid + 708 weight-axis + 252 closure + 40 + 31 70B + 68 closed-loop + 12 SGLang served + 8 policy) plus TR170 (555 new, separate TR); the full 2026-06-24 supplement adds +11,781 (also folding in the 5,184-row TR165 GIL-control arms surfaced by the cross-version reconciliation), taking the program headline to ~1,348,000. The downstream artifact is referred to generically as the SSP submission to a full-depth systems venue; this section names no venue, no submission identifier, and no gating language, consistent with the project's substrate-document exposure discipline.

Hardware. V4's static-batch amortization grid spans A100-80GB + H100 (modal_amortization_sweep.py), with the H200 supplying the third bandwidth point (HBM3e) and the alpha microbench sweeping across A100/H100/H200; the per-GPU alpha (A100 0.52 / H100 0.33) and the H200 alpha-reversal negative result are the load-bearing hardware findings. The 70B broad sweep runs on 2×A100/2×H100, and the fp8-KV and 128K probes run on H100.

Addendum to §29 References — post-V4 substrates

70B confirmation (promotes §27 scale future-work to a result) — research/tr164/results/modal_amortization_70b_confirm/confirm_ctx32k_20260619_185314/replicates.csv (40 rows) + broad sweep research/tr164/results/modal_amortization_70b/ (31 rows); method methodology/TR_extension_probes.md. Knee ~1.8× later than 8B; W-ordering holds, TP=2 confound flagged.
Closure audits (§9/§18 decode interpretation) — research/tr164/vllm_decode_timed_closure/decode_timed_batch_measurements.csv (252 rows, streaming true-decode) + research/tr164/mlsys_closure/marginal_decode_cells.csv (336, derived marginal-decode); method methodology/TR_mlsys_closure.md. C·k/W rank survives; magnitude tightens 2.022→1.322.
Cross-family weight-axis matrix — research/tr164/weight_axis_decode/weight_axis_decode_replicates.csv (708 rows, Qwen/Mistral/Gemma3); method methodology/TR_weight_axis_matrix.md. Pooled ρ=0.8142; Gemma3 4B→12B inversion blocks the "larger always later" universal.
SGLang served ceiling (extends §24 to engine robustness) — research/tr164/v5/results/sglang_served/ (12 rows, 8 ok cells); method methodology/TR_served_validation.md. SGLang served ≤ static 8/8.
Policy-cap live validation (operationalizes §22 knee table) — research/tr164/policy_replay/modal_policy_validation/{diag_pcdiag2,rescue32k_pc32k1,<ssp-policy-cap-run>}/results.csv (8 live rows). No dedicated methodology TR_*.md (NOT-verified column).
TR170 kernel-reproducibility pilot (separate TR, arc companion) — research/tr170/results/{20260622_214004/outcomes.jsonl,tritonbench_sweep/outcomes_t{331,340,360}.jsonl} (555 rows). Pilot-grade.

Addendum to Appendix F — post-V4 substrate number table

Name	Value	Run dir / artifact	Gloss
`v4PrimaryRows`	1,143	`results/modal_amortization/**/results.csv` (15 files)	per-request granularity; the 672-cell / 2,016-generation framing is a coarser cell/curve view of the same data (no double-count)
`rows70bConfirm`	40	`results/modal_amortization_70b_confirm/confirm_ctx32k_20260619_185314/replicates.csv`	70B knee ~1.8× later than 8B; W-ordering holds (TP=2 confound)
`rows70bBroad`	31	`results/modal_amortization_70b/`	broad pre-confirm 70B sweep
`rowsClosureStreaming`	252	`vllm_decode_timed_closure/decode_timed_batch_measurements.csv`	streaming true-decode; C·k/W rank survives (ρ=0.886; slope 2.022→1.322)
`rowsClosureMarginal`	336 (derived)	`mlsys_closure/marginal_decode_cells.csv`	D512−D64 subtraction; excluded from headline
`rowsWeightAxis`	708	`weight_axis_decode/weight_axis_decode_replicates.csv`	cross-family W isolation; pooled ρ=0.8142
`rowsSglangServed`	12 (8 ok)	`v5/results/sglang_served/`	SGLang served ≤ static 8/8 — engine-robust upper bound
`rowsPolicyValidation`	8	`policy_replay/modal_policy_validation/**/results.csv`	live Modal policy-cap validation of §22 knee table
`rowsTr170Pilot`	555	`research/tr170/results/*/outcomes.jsonl`	kernel-reproducibility pilot (separate TR); Triton 3.3.1/3.4.0/3.6.0
`sspPostHocIncrement`	6,042 (+555 TR170)	`BANTERHEARTS_MEASUREMENT_COUNT.md` 2026-06-24	SSP increment; full 2026-06-24 supplement +11,781 (incl. 5,184 TR165 GIL arms) → headline ~1,348,000

V4's 1,143 primary rows are the static-batch amortization grid; the 70B confirmation (40+31 rows) promotes §27's scale future-work into a licensed-with-caveat result that partially discharges Forbidden Claim #2; the closure (252), weight-axis (708), SGLang-served (12), and policy-cap (8) substrates each harden a specific V4 claim against a specific reviewer attack; and TR170 (555 rows, separate TR) is the kernel-level companion. All counts reconcile to the row with the 2026-06-24 measurement supplement and the lineage map; nothing here double-counts.