Research Archive
Technical Report

TR164 V3: Cross-Backend Serving-Stack Boundary Matrix (vLLM vs SGLang)

Matched A100 80GB PCIe head-to-head — 5 models × 5 workloads × 6 concurrency levels × 2 phases per backend (1,800 cells / 189,000 rows, ok_rate 1.0). Continuous batching holds 0.47–0.62 parallel efficiency at N=32, an order-of-magnitude shift up from V1 pytorch_direct's 0.056 collapse. vLLM/SGLang deltas are small (1.5–7.6pp) but Holm-significant on 4 of 5 workloads — a workload-conditional routing signal, not an engine ranking.

Table of Contents

Technical Report 164 V3: Cross-Backend Serving-Stack Boundary Matrix on Matched A100 80GB PCIe — vLLM 0.10.2 vs SGLang 0.5.12.post1 Across a 5-Model × 5-Workload Grid

Status: Complete and hand-narrated at full TR depth. Both backends executed on matched A100 80GB PCIe (900 cells / 94,500 metrics rows each, ok_rate = 1.0; aggregate 1,800 cells / 189,000 rows). All empirical numbers trace to research/tr164/v3_paper_stats.json or the two V3 findings docs (V3_SGLANG_RUN_FINDINGS_2026-06-13.md, V3_CROSS_BACKEND_FINDINGS_2026-06-13.md); V1/V2/TR165 lineage facts are cited, not recomputed. Run dirs: results/20260612_204254_036590 (vLLM), results/20260612_212816_266127 (SGLang merged master incl. Mistral refill).


Overview

The V3 boundary matrix is complete and hand-narrated at full TR depth on matched hardware — both backends executed on A100 80GB PCIe, so the vLLM-versus-SGLang head-to-head carries no crossover-anchor and no bandwidth-adjustment caveat, in deliberate contrast to V2, where the head-to-head was confounded by an A100-SXM-versus-A100-PCIe hardware split. Two independent runs landed: a vLLM 0.10.2 grid (run dir research/tr164/results/20260612_204254_036590/, A100 80GB PCIe, driver 580.159) and an SGLang 0.5.12.post1 grid (run dir research/tr164/results/20260612_212816_266127/, A100 80GB PCIe, driver 575.57). Each backend executed the full 2-phase cross-product of 5 model families × 5 workloads × 6 concurrency levels {1, 2, 4, 8, 16, 32} × 3 reps × 2 phases = 900 cells, 94,500 metrics.csv rows, ok_rate = 1.0 (provenance.cells_per_backend, provenance.rows_per_backend, provenance.ok_rate). The aggregate V3 substrate is 1,800 cells and 189,000 metrics rows at perfect cell completion across both backends. One comparability caveat is named up front and threaded through every cross-backend comparison: the Mistral × long_prefill_8k cell is tokenizer-confounded (vLLM's mistral_common produces ~16% fewer prompt tokens than SGLang's HF AutoTokenizer on char-matched prompts) and is documented-and-excluded from that workload's head-to-head (forward-ref §13, §15, §16). Every empirical number in this report traces to a key in research/tr164/v3_paper_stats.json or to one of the two V3 findings docs; lineage facts about V1 and V2 are cited from Technical_Report_164.md / Technical_Report_164_V2.md, not recomputed.

The load-bearing finding for this report is two-part. First, the efficiency physics: continuous batching produces a gradual, workload-shaped decline, not V1's catastrophic collapse — both backends retain 0.47–0.62 parallel efficiency at N=32, a more-than-order-of-magnitude upward shift of the breakdown boundary from V1 pytorch_direct's 0.056 at N=16. Second, the cross-backend head-to-head: on matched hardware the two engines are near-identical on decode-heavy and balanced traffic and diverge only at the prefill/short-prompt extremes; the per-workload deltas are small (1.5–7.6 percentage points) but Holm-significant on 4 of 5 workloads with bootstrap 95% CIs excluding zero, while balanced_2k is a statistical tie. We read this not as an engine ranking but as a workload-conditional routing signal on a clean, matched-SKU substrate — the load-bearing comparison V2 could not deliver.


1. Abstract

Technical Report 164 V3 is the capstone leg of the TR164 serving-stack-physics arc, extending the V1 single-backend pytorch_direct characterization and the V2 cross-backend cloud feasibility pass into a matched-hardware, five-model, both-backend boundary matrix. Both inference engines — vLLM 0.10.2 and SGLang 0.5.12.post1 — executed on A100 80GB PCIe (provenance.hardware, matched SKU, no bandwidth adjustment needed) across five model families (qwen2.5-7b, llama3.1-8b, mistral-7b, qwen2.5-14b, phi-4), five workloads (short_decode, balanced_2k, long_prefill_8k, long_decode, repeated_prefix), six concurrency levels {1, 2, 4, 8, 16, 32}, three reps, and two phases — 900 cells and 94,500 metrics rows per backend at ok_rate = 1.0 (aggregate 1,800 cells / 189,000 rows). The first load-bearing finding is that continuous batching produces a gradual, workload-shaped efficiency decline rather than V1's catastrophic collapse: both backends retain parallel efficiency in the 0.47–0.62 band at N=32, where V1 pytorch_direct had collapsed to 0.056 at N=16 — the breakdown boundary shifts up by more than an order of magnitude in concurrency. The second is the cross-backend head-to-head, per workload via paired Wilcoxon signed-rank with Holm-Bonferroni stepdown and a bootstrap 95% CI on the mean delta. On matched hardware the engines are near-identical on decode-heavy and balanced traffic and diverge only at the prefill/short-prompt extremes; the deltas are small (1.5–7.6pp) but Holm-significant on 4 of 5 workloads with CIs excluding zero — SGLang edges short_decode, long_decode, and repeated_prefix; vLLM edges long_prefill_8k — while balanced_2k is a tie (Δ +0.0022, Holm-adjusted p 0.65, CI [−0.01, 0.014] straddling zero). One caveat is documented-and-excluded: the Mistral × long_prefill_8k cell is tokenizer-confounded (mistral_common mean 8,148 prompt tokens versus HF AutoTokenizer 9,465, ratio 1.162), so that workload's head-to-head is reported over the four non-Mistral models (n_pairs = 20). V3 licenses a matched-SKU cross-backend boundary readout as a workload-conditional routing signal, not an engine leaderboard.


2. Table of Contents

  1. Abstract
  2. Table of Contents
  3. Executive Summary
  4. Introduction and Research Motivation
  5. Research Questions
  6. Methodology
  7. SS1. The 5-Model Grid Design
  8. SS2. Both-Backend Setup (Matched A100 80GB PCIe)
  9. SS3. The Four-Failure Environment Arc as Methodology-Honest Substrate
  10. SS4. Per-Model Efficiency Curves and the Model-Size Gradient
  11. SS5. Gradual Continuous-Batching Decline vs V1 Collapse
  12. SS6. The Workload-Shaped p95 Boundary
  13. SS7. vLLM vs SGLang — Per-Workload Holm + Bootstrap Deltas
  14. SS8. Balanced Tie, RadixAttention Prefix Edge, and the Same-SKU Retirement of the Bandwidth Adjustment
  15. SS9. Mistral Tokenizer Asymmetry: Overflow, Refill, and the Document-and-Exclude Decision
  16. SS10. Data Verification — md5 Integrity and the Row-vs-Cell-Level Catch
  17. SS11. Cross-Run Lineage — V1 Catastrophic, V3 Gradual, V2 Consistency
  18. SS12. Limitations
  19. SS13. Forbidden Claims
  20. SS14. Future Work
  21. Conclusion
  22. References
  23. Appendix A — Reproducibility Pins
  24. Appendix B — Statistical Methodology
  25. Appendix C — Per-Cell Boundary Summary

3. Executive Summary

Technical Report 164 V3 is the matched-hardware capstone of the TR164 serving-stack-physics line. Where V1 (Technical_Report_164.md) characterized a single pytorch_direct dispatcher and documented its uniform-at-N=2 collapse, and V2 (Technical_Report_164_V2.md) confirmed that continuous batching eliminates that collapse but could not deliver a clean vLLM-versus-SGLang head-to-head because the two engines ran on different A100 SKUs, V3 closes that gap. Both backends ran on identical A100 80GB PCIe hardware (provenance.hardware), so every cross-backend delta in this report is hardware-clean by construction — no SXM-versus-PCIe confound, no bandwidth-ratio adjustment. The substrate is 5 model families × 5 workloads × 6 concurrency levels × 3 reps × 2 phases per backend, landing 900 cells and 94,500 rows each at ok_rate = 1.0, for an aggregate 1,800 cells and 189,000 rows. The remainder of this summary states the four load-bearing readings; each is expanded with its full table, an Observations paragraph, and an interpretive blockquote in the corresponding body section.

3.1 The headline: gradual decline, not collapse

The defining V3 result is a regime contrast against V1. Under V1's pytorch_direct dispatcher, median parallel efficiency collapsed monotonically — 0.547 at N=2, 0.295 at N=4, 0.127 at N=8, and 0.056 at N=16 (Technical_Report_164.md SS4, median column) — operationally indistinguishable from sequential execution by N=16. On the V3 continuous-batching backends, efficiency declines gradually, and both engines still retain useful throughput at twice V1's deepest concurrency. The per-model N=32 retention band is 0.47–0.62 on both backends.

Model vLLM eff. @ N=32 SGLang eff. @ N=32
qwen2.5-7b 0.4997 0.5290
llama3.1-8b 0.4876 0.4777
mistral-7b 0.4701 0.4713
qwen2.5-14b 0.6118 0.6000
phi-4 0.6189 0.5952

Observations. Every cell in the table sits at or above 0.47 at N=32 — and the cleanest like-for-like read is one concurrency rung shallower: at N=16 the V3 backends retain 0.572–0.718 (vLLM) and 0.641–0.729 (SGLang) where V1 pytorch_direct had already collapsed to 0.056, an order-of-magnitude efficiency gap at the same concurrency level (per_model_efficiency_curves.*."16" vs Technical_Report_164.md SS4). Read across N, the V3 N=32 retention of ≥0.47 is itself roughly eight times the efficiency V1 retained one concurrency-doubling earlier at N=16 — an efficiency ratio across different N, distinct from the same-N gap and from the breakdown-boundary shift the blockquote quantifies. The two backends agree to within roughly 3 percentage points on every model at N=32, and the model-by-model ordering is preserved across engines (14B-class on top, 7–8B-class below). These are the N=32 endpoints of per_model_efficiency_curves.vllm.*."32" and per_model_efficiency_curves.sglang.*."32"; the V1 contrast is cited from Technical_Report_164.md SS4 (observed).

The single most important sentence in V3 is this: continuous batching does not merely soften V1's collapse, it moves the breakdown boundary up by more than an order of magnitude in concurrency. V1 was operationally dead at N=16; V3 retains half its single-stream efficiency at N=32. This is a dramatic, regime-changing effect — and it is precisely the magnitude the reader must hold separate from the small head-to-head deltas in §3.4, which are a different and far smaller phenomenon.

3.2 The model-size gradient

A secondary but consistent structure rides on top of the gradual-decline result: the 14B-class models retain more efficiency under deep batching than the 7–8B class, and this gradient is identical across both backends. At N=32, qwen2.5-14b and phi-4 sit at roughly 0.60–0.62 on both engines, while the 7–8B family (qwen2.5-7b, llama3.1-8b, mistral-7b) sits in the 0.47–0.53 band. The mechanism is intuitive: larger per-token compute amortizes the fixed per-step batching and kernel-launch overhead more favorably, so the marginal cost of an additional concurrent request is a smaller fraction of total work. The gradient is read directly off the N=32 column of per_model_efficiency_curves (both backends) and is flagged as the secondary gradient in V3_SGLANG_RUN_FINDINGS §5.1 (observed, consistent across both backends).

The size gradient is the kind of finding that only emerges once the grid spans a scale axis. V2 was 7B-only and could not have seen it; V3's 7B → 14B span makes it legible. It is a clean, monotone, engine-agnostic signal: the bigger the model, the gentler the batching penalty.

3.3 The workload-shaped p95 boundary

The breakdown boundary is not uniform across workloads — it is shaped by the prefill/decode character of the traffic. The cleanest one-number summary is the mean p95 latency multiplier at N=32 relative to N=1, reported per workload for each backend.

Workload vLLM p95× @ N=32 SGLang p95× @ N=32 vLLM eff. @ N=32 SGLang eff. @ N=32
short_decode 2.43 1.96 0.4493 0.5388
balanced_2k 1.81 1.85 0.5733 0.5514
long_prefill_8k 2.64 3.50 0.3875 0.2812
long_decode 1.41 1.40 0.7151 0.7231
repeated_prefix 1.80 1.76 0.5629 0.5788

Observations. The prefill-bound long_prefill_8k workload drives the worst tail on both backends (vLLM 2.64×, SGLang 3.50×) and the lowest N=32 efficiency (vLLM 0.39, SGLang 0.28) — note these long_prefill_8k workload-marginals average over all five models including the tokenizer-confounded Mistral cell, so the cross-backend gap here is read clean only in the §13 paired-and-excluded delta (§15) — while the decode-bound long_decode workload stays nearly flat (both ~1.40×) and retains the highest efficiency (~0.72 on both). The middle three workloads cluster between these extremes. These values are workload_boundary.vllm.*.mean_p95x_at_n32 / mean_efficiency_at_n32 and the SGLang mirror (observed).

Continuous batching works until workload shape breaks the amortization regime. Decode-heavy traffic — where the marginal request is mostly cheap decode steps — batches almost for free (1.4× tail). Prefill-heavy traffic — where each new request drags a large one-shot prefill through the scheduler — breaks the amortization and pushes the tail to 2.6–3.5×. The boundary is a property of the workload, not a fixed engine constant.

3.4 The cross-backend head-to-head (the deliverable)

The paper deliverable is the per-workload vLLM-versus-SGLang comparison on matched hardware, computed as a paired Wilcoxon signed-rank test over per-cell efficiency pairs, Holm-Bonferroni-adjusted across the five workloads, with a bootstrap 95% CI on the mean delta (Δ = vLLM − SGLang).

Workload n_pairs Mean Δ (vLLM−SGLang) Boot 95% CI Holm p Significant Read
short_decode 25 −0.0349 [−0.0544, −0.0176] 0.001824 yes SGLang edge on short prompts
balanced_2k 25 +0.0022 [−0.0100, 0.0140] 0.652841 no TIE (CI straddles zero)
long_prefill_8k 20 +0.0757 [0.0546, 0.0961] 0.000048 yes vLLM edge (Mistral excluded)
long_decode 25 −0.0349 [−0.0570, −0.0172] 0.000649 yes SGLang edge
repeated_prefix 25 −0.0148 [−0.0227, −0.0076] 0.001289 yes SGLang edge (prefix reuse)

Observations. The deltas span 1.5–7.6 percentage points in magnitude. Four of five workloads are Holm-significant with bootstrap CIs that exclude zero; the direction splits by workload — SGLang edges short_decode, long_decode, and repeated_prefix, vLLM edges long_prefill_8k — and balanced_2k is a genuine tie whose CI straddles zero. The long_prefill_8k row carries n_pairs = 20 rather than 25 because the Mistral cell is excluded for the tokenizer confound (§15). All values are from head_to_head_by_workload (observed).

These deltas are small but consistent — and the honest framing matters. This is not "vLLM beats SGLang" or "SGLang beats vLLM." The direction flips by workload, balanced traffic is a tie, and no delta exceeds ~7.6pp even at the extremes. What V3 delivers is a workload-conditional routing signal on a clean matched-SKU substrate: which engine to prefer is a function of the traffic shape, not a leaderboard rank. The reader must not conflate these single-digit-pp head-to-head deltas with the order-of-magnitude V1-vs-V3 regime shift in §3.1; they are two phenomena of entirely different size.

3.5 What V3 licenses and does not

V3 licenses a matched-SKU cross-backend boundary readout: because both engines ran on identical A100 80GB PCIe hardware, the §3.4 deltas need no bandwidth-adjustment caveat (the load-bearing methodological advance over V2, whose vLLM-vs-SGLang contrast required a 0.75 SXM/PCIe ratio correction). It also licenses the gradual-decline regime contrast against V1, the model-size gradient, and the workload-shaped p95 boundary — each on five model families.

V3 does not license: a clean Mistral × long_prefill_8k cross-backend cell (tokenizer-confounded, documented-and-excluded); any generalization beyond the five tested model families or beyond 14B scale; an engine ranking (the head-to-head deltas are a workload-conditional routing signal, not a leaderboard); or a claim that continuous batching always avoids breakdown (the boundary is workload-shaped, and long_prefill_8k still drives efficiency below 0.30 at N=32 on SGLang). These boundaries are enumerated in full in §18 (Limitations) and §19 (Forbidden Claims).


4. Introduction and Research Motivation

Technical Report 164 has run as a three-leg arc, each leg tightening the previous leg's open caveat. The through-line is a single question — how does a serving stack's parallel efficiency degrade as concurrency rises, and what does the degradation depend on — pursued first against one dispatcher, then across backends on cloud hardware, and finally on a matched-hardware grid broad enough to separate engine effects from hardware and scale effects. V3 is that final leg.

4.1 V1 — The catastrophic baseline

V1 (Technical_Report_164.md) was a single-backend study. Its pytorch_direct dispatcher fanned concurrency out inside one Python process across worker threads, so every per-request prefill and decode shared the interpreter, the CUDA context, and the thread pool. Under that architecture median parallel efficiency collapsed uniformly at N=2 across all 24 (model × workload × phase) combinations, falling 1.000 → 0.547 (N=2) → 0.295 (N=4) → 0.127 (N=8) → 0.056 (N=16) — operationally dead by N=16. The run also surfaced six (phase × workload × N=16) cell shapes on llama3.2-3b that deterministically hung the dispatcher (Technical_Report_164.md SS7, where the pathological coordinate is the (phase, workload, N) triple on the single llama3.2-3b model, all six at N=16) and a median TTFT of 188 seconds on llama3.2-1b at N=16 balanced_2k. V1 named a GIL-contention hypothesis for the collapse but stopped short of a matched-pair test. It is the pessimistic anchor of the arc (cited, not recomputed, from Technical_Report_164.md SS4 and SS14).

4.2 V2 — Continuous batching, but a hardware confound

V2 (Technical_Report_164_V2.md) was the cross-backend feasibility pass. It fanned the V1 cell shape across vLLM, SGLang, and TGI and confirmed the load-bearing H1 result: continuous batching eliminates the V1 N=2 breakdown — TGI kept parallel efficiency well above 0.65 at N=2 across every combination where V1 had collapsed. But V2's vLLM-versus-SGLang head-to-head was hardware-confounded: the two cloud engines ran on different A100 SKUs (vLLM on A100 PCIe, SGLang on A100 SXM, after the SGLang vendor image forced a native-launcher fallback onto a SXM-only pod). The raw contrast favored SGLang by 1.71pp, but a bandwidth adjustment dividing by the 0.75 SXM/PCIe HBM ratio inverted the result — meaning the raw head-to-head could not be reported as an engine comparison without a hardware caveat. V2 was also 7B-only on cloud, so it had no view of how batching efficiency scales with model size.

4.3 The V3 mandate

V3 closes the V2 confound directly. Both backends run on a matched A100 80GB PCIe SKU (provenance.hardware — "matched SKU, no bandwidth adjustment needed"), so the vLLM-versus-SGLang head-to-head is hardware-clean by construction. The grid scales to five model families spanning the 7B → 14B axis V2 lacked (qwen2.5-7b, llama3.1-8b, mistral-7b, qwen2.5-14b, phi-4) and five workloads (V2's four plus short_decode), with vLLM pinned at 0.10.2 and SGLang at 0.5.12.post1. Crucially, V3 is a standalone extension, not a controlled re-run of V2: V2 was the earlier, smaller feasibility pass on hardware-split SKUs, and V3 stands on its own matched-SKU substrate with no cross-version confound to manage (V3_CROSS_BACKEND_FINDINGS §4 — "V3 stands on its own; no cross-version confound to manage"). Any V2-vs-V3 agreement reported later is a confirmatory bonus, not a load-bearing claim. The organizing direction throughout is the V2-reframe thesis: continuous batching works until workload shape breaks the amortization regime (V3_SGLANG_RUN_FINDINGS §5.2).


5. Research Questions

V3 is structured around four research questions, each mapping to a results block and each phrased to be answerable from the v3_paper_stats.json substrate alone.

RQ1 — Regime. Does the gradual continuous-batching decline (versus V1's catastrophic N=2 collapse) hold across five model families on a matched-SKU grid? Operationalized as: do both backends retain parallel efficiency materially above V1's 0.056-at-N=16 floor across all five models at N=32? Answered in §11 against per_model_efficiency_curves.

RQ2 — Scale. Is there a model-size gradient in batching-efficiency retention, and is it engine-agnostic? Operationalized as: at N=32, do the 14B-class models retain more efficiency than the 7–8B class on both backends? Answered in §10 against the N=32 column of per_model_efficiency_curves.

RQ3 — Workload shape. Is the parallel-efficiency / p95 boundary workload-shaped rather than uniform, and how? Operationalized as: do the mean p95-multiplier and N=32-efficiency summaries separate prefill-bound from decode-bound workloads on both backends? Answered in §12 against workload_boundary and the per-cell boundary_matrix.

RQ4 — Engine. On matched hardware, do vLLM and SGLang differ in parallel efficiency — and if so, is the difference an engine ranking or a workload-conditional routing signal? Operationalized as: per workload, is the paired-Wilcoxon delta Holm-significant with a bootstrap CI excluding zero, and does the direction split by workload? Answered in §13–§14 against head_to_head_by_workload, with the magnitude-honesty discipline (small but consistent; routing signal, not leaderboard) carried throughout.



6. Methodology

The organizing principle of V3 is deliberately narrow: V3 is a single grid — backend × model × workload × concurrency × repetition × phase — executed on one matched GPU SKU, and it exists to retire the one confound that V2 could not retire. V2's cross-backend head-to-head was hardware-split — vLLM measured on A100 PCIe, SGLang on A100 SXM — and the resulting vLLM-vs-SGLang efficiency claim had to carry a bandwidth-adjustment caveat (the 0.75 SXM/PCIe ratio in Technical_Report_164_V2.md SS8). V3 removes that caveat by construction: both backends run on A100 80GB PCIe (provenance.hardware = "A100 80GB PCIe (both backends — matched SKU, no bandwidth adjustment needed)"). Every cross-backend delta reported downstream is therefore hardware-clean — the difference between two efficiency curves on the same SKU, with no memory-bandwidth correction interposed between the measurement and the claim. The remainder of this section states the cell-shape contract, the two-phase measurement design, the parallel-efficiency definition that all three reports share, and the paired statistical battery that turns the matched substrate into a defensible head-to-head — each in its own subsection so that a downstream reader can audit any reported number against the procedure that produced it.

6.1 The cell-shape contract and its arithmetic

The cell-shape contract is a fully-crossed grid. Each backend executes 5 model families × 5 workloads × 6 concurrency levels (N ∈ {1, 2, 4, 8, 16, 32}) × 3 repetitions × 2 measurement phases (the scaling phase and the time-to-first-token phase). That product is 900 cells per backend (provenance.cells_per_backend = 900), and the two backends together form the 1,800-cell aggregate matrix. Every cell completed: ok_rate = 1.0 on both backends (provenance.ok_rate), 94,500 metrics rows per backend (provenance.rows_per_backend), 189,000 rows aggregate. There were no degenerate cells (zero-throughput or none-throughput), and 25/25 (model × workload) combinations are clean on each backend (V3_SGLANG_RUN_FINDINGS §2; V3_CROSS_BACKEND_FINDINGS §1). The arithmetic is internally consistent: 94,500 rows over 900 cells is a mean of 105 measured requests per cell, and 25 (model × workload) combinations × 6 N-levels × 3 reps × 2 phases reconstructs the 900-cell total exactly.

The factorization is worth writing out because it is the cell census that every downstream count traces back to, and because it is the spine the verification of §16 audits. The fully-crossed product decomposes two ways, and both must agree. Reading by combination first: 5 models × 5 workloads = 25 (model × workload) combinations, each of which runs the 6-level concurrency ladder × 3 reps × 2 phases = 36 cells, and 25 × 36 = 900. Reading by axis first: 6 N-levels × 3 reps × 2 phases = 36 cell-shapes per combination, replicated across 25 combinations, again 900. The two decompositions reconciling exactly is not decorative — it is the check that no axis silently lost a level, no combination silently dropped, and no phase ran short. A grid that completed at 899 or 901 cells would signal a structural defect in the dispatch before any efficiency number was even computed; the fact that both backends land on exactly 900 / 900 is the first integrity gate, prior to the md5 and row-status gates of §16.

The matched cell-shape across the two backends is the precondition for everything that follows. The same 25 (model × workload) combinations, the same {1,2,4,8,16,32} ladder, the same 3 reps, and the same 2 phases run on vLLM and on SGLang, so every cell on one backend has an exact-coordinate twin on the other. That exact-coordinate correspondence is what makes the §13 head-to-head a paired comparison rather than a between-groups one: the test is not "vLLM's distribution versus SGLang's distribution" but "the per-coordinate difference vLLM(model, N, rep) − SGLang(model, N, rep)," and the pairing is only well-defined because the two grids are cell-shape-identical by construction. The matched cell-shape is the V3 analogue of the matched join-key discipline V2 used for its cross-run synthesis, scaled from V2's single cloud model to V3's five.

Observations. The cell census reconciles to 900 / 900 on both backends by two independent factorizations, and the per-cell row mean of 105 is uniform enough across the grid that no cell is sample-starved relative to another. The matched cell-shape is the structural reason the head-to-head can be paired; without it, the comparison would forfeit the within-coordinate variance reduction that makes the small backend deltas legible.

The cell census is the boring number that everything else stands on. A reader who trusts the §13 head-to-head is implicitly trusting that the 25-combination × 36-cell-shape grid completed identically on both backends, because the pairing is undefined otherwise. State of confidence: the 900 / 900 / 94,500 / ok_rate 1.0 counts are observed directly from provenance and the two findings docs §2/§1; the two-way factorization is arithmetic over those counts.

6.2 The two measurement phases and the per-cell request budget

The two measurement phases serve different roles and together account for the cell count. The scaling phase drives N concurrent agents at each concurrency level and measures per-agent decode throughput, which is the input to the parallel-efficiency curve; the time-to-first-token (ttft) phase measures the latency-to-first-token distribution under the same concurrency ladder, which is the input to the p95 latency multiplier. Both phases run the full {1,2,4,8,16,32} ladder over all three reps, so each (model, workload) combination contributes 6 N-levels × 3 reps × 2 phases = 36 cells, and the 25 (model × workload) combinations × 36 reconstructs the 900-cell-per-backend total. The two phases are kept separate rather than fused because they measure orthogonal facets of the same boundary: the scaling phase answers "how much aggregate decode throughput does the scheduler recover as concurrency rises" and the ttft phase answers "how far does the tail of first-token latency stretch under the same load." A single fused phase would conflate the throughput-amortization signal (which the efficiency curve isolates) with the queueing-and-admission signal (which the p95 multiplier isolates), and the workload-shaped boundary of §12 is only legible because the two are measured independently and then read against each other.

The requests-per-agent budget is what fills each cell: at 94,500 rows over 900 cells the substrate averages 105 measured requests per cell, and because a cell at concurrency N fans those requests across N agents, the per-agent request count scales down with N while the per-cell request budget stays in the same band — the design holds the per-cell measurement mass roughly constant across the ladder so that the high-N efficiency estimates are not starved of samples relative to the low-N ones. This is a deliberate design choice with a statistical consequence. If the request budget were held per-agent constant instead of per-cell constant, an N=32 cell would carry 32× the request mass of an N=1 cell and the high-N efficiency estimates would be far tighter than the low-N ones — a precision asymmetry that would bias any pooled summary toward the high-N tail. By holding the per-cell budget roughly constant the design keeps the efficiency estimate at every rung of the ladder on comparable statistical footing, which is the precondition for reading the N=1→N=32 descent as a clean curve rather than as an artifact of unequal sampling. The 105-request-per-cell average is the realized mass; the three reps per cell are the replication that lets a cell's efficiency be an average over independent dispatch rounds rather than a single noisy draw.

Observations. The scaling and ttft phases are the two orthogonal projections of the boundary — throughput amortization and tail-latency inflation — and they are measured under the identical concurrency ladder so that the §12 efficiency-vs-p95 anti-correlation is a same-substrate reading, not a cross-substrate one. The per-cell (rather than per-agent) request budget is the design decision that keeps the efficiency estimates comparable across the N ladder.

Two phases, one ladder, a roughly constant per-cell budget: that triad is what makes the efficiency curve and the p95 boundary two readings of the same grid rather than two separate experiments. State of confidence: the two-phase structure and the 36-cell-per-combination factorization are observed from the cell census; the per-cell-constant budget and its precision-symmetry rationale are inferred from the 105-row-per-cell mean and the dispatch design, stated as inferred.

6.3 The parallel-efficiency metric and the 0.65 breakdown threshold

The parallel-efficiency metric is carried over unchanged from V1 and V2 so that all three reports are directly comparable. Efficiency at concurrency N is the per-agent decode throughput at N normalized to the per-agent decode throughput at N=1 — equivalently, aggregate decode-tps at N divided by N times the N=1 decode-tps — so that a perfectly-amortizing batch scheduler scores 1.0 at every N and a fully-serializing dispatcher scores 1/N. By definition efficiency is anchored at 1.0 at N=1 (per_model_efficiency_curves confirms every model × backend reads exactly 1.0 at N=1). The metric is therefore a retention measure, not an absolute-throughput measure: it asks what fraction of the ideal linear-in-N scaling the backend actually delivers, and it is by construction blind to the absolute single-stream throughput of the model. That blindness is a feature for the cross-backend and cross-model comparison — a 14B model has lower absolute decode-tps than a 7B model, but efficiency normalizes that away and isolates the scaling behavior, which is the dimension the report is about. It is also why the V1→V3 contrast of §11 is legible across entirely different hardware and model sets: efficiency is a dimensionless ratio, so V1's 0.056-at-N=16 on an RTX 4080 and V3's 0.47–0.62-at-N=32 on an A100 are directly comparable as scaling-retention numbers even though their absolute throughputs are incomparable.

The breakdown threshold is fixed at 0.65 (provenance.efficiency_breakdown_threshold): a cell is said to "break" at the first N where its efficiency drops below 0.65, and the per-cell first_n_below_0.65 field records that crossing (null when a cell never breaks within the tested N ≤ 32). The 0.65 cut is the same operational threshold V2 used to separate the amortizing regime from the saturating regime; it is not re-derived here. The threshold is config-driven — analysis.efficiency_breakdown_threshold: 0.65 in the run config — rather than chosen post-hoc to make the V3 numbers land at a particular break-point distribution, which is the discipline that keeps the §23 break-point counts honest. A cell at exactly 0.65 efficiency is recovering roughly two-thirds of ideal linear scaling; below that the marginal request is costing more than a third of a full serial slot, which the program reads as the boundary between "batching is still buying most of the parallelism" and "batching is saturating." The threshold is a convention, not a physical constant, and the report is careful to read the boundary statistics categorically (broke-at-N versus never-broke) rather than treating the 0.65 line as a sharp phase transition.

Observations. Efficiency is a dimensionless scaling-retention ratio anchored at 1.0 at N=1, which is what licenses comparing it across the hardware and model sets V1, V2, and V3 each used. The 0.65 break threshold is inherited config-driven from V2, not re-tuned for V3, so the break-point distribution of §23 is a measurement, not a fitted result.

The efficiency definition is the one piece of machinery that has to be identical across V1, V2, and V3, because the entire lineage argument — V1 collapse, V2 elimination, V3 gradual decline — is a comparison of this single ratio across three substrates. Carrying it over unchanged, and carrying the 0.65 threshold over config-driven rather than re-derived, is what makes the cross-report comparison a like-for-like one. State of confidence: the metric definition and the 0.65 threshold are observed from provenance.efficiency_breakdown_threshold and the V1/V2 reports; the dimensionless-comparability rationale is inferred from the ratio's construction.

6.4 The paired head-to-head battery: Wilcoxon, Holm, bootstrap

The cross-backend head-to-head is a paired design, not a between-groups design. For each workload, the per-cell efficiency values from vLLM and SGLang are paired by (model, N, rep) and the pairwise differences vLLM − SGLang are tested with a paired Wilcoxon signed-rank test (head_to_head_by_workload.<workload>.wilcoxon_p_raw). The five per-workload tests form one family, so the raw p-values are corrected with a Holm-Bonferroni stepdown across the five workloads (wilcoxon_p_holm, holm_significant_0.05). On top of the rank test, a percentile bootstrap 95% confidence interval on the mean delta is reported per workload (boot_ci95_delta); the resample count and bootstrap seed follow the cross-run analysis script that produced boot_ci95_delta (the experiment seed is 164; the v3_paper_stats.json provenance block carries the interval but not the resample parameters, so those specifics are inferred from the analysis pipeline rather than substrate-traced), and the interpretation rule is explicit — a CI excluding zero is a consistent-direction signal, a CI straddling zero is a tie regardless of point estimate. The pairing count is n_pairs per workload: 25 for the four full-model workloads, and 20 for long_prefill_8k because Mistral is excluded there for a tokenizer reason developed in §15.

Each leg of the battery does a distinct job, and the redundancy between them is intentional rather than wasteful. The Wilcoxon signed-rank test answers the direction-reliability question — across the paired deltas in a workload, is the sign of the difference consistent enough to reject the null of no difference — and it does so non-parametrically, which matters because the efficiency deltas are heavy-tailed across the N ladder (a delta at N=2 is a few percent while a delta at N=32 is tens of percent), so a mean-based paired t-test would let the high-N cells dominate the variance. The Holm-Bonferroni stepdown answers the family-wise-error question — given that five workloads are tested as one family, what is the corrected significance after controlling for the five looks — and it is the correction that keeps the head-to-head from being a five-way fishing expedition where one nominally-significant workload could be cherry-picked. The bootstrap CI answers the effect-size question — not just "is the difference real" but "how large is it and could it plausibly be zero" — and it is the leg that enforces magnitude honesty, because a reader who sees a tight CI of [−0.0544, −0.0176] on short_decode cannot round the 3.5pp edge up into a dominance claim when the upper bound is 1.8pp. The decision rule pairs two of these legs explicitly: a workload counts as a directional finding only when the Holm-adjusted p is below 0.05 and the bootstrap CI excludes zero, so a borderline result that passed one test but not the other would be reported as ambiguous rather than as a win.

Observations. The battery is three independent statistical lenses — rank-based significance, family-wise correction, and a bootstrap effect-size interval — and the head-to-head only reports a directional finding when the significance and the interval agree. The pairing by exact (model, N, rep) coordinate is what removes the between-cell variance that would otherwise swamp the 1.5–7.6pp backend signal.

The whole point of V3's methodology is that the cross-backend delta is the only thing that moves. V2 had to subtract a bandwidth ratio before it could read the vLLM-vs-SGLang difference; V3 reads it directly off the same A100 PCIe SKU. The paired Wilcoxon design is what licenses calling a 1.5–7.6pp delta "consistent" rather than "noise" — the sign of the difference is stable across all five model families and three reps, even when the magnitude is small. State of confidence: the design is matched by construction (observed from provenance.hardware); the statistical machinery is paired-and-corrected (observed from the head_to_head_by_workload field set); the role-decomposition of the three legs is the standard interpretation of the battery, stated as inferred.


7. SS1. The 5-Model Grid Design

V3's model axis is the dimension V2 lacked. V2's cloud arm was 7B-only (Qwen2.5-7B on cloud); V3 spans the 7B → 14B scale axis with five families: two 7B models (qwen2.5-7b, mistral-7b), one 8B model (llama3.1-8b), and two 14B-class models (qwen2.5-14b, phi-4). The model list is exactly the key set of per_model_efficiency_curves and is identical across both backends, so every model contributes a matched pair to every head-to-head. The 7B/8B versus 14B split is intentional: it is the minimal grid that can answer RQ2 (is there a model-size gradient in batching-efficiency retention), and §10 shows that the gradient is real and consistent across both backends.

The workload axis is V2's four workloads plus one. V3 measures short_decode, balanced_2k, long_prefill_8k, long_decode, and repeated_prefix (the key set of head_to_head_by_workload); the new entry relative to V2 is short_decode, the short-prompt regime, added precisely because that is where the backends turn out to diverge most (§13). Each workload probes a distinct point on the prefill/decode amortization surface, which is the surface the V2 reframe is built on. The geometry below states each workload's role and the regime it is designed to expose; the token counts are nominal targets of the synthetic prompt generator (the long_prefill_8k row carries the ~8,704-target-token figure that §15 shows is char-matched, not token-matched, across tokenizers).

Workload Regime probed Nominal shape Role in the grid
short_decode Short-prompt Short prefill, short decode Short-prompt regime; where batching overhead is least amortized
balanced_2k Mixed reference ~2k context, balanced prefill/decode The mixed-traffic reference point; V2's anchor workload
long_prefill_8k Prefill-bound tail ~8.7k-target-token prefill, short decode Prefill-bound tail probe; the worst-case p95 driver
long_decode Decode-bound flat Short prefill, long decode Decode-bound flat regime; where continuous batching amortizes best
repeated_prefix Prefix-reuse Shared long prefix, varying suffix RadixAttention prefix-reuse test surface

Observations. The workload set is not an arbitrary five; it is a designed spread across the amortization regime. long_decode is the easy case (the decode loop dominates and batching amortizes prefill once), long_prefill_8k is the hard case (the prefill is re-paid per request and there is little decode to amortize over), and short_decode, balanced_2k, and repeated_prefix populate the middle. repeated_prefix is the one workload constructed to give a specific engine feature (SGLang's RadixAttention prefix cache) something to bite on, which is why it is the natural place to test whether V2's prefix-reuse finding reproduces at five models (§14).

The grid is the minimum that answers the four research questions cleanly. Five models give a size gradient; five workloads give a regime spread; one matched SKU gives a clean head-to-head. State of confidence: the model and workload sets are observed directly from the per_model_efficiency_curves and head_to_head_by_workload key sets; the regime-role assignment is the design intent inherited from V2 and the V2-reframe direction (V3_SGLANG_RUN_FINDINGS §5.2), stated as inferred.

The matched-variable discipline is the spine of the design, and it is worth tabulating exactly what is held constant, what is the one deliberate variant, and where the single residual confound lives. This is the table that licenses reading the §13 deltas as backend effects.

Category Variable Status
Constant Workloads (5), synthetic prompts, dispatcher orchestration Identical both backends
Constant Analysis pipeline, parallel-efficiency definition, 0.65 threshold Identical both backends
Constant Repetitions (3), concurrency ladder {1,2,4,8,16,32} Identical both backends
Constant Hardware SKU — A100 80GB PCIe Identical both backends (provenance.hardware)
Deliberate variant Serving backend vLLM 0.10.2 vs SGLang 0.5.12.post1 (provenance.vllm_version, provenance.sglang_version)
Residual confound Mistral tokenizer routing on long_prefill_8k only mistral_common (vLLM) vs HF AutoTokenizer (SGLang) — isolated and excluded (§15)

Observations. Every row above the deliberate-variant line is a held constant, including the hardware SKU — which is the row V2 could not hold. The single deliberate variant is the backend, so any clean per-workload delta is attributable to the engine. The single residual confound is fenced to one (model × workload) cell — Mistral × long_prefill_8k — and handled by exclusion rather than adjustment, so it contaminates nothing outside that cell. Twenty-four of the twenty-five (model × workload) combinations carry the same HF tokenizer on both backends and are clean (V3_CROSS_BACKEND_FINDINGS §3).

This table is the V3 advance over V2 in one frame. V2's matched-variable table had to list "hardware SKU" as a confound, not a constant; V3 promotes it to a constant. The only confound that remains is a single tokenizer-routing cell, and the design's response is the conservative one — exclude it, do not adjust it. State of confidence: every row is observed from the provenance block or the cross-backend findings; the "isolated and excluded" disposition is the documented PI decision (V3_CROSS_BACKEND_FINDINGS §3, option (a)).


8. SS2. Both-Backend Setup (Matched A100 80GB PCIe)

The two backends were stood up independently but to a matched hardware target. The vLLM side ran vLLM 0.10.2 in a native venv on A100 80GB PCIe under driver 580.159, with torch 2.8.0+cu128; Mistral was routed through --tokenizer-mode mistral (the mistral_common tokenizer) at a context ceiling of 8,768. The vLLM environment came up clean with no recovery arc, for the reason developed in §9: the pod's driver 580.159 already covers CUDA 13, so the cu128 torch wheel initialized on the first try (V3_CROSS_BACKEND_FINDINGS §1, §5). The vLLM grid was Codex-driven and verified on the main thread.

The SGLang side ran sglang 0.5.12.post1 via the native launcher (the vendor Docker image was bypassed, as in V2) on A100 80GB PCIe under driver 575.57, with torch 2.11.0+cu130 plus a cuda-compat-13-0 forward-compat bridge, kernels<0.15 pinned, and --cuda-graph-max-bs 32. Mistral was routed through the HF AutoTokenizer at a context ceiling of 11,264 (the refilled value; the original 8,768 overflowed — §15). The SGLang environment required a four-failure recovery arc that §9 documents in full; the working pin recorded here is the durable reproducibility contract (V3_SGLANG_RUN_FINDINGS §3). The two repro pins are tabulated below.

vLLM pod SGLang pod
GPU A100 80GB PCIe A100 80GB PCIe
Driver 580.159 575.57
Backend vLLM 0.10.2 (native venv) sglang 0.5.12.post1 (native launcher)
torch 2.8.0+cu128 2.11.0+cu130 + cuda-compat-13-0
Mistral routing --tokenizer-mode mistral, ctx 8768 HF AutoTokenizer, ctx 11264 (refilled)
Run dir 20260612_204254_036590 20260612_212816_266127

Observations. The two pods differ in three software particulars — driver version (580.159 vs 575.57), torch wheel flavor (cu128 vs cu130-plus-compat), and Mistral tokenizer routing — none of which compromise the matched-SKU comparison. The driver and wheel differences are environment-recovery artifacts, not measurement variables: both stacks resolve to a working CUDA-13-capable runtime on the same A100 80GB PCIe SKU, and the parallel-efficiency metric is invariant to which forward-compat path got the driver there. The Mistral routing difference is the one that matters for measurement, and it is fenced to the single long_prefill_8k Mistral cell and excluded (§15). The context-ceiling difference (8768 vs 11264) is itself benign for the metric — a KV ceiling at or above the actual sequence length does not change per-request latency or throughput — it only had to be raised on SGLang because that backend's tokenizer produced longer sequences (§15).

The setup is matched where it has to be (the SKU) and divergent only where the environment forced it (the driver/wheel path) or where a tokenizer asymmetry surfaced (Mistral). The discipline is to keep the divergences out of the measurement: the driver/wheel divergence does not touch efficiency, and the tokenizer divergence is excluded rather than averaged in. State of confidence: every pin is observed from V3_CROSS_BACKEND_FINDINGS §5 and V3_SGLANG_RUN_FINDINGS §3; the "benign for the metric" claim about context ceiling is the documented reasoning in V3_SGLANG_RUN_FINDINGS §4 (inferred, but consistent with the KV-ceiling semantics).

Verification was treated as a first-class step, not a formality. Both runs are byte-identical local-versus-pod by md5: SGLang's original 900-cell metrics file matches at 5fd15755… and its refill file at e0400837…, and vLLM matches at 540adec5… (V3_SGLANG_RUN_FINDINGS §2; V3_CROSS_BACKEND_FINDINGS §1). Both backends report 900/900 cells complete, 94,500 rows (the theoretical maximum, no request dropped), 0 errors, 25/25 (model × workload) combinations clean, and 0 degenerate cells. The full integrity discipline — and the row-versus-cell-level catch that exposed the Mistral overflow that a cell-level gate alone would have missed — is developed as a standalone result in §16.

Observations. The verification chain is two-layered: md5 byte-identity proves the local substrate is the pod substrate (no transfer corruption, no partial pull), and the per-cell/per-row status audit proves the substrate is request-clean (no silent error mass hiding under a "complete" cell flag). Both layers passed on both backends. The md5 hashes are the cheap, decisive check; the row-level audit is the one that caught the Mistral overflow (§16).

Byte-identity plus a row-level status audit is the verification floor for a substrate this size — 189,000 rows is too large to eyeball, and the §16 Mistral case is the concrete proof that cell-completion is necessary but not sufficient. State of confidence: md5 identity and the 25/25-clean / 0-degenerate counts are observed from the two findings docs §2/§1.


9. SS3. The Four-Failure Environment Arc as Methodology-Honest Substrate

The SGLang native environment took four attempts to stand up, and the arc is reported here as a reproducibility-contract disclosure, not as a confound or an apology. The 900-cell SGLang matrix completed at ok_rate 1.0; the environment arc is the cost of getting to that clean substrate, and each failure is a distinct, reusable lesson. The canonical record is the runpod-driver-preflight memory; the four failures, verbatim from V3_SGLANG_RUN_FINDINGS §3, are walked below.

Failure 1 — PEP 668 (externally-managed environment). The Ubuntu 24.04 pod image refuses bare pip install under PEP 668's externally-managed-environment guard. Fix: --break-system-packages (acceptable in a disposable container). This is a one-line speed bump, but it is the first thing that breaks on a fresh 24.04 pod and worth recording so the next stand-up does not rediscover it.

Failure 2 — unpinned transitive dependency kernels. sglang==0.5.12.post1 pulled kernels==0.15.2 (released into the dependency tree mid-window), which crashed the sglang import with ValueError: Either a revision or a version must be specified. Fix: pin kernels<0.15. This is the classic moving-transitive-dependency failure — the top-level pin was honored, but an unpinned transitive resolved to a fresh, incompatible release.

Failure 3 — CUDA wheel ↔ driver mismatch (the load-bearing failure). PyPI's default torch==2.11.0 resolves to cu130 wheels, but the pod host driver was 575.57 (CUDA 12.9), so torch.cuda.is_available() returned False and every SGLang server died at launch with "No accelerator." The naive fix — swapping torch to a cu128 wheel — then broke the cu13-built sgl-kernel/flashinfer wheels (libnvrtc.so.13 missing, ModuleNotFoundError: common_ops), because those kernels were compiled against the cu13 stack. The resolution was to keep the coherent cu13 stack (torch 2.11.0+cu130 + sgl-kernel cu13) and bridge the older host driver forward with cuda-compat-13-0 (580.159 forward-compat libraries) plus an LD_LIBRARY_PATH pointing at the compat libs. Verified after the bridge: torch.cuda.is_available() True, device reported as "NVIDIA A100 80GB PCIe," sgl_kernel imports. This is the failure that ate the most time and the one with the most transferable lesson: do not mix wheel CUDA flavors to chase a driver; keep one coherent stack and bridge the driver.

Failure 4 — silent-success manifests (the dangerous one). TR164's run.py exits 0 and writes a manifest with status:"complete" even when every cell is group_failed (the server never came up). The first driver's exit-code gates therefore reported three fake "smoke ok"s, each consuming exactly the 1200-second startup timeout — and that uniform-timeout signature is the tell that the smoke "passed" by timing out, not by serving. The fix: driver v2 verifies every run.py invocation at the manifest level (per-cell status plus a metrics-row count), never the exit code, and stage 0 asserts torch.cuda.is_available() and import sgl_kernel up front before any cell runs. This is the same silent-success failure class as the TR164 V1 nsys incident, where a wrapped time.sleep "succeeded" by capturing nothing — the canonical case in this program for why exit codes are not evidence.

The contrast with the vLLM side is the most useful single data point in the arc. The vLLM pod shipped driver 580.159 (CUDA-13-capable), so Codex's torch 2.8.0+cu128 initialized cleanly and none of failures 3 or 4 arose — there was no environment recovery at all on the vLLM side (V3_CROSS_BACKEND_FINDINGS §1). The lesson the contrast reinforces is the one that runs through the whole arc: match the wheel's CUDA flavor to the host driver, or pick a host whose driver already covers it. The SGLang pod's 575.57 driver did not cover the cu130 wheels and required the compat bridge; the vLLM pod's 580.159 driver did, and the stack just worked.

Observations. The four failures are not one bug rediscovered four times; they are four distinct classes — a packaging guard (PEP 668), a transitive-pin drift (kernels), a wheel/driver ABI mismatch (cu130-on-575), and a verification-gate fallacy (exit-code-as-success). Failures 3 and 4 are the load-bearing pair: failure 3 is the one that took real wall time and produced the durable cuda-compat-13-0 pin, and failure 4 is the dangerous one because it produces a substrate that looks complete while being empty. The fix for failure 4 — manifest-level verification rather than exit-code gating — is not SGLang-specific; it is the standing discipline this program applies after the V1 nsys silent-success, and it is what makes the final 900/900-clean claim trustworthy rather than merely asserted.

The environment arc belongs in the methodology, not in a footnote, because the working pin is the reproducibility contract for the SGLang side and the manifest-gating fix is why the clean-substrate claim is credible. A run that completes at ok_rate 1.0 only because the gate actually checked the rows — rather than trusting an exit code that lies — is a stronger result than one that completes because nothing was checked. State of confidence: all four failures and their fixes are observed verbatim from V3_SGLANG_RUN_FINDINGS §3; the vLLM clean-stand-up contrast is observed from V3_CROSS_BACKEND_FINDINGS §1; the framing of failure 4 as the same class as the V1 nsys incident is the program's own characterization (inferred-and-documented).


10. SS4. Per-Model Efficiency Curves and the Model-Size Gradient

The first results layer of V3 is the per-model parallel-efficiency curve: for each of the five model families, on each backend, how mean parallel efficiency degrades as the concurrency level N climbs from 1 to 32. This is the model-marginal granularity — efficiency averaged over the five workloads and three replicates — and it is the cleanest single picture of what continuous batching buys the deployment. Parallel efficiency is defined exactly as in V1 and V2: the per-N decode throughput normalized so that N=1 is 1.000 by construction, with values below 1.0 measuring how far the realized throughput falls short of the ideal linear-in-N scaling the dispatcher would deliver if every concurrent request were free. The efficiency-breakdown threshold carried through the whole boundary analysis is 0.65 (provenance.efficiency_breakdown_threshold): below it, a cell is judged to have exited the amortization regime. Every number in the two tables below is read directly from per_model_efficiency_curves in research/tr164/v3_paper_stats.json (observed, not inferred).

The defining feature of these curves, established before any cross-backend comparison, is that they are gradual. There is no concurrency level at which any model on either backend falls off a cliff. The curves descend monotonically and smoothly from 1.000 at N=1 through the high-0.9s at N=2, the low-0.9s at N=4, the high-0.8s at N=8, the mid-0.6s at N=16, and finally into the 0.47–0.62 band at N=32. That smoothness is the entire physics story of V3's scaling layer, and it is the deliberate contrast against V1's pytorch_direct collapse that Section 11 quantifies.

The vLLM 0.10.2 per-model curves on matched A100 80GB PCIe:

Model N=1 N=2 N=4 N=8 N=16 N=32
qwen2.5-7b 1.000 0.961 0.905 0.806 0.686 0.500
llama3.1-8b 1.000 0.971 0.918 0.805 0.644 0.488
mistral-7b 1.000 0.953 0.885 0.787 0.572 0.470
qwen2.5-14b 1.000 0.976 0.936 0.859 0.696 0.612
phi-4 1.000 0.983 0.941 0.850 0.718 0.619

Observations. Every family retains useful throughput all the way to N=32: the worst of the five (mistral-7b at 0.470) is still extracting just under half of ideal linear scaling at 32-way concurrency, and the best (phi-4 at 0.619) retains nearly two-thirds. The descent is orderly. At N=8 all five models are bunched within a 0.072 band (0.787 to 0.859); the families do not begin to visibly separate until N=16, where the 14B-class pair (qwen2.5-14b 0.696, phi-4 0.718) pulls clear of the 7–8B group (mistral-7b 0.572 trailing). By N=32 that separation has hardened into two tiers: the 14B-class pair sits at 0.612–0.619 while the three 7–8B models cluster at 0.470–0.500. The single largest per-step drop anywhere in the vLLM table is the N=16→N=32 transition, where every model sheds 0.10–0.19 of efficiency at once — the regime where 32-way batching begins to genuinely contend for the GPU's compute and memory-bandwidth resources rather than merely amortizing kernel-launch overhead.

Read as a deployment curve, the vLLM table says continuous batching holds the line well past the N=16 ceiling that V1 could not survive: even at 32-way concurrency, the worst-case efficiency loss is roughly 53% (mistral-7b), and the best case is a 38% loss (phi-4). There is no concurrency setting in the tested range at which vLLM degenerates to sequential behavior. The orderly, monotone shape is the signature of a scheduler that is working as designed; the only question the table leaves open is the size of the per-family separation, which Section 11 reads as the model-size gradient.

The SGLang 0.5.12.post1 per-model curves on the same matched A100 80GB PCIe SKU:

Model N=1 N=2 N=4 N=8 N=16 N=32
qwen2.5-7b 1.000 0.973 0.928 0.823 0.687 0.529
llama3.1-8b 1.000 0.959 0.897 0.795 0.648 0.478
mistral-7b 1.000 0.961 0.889 0.796 0.641 0.471
qwen2.5-14b 1.000 0.980 0.938 0.860 0.722 0.600
phi-4 1.000 0.981 0.937 0.848 0.729 0.595

Observations. The SGLang table is qualitatively indistinguishable from the vLLM table at the model-marginal granularity, which is itself a load-bearing result: two independently-built continuous-batching engines, on matched hardware, produce the same scaling shape and very nearly the same N=32 efficiency band. The 14B-class pair again pulls clear — qwen2.5-14b 0.600 and phi-4 0.595 at N=32, against the 7–8B group's 0.471–0.529. The ordering within the 7–8B group differs slightly from vLLM (here qwen2.5-7b leads the small models at 0.529, with mistral-7b 0.471 and llama3.1-8b 0.478 nearly tied at the bottom), but the band itself is the same. The N=16→N=32 drop is once more the steepest single step, and the high-N separation between the 14B pair and the 7–8B group is the same ~0.07–0.13 gap seen on vLLM.

The two backends agreeing this closely at the model-marginal level is the methodological payoff of the matched-SKU design: the per-model efficiency physics is a property of the model and the concurrency level, not of the engine. Where the engines do differ shows up only when the marginal is decomposed by workload (Section 13), not in these averaged-over-workload curves. For a deployment that mixes traffic shapes, the SGLang table licenses the same N=32 planning band as the vLLM table — roughly 0.47–0.53 for 7–8B and 0.60 for 14B.

The model-size gradient that both tables exhibit is the second observed finding of this section, and it is consistent across both backends (V3_SGLANG_RUN_FINDINGS §5.1, stated there as the "secondary, sensible gradient"). The 14B-class families retain more parallel efficiency at high concurrency than the 7–8B families: at N=32, qwen2.5-14b and phi-4 land at 0.60–0.62 on both backends, while qwen2.5-7b, llama3.1-8b, and mistral-7b land at 0.47–0.53. The gap is not an artifact of one engine — it appears with the same sign and roughly the same magnitude in both the vLLM and SGLang columns.

The mechanism is straightforward and is stated here as inferred-from-the-curves rather than separately instrumented: a 14B model does more floating-point work per generated token than a 7–8B model, so the fixed per-step batching and scheduling overhead is amortized over a larger compute payload. When the GPU spends a larger fraction of each decode step doing useful matrix multiplication rather than dispatching kernels and managing the batch, the marginal cost of adding another concurrent request is proportionally smaller, and parallel efficiency decays more slowly. The gradient is therefore exactly the direction a compute-amortization account predicts: bigger model, flatter efficiency curve. We note explicitly that this is observed across five families spanning only the 7B–14B range; the report makes no claim about whether the gradient continues, saturates, or reverses above 14B, because no model above 14B was tested.

10.1 Where the curve bends: the N=16→N=32 inflection

Reading the per-model curves as deployment-planning instruments rather than as static tables, the single most decision-relevant feature is where the curve bends — the concurrency level at which the marginal request stops being cheap. On both backends, and on every one of the five families, the steepest single-step efficiency loss in the entire ladder is the N=16→N=32 transition. The vLLM per-step drops make this concrete: qwen2.5-7b sheds 0.1861 of efficiency across that one step (0.686 → 0.500), llama3.1-8b sheds 0.1568 (0.644 → 0.488), mistral-7b 0.1021 (0.572 → 0.470), qwen2.5-14b 0.0839 (0.696 → 0.612), and phi-4 0.0991 (0.718 → 0.619) — all computed off the N=16 and N=32 columns of per_model_efficiency_curves.vllm. The SGLang side carries the same signature: the N=16→N=32 drop is the largest single step for qwen2.5-7b (0.1584), llama3.1-8b (0.1705), mistral-7b (0.1695), qwen2.5-14b (0.1219), and phi-4 (0.1338), each read off per_model_efficiency_curves.sglang. By contrast the early ladder is almost flat — the N=1→N=2 step costs only 0.017–0.047 of efficiency across all ten model×backend curves, an order of magnitude smaller than the N=16→N=32 step it bookends.

Observations. The curve is not a uniform slope; it is shallow through N=8 and then bends hard at the top of the ladder. The N=1→N=2 step — the exact step where V1's pytorch_direct lost nearly half its efficiency (1.000 → 0.547, §11) — costs the V3 continuous-batching backends at most 0.047, with the 14B-class models losing under 0.02. The cost of concurrency is back-loaded: it accumulates slowly until N=16 and then the N=16→N=32 doubling extracts the steepest toll of the ladder, as 32-way batching begins to contend for the GPU's compute and memory-bandwidth resources rather than merely amortizing kernel-launch overhead. The single largest drop anywhere in either table is mistral-7b's 0.2146 N=8→N=16 step on vLLM, which is the early-onset signature of the smallest model in the set saturating one rung before the others.

The deployment reading of the inflection is direct: on this substrate, doubling concurrency from N=8 to N=16 is comfortably amortized (every model stays above 0.57), but the N=16→N=32 doubling is where the planner must price in a 0.08–0.19 efficiency loss depending on model size. A capacity plan that treats the efficiency curve as a constant slope would over-provision at low N and under-provision at the top of the ladder; the curve's back-loaded shape says the marginal cost of concurrency is itself concurrency-dependent, and the inflection sits at N=16→N=32 for every family on both engines. State of confidence: every per-step drop is arithmetic over adjacent columns of per_model_efficiency_curves (observed); the contention-onset reading is inferred from the inflection's position.

10.2 Per-family deep reads

The model-marginal table hides as much as it shows, and the five families each carry a distinct signature worth stating individually, because the head-to-head of §13 and the boundary of §12 both decompose along these families. The reads below are model-marginal (averaged over five workloads and three reps); the per-cell joint that exposes the within-model workload spread is Appendix C.

qwen2.5-7b is the median 7B-class curve and the one cell where V2 and V3 share a model for the lineage check of §17. It descends 1.000 → 0.961 → 0.905 → 0.806 → 0.686 → 0.500 on vLLM and 1.000 → 0.973 → 0.928 → 0.823 → 0.687 → 0.529 on SGLang, landing within 0.029 of itself across the two engines at N=32 with SGLang marginally ahead. It is the model whose balanced_2k cell anchors the cross-run V2/V3 consistency check, and its model-marginal N=32 of 0.4997 (vLLM) is the workload-mean of cells spanning 0.354 (long_prefill_8k) to 0.681 (long_decode) — a 0.33 spread that the model-marginal averages away (boundary_matrix.vllm).

llama3.1-8b is the one family whose cross-engine sign is not uniformly SGLang-favoring at the model-marginal level: vLLM leads it at N=2, N=4, and N=8 (deltas +0.0119, +0.0211, +0.0100) before SGLang catches up at N=16 (−0.0038) and the two finish within 0.0099 at N=32 with vLLM nominally ahead (0.4876 vs 0.4777). It is also the family with the worst single short_decode cell in the matrix — vllm."llama3.1-8b|short_decode" drops to 0.3071 at N=32 with a 3.6× tail (boundary_matrix.vllm), the steepest short_decode cell on either backend — which is the per-cell substrate behind the §13 short_decode SGLang edge.

mistral-7b is the floor of the model set at N=32 on both engines (0.470 vLLM, 0.471 SGLang) and the family that saturates earliest: its vLLM N=8→N=16 step of 0.2146 is the single largest one-step drop anywhere in either per-model table. It is also the family carrying the §15 tokenizer confound, so its long_prefill_8k cell is the one excluded from that workload's head-to-head; its model-marginal curve still includes that cell (the exclusion is a head-to-head-only scoping, not a data drop), which is part of why mistral sits at the bottom of the band.

qwen2.5-14b and phi-4 are the 14B-class pair that defines the top of the band and the model-size gradient. They track each other closely — qwen2.5-14b 1.000 → 0.976 → 0.936 → 0.859 → 0.696 → 0.612 and phi-4 1.000 → 0.983 → 0.941 → 0.850 → 0.718 → 0.619 on vLLM — and both retain ~0.60–0.62 at N=32 on both engines, roughly 0.11–0.15 above the 7–8B floor. They also carry the two flattest cells in the matrix: vllm."phi-4|long_decode" never breaks and tops out at a 1.28× tail at N=32, the lowest tail multiplier anywhere, and vllm."phi-4|balanced_2k" is the only non-long_decode cell on vLLM that never crosses the 0.65 threshold (boundary_matrix.vllm).

Observations. The five families partition into three tiers at N=32: the 14B-class pair (0.60–0.62), the qwen2.5-7b mid-7B point (~0.50–0.53), and the llama3.1-8b / mistral-7b floor (~0.47–0.49). The tiering is stable across both engines, and the within-tier cross-engine deltas are smaller than the between-tier model-size gap — which is the model-marginal statement of the gradient. The one family that breaks the otherwise-uniform SGLang-leaning cross-engine sign is llama3.1-8b, where vLLM holds the early-ladder edge; this is the model-marginal shadow of the fact that the head-to-head sign is workload-conditional, not model-uniform (§13).

Read family by family, the curves say the same thing the marginal does but with the texture the marginal erases: the model-size gradient is a between-tier effect (14B-class above 7–8B), the cross-engine difference is a small within-tier effect that even flips sign on llama3.1-8b, and the within-model workload spread (0.33 on qwen2.5-7b alone) is larger than either. That ordering — workload spread > model-size gradient > cross-engine delta — is the load-bearing reason the report leads with the workload-shaped boundary (§12) and treats the model-size gradient as secondary and the head-to-head as a small-but-consistent routing signal. State of confidence: every curve and cell value is observed from per_model_efficiency_curves and boundary_matrix; the cross-engine delta signs are arithmetic over the two per-model tables.

10.3 The cross-engine model-marginal agreement as a result in its own right

That the two independently-built backends produce model-marginal curves agreeing to within ~3 percentage points at N=32 on every family is itself a load-bearing finding, not merely a sanity check. The per-N cross-engine deltas (vLLM − SGLang, averaged over workloads) are small at every rung: at N=2 they span −0.0119 to +0.0119, at N=8 they span −0.0176 to +0.0100, and at N=32 they span −0.0293 (qwen2.5-7b, SGLang ahead) to +0.0237 (phi-4, vLLM ahead). The largest single model-marginal cross-engine gap anywhere in the ladder is mistral-7b at N=16 (−0.0686), and even that washes out by N=32 to −0.0012. No model-marginal cross-engine delta exceeds ~0.07 at any concurrency level, and most sit under 0.03.

Observations. At the model-marginal granularity the two engines are nearly interchangeable — the per-model efficiency physics is a property of the model and the concurrency level, not of the backend. The cross-engine difference only becomes a Holm-significant directional signal once the marginal is decomposed by workload (§13), where the SGLang short_decode edge and the vLLM long_prefill_8k edge that partially cancel in the workload-mean are separated out. The model-marginal agreement is therefore the foil against which the §13 head-to-head is read: the engines agree on the average and differ only on the shape.

The model-marginal cross-engine agreement is the methodological payoff of the matched-SKU design stated at the model granularity: two engines, same hardware, same model, same concurrency, agree within ~3pp on the average. That agreement is exactly what makes the §13 workload-decomposed deltas trustworthy — they are not a baseline difference between two engines that disagree everywhere, but a workload-conditional structure riding on top of two engines that agree on the marginal. State of confidence: every cross-engine delta is arithmetic over per_model_efficiency_curves.vllm and .sglang (observed).

11. SS5. Gradual Continuous-Batching Decline versus the V1 Collapse

Section 10 established that the V3 curves are gradual; this section quantifies how gradual by setting them against the V1 pytorch_direct baseline, which is the catastrophic anchor the entire TR164 program was built to characterize. The contrast is the L2 thesis of the serving-stack line — that continuous batching does not merely improve the scaling curve, it relocates the breakdown boundary by more than an order of magnitude in concurrency — and it is the single most dramatic effect in this report. The V1 numbers below are cited from Technical_Report_164.md SS4, not recomputed here; the V3 numbers are the both-backend band from per_model_efficiency_curves.

V1 pytorch_direct, run on the local RTX 4080 Laptop substrate, exhibited a uniform-at-N=2 breakdown: median parallel efficiency across the model trio fell from 1.000 at N=1 to 0.547 at N=2, then 0.295 at N=4, 0.127 at N=8, and 0.056 at N=16 — at which point throughput was operationally indistinguishable from sequential execution. All 24 (model × workload × phase) combinations broke below the 0.65 threshold at N=2, the first batching step, and six (phase × workload × N=16) cell shapes on llama3.2-3b deterministically hung the dispatcher. That is the pessimistic anchor: a serving path that loses 45% of its efficiency the instant a second concurrent request arrives, and 94% of it by 16-way concurrency.

The side-by-side, V1's collapse against V3's both-backend per-model N=32 retention:

Reference point N=1 N=2 N=4 N=8 N=16 N=32
V1 pytorch_direct (trio median, RTX 4080) 1.000 0.547 0.295 0.127 0.056 (hung at N=16)
V3 vLLM 0.10.2 (per-model band, A100 PCIe) 1.000 0.953–0.983 0.885–0.941 0.787–0.859 0.572–0.718 0.470–0.619
V3 SGLang 0.5.12.post1 (per-model band, A100 PCIe) 1.000 0.959–0.981 0.889–0.938 0.795–0.860 0.641–0.729 0.471–0.600

Observations. The two regimes do not overlap at any concurrency level beyond N=1. At N=2, where V1 had already shed nearly half its efficiency to 0.547, both V3 backends remain above 0.95 across every model — a gap of roughly 40 percentage points at the very first batching step. The divergence then widens monotonically as V1 collapses and V3 declines gracefully: by N=8, V1 sits at 0.127 while the worst V3 cell is at 0.787, a better-than-six-fold ratio; by N=16, V1 is at 0.056 (and physically hanging on six cells) while the worst V3 model retains 0.572 — V3 holds roughly ten times the efficiency V1 had at the same concurrency. The most direct statement of the boundary shift: V3 retains 0.47–0.62 efficiency at N=32, a concurrency level V1 never reached because it had already collapsed to 0.056 and was hanging at N=16. Continuous batching moves the breakdown boundary up by more than an order of magnitude in concurrency (V3_SGLANG_RUN_FINDINGS §5.1).

This is the one place in the report where the effect genuinely is dramatic, and the framing must be kept clean: the V1-versus-V3 contrast is a categorical regime shift, not a small delta. V3 at N=32 sits an order of magnitude above where V1 had already failed at N=16. That magnitude is not to be confused with the cross-backend head-to-head deltas of Section 13, which are small (1.5–7.6pp) and workload-conditional. Two different magnitudes live in this report: the order-of-magnitude V1→V3 dispatcher effect, and the few-percentage-point vLLM-versus-SGLang routing signal. A reader who carries the order-of-magnitude framing into the head-to-head will badly overread the engine comparison; a reader who carries the few-points framing into this section will badly underread the continuous-batching result. The two must be held apart.

It is worth being explicit about what is and is not matched in this particular comparison. V1 ran on the local RTX 4080 Laptop on a different model trio (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b) under the pytorch_direct dispatcher; V3 ran on cloud A100 80GB PCIe on the 7B–14B family set under vLLM and SGLang. The V1-versus-V3 comparison is therefore not hardware-matched or model-matched — it is a dispatcher-regime contrast, and the order-of-magnitude boundary shift it documents is attributable to the move from a per-thread pytorch_direct loop to a continuous-batching scheduler, consistent with the V2 H1 confirmation that the N=2 breakdown is dispatcher-specific rather than hardware-bound (Technical_Report_164_V2.md). The matched-hardware comparison in V3 is the vLLM-versus-SGLang head-to-head; the V1 contrast is the program-lineage anchor that gives the gradual V3 curves their meaning.

12. SS6. The Workload-Shaped p95 Boundary

The per-model curves of Section 10 average over workloads and so deliberately hide the dimension along which the boundary actually varies most: the traffic shape. This section reads the boundary by workload, using the p95 latency multiplier at N=32 versus N=1 — the factor by which tail latency inflates when the system goes from serving one request to serving thirty-two — alongside the companion mean parallel efficiency at N=32. Both quantities are read directly from workload_boundary in research/tr164/v3_paper_stats.json (observed). The headline of the section is that the boundary is workload-shaped, not uniform: continuous batching amortizes some traffic shapes almost perfectly and others only partially, and the difference is large and consistent across both backends.

The mean p95 multiplier and mean efficiency at N=32, by workload, on both backends:

Workload vLLM p95× @ N=32 vLLM eff @ N=32 SGLang p95× @ N=32 SGLang eff @ N=32
long_decode 1.41 0.715 1.40 0.723
repeated_prefix 1.80 0.563 1.76 0.579
balanced_2k 1.81 0.573 1.85 0.551
short_decode 2.43 0.449 1.96 0.539
long_prefill_8k 2.64 0.388 3.50 0.281

Observations. The workloads span the boundary, not cluster on it. At the flat end, long_decode inflates p95 by only 1.40–1.41× at 32-way concurrency on both backends and retains the highest efficiency of any workload (0.715 vLLM, 0.723 SGLang) — the decode-bound regime where there is little prefill to contend over and the continuous-batching scheduler amortizes near-ideally. At the steep end, long_prefill_8k is the worst tail on both backends (2.64× on vLLM, 3.50× on SGLang) and the lowest efficiency (0.388 vLLM, 0.281 SGLang) — the prefill-bound regime where 32-way concurrency forces the system to contend for memory bandwidth on large prompt tensors and the amortization regime breaks down. (Both long_prefill_8k workload-marginals average over all five models including the tokenizer-confounded Mistral cell, so the cross-backend gap here — and the sub-0.30 SGLang number in particular — is read clean only in the §13 paired-and-excluded delta; see §15.) Between those poles, repeated_prefix and balanced_2k sit in a middle band (p95 ~1.76–1.85×, efficiency ~0.55–0.58), and short_decode is the one workload where the two backends visibly part company at this granularity: vLLM inflates p95 to 2.43× and drops to 0.449 efficiency while SGLang holds 1.96× and 0.539 — the short-prompt regime that Section 13 reads as a Holm-significant SGLang edge. The p95-multiplier and efficiency columns are anti-correlated by construction and by mechanism: the workloads with the worst tail inflation are the workloads with the lowest retained efficiency, because both are downstream of the same fact — that 32-way concurrency on prefill-heavy traffic saturates the resource the scheduler cannot amortize.

The operative reading is that continuous batching works until the workload shape breaks the amortization regime (V3_SGLANG_RUN_FINDINGS §5.2). On decode-bound traffic the boundary is nearly flat — a 1.4× tail at 32-way concurrency is a benign, plannable degradation. On prefill-bound traffic the boundary bites hard — a 2.6–3.5× tail with sub-0.30 efficiency at the same concurrency is the regime where capacity planning must assume the scheduler will not save the deployment. A single p95 budget across all traffic shapes would be wrong in both directions: too conservative for long_decode and too optimistic for long_prefill_8k. The boundary is a surface over the workload axis, and the deployment must read it per shape.

The workload-marginal table above is the smoothed view; the boundary is in fact workload-shaped down to the individual (model × workload) cell, which is visible in boundary_matrix. The cell-level statistic that makes this concrete is first_n_below_0.65 — the first concurrency level at which a cell's efficiency drops below the 0.65 breakdown threshold, with a null value meaning the cell never breaks within the tested range N ≤ 32. The decode-bound and prefill-bound extremes are cleanly separated in this statistic.

A representative slice of the cell-level boundary, contrasting the flattest workload against the steepest on both backends:

Cell first N below 0.65 p95× @ N=32
vLLM phi-4 | long_decode never (null) 1.28
vLLM qwen2.5-7b | long_decode never (null) 1.44
vLLM qwen2.5-14b | long_decode never (null) 1.35
SGLang qwen2.5-7b | long_decode never (null) 1.32
SGLang qwen2.5-7b | long_prefill_8k N=8 3.79
SGLang llama3.1-8b | long_prefill_8k N=8 4.40
vLLM mistral-7b | long_prefill_8k N=16 3.24

Observations. The two workloads occupy opposite ends of the cell-level boundary with no overlap. Every long_decode cell shown never crosses below 0.65 within the tested range — phi-4 and qwen2.5-7b and qwen2.5-14b on vLLM, and qwen2.5-7b on SGLang, all carry first_n_below_0.65 = null and tails between 1.28× and 1.44× — these are cells where 32-way concurrency simply does not break the amortization regime. The long_prefill_8k cells break earliest of any workload: on SGLang, qwen2.5-7b and llama3.1-8b both fall below 0.65 by N=8, the earliest break in the entire matrix, with tails of 3.79× and 4.40×; vLLM's mistral-7b long_prefill cell breaks by N=16 with a 3.24× tail. The SGLang long_prefill cells breaking a full concurrency step earlier than the vLLM ones (N=8 versus the more common N=16) is the cell-level shadow of the workload-marginal gap (SGLang 3.50× versus vLLM 2.64× mean p95) that Section 13 reads as the long_prefill_8k head-to-head — read there over the four non-Mistral models only, because the Mistral long_prefill cell is tokenizer-confounded and is documented-and-excluded in Section 15.

At the cell granularity the workload-shaped boundary is unmistakable: long_decode cells dominate the never-breaks set, and long_prefill_8k cells dominate the breaks-earliest set, on both backends. This is the strongest form of the section's thesis — the boundary is not a single number, not even a single number per backend, but a surface over the (model × workload) grid whose shape is governed first by the workload and second by the model. The per-model curves of Section 10 and the workload-marginal table of this section are two projections of that same surface; the full per-cell substrate behind both is reported in Appendix C, and the cross-backend reading of the surface — where the two engines' boundaries actually differ — is the deliverable of Section 13.

The claim that the boundary is "governed first by the workload and second by the model" is quantifiable, and quantifying it is what licenses the report leading with the workload-shaped boundary rather than the model-size gradient. For each model, the N=32 efficiency spread across its five workloads — the gap between its best-case (always long_decode) and worst-case (almost always long_prefill_8k) cell — measures how much the workload axis moves a single model's efficiency at the top of the ladder. The per-model spreads, computed off boundary_matrix.{vllm,sglang}, are large on every family and on both backends.

Model vLLM N=32 best (long_decode) vLLM N=32 worst vLLM spread SGLang N=32 best (long_decode) SGLang N=32 worst SGLang spread
qwen2.5-7b 0.681 0.354 (long_prefill_8k) 0.327 0.758 0.252 (long_prefill_8k) 0.506
llama3.1-8b 0.686 0.307 (short_decode) 0.379 0.668 0.222 (long_prefill_8k) 0.445
mistral-7b 0.651 0.311 (long_prefill_8k) 0.339 0.645 0.297 (long_prefill_8k) 0.348
qwen2.5-14b 0.780 0.448 (long_prefill_8k) 0.332 0.774 0.320 (long_prefill_8k) 0.454
phi-4 0.778 0.466 (long_prefill_8k) 0.312 0.770 0.315 (long_prefill_8k) 0.456

Observations. The within-model workload spread is 0.31–0.38 on vLLM and 0.35–0.51 on SGLang — in every case larger than the model-size gradient (the ~0.11–0.15 gap between the 14B-class and 7–8B-class model-marginals at N=32 from §10). The worst-case cell is long_prefill_8k for nine of the ten model×backend pairs; the single exception is vllm."llama3.1-8b", whose worst N=32 cell is short_decode at 0.307 rather than its long_prefill cell — the per-cell signature of llama3.1-8b being the family that anchors the §13 short_decode SGLang edge. SGLang's spreads run systematically wider than vLLM's (0.45–0.51 versus 0.31–0.38 on the four non-mistral families) because SGLang's long_prefill_8k cells bottom out lower — qwen2.5-7b at 0.252 and llama3.1-8b at 0.222 at N=32, the two deepest cells in the entire matrix — which is the cell-level root of the §13 vLLM long_prefill edge.

The spread table is the quantitative form of "workload first, model second." A single model's efficiency at N=32 moves by 0.31–0.51 depending only on which workload it serves, which is two-to-four times the 0.11–0.15 model-size gradient that moves efficiency between models on a fixed workload. That ordering is not a stylistic preference in how the report is sequenced — it is a measured fact about which axis dominates the boundary, and it is the reason the headline is the workload-shaped boundary and the model-size gradient is the rider. State of confidence: every spread is arithmetic over the N=32 column of boundary_matrix.{vllm,sglang} (observed); the comparison to the §10 model-size gradient is arithmetic over per_model_efficiency_curves.


13. SS7. Results II — vLLM vs SGLang: Per-Workload Holm + Bootstrap Deltas

This is the load-bearing section of the report. Everything before it — the gradual continuous-batching decline (§11), the per-model efficiency gradient (§10), the workload-shaped p95 boundary (§12) — characterizes each backend on its own terms. This section is the contrast V2 set out to deliver but could not, because V2's vLLM and SGLang ran on different A100 SKUs (PCIe vs SXM) and the head-to-head was hardware-confounded. V3 closes that confound by construction: both backends ran on matched A100 80GB PCIe (provenance.hardware), so every per-workload efficiency delta below is hardware-clean — no bandwidth adjustment, no crossover anchor, no SXM/PCIe ratio to apply (the retirement of that adjustment is the subject of §14).

The statistical design is the paired estimator the V2 reframe established, now run cleanly on five model families. For each of the five workloads we form per-cell parallel-efficiency pairs (vLLM minus SGLang at the matched model × N × rep), compute a paired Wilcoxon signed-rank test on those deltas, apply a Holm-Bonferroni stepdown across the five-workload family to control the family-wise error rate, and attach a bootstrap 95% confidence interval on the mean delta (head_to_head_by_workload.* fields: n_pairs, mean_delta_vllm_minus_sglang, median_delta, boot_ci95_delta, wilcoxon_p_raw, wilcoxon_p_holm, holm_significant_0.05). The sign convention is fixed throughout: a negative delta means SGLang is the more efficient backend on that workload; a positive delta means vLLM is. The decision rule pairs two independent signals — a Holm-adjusted p below 0.05 and a bootstrap CI that excludes zero — so a workload only counts as a directional finding when both agree.

Two things are worth watching for as the table is read: the direction of the delta splits by workload rather than pointing one way, and exactly one workload (balanced_2k, the mixed-traffic reference) lands as a tie — the shape of a routing signal, not of an engine ranking. The full head-to-head table follows. All values are observed, traced to research/tr164/v3_paper_stats.jsonhead_to_head_by_workload.

Workload n pairs Mean Δ (vLLM − SGLang) Median Δ Bootstrap CI₉₅ Wilcoxon p (raw) Holm-adj p Holm-sig @ 0.05 Mistral excl. Directional read
short_decode 25 −0.0349 (−3.49pp) −0.0299 [−0.0544, −0.0176] 0.000912 0.001824 no SGLang edge
long_decode 25 −0.0349 (−3.49pp) −0.0120 [−0.0570, −0.0172] 0.000162 0.000649 no SGLang edge
repeated_prefix 25 −0.0148 (−1.48pp) −0.0070 [−0.0227, −0.0076] 0.000430 0.001289 no SGLang edge
balanced_2k 25 +0.0022 (+0.22pp) +0.0071 [−0.0100, +0.0140] 0.652841 0.652841 no Tie (CI straddles 0)
long_prefill_8k 20 +0.0757 (+7.57pp) +0.0864 [+0.0546, +0.0961] 0.000010 0.000048 yes vLLM edge (4-model)

Observations. Four of the five workloads produce a Holm-significant directional finding with a bootstrap CI that excludes zero; the fifth, balanced_2k, is a clean statistical tie. The three SGLang edges (short_decode, long_decode, repeated_prefix) all carry negative deltas with upper CI bounds well below zero (−0.0176, −0.0172, −0.0076 respectively), and Holm-adjusted p-values between 6.5 × 10⁻⁴ and 1.8 × 10⁻³ — small effects, but stable ones. The single vLLM edge (long_prefill_8k, +7.57pp, Holm-adj p = 4.8 × 10⁻⁵) is the largest delta in the table and is also the one workload reported over 20 pairs rather than 25: Mistral × long_prefill_8k is excluded because of the tokenizer asymmetry dissected in §15, so that row compares the four non-Mistral families only. The balanced_2k mean delta of +0.0022 (0.22pp) is the smallest in magnitude, its CI [−0.0100, +0.0140] straddles zero symmetrically, and its raw Wilcoxon p of 0.65 carries through Holm unchanged — there is simply no signal there to adjust. The median deltas track the means in sign on every workload, which rules out a single outlier cell driving any of the directional findings.

The honest reading of this table is small but consistent. Every directional delta here is a single-digit-percentage-point effect — 1.5pp on repeated_prefix, 3.5pp on the two decode-shaped workloads, 7.6pp at the prefill extreme — not the order-of-magnitude swings that separate continuous-batching backends from V1's pytorch_direct collapse (§11). What makes them publishable is not their size but their stability under paired statistics on matched hardware: the bootstrap CIs are tight and exclude zero, and the Holm stepdown survives the multiple-comparisons correction across the five-workload family. The correct interpretation is a workload-conditional routing signal, not an engine ranking. The deltas split direction by workload — SGLang ahead on the three decode/prefix-reuse-shaped traffic patterns, vLLM ahead on the one prefill-bound extreme, dead even on mixed traffic — which is precisely the shape of a "route this workload to that backend" finding, and precisely not the shape of "backend X is faster than backend Y." Any sentence in the downstream substrate that reads "vLLM beats SGLang" or "SGLang beats vLLM" is forbidden by this table (see §19); the licensed claim is the per-workload routing map.

A note on the n=20 versus n=25 asymmetry, since it is the one structural irregularity in the table. Four workloads pair all five model families across the matched N × rep grid that survives the per-cell efficiency join, yielding 25 pairs each. The long_prefill_8k row drops to 20 because Mistral's five contributing pairs are removed: on this workload the two backends do not tokenize Mistral's prompts to the same length (mistral_common on vLLM versus HF AutoTokenizer on SGLang produce a 1.162× token-count ratio — §15), so the Mistral cell is not an apples-to-apples cross-backend comparison and is documented-and-excluded rather than silently averaged in. The +7.57pp vLLM edge on long_prefill_8k is therefore a four-family finding, and the report never claims it for Mistral. This exclusion is the single asterisk on the head-to-head, and it is carried explicitly forward into the limitations (§18, L1) and forbidden claims (§19, FC2).

14. SS8. Balanced Tie, RadixAttention Prefix Edge, and the Same-SKU Retirement of the Bandwidth Adjustment

The §13 table supports three distinct readings, and it is worth separating them because they license different downstream claims. The first is a null, the second is a workload-conditional architecture effect that reproduces V2's qualitative finding at five models, and the third is the methodological advance that makes this entire head-to-head load-bearing where V2's was not.

A compact restatement at the operating point — mean parallel efficiency at N=32, the concurrency level where the backends have diverged most — frames all three readings. These per-workload means are the N=32 marginals behind the paired deltas of §13, traced to workload_boundary.{vllm,sglang}.*.mean_efficiency_at_n32.

Workload vLLM eff @ N=32 SGLang eff @ N=32 Δ (vLLM − SGLang) Reading
long_decode 0.7151 0.7231 −0.0080 Near-tie; SGLang edge in the paired test
balanced_2k 0.5733 0.5514 +0.0219 Tie (paired CI straddles zero)
repeated_prefix 0.5629 0.5788 −0.0159 SGLang edge — RadixAttention prefix reuse
short_decode 0.4493 0.5388 −0.0895 SGLang edge on short prompts
long_prefill_8k 0.3875 0.2812 +0.1063 vLLM edge — Mistral-tokenizer-confounded

Observations. At the N=32 operating point the backends are within a hair on decode-heavy and balanced traffic (long_decode −0.0080, balanced_2k +0.0219) and pull apart only at the two workload extremes (short_decode SGLang ahead by ~9pp, long_prefill_8k vLLM ahead by ~11pp before the Mistral exclusion is applied). The long_prefill_8k N=32 marginal of +0.1063 is larger than the paired-test mean delta of +0.0757 from §13 precisely because this marginal still includes the confounded Mistral cell, where SGLang's HF tokenizer inflates the prompt length and depresses its measured efficiency; the §13 paired estimator excludes Mistral and reports the cleaner +0.0757 over four families. This is exactly why the report leads with the §13 paired-and-excluded number for the licensed claim and uses the N=32 marginal here only for the workload-shape narrative. The qualitative shape is unambiguous: the cross-backend difference is workload-shaped, near-zero in the middle of the traffic distribution and signed at the prefill/short-prompt edges.

The single sentence that captures the head-to-head is: the backends are near-identical on decode-heavy and balanced traffic and diverge on the prefill and short-prompt extremes — SGLang on short_decode, vLLM (apparently) on long_prefill — now established on five model families on matched hardware. That is a backend-routing map, not a leaderboard, and the matched A100 80GB PCIe substrate is what lets us state it without an asterisk on the hardware.

The balanced tie — the honest null. balanced_2k returns a mean delta of +0.0022, a bootstrap CI of [−0.0100, +0.0140] that straddles zero, and a Holm-adjusted p of 0.65 (head_to_head_by_workload.balanced_2k). On the mixed-traffic reference workload — the one most representative of an undifferentiated production request stream — vLLM and SGLang are statistically equivalent. This null is load-bearing in its own right: it tells a deployer that the backend choice is immaterial for balanced traffic, and it disciplines the directional findings by demonstrating that the pipeline does not manufacture a signal where none exists. A study that reported four wins and zero ties would invite the suspicion that its estimator is biased toward detecting differences; the balanced_2k tie is the calibration that the four directional findings earn their significance honestly.

The RadixAttention prefix edge — workload-conditional, reproducing V2. repeated_prefix returns a SGLang edge of −0.0148 (Holm-adj p = 0.001289, CI [−0.0227, −0.0076]). This is the prefix-reuse advantage of SGLang's RadixAttention prefix cache, and it is workload-conditional: it appears on the one workload deliberately constructed to expose prefix-cacheable traffic and does not generalize to the decode-bound or mixed workloads. V2 found exactly this — on its smaller single-model grid, SGLang's RadixAttention benefit was Holm-significant on repeated_prefix (+3.7pp) and confined to prefix-cache-friendly workloads, with V2's interpretive reading that "SGLang does not generalize-dominate vLLM; it specializes-dominates on prefix-cacheable traffic" (Technical_Report_164_V2.md §3.5). V3 reproduces that qualitative direction at five model families: the prefix edge is real, it is small (1.48pp here versus V2's 3.7pp on its different SKU and model), and it is bounded to the prefix-reuse surface. State of confidence: observed, and qualitatively consistent with the V2 finding — not a controlled replication (V3 is a standalone extension, §17), but a directionally confirmatory one.

The same-SKU retirement of the bandwidth adjustment — the methodological advance. This is why V3's head-to-head is the load-bearing comparison and V2's was not. In V2, the raw vLLM-vs-SGLang contrast showed SGLang ahead by 1.71pp, but the two backends ran on different hardware — SGLang on SXM (~2.0 TB/s HBM), vLLM on PCIe (~1.5 TB/s HBM) — so V2 had to apply a bandwidth adjustment, dividing raw vLLM efficiency by the 0.75 SXM/PCIe bandwidth ratio, which inverted the result to a +24.55pp vLLM lead and forced the standing caveat that "a clean vLLM-vs-SGLang efficiency claim without the bandwidth-adjustment caveat is not available" (Technical_Report_164_V2.md §3.4, §8). V3 retires that adjustment entirely. Both backends ran on matched A100 80GB PCIe (provenance.hardware = "A100 80GB PCIe (both backends — matched SKU, no bandwidth adjustment needed)"), so the §13 deltas are hardware-clean by construction — there is no SXM/PCIe ratio to apply, no bandwidth confound to isolate, and no marketing-grade comparison to displace. The V2 bandwidth-adjustment sensitivity check is not refined in V3; it is made unnecessary.

The retirement of the bandwidth adjustment is the quiet methodological headline. V2's vLLM-vs-SGLang number was only interpretable after a counterfactual hardware correction, and a number that needs a counterfactual to be read is exactly the kind of comparison that invites disagreement about the size of the correction. V3's number needs no correction: identical GPU SKU, identical memory bandwidth, identical PCIe substrate, identical analysis pipeline — the only deliberate variable is the backend, and the per-workload deltas are therefore attributable to the backend (its scheduler, its KV management, its RadixAttention prefix cache) and nothing else. This is what "load-bearing" means here: the §13 table is the first matched-hardware vLLM-vs-SGLang head-to-head in the V1→V2→V3 arc, and it is the table the downstream serving-stack-physics substrate is built to carry.


15. SS9. Mistral Tokenizer Asymmetry: Overflow, Refill, and the Document-and-Exclude Decision

Every cross-backend comparison in this report rests on a single comparability premise: that a given (model, workload) prompt is the same prompt on both engines. For 24 of the 25 (model × workload) combinations that premise holds without qualification — the prompts are generated once, the identical character strings are dispatched to vLLM and to SGLang, and both engines route them through the same Hugging Face AutoTokenizer. One combination breaks the premise, and it breaks it in a way that is invisible at the cell-completion level and only surfaces under direct tokenization probing: Mistral-7B on long_prefill_8k. This section walks that one cell end to end — the overflow symptom, the verified root cause, the targeted refill that recovered the SGLang data, the cross-backend tokenizer asymmetry that makes the cell non-comparable, and the document-and-exclude decision that scopes it out of the long_prefill_8k head-to-head. It is the load-bearing engineering-rigor disclosure of V3, and it is the reason head_to_head_by_workload.long_prefill_8k.mistral_excluded = true with n_pairs = 20 instead of the 25 every other workload carries.

15.1 The Overflow Symptom (SGLang)

The SGLang main-track grid was dispatched as a single 900-cell run (results/20260612_212816_266127). Twenty-four of the 25 (model × workload) combinations returned clean — 90,720 ok rows, zero errors. The twenty-fifth, Mistral-7B × long_prefill_8k, returned HTTP 400 on all 3,780 requests spanning its 36 cells (6 N-levels × 3 reps × 2 phases). The failure was all-or-nothing: not a tail of dropped requests under load, but a uniform 400 across every concurrency level from N=1 to N=32. That uniformity is the signature of a request-shape rejection at the server boundary — the prompt was refused before it ever entered the batching scheduler — rather than a saturation or timeout effect that would have spared the low-N cells (V3_SGLANG_RUN_FINDINGS §4, observed).

The dangerous property of this failure is that it was not visible at the cell-completion gate. The TR164 manifest marks a cell complete when it writes its expected rows; the 36 Mistral × long_prefill_8k cells did write rows — 3,780 of them, each carrying an http_400 status field. The cell-level manifest therefore reported the run as 900/900 complete while 4.0% of the request mass (3,780 of 94,500) had failed. The error was only recoverable at the row level, by inspecting per-request status rather than per-cell completion. This is the cell-completion ≠ request-success catch, and §16 treats it as a first-class verification lesson in its own right.

15.2 Root Cause: Tokenizer Density Against a Char-Matched Prompt

The root cause was established not by inference but by a direct tokenization probe run on the pod (V3_SGLANG_RUN_FINDINGS §4, observed). The long_prefill_8k workload synthesizes prompts to a target token budget of roughly 8,704 tokens using a generic ~1.3 char/token heuristic to convert that budget into a character count. The character string is then frozen and reused across all five model families. Because the character string is fixed but tokenizers differ in density, the actual token count of that one string varies by model. Qwen2.5, Llama-3.1, and Phi-4 all tokenize the synthetic repeated-paragraph text to within the server's 8,768-token context ceiling (8,704 prompt-high + 64 max-new) and passed. Mistral-v0.3's tokenizer is denser on this text: the on-pod probe measured actual token counts up to 10,031 tokens across seeds {164, 1, 2, 3, 42}, overflowing the 8,768 ceiling by more than 1,200 tokens and tripping the server's context-length rejection — the HTTP 400.

This is a model-specific tokenizer × context interaction, not a pipeline fault. The prompt generator, the dispatcher, the analysis pipeline, and the server configuration were all identical to the 24 clean combinations; the only variable that moved was the tokenizer's per-character token yield, which is a property of the Mistral vocabulary and merge table, not of the harness. The distinction matters for the report's framing: a pipeline fault would invalidate the cell, whereas a tokenizer-density interaction is a real property of running char-matched prompts across heterogeneous tokenizers — the kind of confound that must be documented and scoped, not patched away.

Observations. The mechanism is fully accounted for: a char-matched (not token-matched) prompt design crossed with a denser-than-baseline tokenizer produces a token count that exceeds the context ceiling sized for the nominal target. The 8,704-target heuristic assumed ~1.3 char/token; Mistral-v0.3 ran denser than that assumption, and the ~1,200-token overshoot is exactly the gap between the heuristic and the realized density. Nothing about the failure is stochastic — it is deterministic in the seed set and reproducible on the probe.

The lesson generalizes past Mistral: any benchmark that fixes a character budget and dispatches the identical string across tokenizers is silently varying the token workload model-by-model. A context ceiling sized for the median tokenizer will reject the densest one. The defensible designs are either to token-match the prompts per model (regenerate to a fixed token budget) or to size the context ceiling for the worst-case tokenizer and then report the residual per-model token-count spread as a comparability caveat. V3 took the latter route for the refill and the former discipline for the head-to-head exclusion.

15.3 The Targeted Refill

The overflow was recovered with a targeted re-run rather than a full re-dispatch (V3_SGLANG_RUN_FINDINGS §4, observed). Run 20260613_055519_963364 re-executed Mistral-7B × long_prefill_8k only, with a configuration identical to the main grid except for one override: default_max_model_len raised from 8,768 to 11,264 (mistral_tokenizer_asymmetry.context_ceiling_default = 8768, mistral_tokenizer_asymmetry.sglang_refill_context = 11264). The 11,264 ceiling provides roughly 1,170 tokens of headroom over the observed worst-case sequence length (max prompt + max-new ≈ 10,095), which is enough to clear the densest seed without inflating the ceiling beyond what the measurement needs.

The result was 3,780 / 3,780 ok, including the previously-failing N=32 case at 320/320. The 3,780 http_400 rows in the master metrics.csv were replaced in place by the 3,780 ok rows, and the derived analysis.json / cell_summary.csv were regenerated from the spliced master. Provenance for the splice is recorded in results/20260612_212816_266127/REFILL_NOTE.md, and the pre-refill state is preserved untouched in results/20260612_212816_266127_BACKUP_preRefill/ for audit.

The critical methodological point about the refill is that the context-ceiling bump is benign for the measurement itself. A server KV-cache ceiling that is greater than or equal to the actual sequence length does not alter per-request latency or throughput — the ceiling governs admission, not execution. Raising the ceiling from 8,768 to 11,264 changed which requests the server accepts; it did not change how fast it processes the ones it accepts. The refilled Mistral × long_prefill_8k cells are therefore valid as a within-SGLang measurement of Mistral on this workload. What they are not valid for is a cross-backend comparison against vLLM — and that is a separate problem, addressed next.

15.4 The Cross-Backend Tokenizer Asymmetry

The vLLM side of the grid did not hit the overflow (results/20260612_204254_036590, Mistral × long_prefill_8k = 3,780 ok, no refill needed; V3_CROSS_BACKEND_FINDINGS §1, observed). The reason is the load-bearing comparability caveat of the entire V3 head-to-head: the two backends tokenize Mistral's prompts differently. The V3 configuration routes vLLM's Mistral through --tokenizer-mode mistral, which invokes the mistral_common tokenizer (Mistral's own reference implementation), while SGLang used the Hugging Face AutoTokenizer. On the identical char-matched long_prefill_8k prompts, those two tokenizers yield materially different token counts:

Backend Tokenizer prompt_tokens min mean max n
vLLM mistral_common 7,629 8,148 8,644 1,890
SGLang HF AutoTokenizer 8,860 9,465 10,039 1,890

The two distributions do not overlap. The vLLM mistral_common maximum (8,644) sits below the SGLang HF minimum (8,860); every single one of the 1,890 prompts tokenizes longer under the HF AutoTokenizer than the longest prompt does under mistral_common. The mean ratio is sglang_over_vllm_token_ratio = 1.162 — SGLang's tokenizer produces about 16% more tokens for the same character string (mistral_tokenizer_asymmetry, observed). This is the direct mechanical explanation for the asymmetric overflow: under mistral_common the worst-case Mistral prompt is 8,644 tokens and fits comfortably under the 8,768 ceiling, so vLLM never overflowed; under the HF AutoTokenizer the worst-case is 10,039 tokens, which is why SGLang overflowed and needed the 11,264 refill. (This 10,039-token max is the full n=1890 production-run distribution; it differs trivially from the 10,031-token max quoted in §15.2, which is the 5-seed on-pod overflow probe — two distinct samples of the same denser-tokenizer phenomenon, not a discrepancy.)

Observations. Read across the two rows, the asymmetry is total, not marginal: the per-prompt token counts are disjoint between the two backends. That disjointness propagates straight into the efficiency measurement, because efficiency on long_prefill_8k is dominated by prefill work and prefill work scales with token count. SGLang is doing ~16% more prefill tokens per request on this one cell than vLLM is, for prompts that are character-identical. Any naive efficiency delta on Mistral × long_prefill_8k therefore reads tokenizer density and engine architecture as a single composite signal — the same category of error the V2 report flagged when it separated SXM-vs-PCIe bandwidth from RadixAttention. Here the confound is tokenizer density rather than memory bandwidth, but the discipline is identical: the composite signal must be decomposed or the cell must be excluded.

The honest reading is that SGLang's lower efficiency on Mistral × long_prefill_8k partly reflects more prefill work, not backend inferiority. The two engines were never given the same token workload on this cell, so the cell cannot adjudicate engine architecture. It is the one place in the 25-combination grid where the "same prompt on both engines" premise fails, and the failure is intrinsic to running char-matched prompts across a tokenizer pair of differing density — not an artifact of the refill, which only changed admission, not the token counts.

15.5 The Document-and-Exclude Decision

Two options were on the table (V3_CROSS_BACKEND_FINDINGS §3). Option (b), harmonize, would re-run one side so both backends use the same Mistral tokenizer (SGLang under mistral_common, or vLLM under HF), producing a clean head-to-head cell at a cost of ~30–60 minutes of pod time per side. Option (a), document-and-exclude, keeps both runs as-is, footnotes the Mistral × long_prefill_8k cell, and reports the long_prefill_8k head-to-head over the four non-Mistral model families. V3 took option (a) — document-and-exclude — for two reasons. First, it costs zero additional compute and introduces no new run to verify. Second, and more importantly, the four-model long_prefill_8k contrast is already clean: Qwen2.5-7B, Llama-3.1-8B, Qwen2.5-14B, and Phi-4 all route through the HF AutoTokenizer on both backends, so their prompts are token-matched and the comparison is apples-to-apples. Excluding the single confounded model preserves a valid four-model head-to-head; harmonizing would have recovered the fifth at a cost that the workshop-tier scope does not require.

The mechanical consequence of the decision is visible directly in the statistics. The long_prefill_8k head-to-head carries n_pairs = 20 rather than the 25 every other workload carries, and mistral_excluded = true (head_to_head_by_workload.long_prefill_8k). The 20-vs-25 gap is exactly the five (model-rep) pairs that Mistral would have contributed had it been comparable. The reported long_prefill_8k delta — mean Δ +0.0757 (vLLM−SGLang), bootstrap 95% CI [0.0546, 0.0961], Holm-adjusted p 4.8×10⁻⁵ — is therefore the vLLM edge over the four non-Mistral families, not a five-model claim. Section 13 carries this delta into the head-to-head table with the exclusion noted; the magnitude-honesty framing there reads it as a workload-conditional routing signal over the four comparable models, never as a Mistral-inclusive engine ranking.

Observations. The exclusion is surgical rather than wholesale. It removes exactly one model from exactly one workload — the only cell where the comparability premise fails — and leaves the other 24 combinations, and the four-model long_prefill_8k contrast, fully intact. The refilled SGLang Mistral data is retained in the substrate (it is a valid within-SGLang measurement) and is excluded only from the cross-backend delta on that one workload. The distinction between "drop the data" and "scope the comparison" is the entire point: no data is discarded, one comparison is bounded.

The forbidden claim that this section forecloses is "Mistral is slower than the other models on long prefills." That claim is not supported by V3 because Mistral was never measured on a token-matched long_prefill_8k prompt against the others — it processed ~16% more prefill tokens for the same character budget. The licensed claim is the narrow one: on char-matched long_prefill_8k prompts, Mistral-v0.3's tokenizer is denser, Mistral does proportionally more prefill work, and its cross-backend efficiency delta is therefore documented-and-excluded rather than reported. Section 19 carries this as forbidden-claim FC2.


16. SS10. Data Verification: md5 Byte-Integrity and the Row-vs-Cell-Level Catch

Integrity verification is treated as a first-class result in V3, not as a checklist appendix, because two of the three things that could have silently corrupted the head-to-head — a truncated or mangled transfer off the pod, and a cell-complete-but-request-failed cell — were caught only because the verification was run at the right granularity. This section reports the byte-level md5 integrity checks that confirm the local substrate is identical to what ran on the pods, the final-data-state tables for both backends, and the row-vs-cell-level catch that surfaced the 3,780 Mistral http_400 errors that §15 then recovered. The durable lesson — that cell completion is necessary but not sufficient, and request success must be audited at the row level — is the same silent-success failure class as the §9 environment failure #4 and the TR164 V1 nsys incident, and it is the reason the V3 driver verifies at the manifest level rather than trusting an exit code.

16.1 md5 Byte-Identity, Local vs Pod

The first integrity gate is byte-identity between the metrics files as they were written on the pod and as they landed in the local repository after transfer. Each metrics.csv was md5-hashed on the pod immediately after the run completed and again locally after the pull; the two hashes must match exactly, byte for byte. All three files passed (V3_SGLANG_RUN_FINDINGS §2, V3_CROSS_BACKEND_FINDINGS §1, observed):

File Backend Local md5 Pod md5 Match
metrics.csv (main 900-cell) vLLM 540adec5… 540adec5…
metrics.csv (main 900-cell) SGLang 5fd15755… 5fd15755…
metrics.csv (mistral refill) SGLang e0400837… e0400837…

Observations. Three files, three exact matches. The vLLM main grid, the SGLang main grid, and the SGLang Mistral refill all hash identically local-vs-pod, which closes the transfer-corruption failure mode: nothing was truncated, re-encoded, or line-ending-mangled in the pull. The refill file is hashed separately from the main SGLang grid because it was generated by a distinct run (20260613_055519_963364) and spliced into the master afterward; verifying it independently means the splice provenance is auditable end to end rather than only at the merged-master level.

Byte-level md5 is the cheapest and strongest integrity gate available for tabular result data, and it is the one that most often gets skipped under deadline pressure because "the file looks fine." A file that looks fine can be silently truncated at a transfer boundary or re-encoded by a tool in the chain; md5 catches both for the cost of one hash per file. The discipline here is that every metrics.csv that feeds a reported number was hashed on both ends before any analysis ran on it.

16.2 Final Data State, Both Backends

The second integrity gate is the final-data-state audit: every cell complete, every expected row present, every request status ok, every (model × workload) combination clean, and zero degenerate cells. Both backends passed at identical totals (V3_SGLANG_RUN_FINDINGS §2, V3_CROSS_BACKEND_FINDINGS §1, observed):

Check vLLM SGLang
Cells complete 900 / 900 900 / 900
Rows (= theoretical max) 94,500 94,500
Request status 94,500 ok / 0 error 94,500 ok / 0 error
(model × workload) clean 25 / 25 25 / 25
Degenerate cells (zero/none throughput) 0 0
ok_rate per cell (min / max) 1.000 / 1.000 1.000 / 1.000

The aggregate is 1,800 cells / 189,000 rows / ok_rate 1.0, which is the substrate-scale headline carried in §1 (provenance.cells_per_backend = 900, provenance.rows_per_backend = 94500, provenance.ok_rate = 1.0). The 94,500-row figure is the theoretical maximum for the cell shape — 900 cells × 105 requests per cell averaged across the N-grid — meaning no request was dropped on either backend. The SGLang 94,500 ok includes the 3,780 refilled Mistral × long_prefill_8k rows; the pre-refill SGLang state carried 90,720 ok + 3,780 http_400 = 94,500 rows written, which is precisely the trap §16.3 dissects.

Observations. The two backends land on byte-for-byte-identical row totals and per-cell ok_rates, which is what a matched-SKU, matched-cell-shape design should produce when both halves complete cleanly. The ok_rate per cell min 1.000 max 1.000 line is the strongest single summary: there is no cell anywhere in either 900-cell grid that lost even one request after the refill. The "degenerate cells = 0" check is a separate guard against the silent-zero failure mode (a cell that completes and writes rows but reports zero or null throughput because the server stalled mid-run); zero degenerate cells means every cell carries real throughput numbers, not placeholder zeros.

The final-data-state table is what licenses every efficiency and head-to-head number downstream. A 100% ok_rate across 189,000 requests on matched hardware is the substrate quality that makes the small 1.5–7.6 pp head-to-head deltas trustworthy: there is no missing-data imputation, no degenerate-cell exclusion, and no per-backend completion asymmetry that could masquerade as an engine difference. The deltas are differences between two complete, byte-verified grids, not between two partially-failed ones.

16.3 The Row-vs-Cell-Level Catch

The durable lesson of V3's verification is that the right integrity check at the wrong granularity reports success on a failed run. The cell-level manifest gate — which marks a cell complete when it has written its expected number of rows — reported the original SGLang grid as 900/900 complete. That report was true and useless: the 36 Mistral × long_prefill_8k cells had written their expected 3,780 rows, but every one of those rows carried an http_400 status. Cell completion was satisfied; request success was not. The 4.0% error mass (3,780 of 94,500 requests) was invisible to the cell-level gate and visible only to a row-level audit that inspected the per-request status field rather than the per-cell completion flag (V3_SGLANG_RUN_FINDINGS §4, observed).

This is the same silent-success failure class that recurs across the TR164 program. It is the §9 environment failure #4, where run.py exited 0 and wrote manifest status:"complete" even when every cell was group_failed because the server never came up. It is the TR164 V1 nsys incident, where a profiler wrapper captured zero events and exited 0, reporting success on a run that measured nothing. The common structure is a coarse-grained success signal (exit code, cell-completion flag) that returns green while a fine-grained failure (group-failed cells, zero captured events, http_400 rows) sits underneath it. The defense is always the same: never trust the coarse signal; verify at the granularity where the failure actually lives.

Observations. The catch worked because the verification pipeline does not stop at cell completion. After the cell-level manifest confirms 900/900, the row-level audit sums per-request status across all 94,500 rows and asserts 94,500 ok / 0 error. On the original SGLang grid that assertion failed — 90,720 ok, 3,780 http_400 — and the failure pointed directly at the Mistral × long_prefill_8k combination, which §15 then root-caused and refilled. Had the verification stopped at the cell-completion gate, the 3,780 http_400 rows would have entered the analysis as missing or zero-throughput data on exactly the cell that the tokenizer asymmetry already makes fragile, and the long_prefill_8k head-to-head would have been computed over corrupted input.

The generalizable rule is that cell completion is necessary but not sufficient for data integrity, and request success must be audited at the row level. A cell that writes rows is not a cell that succeeded; a run that exits 0 is not a run that ran; a profiler that returns is not a profiler that captured. Each of these is a coarse-grained green light over a fine-grained red, and the TR164 program has now hit the pattern three times — V1 nsys, V3 environment failure #4, and V3 Mistral http_400. The V3 driver's manifest-level + row-level verification is the institutional response: the green signal is only issued after the granularity where the failure lives has been checked, never before.

16.4 Why Verification Is a Result, Not a Footnote

The three gates in this section — md5 byte-identity, final-data-state, and the row-level audit — are reported as a numbered result rather than relegated to an appendix because each one foreclosed a specific way the head-to-head could have been silently wrong. The md5 gate forecloses transfer corruption; the final-data-state gate forecloses missing-row and degenerate-cell contamination; the row-level audit forecloses the cell-complete-but-request-failed trap that would otherwise have poisoned the most fragile cell in the grid. Together they are the reason the V3 head-to-head deltas — small, Holm-significant, bootstrap-bounded — can be read as engine-and-workload signal on a clean, complete, byte-verified, matched-hardware substrate, rather than as artifacts of a partially-failed or partially-corrupted run.

Observations. The verification layer is load-bearing precisely because the headline deltas are small. When an effect is large, a few percent of corrupted or missing data rarely flips the conclusion; when the head-to-head deltas are 1.5–7.6 pp, a 4.0% slug of http_400 rows on the wrong cell can manufacture or erase a "difference" out of nothing. The same magnitude honesty that governs how the deltas are reported (§13) governs why the substrate underneath them is verified to byte-identity here.

The verification discipline and the magnitude-honesty discipline are two faces of the same commitment: small, real effects on a clean substrate are worth more than large, plausible effects on an unaudited one. V3 earns its head-to-head by making the substrate auditable end to end — byte-identical to the pod, complete to the theoretical row maximum, and clean at the row level after one documented refill — before it reports a single delta.


17. SS11. Cross-Run Lineage: V1 Catastrophic, V3 Gradual, V2 Consistency

V3 is not a standalone benchmark dropped onto a fresh substrate; it is the third entry in a three-report lineage that has, across V1, V2, and V3, moved the same question — where does in-process and continuous-batching serving break down under concurrency, and what mechanism drives the break — from a single-backend catastrophe report to a matched-SKU cross-backend characterization. Reading V3's numbers in isolation risks two opposite errors: under-reading the headline (the V1→V3 contrast is an order-of-magnitude effect, not a marginal one) and over-reading the head-to-head (the vLLM-vs-SGLang deltas are small, workload-conditional, and split direction by workload). This section positions V3 against its predecessors so that neither error is available to a downstream reader. All V1 and V2 numbers in this section are cited from the prior reports, not recomputed here; every V3 number traces to research/tr164/v3_paper_stats.json.

17.1 Three points on the serving-stack physics axis

The three reports occupy three distinct points on the serving-stack physics axis. V1 (PublishReady/reports/Technical_Report_164.md) is the pessimistic anchor: an in-process pytorch_direct backend on a consumer RTX 4080 Laptop that exhibits a uniform N=2 breakdown across all twenty-four (model × workload × phase) combinations, with median parallel efficiency collapsing from 1.000 at N=1 to 0.547 at N=2, 0.295 at N=4, 0.127 at N=8, and 0.056 at N=16 — a 95% loss of nominal scaling headroom by the highest concurrency V1 tested. Six (phase × workload × N=16) cell shapes on llama3.2-3b hung the dispatcher deterministically for 3.3-to-10 hours each, and median time-to-first-token on llama3.2-1b at N=16 balanced_2k reached 188.4 seconds. V3 (this report) is the gradual-decline result: on two continuous-batching backends across five model families on matched A100 80GB PCIe, parallel efficiency at N=32 retains 0.47-0.62 (per_model_efficiency_curves), with no deterministic hangs and 100% request success across 1,800 cells. V2 (Technical_Report_164_V2.md) sits between them as the bridge: it confirmed continuous batching eliminates the V1 breakdown across three backends but could not deliver a hardware-clean vLLM-vs-SGLang head-to-head because its cloud arc straddled A100 SXM and A100 PCIe.

What each leg added to the program is worth stating as a progression, because the three reports are not three independent studies but three tightenings of a single open caveat. V1 established that the in-process dispatcher collapses and characterized the collapse in full — the N=2 breakdown, the deterministic hangs, the TTFT catastrophe — but it left open whether the collapse was a property of the dispatcher architecture or of the hardware and small-model set it ran on. V2 closed the architecture question by fanning the same cell shape across vLLM, SGLang, and TGI and showing the N=2 breakdown does not replicate under continuous batching on any of the three — which localized the V1 pathology to the pytorch_direct dispatch path rather than the hardware. But V2's own deliverable, the vLLM-vs-SGLang head-to-head, inherited a new confound: its cloud arc straddled two A100 SKUs, so its engine comparison had to carry the bandwidth-adjustment caveat. V3 closes that caveat by running both engines on a matched A100 80GB PCIe SKU, and in doing so also adds the scale axis (7B→14B) that V2's 7B-only cloud arc lacked. The arc is therefore monotone in what it pins down: V1 pins the collapse, V2 pins the architecture-specificity, V3 pins the matched-SKU engine comparison and the model-size gradient.

The lineage table below states the three legs against the dimensions that distinguish them — backend, hardware, model set, headline efficiency outcome, and the open caveat each leg left for its successor. Every V1 and V2 entry is cited from the prior reports; every V3 entry traces to provenance and per_model_efficiency_curves.

Leg Backend(s) Hardware Models Headline efficiency Caveat left open
V1 pytorch_direct (in-process) RTX 4080 Laptop llama3.2-1b/qwen2.5-1.5b/llama3.2-3b Collapse: 0.056 @ N=16, 6 hangs Dispatcher- or hardware-bound?
V2 vLLM / SGLang / TGI A100 PCIe + A100 SXM (split) qwen2.5-7b (cloud arm) Collapse eliminated; engine H2H bandwidth-confounded Clean matched-SKU engine H2H
V3 vLLM / SGLang A100 80GB PCIe (matched) qwen2.5-7b/llama3.1-8b/mistral-7b/qwen2.5-14b/phi-4 Gradual: 0.47–0.62 @ N=32, 0 hangs Coverage >14B; 4-backend survey

Observations. The table reads as a confound-retirement ledger. V1's open question (dispatcher vs hardware) is answered by V2's three-backend elimination; V2's open question (matched-SKU engine comparison) is answered by V3's matched A100 PCIe grid; V3's own open questions (above-14B coverage, a four-backend survey) pass forward to the future-work queue of §20. The headline-efficiency column is the program's through-line: 0.056-at-N=16 collapse, then collapse-eliminated, then 0.47–0.62-at-N=32 gradual decline — three readings of the same parallel-efficiency ratio that only mean something against each other.

The lineage is a chain of single-confound retirements, not a set of parallel benchmarks. Each leg holds the prior leg's open caveat as its mandate and leaves exactly one new caveat for the next. That discipline is why the V1→V3 contrast can be read as an order-of-magnitude dispatcher effect (V2 having localized it to the dispatch path) and the V3 head-to-head as a clean engine signal (V3 having retired the SKU confound). State of confidence: V1/V2 entries observed from the prior reports; V3 entries observed from provenance and per_model_efficiency_curves.

17.2 The SGLang V2/V3 cross-run consistency check

The cross-run consistency check is the SGLang qwen2.5-7b balanced_2k efficiency curve, the one cell where V2 and V3 ran the same model on the same backend a week apart on two different A100 SKUs. The table below places the V3 PCIe curve (this run) against the V2 SXM curve from Technical_Report_164_V2.md, with the per-model-curve N=32 value 0.529 traceable to per_model_efficiency_curves.sglang."qwen2.5-7b"."32" and the balanced_2k-specific cell to boundary_matrix.sglang."qwen2.5-7b|balanced_2k".efficiency_by_n.

N SGLang V3 (PCIe, this run) SGLang V2 (SXM, 2026-06-05) Δ (V3 − V2)
1 1.00 1.00 0.00
2 0.98 0.98 0.00
4 0.95 0.93 +0.02
8 0.86 0.83 +0.03
16 0.73 0.70 +0.03
32 0.55 0.50 +0.05

Observations. Same model, same backend, two A100 SKUs a week apart: efficiency-vs-N agrees within ~5 percentage points at every concurrency level, with V3 (PCIe) running marginally higher at high N than V2 (SXM) — a difference small enough to sit inside run-to-run and SKU variation, and one that does not invert the qualitative shape. This is a confirmatory cross-run check, not a load-bearing claim: V3's validity rests on its own 900-cell-per-backend matched-SKU grid, not on agreeing with V2. The vLLM side does not cross-validate as cleanly — V2-vs-V3 vLLM diverges at high N (V3 0.52 vs V2 0.43 at N=32 on qwen2.5-7b balanced_2k per V3_CROSS_BACKEND_FINDINGS §4) — but that divergence is expected and explained: V3 pins vLLM 0.10.2 while V2 ran an earlier April build digest, and the qualitative shape is identical across the two builds. V3 is a standalone extension of the program, not a controlled re-run of V2; there is no cross-version confound to manage because no V3 claim inherits from V2.

The lineage reading is that V3 retires the one confound V2 could not. V1 established the catastrophe; V2 established that continuous batching eliminates it but left the vLLM-vs-SGLang comparison entangled with the SXM-vs-PCIe memory-bandwidth split; V3 runs both backends on matched A100 80GB PCIe and so delivers the hardware-clean head-to-head V2 explicitly deferred. The SGLang V2/V3 within-5% agreement is the kind of confirmatory bonus a program wants — it says the substrate is stable across SKUs and weeks — but it is structurally a bonus, not a dependency. The honest framing is "V3 stands on its own matched-SKU grid, and happens to reproduce the trusted V2 SGLang curve within noise."

The direction of the V2/V3 SGLang divergence is worth one further sentence, because its sign is informative rather than merely small. V3 (PCIe) runs marginally higher than V2 (SXM) at the top of the ladder (+0.05 at N=32), which is the opposite of what a naive bandwidth-ratio account would predict — SXM has the higher HBM bandwidth, so if memory bandwidth were the dominant factor at high concurrency the SXM run should retain more efficiency, not less. That the PCIe run retains slightly more is consistent with the efficiency metric being a normalized retention ratio rather than an absolute-throughput one: SXM's higher absolute throughput appears in the N=1 anchor and the N=32 numerator alike, so it largely cancels in the ratio, leaving the small residual to run-to-run and build variation rather than to the SKU's bandwidth. This is the same reason the V3 matched-SKU design retires the bandwidth adjustment by construction — on a normalized retention curve, the SKU's bandwidth is mostly divided out, and the residual SKU effect is within the ~5% cross-run noise band. The vLLM side does not cross-validate as cleanly precisely because its confound is a build difference (V3 0.10.2 vs V2's April digest) rather than a SKU difference, and a build change can move the scheduler's amortization behavior in a way the normalized ratio does not cancel.

17.3 The TR165 interpreter-axis companion and the compound-failure triangulation

V3 also has a mechanism companion that runs on a different axis entirely. TR165 (Technical_Report_165.md) holds the backend fixed at pytorch_direct and flips exactly one knob — the Python interpreter, from a stock GIL build to a 3.14.5 free-threaded (nogil) build — and returns a pre-registered H2_partial verdict: 19 of 24 (79.2%) of the N=2 (model × workload × phase) combinations improve above threshold under nogil and 5 of 24 (20.8%) do not, with mean N=2 parallel-efficiency uplift +17.91pp and 2 of the 6 V1 deterministic-hang cell shapes resolved (the other 4 still hang). V3 and TR165 are the two orthogonal probes of the V1 breakdown: V3 changes the backend architecture (in-process dispatch → continuous-batching server) and the breakdown disappears, while TR165 changes the interpreter (GIL → nogil) and roughly four-fifths of the breakdown lifts. Each test rules out the other as a sole cause, and together they triangulate the V1 collapse as a compound failure — a continuous-batching scheduler removes it at the backend layer (V3), and a no-GIL interpreter removes a large but partial slice at the runtime layer (TR165). V3's contribution to that triangulation is the backend-architecture axis at scale: five model families, two production engines, matched hardware.

The triangulation is worth laying out explicitly, because two probes that each lift the V1 collapse pin the causal account more tightly than either could alone. If only V3 existed, the V1 collapse would be consistent with "the in-process dispatcher is bad and a real serving engine fixes it," which leaves the mechanism of the in-process collapse unidentified — it could be the GIL, the CUDA-context sharing, the thread-pool contention, or any compound of them. If only TR165 existed, the partial nogil lift (79.2% of cells, +17.91pp) would be consistent with "the GIL is the whole story" only if the lift were complete, but it is not — five of twenty-four cells do not improve and four of six hangs persist, so the GIL cannot be the sole cause. The two probes read together give the sharper account: the V1 collapse is compound, with a large GIL-attributable component (the slice TR165's nogil build lifts) and a residual non-GIL component (the slice nogil does not lift — the four persistent hangs and five non-improving cells), and a continuous-batching scheduler removes the whole thing at the backend layer because it sidesteps the in-process dispatch path entirely rather than fixing one of its contended resources. V3 is the backend-layer probe; TR165 is the runtime-layer probe; the compound-failure reading is what neither produces in isolation.

The two probes also differ in their verdict shape in a way that is itself a methodological lesson. V3's verdict is categorical — the N=2 breakdown does not occur on either continuous-batching engine, full stop — because changing the backend architecture removes the failure surface rather than ameliorating it. TR165's verdict is graded and pre-registered as H2_partial — most but not all of the breakdown lifts — because changing the interpreter ameliorates one contended resource (the GIL) without touching the others. A program that ran only the graded probe might under-claim ("nogil helps but doesn't fully fix it, so the problem is hard"); a program that ran only the categorical probe might over-claim ("a real engine fixes it, so the in-process path is simply wrong"). Running both, and reporting the categorical V3 result and the graded TR165 result side by side, is what licenses the precise compound-failure claim instead of either of the looser single-probe ones.

Observations. V3 and TR165 are orthogonal single-knob probes of the same V1 breakdown — backend architecture and interpreter respectively — and their verdicts are complementary in shape (categorical vs graded-partial). Together they identify the V1 collapse as a compound failure with a large GIL-attributable slice (TR165's nogil lift) and a residual non-GIL slice (the persistent hangs), removed entirely at the backend layer by continuous batching (V3) and partially at the runtime layer by nogil (TR165).

The triangulation is the program-level payoff of running two single-knob probes rather than one multi-knob study. Each probe rules out the other as a sole cause — V3 rules out "the GIL is the whole story" (a non-interpreter change fixes it), TR165 rules out "the engine is the whole story" (an interpreter-only change lifts most of it) — and the conjunction pins the V1 collapse as compound. V3's specific contribution is that it runs the backend-architecture probe at scale (five families, two engines, matched hardware) rather than at the single-model feasibility scale, which is what makes the categorical "no N=2 breakdown" verdict a five-model result rather than an anecdote. State of confidence: the TR165 H2_partial numbers (19/24, +17.91pp, 2-of-6 hangs resolved) are cited from Technical_Report_165.md (observed); the compound-failure synthesis is the program's own reading of the two probes together (inferred-and-documented).


18. SS12. Limitations

V3 is the strongest entry in the TR164 lineage on the cross-backend axis — 1,800 cells, 189,000 rows, ok_rate 1.0 across every cell, and the first hardware-clean vLLM-vs-SGLang head-to-head the program has produced — but the substrate carries five structural limitations that bound the breadth of claims it can license. None of these invalidate the V1-vs-V3 gradual-decline result or the matched-SKU head-to-head deltas; each constrains how far those results travel. The limitations are stated in the order of how load-bearing the confound is, beginning with the one that already forces an explicit exclusion in the results tables.

Limitation 1: Mistral × long_prefill_8k is not apples-to-apples across backends. The configuration routes vLLM Mistral through --tokenizer-mode mistral (the mistral_common tokenizer) while SGLang Mistral used the HF AutoTokenizer, and on the synthetic repeated-paragraph long_prefill_8k prompts these tokenizers disagree on token count. Measured prompt_tokens over n=1890 requests per backend: vLLM mistral_common 7,629 / 8,148 / 8,644 (min/mean/max) versus SGLang HF 8,860 / 9,465 / 10,039, a sglang_over_vllm_token_ratio of 1.162 (mistral_tokenizer_asymmetry). SGLang therefore processes ~16% more prefill tokens on the identical char-matched prompt, which is both why SGLang Mistral overflowed the 8,768 context ceiling (forcing the 11,264 refill) and why the Mistral long_prefill cell cannot enter the cross-backend comparison cleanly. The resolution is to document and exclude: head_to_head_by_workload.long_prefill_8k.mistral_excluded = true and n_pairs = 20 (versus 25 on the four full-model workloads), so the long_prefill_8k vLLM "edge" (Δ +0.0757, Holm p 4.8×10⁻⁵) is over the four non-Mistral models, not five. The ~16% extra Mistral prefill work is a property of char-matched (not token-matched) prompts across tokenizers of differing density — inherent to the workload design, not an artifact of the context-length fix.

Limitation 2: Model scope is five families with no coverage above 14B. The grid spans 2× 7B (qwen2.5-7b, mistral-7b), 1× 8B (llama3.1-8b), and 2× 14B-class (qwen2.5-14b, phi-4), as enumerated in the per_model_efficiency_curves keys. This is a deliberate expansion over V2's 7B-only cloud arc and it is enough to expose the model-size gradient (14B-class retains ~0.60-0.62 efficiency at N=32 versus 7-8B ~0.47-0.53), but it does not extend past 14B. No claim in this report about efficiency retention, the workload-shaped boundary, or the head-to-head deltas is licensed beyond the tested set; in particular, the model-size gradient is observed across the 7B→14B span and is not extrapolated to 32B/70B-class models, where memory-bandwidth and KV-cache pressure regimes differ.

Limitation 3: Two backends only. V3 is a vLLM-vs-SGLang matched-SKU head-to-head, not a four-backend study. TGI, Ollama, and the V1 pytorch_direct path are not in the V3 grid — TGI and pytorch_direct live in V1/V2, Ollama remains deferred from the program's local arc. The head-to-head statistical battery (head_to_head_by_workload) is structurally a paired comparison of two continuous-batching engines; any framing of V3 as a broad serving-stack survey would overreach the substrate, which is precisely two production engines on one SKU across five models and five workloads.

Limitation 4: The A100 cross-device ncu roofline tail is blocked, not run. The intended cross-device roofline extension — re-running the ncu_roofline.py harness on the A100 to test whether the decode/prefill roof split observed on the RTX 4080 holds across the ~4.5× memory-bandwidth gap — was blocked at the perf-counter gate inside the RunPod container. Two independent confirmations from inside the pod: /proc/driver/nvidia/params reported RmProfilingAdminOnly: 1 (a host-level, RunPod-controlled driver restriction) and /proc/self/status CapEff showed no cap_sys_admin; ncu fails with ERR_NVGPUCTRPERM and there is no in-container workaround (the restriction is set at pod creation). This does not affect the V3 boundary result — the ncu roofline layer already stands complete on the RTX 4080 data (research/tr164/ncu_analysis.json), nsight-compute 2025.4.1 installed cleanly, and the harness/extractor are pod-ready (ncu_extract.py takes --peak-dram-gbs, A100=1935). Only the driver capability blocks it; the A100 cross-device roofline is deferred to a privileged / --cap-add=SYS_ADMIN pod (V3_SGLANG_RUN_FINDINGS §6b).

Limitation 5: The head-to-head deltas are small and one workload is a statistical tie. The vLLM-vs-SGLang deltas are 1.5-7.6pp in magnitude (head_to_head_by_workload mean deltas), Holm-significant on four of five workloads with bootstrap CIs excluding zero, but balanced_2k is a tie (Δ +0.0022, CI [−0.01, 0.014] straddling zero, Holm p 0.652841). These deltas are workload-routing signals — which engine to prefer for a given traffic shape on this hardware — not an engine-quality ranking, and they split direction by workload (SGLang on short_decode/long_decode/repeated_prefix, vLLM on long_prefill_8k over the four non-Mistral models). The substrate does not license a single-number "engine X is faster" claim.

Observations. The five limitations partition cleanly: Limitation 1 is a per-cell confound that forces an explicit exclusion (Mistral tokenizer asymmetry), Limitation 2 is a scope cap (≤14B), Limitation 3 is a backend-count cap (two engines), Limitation 4 is a blocked extension (A100 perf-counter restriction), and Limitation 5 is a magnitude-and-direction caveat on the deltas. None of them touch the two load-bearing V3 results — the V1-vs-V3 gradual-decline contrast is a within-program matched-direction claim against V1's documented collapse, and the head-to-head deltas are hardware-clean by construction because both backends ran on matched A100 80GB PCIe. The limitations bound the breadth of secondary claims, not the load-bearing comparisons.

The honest summary is that V3 has earned the matched-SKU head-to-head V2 could not — a hardware-clean vLLM-vs-SGLang comparison across five model families, with the one tokenizer-confounded cell documented and excluded — and it has earned the gradual-decline contrast against V1's collapse. It has NOT earned coverage above 14B, a four-backend survey, an A100 cross-device roofline, or an engine-ranking claim. Reporting these five limitations explicitly is the substrate-fidelity discipline the program runs on; the magnitude-honesty rule (small-but-consistent deltas, one tie, document-and-exclude on Mistral) is what keeps the head-to-head from being read as a leaderboard it is not.


19. SS13. Forbidden Claims

Substrate fidelity has a dual face. SS15 and SS16 name what V3 licenses and what bounds it; SS17 names the specific claims V3's numbers tempt but cannot pay for. Each forbidden claim corresponds to a real V3 result that an under-disciplined writeup would over-promote, and each one has a licensed form that the substrate does support. The forbidden-claims discipline is the report's structural boundary on conclusions; holding the line here is what lets the licensed claims carry weight at external review.

19.1 Forbidden Claim 1: "vLLM beats SGLang" (or "SGLang beats vLLM").

The temptation comes from having a clean matched-SKU head-to-head with Holm-significant deltas on four of five workloads. But the deltas are small (1.5-7.6pp), workload-conditional, and split direction by workload: SGLang holds the edge on short_decode (Δ −0.0349, Holm p 0.001824), long_decode (Δ −0.0349, Holm p 0.000649), and repeated_prefix (Δ −0.0148, Holm p 0.001289), while vLLM holds the edge on long_prefill_8k (Δ +0.0757, Holm p 4.8×10⁻⁵, over the four non-Mistral models), and balanced_2k is a tie (Δ +0.0022, Holm p 0.652841, CI straddling zero). A single-engine ranking folds five direction-split workload results into one number that does not exist in the substrate (head_to_head_by_workload).

Observations. The matched-SKU grid is the prerequisite for an engine comparison; the direction-split-by-workload result is the prerequisite for refusing to publish a single-engine ranking.

What IS licensed is a workload-conditional routing signal on matched hardware: for short-prompt and decode-heavy traffic SGLang is the marginally better route on this A100 80GB PCIe substrate, for prefill-heavy traffic vLLM is (over the non-Mistral models), and for balanced mixed traffic the two are equivalent. That is a deployment-routing statement, not a leaderboard. "vLLM beats SGLang" / "SGLang beats vLLM" is not licensed.

19.2 Forbidden Claim 2: "Mistral is slower than the other models on long prefills."

The temptation comes from Mistral's lower efficiency in the long_prefill_8k cells. But that cell is confounded by the tokenizer asymmetry: on the char-matched long_prefill_8k prompts, SGLang's HF AutoTokenizer produces ~16% more tokens for Mistral than vLLM's mistral_common (ratio 1.162, sglang_over_vllm_token_ratio), so Mistral genuinely processes ~10k-token prefills against ~8.7k for the other four families on the same nominal prompt. Lower measured efficiency there reflects more prefill work, not a slower model.

Observations. The char-matched (not token-matched) workload design means cross-model long_prefill comparisons must be read as char-matched; Mistral does more token-work per nominal prompt and that is a property of its tokenizer's density, not its serving speed.

The licensed form is document-and-exclude: head_to_head_by_workload.long_prefill_8k.mistral_excluded = true, the long_prefill_8k head-to-head reported over the four non-Mistral models, and the Mistral long_prefill cell footnoted as tokenizer-confounded. A "Mistral is slower on long prefills" claim collapses the tokenizer confound into a model-speed claim and is forbidden.

19.3 Forbidden Claim 3: "Continuous batching always avoids breakdown."

The temptation comes from the V1-vs-V3 headline — continuous batching turns V1's 0.056-at-N=16 collapse into 0.47-0.62-at-N=32 retention. But the V3 boundary is workload-shaped, not uniform: prefill-heavy long_prefill_8k still drives the p95 latency multiplier to 2.64× (vLLM) and 3.5× (SGLang) at N=32 and pulls mean efficiency below 0.30 on SGLang (0.2812, workload_boundary.sglang.long_prefill_8k.mean_efficiency_at_n32). Continuous batching shifts the breakdown boundary up by more than an order of magnitude in concurrency, but it does not erase it.

Observations. Three continuous-batching engines across two reports clearing the V1 N=2 collapse is structural evidence that the V1 pathology is dispatch-architecture-specific; it is not a universal-quantifier guarantee that continuous batching never breaks down.

The licensed form is a gradual, workload-shaped decline: continuous batching moves the boundary upward dramatically (V1 0.056 at N=16 → V3 0.47-0.62 at N=32) but the prefill-heavy tail still degrades to 2.6-3.5× p95 and sub-0.30 efficiency at the highest tested concurrency. "Continuous batching always avoids breakdown" overstates a real-but-bounded result and is forbidden.

19.4 Forbidden Claim 4: "V3 reproduces V2."

The temptation comes from the SGLang qwen2.5-7b balanced_2k curve agreeing within ~5% across V2 (SXM) and V3 (PCIe). But V3 is a standalone extension, not a controlled re-run of V2: different SKU set, different model count (5 vs 1 on cloud), different workload count, and a pinned vLLM 0.10.2 versus V2's April build digest. The SGLang within-5% agreement is a confirmatory bonus, and the vLLM V2-vs-V3 divergence at high N (V3 0.52 vs V2 0.43 at N=32) is the expected build difference, not a reproduction failure (V3_CROSS_BACKEND_FINDINGS §4).

Observations. A controlled-reproduction claim would require holding SKU, model set, workload set, and backend build fixed across the two runs; V3 deliberately changes all four to extend the program, so it cannot be a replication of V2.

The licensed form is "V3 is a standalone matched-SKU extension that happens to reproduce the trusted V2 SGLang curve within noise." V3 does not depend on that agreement and does not claim to replicate V2. A controlled "V3 reproduces V2" claim is forbidden.

19.5 Why these four are the discipline boundary.

Each forbidden claim corresponds to a real V3 result that an under-disciplined writeup would over-promote. The engine-ranking temptation comes from having Holm-significant matched-SKU deltas; the Mistral-is-slow temptation comes from a lower long_prefill efficiency that is really a tokenizer artifact; the continuous-batching-always-works temptation comes from the dramatic V1-vs-V3 contrast; the V3-reproduces-V2 temptation comes from the within-5% SGLang agreement. The substrate-fidelity rule is the same one that gated the program's earlier scaffolds: licensed claims travel, over-promoted claims get caught at external review. Each forbidden claim above has a licensed form one degree narrower, and reporting that boundary explicitly is what distinguishes the V3 synthesis from the genre of backend-benchmark posts that report a single raw delta and call it an engine comparison.


20. SS14. Future Work

Four extensions are queued against the V3 substrate, ordered by compute cost. First, harmonize the Mistral tokenizer: re-run one side of the grid with a matched Mistral tokenizer (SGLang Mistral on mistral_common, or vLLM Mistral on HF AutoTokenizer) so the long_prefill_8k cell becomes a clean five-model head-to-head rather than a documented-and-excluded four-model one. This is the cheapest extension — ~30-60 minutes of pod time per side (V3_CROSS_BACKEND_FINDINGS §3 option (b)) — and it converts Limitation 1 from a confound into a clean cell. Second, the A100 cross-device ncu roofline: re-run the ncu_roofline.py harness on a privileged / --cap-add=SYS_ADMIN pod (or a host with RmProfilingAdminOnly=0) to test whether the decode/prefill roof split observed on the RTX 4080 holds across the ~4.5× memory-bandwidth gap to the A100; the harness and extractor are already pod-ready and only the perf-counter capability blocks it (V3_SGLANG_RUN_FINDINGS §6b). Third, scale and breadth: extend the grid beyond 14B and add TGI and Ollama to the matched-SKU set, retiring Limitations 2 and 3 in one wave. Fourth, the formalized cross-run synthesis: fold both V3 backends into cross_run_analyze.py for the paired vLLM↔SGLang boundary matrix with the head-to-head deltas carried through the full Holm-adjusted paired battery as a tracked artifact (V3_SGLANG_RUN_FINDINGS §7, V3_CROSS_BACKEND_FINDINGS §6). The first and fourth fire today against existing substrate and existing or short pod time; the second waits on a privileged pod; the third is the only piece that meaningfully scales the compute envelope, and it aligns with the program's free-compute expansion lane.

21. Conclusion

Technical Report 164 V3 delivers the matched-SKU clean head-to-head that V1 and V2 could not. V1 established the pessimistic anchor — an in-process pytorch_direct backend collapsing to 0.056 parallel efficiency at N=16 with six deterministic dispatcher hangs and 188-second tail TTFT — and V2 confirmed continuous batching eliminates that collapse but left the vLLM-vs-SGLang comparison entangled with an A100 SXM-vs-PCIe memory-bandwidth confound. V3 runs both backends on matched A100 80GB PCIe (provenance.hardware), pins vLLM 0.10.2 and SGLang 0.5.12.post1 (provenance.vllm_version, provenance.sglang_version), and executes a 5-model × 5-workload × 6-N × 3-rep × 2-phase grid per backend — 900 cells and 94,500 rows per backend at ok_rate 1.0, an aggregate of 1,800 cells and 189,000 rows (provenance.cells_per_backend, provenance.rows_per_backend, provenance.ok_rate).

The first load-bearing result is the gradual continuous-batching decline. Across all five model families on both backends, parallel efficiency at N=32 retains 0.47-0.62 (per_model_efficiency_curves: vLLM qwen2.5-7b 0.4997, llama3.1-8b 0.4876, mistral-7b 0.4701, qwen2.5-14b 0.6118, phi-4 0.6189; SGLang qwen2.5-7b 0.529, llama3.1-8b 0.4777, mistral-7b 0.4713, qwen2.5-14b 0.60, phi-4 0.5952), where V1 pytorch_direct had already collapsed to 0.056 at N=16. Continuous batching moves the breakdown boundary up by more than an order of magnitude in concurrency, and a sensible model-size gradient rides on top of it: the 14B-class families (qwen2.5-14b, phi-4) retain ~0.60-0.62 at N=32 against the 7-8B class's ~0.47-0.53, consistent across both backends — larger per-token compute amortizes the batching overhead better. This V1-vs-V3 effect is large and dramatic, and is deliberately framed separately from the small head-to-head deltas so the two magnitudes are not conflated.

The second load-bearing result is the workload-shaped p95 boundary. The breakdown is not uniform: prefill-heavy long_prefill_8k drives the mean p95 latency multiplier at N=32 to 2.64× (vLLM) and 3.5× (SGLang) and pulls efficiency below 0.30 on SGLang, while decode-heavy long_decode stays nearly flat at ~1.4× on both backends and retains ~0.72 efficiency (workload_boundary). Continuous batching works until workload shape breaks the amortization regime — the prefill-bound tail is where it bends.

The third load-bearing result is the matched-SKU head-to-head itself. The vLLM-vs-SGLang deltas are small (1.5-7.6pp) but Holm-significant on four of five workloads with bootstrap CIs excluding zero, and they split direction by workload: SGLang holds the edge on short_decode (Δ −0.0349, Holm p 0.001824), long_decode (Δ −0.0349, Holm p 0.000649), and repeated_prefix (Δ −0.0148, Holm p 0.001289, the RadixAttention prefix-reuse advantage matching V2's qualitative direction), vLLM holds the edge on long_prefill_8k (Δ +0.0757, Holm p 4.8×10⁻⁵, over the four non-Mistral models), and balanced_2k is the honest tie (Δ +0.0022, Holm p 0.652841, CI [−0.01, 0.014] straddling zero) (head_to_head_by_workload). Because both backends ran on matched A100 80GB PCIe, these deltas are hardware-clean by construction — no bandwidth-adjustment caveat is needed, which is exactly the methodological advance V2 could not make. The deltas are a workload-conditional routing signal on this hardware, not an engine-quality ranking.

One confound is documented and excluded: Mistral × long_prefill_8k is tokenizer-confounded (mistral_common mean 8,148 tokens vs HF AutoTokenizer mean 9,465, ratio 1.162; mistral_tokenizer_asymmetry), so SGLang processes ~16% more prefill work on the char-matched prompt, and that cell is excluded from the long_prefill_8k head-to-head (mistral_excluded = true, n_pairs = 20). V3 reads alongside TR165's H2_partial nogil result as the two orthogonal probes of the V1 breakdown — V3 the backend-architecture axis at five-model scale, TR165 the interpreter axis — which together triangulate the V1 collapse as a compound failure that a continuous-batching scheduler removes at the backend layer and a no-GIL interpreter removes a large but partial slice of at the runtime layer. V3 is a standalone extension of the program, not a replication of V2; every numeric claim in this report is reproducible from research/tr164/v3_paper_stats.json and the run directories results/20260612_204254_036590 (vLLM) and results/20260612_212816_266127 (SGLang).


22. References

This section lists the artifacts and external sources load-bearing for TR164 V3. Every internal reference is a path inside this repository; every external reference is an open-source upstream surface, a vendor documentation page, or a public PEP. Citations to blind-review-active venues are deliberately omitted per substrate hygiene policy; where a forward-looking external venue would otherwise be named, the generic equivalent is used.

24.1 Banterhearts internal references

The internal substrate is organized along three axes: the TR164 lineage (V1 pytorch_direct, V2 cross-backend, and the TR165 interpreter-axis companion) that V3 extends; the two V3 run directories plus their findings docs that carry every number in this report; and the cross-program synthesis surfaces that consume the V3 substrate.

Ref ID Path Role
INT-1 PublishReady/reports/Technical_Report_164.md TR164 V1 hand-narrated report — pytorch_direct local_core catastrophic-collapse baseline (0.056 at N=16; six deterministic hangs); the pessimistic anchor V3 contrasts against
INT-2 PublishReady/reports/Technical_Report_164_V2.md TR164 V2 report — earlier cross-backend cloud_core feasibility pass with the SXM-vs-PCIe bandwidth confound that V3 retires
INT-3 PublishReady/reports/Technical_Report_165.md TR165 companion — Python 3.14t nogil GIL ablation (H2_partial); the interpreter axis to V3's backend-architecture axis
INT-4 research/tr164/results/20260612_204254_036590/ V3 vLLM 0.10.2 run directory (A100 80GB PCIe, 900 cells, 94,500 rows, ok_rate 1.0)
INT-5 research/tr164/results/20260612_212816_266127/ V3 SGLang 0.5.12.post1 run directory (A100 80GB PCIe, 900 cells, 94,500 rows, ok_rate 1.0; merged master incl. Mistral refill)
INT-6 research/tr164/results/20260612_212816_266127_BACKUP_preRefill/ Pre-refill SGLang snapshot (with the 3,780 http_400 Mistral rows), retained untouched for splice audit
INT-7 research/tr164/results/20260613_055519_963364/ Mistral × long_prefill_8k refill run (ctx 11264, 3,780 ok rows)
INT-8 research/tr164/v3_paper_stats.json Authoritative V3 statistics — head-to-head, boundary matrix, per-model curves, workload boundary, Mistral tokenizer asymmetry
INT-9 research/tr164/V3_SGLANG_RUN_FINDINGS_2026-06-13.md SGLang run/incident narrative — four-failure environment arc, Mistral refill, verification
INT-10 research/tr164/V3_CROSS_BACKEND_FINDINGS_2026-06-13.md vLLM grid verification, cross-backend head-to-head, Mistral tokenizer caveat, V2 relationship, repro pins
INT-11 research/tr164/config_cloud_5model_3rep.yaml Main-grid config (version 0.2_cloud5_3rep, seed 164) — the 900-cell cross-product and per-backend launch args

Observations. Every numeric claim in this report resolves to INT-8 (the statistics JSON) or to one of the findings docs INT-9/INT-10, which in turn resolve to the run directories INT-4/INT-5. INT-6 (the pre-refill backup) and INT-7 (the refill run) are the two references that make the Mistral document-and-exclude decision of Section 15 auditable rather than merely asserted — without them, the 3,780-row splice into the master metrics.csv would not be reconstructable.

The substrate-fidelity discipline this report inherits depends on every number resolving to INT-8 or, for the run/incident facts, to INT-9/INT-10. If a later synthesis pass invokes a number not traceable to those three documents, that claim is by construction outside the V3 substrate and must be flagged as TBD.

24.2 External references

Ref ID Citation Role
EXT-1 vLLM project documentation (continuous batching, PagedAttention) Backend vendor doc for INT-4 (vLLM 0.10.2)
EXT-2 SGLang project documentation (RadixAttention, prefix-cache reuse) Backend vendor doc for INT-5 (sglang 0.5.12.post1); the prefix-reuse mechanism behind the repeated_prefix SGLang edge
EXT-3 mistral_common tokenizer documentation (Mistral-v0.3) The --tokenizer-mode mistral routing on vLLM; one half of the Mistral tokenizer asymmetry
EXT-4 Hugging Face transformers AutoTokenizer documentation The HF tokenizer routing on SGLang; the other half of the Mistral asymmetry (ratio 1.162)
EXT-5 NVIDIA CUDA forward-compatibility (cuda-compat-13-0) documentation The 580.159 forward-compat libs that bridged the SGLang pod's 575.57 host driver to the cu130 torch stack
EXT-6 PEP 779 — Criteria for supported status for free-threaded Python Defines the Python 3.14t no-GIL build path exercised in companion TR165 (INT-3)

Observations. External references are restricted to open-source vendor documentation, two tokenizer-library surfaces, an NVIDIA forward-compatibility page, and a single PEP; no blind-review-active venue is cited. EXT-3 and EXT-4 are load-bearing in a way the other external references are not — they are the two tokenizer surfaces whose 1.162 token-count ratio is the entire root cause of the Mistral × long_prefill_8k document-and-exclude decision.

The two tokenizer references (EXT-3, EXT-4) carry more interpretive weight than the rest of the external set combined, because the asymmetry between them is the one cross-backend confound V3 could not engineer away. The NVIDIA forward-compatibility reference (EXT-5) is the second-heaviest, because the cuda-compat-13-0 bridge it documents is the load-bearing clause of the SGLang working pin in Appendix A. Together the external references map one-to-one onto the two caveats — the tokenizer asymmetry and the environment arc — that the body of this report spends the most prose disciplining.

23. Appendix A. Reproducibility Pins

This appendix is the reproduction contract for the V3 boundary matrix. Where the V2 report's environment appendix had to foreground a single load-bearing confound — the A100 PCIe versus A100 SXM memory-bandwidth split that contaminated its vLLM-versus-SGLang head-to-head — V3's environment appendix carries no such asterisk on the hardware axis. Both backends ran on matched A100 80GB PCIe (provenance.hardware: "A100 80GB PCIe (both backends — matched SKU, no bandwidth adjustment needed)"), so the per-cell repro pins below differ on the software stack and on the per-model tokenizer routing, but not on the instrument. The two consequential rows on this page are therefore the backend version row and the Mistral tokenizer/context row; the GPU row, which dominated the V2 appendix, is here a constant.

21.1 Per-backend reproducibility pin table

The table records the minimum (hardware, driver, backend version, torch flavor, tokenizer/context, run directory) tuple a third party needs to reproduce any per-cell row in metrics.csv from either V3 run directory. The driver and torch rows are not cosmetic: the divergence between them is the entire cause of the four-failure SGLang environment arc narrated in Section 9, and the symmetry the table does not show — that both pods landed on the same coherent CUDA-13 stack at the end — is what licenses reading the two backends as software-comparable.

Field vLLM pod SGLang pod
GPU NVIDIA A100 80GB PCIe NVIDIA A100 80GB PCIe
Host driver 580.159 (CUDA-13-capable) 575.57 (CUDA 12.9) + cuda-compat-13-0 (580.159 forward-compat)
Backend vLLM 0.10.2 (native venv) sglang 0.5.12.post1 (native launcher)
torch 2.8.0+cu128 2.11.0+cu130
Transitive pin (resolver-clean; no arc) kernels<0.15 (pin against kernels==0.15.2 import crash)
Mistral tokenizer --tokenizer-mode mistral (mistral_common) HF AutoTokenizer
Mistral context ceiling 8,768 (no overflow) 11,264 (refilled; default 8,768 overflowed)
--cuda-graph-max-bs n/a (vLLM default) 32
Cells planned / completed 900 / 900 900 / 900
metrics.csv rows 94,500 94,500
ok_rate 1.0 across every cell 1.0 across every cell
Local↔pod md5 540adec5… == 540adec5… 5fd15755… (orig) + e0400837… (refill), both byte-identical ✓
Run directory research/tr164/results/20260612_204254_036590/ research/tr164/results/20260612_212816_266127/

Observations. Three rows do the reproducibility work, and none of them is the GPU. First, the host-driver row is the load-bearing software fact: the vLLM pod shipped driver 580.159, which already covers CUDA 13, so Codex's torch 2.8.0+cu128 initialized cleanly with no environment arc; the SGLang pod shipped driver 575.57 (CUDA 12.9), which does not cover the cu130 wheels that the SGLang resolver pulled, forcing the cuda-compat-13-0 forward-compat bridge documented in Section 9. The driver mismatch, not the backend, is what made the two pods asymmetric to stand up. Second, the Mistral tokenizer/context row is the only per-model routing difference in the entire matrix — vLLM routes Mistral through mistral_common at the 8,768 default ceiling, SGLang routes Mistral through the HF AutoTokenizer at a refilled 11,264 ceiling — and that single asymmetry is the entire content of the document-and-exclude decision in Section 15. Third, the md5 row converts "we pulled the data back" from a claim into a verifiable artifact: both run directories are byte-identical local-versus-pod, with the SGLang side carrying two hashes because the Mistral refill spliced 3,780 rows into the master metrics.csv.

The V2 environment appendix had to teach the reader that A100 SXM is not the same instrument as A100 PCIe. The V3 appendix has the opposite shape: the instrument is held constant and the reader is instead taught that torch 2.8.0+cu128 on driver 580.159 and torch 2.11.0+cu130 + cuda-compat-13-0 on driver 575.57 are two routes to the same coherent CUDA-13 runtime. The matched-SKU design is exactly what moves the asterisk off the hardware row and onto the tokenizer row — which is a far narrower, single-model, single-workload caveat than a bandwidth confound that touched every prefill-heavy cell.

21.2 SGLang working-environment pin (verbatim)

The SGLang side took four attempts to stand up, and the durable artifact of that arc is a single working pin that any downstream reader inherits. Stated verbatim for reproduction (V3_SGLANG_RUN_FINDINGS §3): sglang 0.5.12.post1 + torch 2.11.0+cu130 + kernels<0.15 + cuda-compat-13-0 (580.159 forward-compat libs) on host driver 575.57, native launcher (the vendor Docker image bypassed exactly as in V2), with LD_LIBRARY_PATH=/usr/local/cuda-13.0/compat:/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib exported before launch and --cuda-graph-max-bs 32. The verification gate that proves the pin works is the up-front stage-0 assertion torch.cuda.is_available() True with device string "NVIDIA A100 80GB PCIe" and a clean import sgl_kernel; the pin is not considered live until both pass.

Observations. The verbatim pin is deliberately recorded as a single composable string rather than as four independent fixes, because the four fixes are not independent — pinning kernels<0.15 without the cuda-compat-13-0 bridge still leaves torch.cuda.is_available() False on driver 575.57, and bridging the driver without pinning kernels still crashes the sglang import. The pin is the contract; the individual failures in Section 9 are the rationale for each clause of it.

The cleanest test of whether a reproducibility pin is real is whether a reader who has never seen the failure can stand the environment up from the pin alone. The SGLang pin clears that bar because it carries the up-front torch.cuda.is_available() + import sgl_kernel assertion as part of the contract — a reader who copies the pin and runs stage 0 finds out in seconds whether the CUDA-13 stack is coherent, rather than discovering it 1,200 seconds later when the server startup timeout fires.

21.3 Config files, driver tooling, and artifact manifest

The main-grid config is config_cloud_5model_3rep.yaml (version 0.2_cloud5_3rep, seed 164), which encodes the 5-model × 5-workload × 6-N × 3-rep cross-product, the per-backend extra_args (vLLM --dtype float16 --gpu-memory-utilization 0.86; SGLang --dtype float16 --mem-fraction-static 0.82 --cuda-graph-max-bs 32), the cloud_full suite's --enable-prefix-caching on vLLM, and the Mistral backend_extra_args.vllm: ["--tokenizer-mode", "mistral"] routing that is the upstream cause of the tokenizer asymmetry. The Mistral refill override is config_mistral_refill.yaml, which is identical to the main grid except default_max_model_len: 8768 → 11264. The manifest-gated pipeline driver is v3_sglang_driver.sh (v2.2), which verifies every run.py invocation at the manifest level — per-cell status plus metrics-row count — rather than trusting the exit code, the fix for the silent-success failure mode in Section 9.

The on-disk artifact set is three directories plus the durable cross-run synthesis target. The merged master is results/20260612_212816_266127/ (SGLang: metrics + regenerated analysis.json/cell_summary.csv + REFILL_NOTE.md) and results/20260612_204254_036590/ (vLLM). The pre-refill snapshot results/20260612_212816_266127_BACKUP_preRefill/ retains the 3,780 http_400 Mistral rows untouched for audit, and the refill run itself is results/20260613_055519_963364/ (Mistral × long_prefill_8k only, ctx 11264). Per research/tr164/.gitignore, the results/ trees are gitignored; the tracked durable output is the cross-run synthesis JSON, which is the next thing extended over this substrate.

Observations. The artifact manifest is deliberately three-part — merged master, pre-refill backup, refill run — rather than a single overwritten directory, because the Mistral refill replaced 3,780 rows in place and a reader auditing the splice needs the pre-refill state to verify that exactly the http_400 rows were replaced and nothing else. The _BACKUP_preRefill directory is the receipt for that splice; without it, "we refilled Mistral" would be a claim rather than a diff a third party can reconstruct.

Reproducibility here is the same triad V2 established — pinned backend version plus pinned dispatcher driver plus the on-disk metrics.csv row count — with one V3-specific addition: the pre-refill backup. The backup is what turns the Mistral refill from an opaque mutation of the master metrics.csv into an auditable 3,780-row replacement, and it is the reason the document-and-exclude decision in Section 15 can be checked rather than merely trusted.


24. Appendix B. Statistical Methodology

This appendix states the statistical machinery behind the cross-backend head-to-head of Section 13 in full, so that every number in that table is reconstructable from the committed metrics.csv and the methodology below. The design is deliberately conservative: a paired non-parametric primary test, a family-wise error correction across the five-workload family, and a bootstrap interval as the effect-size companion. No number in the head-to-head depends on a parametric distributional assumption, and no workload's significance is read without the Holm correction applied.

22.1 Paired Wilcoxon signed-rank, per workload

The head-to-head is computed per workload over paired per-cell parallel-efficiency deltas. For each (model × N-level × rep) cell that exists on both backends, the pair is (vLLM efficiency, SGLang efficiency) at that coordinate, and the per-pair delta is vllm − sglang. The paired Wilcoxon signed-rank test is run over the within-workload set of these deltas, yielding the raw two-sided p-value reported as wilcoxon_p_raw per workload. The pairing is exact-coordinate by construction — the matched-cell-shape discipline that produces identical 900-cell grids on both backends is the prerequisite for the test, exactly as the matched cell-shape was the prerequisite for any cross-backend test in V2.

The number of pairs per workload is n_pairs: 25 for the four full-model workloads (short_decode, balanced_2k, long_decode, repeated_prefix) and 20 for long_prefill_8k, where the drop from 25 to 20 is the Mistral exclusion of Section 15 — five Mistral pairs removed because the Mistral × long_prefill_8k cell is tokenizer-confounded across backends and not apples-to-apples.

Observations. Wilcoxon signed-rank rather than a paired t-test is the deliberate choice, for two reasons that are visible in the substrate. First, parallel-efficiency deltas across N-levels are not symmetric — a cell at N=2 deviates from 1.0 by a few percent while a cell at N=32 deviates by tens of percent — so the rank-based test is robust to the heavy-tailed delta distribution in a way the mean-based t-test is not. Second, the test is paired at exact coordinates, which removes the between-cell variance that would otherwise swamp the small (1.5–7.6pp) backend signal; the signal is only legible because each delta is a within-coordinate contrast.

The choice of a paired rank test over a paired mean test is what keeps the small head-to-head deltas honest. An unpaired or mean-based comparison would either inflate the apparent gap (by ignoring the coordinate pairing) or wash it out (by letting the N=32 cells dominate the variance). The paired Wilcoxon is the test that can resolve a 1.5pp repeated_prefix edge as Holm-significant without overstating it.

22.2 Holm-Bonferroni stepdown across the five-workload family

The five workloads are tested as a single family, and the raw per-workload p-values are corrected by the Holm-Bonferroni stepdown procedure to control the family-wise error rate at 0.05. The corrected values are reported as wilcoxon_p_holm, and the boolean holm_significant_0.05 is the decision after correction. The stepdown is what licenses reading four of the five workloads as carrying a real backend difference while reading the fifth (balanced_2k) as a tie — without the correction, reporting five separate tests at the nominal 0.05 level would inflate the family-wise false-positive rate.

The corrected pattern is: short_decode wilcoxon_p_holm 0.001824 (significant), balanced_2k 0.652841 (not significant), long_prefill_8k 4.8×10⁻⁵ (significant, Mistral-excluded), long_decode 0.000649 (significant), repeated_prefix 0.001289 (significant). Four of five survive Holm; balanced_2k does not, and its raw and Holm-adjusted p-values are identical at 0.652841 because it is the largest p-value in the family and the Holm stepdown leaves the largest value uncorrected.

Observations. The detail that balanced_2k's raw and Holm-adjusted p-values are byte-identical (0.652841) is not a coincidence — it is the signature of the Holm stepdown applied to the least-significant member of a family, which is multiplied by 1 and therefore unchanged. The four surviving workloads, by contrast, were all corrected upward from their raw values (e.g. short_decode 0.000912 → 0.001824, a factor-of-2 correction consistent with its rank position in the stepdown), and all four still clear 0.05 after correction. The headline "four of five Holm-significant" is therefore a post-correction count, not a nominal one.

Holm-Bonferroni is the discipline that prevents the head-to-head from being a fishing expedition across five workloads. The honest claim it licenses is precise: on four of five workloads the backends differ at a family-wise-controlled 0.05 level, and on the fifth they do not. Dropping the correction would let balanced_2k's nominal p-value be cherry-picked one way or the other; keeping it forces balanced_2k to be reported as the tie it is.

22.3 Percentile bootstrap 95% confidence interval on the mean delta

For each workload, a percentile bootstrap is run on the within-workload set of per-cell deltas to produce a 95% confidence interval on the mean delta, reported as boot_ci95_delta. The procedure is a percentile bootstrap with replacement over the paired deltas — 10,000 resamples at a fixed seed of 164, traced directly to the analysis script research/tr164/v3_paper_stats.py (BOOT_RESAMPLES = 10000, BOOT_SEED = 164) that produced boot_ci95_delta — with the interval read as the 2.5th and 97.5th percentiles of the resampled mean-delta distribution. The v3_paper_stats.json provenance block carries the resulting interval; the resample-count and bootstrap-seed are fixed in the producing script (cited above), so the interval is reproducible to the exact resample, not merely to the analysis pipeline. The interpretation rule is stated once and applied uniformly: a CI that excludes zero indicates a consistent-direction backend difference; a CI that straddles zero indicates a tie.

Under that rule the five workloads read as: short_decode [−0.0544, −0.0176] (excludes zero, SGLang edge), balanced_2k [−0.01, 0.014] (straddles zero, tie), long_prefill_8k [0.0546, 0.0961] (excludes zero, vLLM edge over the four non-Mistral models), long_decode [−0.057, −0.0172] (excludes zero, SGLang edge), repeated_prefix [−0.0227, −0.0076] (excludes zero, SGLang edge). The bootstrap CI and the Holm decision agree on all five workloads — the four Holm-significant workloads all have CIs excluding zero, and the one Holm-null workload (balanced_2k) has the one CI straddling zero — which is the cross-check that the two methods are telling the same story.

Observations. The agreement between the Holm decision and the bootstrap CI on every workload is the load-bearing consistency check of the head-to-head. The Holm test answers "is the direction reliable?" and the bootstrap CI answers "how large is the effect and could it be zero?"; that they agree on all five workloads — four directional-and-nonzero, one null-and-zero-straddling — means the small-but-consistent framing is supported by two independent statistical lenses, not one. The bootstrap is seeded in the analysis script (10,000 resamples, seed 164, per v3_paper_stats.py) so the interval is itself reproducible to the exact resample.

The percentile bootstrap is the effect-size companion to the Wilcoxon significance test, and its job in this report is to keep magnitude honesty enforceable. A reader who sees short_decode's CI of [−0.0544, −0.0176] cannot round the 3.5pp SGLang edge up into "SGLang dominates" — the upper bound is 1.8pp — and a reader who sees balanced_2k's CI of [−0.01, 0.014] cannot round the tie into a vLLM win, because the interval crosses zero. The CI is what makes "routing signal, not engine ranking" a checkable claim rather than a rhetorical one.

22.4 Parallel-efficiency definition, breakdown threshold, and boundary semantics

The parallel-efficiency metric is inherited unchanged from V1 and V2: efficiency at concurrency N is (N × decode_tps_at_N) / decode_tps_at_N=1, normalized to 1.0 at N=1 by construction, so that 1.0 is perfect linear scaling and lower values quantify the fraction of linear scaling retained. The efficiency-breakdown threshold is 0.65 (provenance.efficiency_breakdown_threshold), the same threshold the config encodes (analysis.efficiency_breakdown_threshold: 0.65). The per-cell boundary statistic first_n_below_0.65 is the smallest tested concurrency level at which a cell's efficiency drops below 0.65; the value null encodes "never breaks within the tested N≤32 envelope." The companion per-cell statistic p95x_at_n32 is the ratio of p95 latency at N=32 to p95 latency at N=1, the tail-latency multiplier that quantifies how far the boundary has stretched the tail at the top of the tested concurrency range.

Observations. The choice to encode "never breaks" as null rather than as a sentinel like 64 or infinity is deliberate and matters for the Appendix C reading: a null is a qualitatively different outcome from a numeric break-point, because it says the cell retained ≥0.65 efficiency through the entire tested range, and treating it as a large number would let those cells be averaged into a misleadingly high mean break-point. The boundary statistics are therefore read categorically (broke-at-N versus never-broke) rather than as a continuous quantity.

The 0.65 threshold and the null-means-never-broke convention are the two definitions a reader needs before the per-cell boundary matrix of Appendix C is legible. The threshold is config-driven, not chosen post-hoc to make the numbers land; the null convention is what lets the long_decode cells — which dominate the never-break set — be read as a qualitatively flat regime rather than as cells with an implausibly high numeric break-point.


25. Appendix C. Per-Cell Boundary Summary

This appendix is the per-cell substrate that the body sections sample from: where Section 10 reports the model-marginal efficiency curves and Section 12 reports the workload-marginal p95 boundary, this appendix reports the full 25-cell-per-backend (model × workload) boundary matrix at the two granularities the body could only summarize. Each cell carries its full efficiency_by_n curve (N=1→32), its first_n_below_0.65 break-point, and its p95x_at_n32 tail multiplier. The matrix is large — 50 cells across the two backends — so the appendix leads with an orientation sweep over the structure rather than fifty separate readings, and then points at the specific cells that anchor the body claims. There are no naked tables here either: the structural reading precedes the table and the Observations sweep follows it.

23.1 Boundary structure across the matrix

The boundary statistic that organizes the matrix is first_n_below_0.65, and its distribution across the 25 cells per backend is the cleanest single view of the workload-shaped boundary. On the vLLM side, of 25 cells, 5 never break (first_n_below_0.65 = null), 11 break at N=16, and 9 break at N=32. On the SGLang side, of 25 cells, 4 never break, 2 break as early as N=8, 5 break at N=16, and 14 break at N=32. The never-break set is workload-pure on both backends: it is dominated by the long_decode cells. On vLLM the five never-break cells are llama3.1-8b|long_decode, phi-4|long_decode, qwen2.5-14b|long_decode, qwen2.5-7b|long_decode, plus the one non-long_decode entry phi-4|balanced_2k; on SGLang the four never-break cells are exactly the four long_decode cells llama3.1-8b|long_decode, phi-4|long_decode, qwen2.5-14b|long_decode, qwen2.5-7b|long_decode. (All counts derived from boundary_matrix.{vllm,sglang}.)

The earliest-break cells are the load-bearing other end of the distribution. The only two cells anywhere in the matrix that break as early as N=8 are both on SGLang and both on long_prefill_8k: sglang."qwen2.5-7b|long_prefill_8k".first_n_below_0.65 = 8 and sglang."llama3.1-8b|long_prefill_8k".first_n_below_0.65 = 8. The same two cells carry the worst tail multipliers in the matrix: sglang."llama3.1-8b|long_prefill_8k".p95x_at_n32 = 4.4 and sglang."qwen2.5-7b|long_prefill_8k".p95x_at_n32 = 3.79. The flattest cell on each backend is a long_decode cell — vllm."phi-4|long_decode".p95x_at_n32 = 1.28 and sglang."qwen2.5-14b|long_decode".p95x_at_n32 = 1.3 — confirming that the never-break set and the flat-tail set are the same cells.

The table below reports first_n_below_0.65 per (model × workload) for both backends side by side, which is the most compact view of where each cell breaks. An en-dash in the break-point column denotes null (never breaks within N≤32).

Model × Workload vLLM break-N SGLang break-N
qwen2.5-7b · short_decode 16 32
qwen2.5-7b · balanced_2k 32 32
qwen2.5-7b · long_prefill_8k 16 8
qwen2.5-7b · long_decode
qwen2.5-7b · repeated_prefix 32 32
llama3.1-8b · short_decode 16 32
llama3.1-8b · balanced_2k 32 32
llama3.1-8b · long_prefill_8k 16 8
llama3.1-8b · long_decode
llama3.1-8b · repeated_prefix 32 32
mistral-7b · short_decode 16 16
mistral-7b · balanced_2k 16 16
mistral-7b · long_prefill_8k 16 16
mistral-7b · long_decode 16 32
mistral-7b · repeated_prefix 16 32
qwen2.5-14b · short_decode 32 32
qwen2.5-14b · balanced_2k 32 32
qwen2.5-14b · long_prefill_8k 16 16
qwen2.5-14b · long_decode
qwen2.5-14b · repeated_prefix 32 32
phi-4 · short_decode 32 32
phi-4 · balanced_2k 32
phi-4 · long_prefill_8k 16 16
phi-4 · long_decode
phi-4 · repeated_prefix 32 32

Observations. Three structural facts fall out of the break-point column. First, long_decode is the never-break workload on both backends — every long_decode cell except none of them carries an en-dash, and the never-break set is otherwise empty save for the single vllm."phi-4|balanced_2k" entry — which is the per-cell substrate behind the body claim that decode-heavy traffic stays in the linear-scaling regime through the entire tested concurrency range. Second, long_prefill_8k is the earliest-break workload, and SGLang breaks earlier on it than vLLM does: the two N=8 break-points in the entire matrix are both SGLang long_prefill_8k cells, while no vLLM long_prefill_8k cell breaks before N=16. This is the per-cell signature of the long_prefill_8k head-to-head edge that Section 13 reports as vLLM-favoring (Δ +0.0757 over the four non-Mistral models). Third, the Mistral row is the row to read with the Section 15 caveat in hand — Mistral's long_prefill_8k cell breaks at N=16 on both backends and looks symmetric in the break-point column, but that symmetry is an artifact of the SGLang refill (context 11264) masking the tokenizer asymmetry, and the cell is excluded from the long_prefill_8k head-to-head for exactly that reason.

The break-point column is the per-cell proof that the boundary is workload-shaped rather than uniform. A uniform boundary would put the same break-N in every row; instead the long_decode rows are all en-dashes (never break), the long_prefill_8k rows cluster at N=8–16 (earliest break), and the short_decode/balanced_2k/repeated_prefix rows sit at N=16–32 in between. The shape of this column is the V2-reframe thesis at the granularity of a single model on a single workload: continuous batching holds the linear-scaling regime on decode-heavy traffic and surrenders it earliest on prefill-heavy traffic.

23.2 The two granularities and how they relate

The per-cell boundary matrix of this appendix and the per-model efficiency curves of Section 10 are two marginals of the same underlying 50-cell substrate, and the appendix is the place to state how they relate so that a reader does not mistake one for the other. The Section 10 curves are the model-marginal: each per_model_efficiency_curves entry is a single model's efficiency-vs-N curve averaged over all five workloads, so qwen2.5-7b's N=32 value of 0.4997 on vLLM is a workload-mean. The Appendix C matrix is the model × workload joint: each boundary_matrix entry is a single model on a single workload, so the five qwen2.5-7b cells (qwen2.5-7b|short_decode through qwen2.5-7b|repeated_prefix) are the five constituents that average into that 0.4997. The model-marginal hides the workload spread; the joint exposes it.

The spread is large enough that the two granularities must not be conflated. For qwen2.5-7b on vLLM, the workload-mean N=32 efficiency is 0.4997, but the constituent cells range from qwen2.5-7b|long_decode at 0.681 (never breaks) down to qwen2.5-7b|long_prefill_8k at 0.3539 (breaks at N=16) — a spread of roughly 0.33 in efficiency at N=32 across the five workloads of a single model. The model-size gradient that Section 10 reports (14B-class retains more efficiency than 7–8B-class) is a real effect in the model-marginal, but it is smaller than the workload spread within any single model, which is why the body reports the workload-shaped boundary as the dominant structure and the model-size gradient as the secondary one.

Observations. The relationship between the two granularities is the reason the report has both a Section 10 and an Appendix C rather than one or the other. The model-marginal is the right granularity for the model-size gradient claim because that claim is about how a model's overall scaling depends on its parameter count; the joint is the right granularity for the workload-shaped boundary claim because that claim is about how a single model's scaling depends on the workload it serves. Reading the 0.4997 qwen2.5-7b workload-mean as if it described any individual workload would be the error the appendix exists to prevent — no qwen2.5-7b cell actually sits at 0.4997; the value is an average of cells ranging from 0.35 to 0.68.

The two granularities are not redundant; they answer different questions and the workload spread within a single model is larger than the model-size gradient across models. That ordering — workload effect dominant, model-size effect secondary — is the load-bearing reason the report's headline is "workload-shaped boundary" and not "model-size-shaped boundary," and it is only legible once the per-cell joint of this appendix is placed next to the model-marginal of Section 10.