Technical Report 147: Final Portability Validation for Benchmarking Integrity

Full v1–v4 integration plus external case study, using TR126 as the reference standard

Field	Value
TR Number	147
Project	Banterhearts
Date	2026-04-18
Version	4.1 (full-depth + external case study)
Author	Research Team
Status	Final
Report Type	Integrated portability validation (v1 + v2 + v3 + v4 + external case study)
Reference Standard	TR126
Primary Run Directories	`research/tr147/results/20260412_195222/`, `research/tr147/v2/20260413_054740/`, `research/tr147/v3/20260414_190849/`, `research/tr147/v4/results/`, `research/tr147/v4/results/gpt_fast_probe/`
Total Measurements	52,410 primary rows across four GPU deployment regimes and three Triton minor versions
Breakdown	v1 Ada: 15,240; v2 Ada: 6,840; v3 A100-SXM4: 1,440; v4 StaticCache Ada/A100: 5,400 + 5,400; v4 Large-Model Ada/A100: 3,600 + 3,600; v4 Triton 3.3.1 / 3.4.0 / 3.6.0 on Ada: 3,600 × 3; external `gpt-fast` case study on A100-PCIe: 30 × 3 primary rows (+3 preflight smoke rows not counted in the primary total)
GPUs Covered	RTX 6000 Ada (48 GB), A100-SXM4 (40 GB SXM4 in v3; 80 GB SXM4 in v4 lane), A100-PCIe-80GB (external case study), and the TR126 reference regime on RTX 4080 Laptop
Software Axes Covered	PyTorch `DynamicCache` vs `StaticCache`; `mode="eager"` vs `mode="default"` vs `mode="reduce-overhead"`; Triton 3.3.1 / 3.4.0 / 3.6.0; pinned `pytorch-labs/gpt-fast` commit `d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a`
Related Work	TR126, TR120, TR117
Depends On	TR126 as the reference regime and reporting standard
Supersedes	`PublishReady/reports/Technical_Report_147.md` v4.0 (2026-04-17 full-depth internal draft without the external case study)

Abstract

TR147 began as a narrow second-regime portability check for the five-gate benchmarking-integrity protocol established in TR126. The original question was: if we rerun the compiler benchmark prospectively on a materially larger RTX 6000 Ada GPU, does the conclusion hold? The final integrated answer, drawn from 52,410 primary measurements across five stages of work, is more important than that initial question.

The early Ada runs (v1–v2) cleanly reproduced TR126's strongest finding: compiled prefill was again substantially faster than eager, while compiled decode under DynamicCache + mode="reduce-overhead" was unstable or outright crashing. A v2 measurement bug fix removed false-success decode rows and confirmed that phase-separated reporting was mandatory. The stale v2 publish-ready draft stopped there and labelled the overall verdict "WEAKENED."

The late-stage v3 and v4 evidence changed the deeper interpretation in three decisive ways. First, the A100 80 GB v3 slice shows that compiled decode on Ampere is worse, not better, than on Ada: 100% crash at every tested token length (64, 128, 256) for both qwen2.5-1.5b and qwen2.5-3b, versus Ada's lower but still real decode crash floor under the same harness (research/tr147/v3/20260414_190849/e3/metrics.csv, 1,440 rows). Compiled prefill on A100 succeeds only at the shortest token length (64) and crashes at 128 and 256. Second, a StaticCache retest on 5,400 rows per GPU (Ada and A100-SXM4-80GB, research/tr147/v4/results/static_cache_*_v2/) establishes that StaticCache + mode="default" rescues decode correctness on both GPUs (0% crash, n=300 per cell) but not decode speed: compiled decode is 1.6%–3.5% slower than eager in every (model, GPU) pair tested, while compiled prefill retains 54%–63% gains. Third — and this is the reviewer kill-shot closure on the internal benchmark chain — a Triton-version ablation on the same Ada GPU with the same qwen2.5-1.5b model (10,800 rows across Triton 3.3.1, 3.4.0, 3.6.0) shows the conclusion flips purely by bumping the software stack: Triton 3.3.1 produces a 62–77% prefill gain plus 80% decode crash rate under reduce-overhead; Triton 3.4.0 and 3.6.0 produce near-zero prefill gain (≤3.4% in either direction) plus 0% decode crash. Large dense 7B/8B models (3,600 rows per GPU) preserve the phase split on A100 (50%–57% prefill gains, 80% reduce-overhead decode crash), while the Ada AWQ-4bit companion path for qwen2.5-7b is neutral and stable.

The new external-validity case study extends that conclusion to a public benchmark object rather than our internal harness. TR147 runs two complementary external probes. The first (SS11.1–SS11.5) pins pytorch-labs/gpt-fast to benchmark-era commit d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a, uses the gated meta-llama/Llama-2-7b-chat-hf model on A100-PCIe-80GB, and varies Triton 3.3.1 / 3.4.0 / 3.6.0 while holding the rest of the stack fixed (torch==2.7.1). Eager is stable at 33.83–35.29 tok/s; compiled has 0/15 successful repetitions across all three Triton versions. The second probe (SS11.6–SS11.10) inverts the design to isolate the opposite axis: it holds the stack constant at a single known-good point (A100-SXM4-80GB, torch==2.11.0+cu130, Triton 3.6.0, CUDA 13.0) and sweeps only the gpt-fast commit SHA. The Dec-2023 pinned code produces 0/5 compiled invocations (all returncode=1, terminal frame torch/_inductor/output_code.py:656), while the current HEAD commit 6ecad9b5b6b987d17ac4303965545873d0192086 produces a stable compiled median of 106.74 tok/s (n=20, CV 0.0066), above the 104.9 tok/s README target and classified as strong_match. The dual-variant probe is the sharper finding: the published claim is reproducible on HEAD code but not on the pinned code that originally produced it, so the benchmark-identity tuple must be extended from five elements (GPU, Triton, PyTorch, cache, compile mode) to six, with code SHA as the new axis.

The final portability verdict: TR147 validates TR126's methodological core — phase separation, environment attribution, and explicit surfacing of cache implementation and compiler mode — in a stricter form than the stale draft. Benchmark conclusions about torch.compile are not portable across hardware, software stack, cache implementation, compiler mode, model-family/quantization path, or even public benchmark object unless those axes are held explicit. Hardware alone is insufficient to explain the boundary. Triton minor-version drift on the same Ada GPU suffices to flip the qualitative interpretation inside our harness, while the public gpt-fast case study shows that a pinned benchmark-era claim can simply fail to reproduce on a modern compile stack even after Triton is swept explicitly.

Executive Summary

Validation Summary

Target	Metric	Required	Achieved	Status
Second-regime replication	Prefill gain on Ada v1	≥ 50% on ≥ 3 models	60.7–77.3% across 7 models	Pass
Decode-bug repair	False-success elimination	zero false-ok decode rows	v2 harness patched, verified	Pass
Ampere portability probe	≥ 1 second architecture	A100 v3 + v4 evidence	A100-SXM4-80GB v3, A100-SXM4-80GB v4	Pass
Compile-mode separation	≥ 3 modes × 2 caches	full factorial on v4	eager × default × reduce-overhead × Dynamic/Static	Pass
StaticCache rescue test	≥ 3 models × 2 GPUs	10,800 rows	gpt2-100m, llama3.2-1b, qwen2.5-1.5b × Ada + A100	Pass
Software-stack isolation	≥ 3 Triton versions, same GPU	10,800 rows on Ada	3.3.1 / 3.4.0 / 3.6.0	Pass
Scale-ceiling closure	≥ 1 × 7B and 1 × 8B	3,600 rows per GPU	qwen2.5-7b, llama3.1-8b on Ada and A100	Pass
Bootstrap CIs on key means	1,000 iterations, seed 42	Yes	`research/tr147/compute_v4_stats.py`	Pass
Effect sizes reported	Cohen's d on all compile-vs-eager contrasts	≥ 1 per cell	96 cells × 2 compile modes	Pass
External case-study execution	≥ 1 public benchmark target with pinned commit and stack lock	end-to-end artifact bundle	`gpt-fast` on A100-PCIe across Triton 3.3.1 / 3.4.0 / 3.6.0	Pass

Observations. Every row of the validation table is now a pass, which is a non-trivial statement because the stale v2 publish-ready draft could only claim four of the nine internal targets and none of the external-validity surface. The three targets that the stale draft explicitly left open — Ampere portability, software-stack isolation, and scale-ceiling closure — are the three targets an external reviewer was most likely to treat as disqualifying. The new case-study row matters for a different reason: it shows TR147 can carry the protocol outside our own benchmark harness and still produce a clean, reviewable verdict.

Claim Validation

Claim	Evidence	Confidence
Phase separation is mandatory	v1 Ada × 7 models, v2 corrected decode, v4 StaticCache × 3 models × 2 GPUs	High
Compiled prefill is real but not a portable single number	Prefill gains 50–77% on Ada v1/v4, 50–57% on A100 v4 large-model, ≤3.4% on Ada Triton ≥ 3.4.0	High
`DynamicCache` + `reduce-overhead` decode is unsafe	v2 Ada (480/600 crash under canary), v4 A100 large-model (80% crash), v3 A100 (100% crash at all token lengths)	High
`StaticCache` rescues decode correctness, not speed	v4: 0% crash under `default`, 1.6–3.5% slower than eager on both GPUs	High
`reduce-overhead` + `StaticCache` is still unsafe at token_len > 1	80% crash on both Ada and A100 across all three v4 StaticCache models	High
Large dense FP16 models preserve the phase split	A100 qwen2.5-7b and llama3.1-8b: 50–57% prefill gain, 80% `reduce-overhead` decode crash	High
Same hardware can flip conclusion via software stack	Ada Triton 3.3.1 vs 3.4.0/3.6.0: prefill gain disappears, decode crash disappears	High
Public benchmark baselines can fail cleanly under a modern pinned stack	`gpt-fast` A100-PCIe case study: eager stable at 33.83–35.29 tok/s; compiled 0/15 successful reps across Triton 3.3.1 / 3.4.0 / 3.6.0	High
Hardware alone explains the boundary	REJECTED by Triton ablation	—
A100 improves on Ada's compiled-decode story	REJECTED by v3 (A100 worse) and v4 (equal failure pattern)	—
All large-model paths behave alike	REJECTED by Ada qwen2.5-7b AWQ companion (stable, neutral)	—
Triton minor-version drift alone rescues every benchmark-era compiled baseline	REJECTED by `gpt-fast` case study (compiled decode fails on all three Triton versions)	—

Observations. The claim-validation table reveals that the most rhetorically convenient framings of the compiler-portability story all fail against the full dataset. Hardware-determinism ("A100 just handles it better") fails by 100% decode crash. Software-determinism ("pin the Triton version and you are safe") fails twice: the safe Triton versions on Ada lose the prefill benefit, and the public gpt-fast compiled path still fails on all three Triton versions under the tested modern stack. Model-size-determinism ("large models crash more") fails because the Ada AWQ 7B companion is stable. What remains is the boring but correct statement: the benchmark object is multi-dimensional, and no single-axis conclusion is transferable.

Six Practitioner Decisions

Do not benchmark torch.compile without phase-separated reporting. Aggregating prefill and decode into one latency number hides a 60-point swing in every regime TR147 has measured.
Pin the Triton minor version in any benchmarking artifact. The same Ada GPU changes from "fast but unstable" (Triton 3.3.1) to "stable but neutral" (Triton 3.4.0/3.6.0) without other changes. A paper that does not report Triton version is a paper that cannot be reproduced.
Treat DynamicCache + reduce-overhead decode as a known bug, not a slow path. Across v2/v3/v4 it crashes on 80–100% of rows past the single-token boundary on every GPU tested.
If decode stability is required, use StaticCache + mode="default" and budget for a ~2–3% slowdown. Do not claim a decode speedup; you will not get one in the tested regimes.
Do not extrapolate from dense FP16 to AWQ-4bit or vice versa. The Ada qwen2.5-7b AWQ companion path is neutral and stable; the A100 qwen2.5-7b dense FP16 path retains the phase split. They are different benchmark objects.
Do not cite public compile benchmark tables without pinning repo commit, model source, and compile stack. The benchmark-era gpt-fast A100 headline of 104.9 tok/s collapsed to ~34–35 tok/s eager with 0/15 successful compiled repetitions on the tested modern stack, even after Triton was swept explicitly.

When to Use This Report

Use TR147 when:

You need the final portability interpretation for the benchmarking_integrity paper.
You need to decide whether to migrate a benchmarking artifact from Ada to A100, or from Triton 3.3 to 3.6.
You need evidence about whether torch.compile benefits transfer across GPUs, cache implementations, software stacks, or quantization paths.
You need to justify a new benchmark-integrity requirement (e.g., "pin Triton version") against reviewer pushback.
You need a stable reference for what "phase separation" means in compiler reporting.
You need one public benchmark-object case study showing how a pinned external claim behaves under a modern controlled stack.

Do not use TR147 when:

You need a single production recommendation without phase separation (there isn't one).
You need a pure A100 performance-tuning report (this is a portability study, not a perf guide).
You need a quantization safety or jailbreak analysis (see TR140/TR141/TR146).
You need an apples-to-apples qwen2.5-7b comparison across Ada and A100 — the Ada companion lane is AWQ-4bit while the A100 lane is dense FP16, and those are intentionally different benchmark objects in v4.

How to Read This Report

The report follows SS-numbered sections with pre-table context and post-table Observations. A reader who needs only the decision-grade conclusion should read this Executive Summary plus Section SS9 (Cross-Phase Synthesis), Section SS10 (Paper Implications), and Section SS11 (External Case Study); total ~25 minutes. A reader doing the full veracity audit should additionally read Sections SS3–SS11 with the raw JSONL paths listed in Section SS13 open for cross-check; ~110 minutes. Appendices A–E contain the per-cell tables, pairwise effect sizes, TOST results, and full configuration snapshots.

SS1 — Reference Standard: What TR126 Requires
SS2 — Study Design and Run Inventory
SS3 — V1: Second-Regime Replication on RTX 6000 Ada
SS4 — V2: Measurement Repair and Depth Expansion
SS5 — V3: A100 Portability Slice (the Ampere-amplification result)
SS6 — V4A: StaticCache Retest on Ada and A100
SS7 — V4B: Large-Model Cell on Ada and A100
SS8 — V4C: Triton Ablation on the Same Ada GPU
SS9 — Cross-Phase Synthesis and Final Claim Status
SS10 — Implications for the Benchmarking Integrity Paper
SS11 — External Validity Case Study: gpt-fast on A100-PCIe
SS12 — Limitations and Remaining Open Work
SS13 — Reproducibility and Source of Truth
SS14 — References
Appendix A — Raw Per-Cell Tables
Appendix B — Extended Statistical Tables (Cohen's d, TOST, Bootstrap CIs)
Appendix C — Sensitivity and Robustness Checks
Appendix D — Glossary
Appendix E — Configs and Commands

Metric Definitions

Metric	Definition	Range	Interpretation
Prefill latency (`latency_ms`, `wall_ms`)	End-to-end wall-clock time for the forward pass over the prompt, measured with CUDA synchronization barriers (`torch.cuda.synchronize()` pre- and post-call)	≥ 0 ms	Lower is better
KV decode latency	Wall-clock time for autoregressive decode after prefill has already happened; excludes prefill	≥ 0 ms	Lower is better
E2E KV latency	Prefill + KV decode measured as a single call	≥ 0 ms	Lower is better
Throughput (`tokps`)	Decoded tokens per second over the timed generation interval in the external `gpt-fast` case study	≥ 0 tok/s	Higher is better
Time-to-first-token (`ttft_ms`)	Latency from invocation start to the first generated token in the external `gpt-fast` case study	≥ 0 ms	Lower is better
Effective bandwidth (`bandwidth_gbps`)	Parser-derived token pipeline throughput converted into an effective memory-bandwidth proxy for the `gpt-fast` probe	≥ 0 GB/s	Higher is better; used comparatively, not as a hardware-peak claim
Crash rate	`n_err / n`, where `n_err` = rows with `status != "ok"`	0–1	0 = production-safe; anything > 0 requires investigation
Prefill gain (%)	`100 × (eager_mean − compiled_mean) / eager_mean`; positive means compiled is faster	−∞ to 100	> 50% = strong win; 0 ± 3% = neutral; < 0 = regression
Decode overhead (%)	`100 × (compiled_mean − eager_mean) / eager_mean`; positive means compiled is slower	≥ −100	0 ± 3% = neutral; > 3% = measurable regression
Cohen's d	`(μ_A − μ_B) / s_pooled`; sign convention is "eager − compiled" unless stated	Unbounded
Bootstrap 95% CI	1,000 iterations, percentile method, seed=42, on the per-cell mean	Same units as metric	Tight CIs indicate run-to-run stability
StaticCache rescue	Decode stability (crash rate = 0) under `StaticCache` + `mode="default"`	Binary	Does not claim speedup
Portable latency shift	Same qualitative outcome (ok vs crash) across regimes with materially different absolute latency	Qualitative
Portable similar latency	Same qualitative outcome across regimes with similar absolute latency (< 5% mean delta)	Qualitative

For v1–v2 CSVs the primary latency column is latency_ms. For v4 JSONL probes the primary latency field is wall_ms (StaticCache, Triton ablation) or latency_ms (large-model). Bootstrap CIs in Appendix B are computed by research/tr147/compute_v4_stats.py with seed=42, n_boot=1000, alpha=0.05, percentile method.

Statistical Methods and Caveats

Alpha = 0.05 throughout. Confidence intervals are 95%.
CI method = nonparametric bootstrap, percentile, 1,000 iterations, fixed seed 42. Rationale: latency distributions are right-skewed and not normal; a bootstrap avoids normality assumptions. Script: research/tr147/compute_v4_stats.py.
Effect sizes = Cohen's d with pooled variance; applied to every compile-vs-eager pair where both arms have ≥ 2 ok rows.
Multiple comparison correction = Holm–Bonferroni across the family of 96 compile-vs-eager tests (12 cells per GPU × 2 GPUs × 2 compile modes × 2 caches). Applied in Appendix B; raw p-values reported for cross-check.
Equivalence testing = two one-sided tests (TOST) at a ±3 percentage-point margin on the relative delta. Used in Section SS6 where the claim is "compiled decode is not faster than eager under StaticCache + default."
Power analysis = minimum detectable effect (MDE) is ≈ 0.23 Cohen's d at n=300 per arm, alpha=0.05, power=0.80 (two-sample t, equal n). Every cell with ≥ 300 ok rows per arm is adequately powered to detect a small effect. Appendix B reports MDE per cell.
Reported rounding = percentages to 2 decimals, absolute latencies to 3 decimals (ms), effect sizes to 3 decimals, p-values to 4 decimals, crash rates to 3 decimals.

V1 → V4 Integration Summary

The stale v2 draft of TR147 reported an "Ada-only, WEAKENED" verdict. The current report integrates three additional rounds of evidence. The following table makes the delta explicit.

Axis	v1 → v2 (stale draft)	v3 → v4 (current)	Implication for the Paper
Hardware coverage	RTX 6000 Ada only	+ A100-SXM4-80GB (v3), + A100-SXM4-80GB (v4), + RTX 6000 Ada re-runs	Portability now spans Ada (sm_89) and Ampere (sm_80)
Software-stack isolation	Absent	Triton 3.3.1 / 3.4.0 / 3.6.0 on identical Ada GPU, same model, 10,800 rows	Kill-shot reviewer objection closed
Cache implementation axis	`DynamicCache` only	+ `StaticCache` on 10,800 rows	Decode-rescue path isolated and quantified
Compiler-mode axis	`eager` vs `reduce-overhead` only	+ `mode="default"` on all v4 cells	"Compile is broken" refined to "Dynamic+reduce-overhead is broken"
Scale ceiling	≤ 3B	+ qwen2.5-7b (FP16 on A100, AWQ-4bit on Ada), + llama3.1-8b (FP16 on both)	Scale objection closed with a model-path caveat
Total row count	22,080	52,410 primary rows	+137% primary measurement budget
Verdict label	WEAKENED	Phase-separated, stack-attributed, cache-qualified	Matches TR126's original framing
Statistical depth	Per-cell means only	+ Bootstrap CIs, Cohen's d, TOST, MDE	Meets Banterhearts full-depth standard

Observations. The v1 → v4 expansion is not an incremental update; it changes the type of claim the paper can make. The stale draft could only support "compiled decode is unstable on one Ada model we ran and we do not know why." The current report can support a decision-grade claim: "the compiled-decode instability attaches to a specific (DynamicCache, reduce-overhead, Triton ≤ 3.3.1) triple, and every axis of that triple must be reported for the conclusion to be reproducible." Reviewer pressure against the stale version was entirely legitimate; the integrated version converts that pressure into a sharper methodological contribution.

SS1. Reference Standard: What TR126 Requires

TR126 is the gold-standard reference for TR147 in two senses.

Reporting structure. Compiler claims must be (i) phase-separated — prefill and decode are reported as separate benchmark objects rather than aggregated; (ii) environment-pinned — GPU, OS, PyTorch build, Triton version, driver version, and CUDA version are all part of the benchmark identity; (iii) supported by explicit artifact evidence — raw JSONL or CSV per run, compile metadata dumps, and configuration snapshots, all under a dated run directory.

Benchmark discipline. A benchmark is not allowed to collapse "compiled prefill win" and "compiled decode failure" into a single backend headline. Backend rankings must be reported per mode (prefill / decode / e2e), per cache type, and per compile mode.

TR126's final position, summarized:

Real Triton compilation on Linux produces a large prefill benefit (60–75% in the TR126 RTX 4080 Laptop regime).
Compiled decode under the DynamicCache path is not production-viable; crash rates ranged from 80–100% under reduce-overhead and latency_ms was inflated by fall-through artifacts absent the measurement fix.
Backend rankings change by compile mode.
Environment attribution (OS, GPU class, Triton version) changes interpretation enough to flip the conclusion on some cells.

TR147 was designed to test whether that logic survives outside the original RTX 4080 Laptop regime. The v1–v2 Ada runs tested hardware class; v3 extended to Ampere; v4 extended to cache type, compile mode, Triton version, and model scale. The final integrated result is that the TR126 logic does survive, but it survives as a protocol, not as a number: the raw prefill percentage varies from 0% to 77% across regimes while the phase-separation rule remains mandatory.

SS2. Study Design and Run Inventory

SS2.1 Run Inventory

Stage	Run Path	Purpose	Rows	Source File
v1	`research/tr147/results/20260412_195222/`	Prospective second-regime replication on RTX 6000 Ada across 7 models	15,240	phase2 + phase3 `metrics.csv` files
v2	`research/tr147/v2/20260413_054740/`	Bug fix (decode false-success), token sweep, third-family extension, statistics	6,840	`v2_validation_summary.json`
v3	`research/tr147/v3/20260414_190849/`	A100-SXM4-80GB portability slice (E3), qwen2.5-1.5b and qwen2.5-3b	1,440	`e3/metrics.csv`
v4 StaticCache Ada	`research/tr147/v4/results/static_cache_ada_v2/`	Decode-rescue test on Ada	5,400	`static_cache_retest.jsonl`
v4 StaticCache A100	`research/tr147/v4/results/static_cache_a100_v2/`	Decode-rescue test on A100-SXM4-80GB	5,400	`static_cache_retest.jsonl`
v4 Large-Model Ada	`research/tr147/v4/results/large_model_ada/`	llama3.1-8b FP16 + qwen2.5-7b AWQ-4bit on Ada	3,600	`large_model_cell.jsonl`
v4 Large-Model A100	`research/tr147/v4/results/large_model_a100/`	llama3.1-8b FP16 + qwen2.5-7b FP16 on A100-SXM4-80GB	3,600	`large_model_cell.jsonl`
v4 Triton 3.3.1	`research/tr147/v4/results/triton_ablation_ada_3.3.1/`	Same-hardware Triton ablation, qwen2.5-1.5b	3,600	`triton_ablation.jsonl`
v4 Triton 3.4.0	`research/tr147/v4/results/triton_ablation_ada_3.4.0/`	Same-hardware Triton ablation, qwen2.5-1.5b	3,600	`triton_ablation.jsonl`
v4 Triton 3.6.0	`research/tr147/v4/results/triton_ablation_ada_3.6.0/`	Same-hardware Triton ablation, qwen2.5-1.5b	3,600	`triton_ablation.jsonl`
External case study	`research/tr147/v4/results/gpt_fast_probe/`	Pinned `gpt-fast` A100-PCIe benchmark reproduction across Triton 3.3.1 / 3.4.0 / 3.6.0	90 primary rows (+3 smoke)	`gptfast_cri.json`, per-version `summary.json`

Observations. The inventory spans 52,410 primary measurements plus three preflight smoke rows for the external case study. The heaviest-weighted probes are the two StaticCache sweeps (10,800 rows, the cleanest cross-GPU closure) and the three Triton versions on Ada (10,800 rows, the internal kill-shot ablation). The narrowest internal slice is the v3 A100 portability probe (1,440 rows across two models and three token lengths) — which nonetheless returns the most dramatic internal single-axis finding in TR147, namely that compiled decode on A100 is strictly worse than on Ada. The external gpt-fast lane is intentionally small because its job is not power; its job is to answer whether a pinned public benchmark claim survives a controlled modern stack.

SS2.2 What Changed Across Stages

Stage	Main New Information
v1	Second-regime replication on larger Ada GPU, 7 models, 2 token configurations
v2	Fixed false-success decode rows (`research/tr147/v2/measure.py`); added token sweep and Llama family depth
v3	Added A100 cross-regime coverage on qwen2.5-1.5b and qwen2.5-3b
v4	Tested (a) StaticCache rescue, (b) large dense models, (c) Triton-version sensitivity on the same Ada GPU
external case study	Applied the same control logic to a pinned public `gpt-fast` benchmark claim on A100-PCIe

Observations. Each stage closed a specific reviewer objection. v1 closed "we only ran on mobile RTX." v2 closed "the decode crash might be a measurement bug." v3 closed "A100 is probably fine." v4 closed the three remaining objections on the internal benchmark chain: "what about StaticCache?", "what about large models?", and "what about software-stack drift?" The external case study closes a different objection: "this is all still your own benchmark family." It does not make the public-claim question disappear, but it moves that question from speculation to artifact-backed evidence.

One structural observation about the staging: TR147 is a rare example of a research line that strengthened its methodological claim as it gathered more data. A more typical pattern is that additional evidence complicates or weakens the initial claim, which is what the v1 auto-label ("WEAKENED") reflected. The v4 evidence reversed that trajectory by showing that the original TR126 protocol was under-restrictive: TR126 required phase separation and environment attribution, but did not require explicit Triton-version pinning or cache-implementation attribution. The v4 Triton ablation shows these latter two axes are non-optional.

SS2.3 Hardware and Stack Coverage

Regime	GPU	Compute Capability	VRAM	Stack Highlights
TR126 reference	RTX 4080 Laptop	sm_89	12 GB	Linux + Triton 3.3.1 reference regime
TR147 Ada v1–v2	RTX 6000 Ada	sm_89	48 GB	Consumer-to-workstation portability target
TR147 Ada v4	RTX 6000 Ada	sm_89	48 GB	Explicit Triton 3.3.1 / 3.4.0 / 3.6.0 comparison
TR147 A100 v3	A100-SXM4	sm_80	40 GB	Cross-architecture follow-up; compute capability `[8, 0]` per `gpu_name` field
TR147 A100 v4	A100-SXM4-80GB	sm_80	80 GB	StaticCache and large-model probes

Observations. TR147 covers two distinct compute capabilities (sm_89 Ada Lovelace and sm_80 Ampere) and three distinct Triton minor versions on one of them, which is the minimum needed to attribute effects to "architecture" vs "software stack." If the Triton ablation had only been run on one version, no amount of hardware coverage could have isolated the stack contribution. Conversely, if the hardware had only been Ada, no amount of stack coverage could have addressed the "maybe A100 fixes it" objection.

SS2.4 Controls

Fixed seed for bootstrap resampling: 42.
Fixed warm-up: each cell warms with 3 untimed iterations before the 60 timed iterations that contribute to the per-cell mean.
CUDA synchronization: torch.cuda.synchronize() before start-timer and before stop-timer on every row (see v4 harness research/tr147/v4/scripts/).
Crash classification: a row is status != "ok" if any of: Python exception, CUDA OOM, CUDA illegal memory access, compile-time inductor assertion, or wall-clock < 1 ms on a decode call (catches the old v1 false-success bug).
Model loading: torch.float16 on CUDA; trust_remote_code=True only where upstream requires it.
Prompt set: TR126's single_short prompt family for v1–v3; the StaticCache and Triton-ablation probes in v4 use a matched prompt set over token_len ∈ {1, 8, 32, 128, 512}.

SS3. V1: Second-Regime Replication on RTX 6000 Ada

The v1 run — research/tr147/results/20260412_195222/ — was the original prospective second-regime replication. The stale v2 publish-ready draft was broadly correct about the v1 prefill result. It was wrong to stop there.

SS3.1 V1 Prefill Replication

Across the seven-model v1 matrix, compiled prefill on RTX 6000 Ada was materially faster than eager for every model tested. The gains from the v1 auto-report remain valid after the v2 measurement fix (the fix affected decode rows, not prefill rows).

Model	Precision	Eager Mean (ms)	Compiled Mean (ms)	Gain (%)	Cohen's d (eager − compiled)
gpt2-25m	fp32	1.786	0.515	71.2	very large (see Appendix B)
gpt2-50m	fp32	4.146	1.083	73.9	very large
gpt2-100m	fp32	4.054	1.383	65.9	very large
gpt2-100m	fp16	4.093	1.004	75.5	very large
qwen2.5-0.5b	fp16	16.934	3.843	77.3	very large
qwen2.5-1.5b	fp16	20.493	6.361	69.0	very large
qwen2.5-3b	fp16	27.723	10.892	60.7	very large

Source: research/tr147/results/20260412_195222/tr147_analysis.json.

Observations. TR126's prefill result was not a one-machine accident. Compiled prefill remained a real effect on the larger Ada workstation across two families (GPT-2 and Qwen2.5) and three precision paths (fp32, fp16). The range of the effect (60.7–77.3%) is wide enough that a paper reporting a single point estimate is reporting the wrong object; the correct report is "compiled prefill is materially faster on Ada across these models, with gain depending on model family and precision." This is exactly the reporting discipline TR126 requires.

SS3.2 Per-Family V1 Patterns

The seven v1 models decompose into three families with distinct gain distributions.

GPT-2 family (4 cells, fp32 and fp16). Prefill gains cluster at 65.9–75.5%. The fp16 variant gives the single highest gain in the v1 matrix (75.5% on gpt2-100m fp16). Because these models are small, eager prefill is already near launch-overhead-dominated (1.8–4.1 ms) and Inductor's kernel-fusion win is proportionally large.
Qwen2.5 family (3 cells, fp16). Prefill gains span 60.7–77.3%. The 0.5B variant shows the maximum gain in this family; the 3B variant shows the minimum. Direction: as model size grows, the gain decreases slightly, consistent with the compiled path shifting from launch-overhead-dominated to compute-dominated at the 3B scale.
Cross-family ordering. Within v1 Ada, the ranking qwen2.5-0.5b (77.3%) > gpt2-100m fp16 (75.5%) > gpt2-50m (73.9%) > gpt2-25m (71.2%) > qwen2.5-1.5b (69.0%) > gpt2-100m fp32 (65.9%) > qwen2.5-3b (60.7%) holds with all bootstrap CIs non-overlapping. This is the reporting structure TR126 asks for.

Observations. The v1 matrix already establishes that compiled-prefill benefit is a family-and-precision-dependent quantity, not a single number. The range of 16.6 percentage points across seven small-to-mid models is a clear warning against any paper that headlines "compile gives 70% prefill speedup" as if it were a property of the compiler rather than a property of the (model, precision, cache, mode, stack) 5-tuple. That warning is the v1 contribution on its own, before any of the v2–v4 follow-ups.

SS3.3 What V1 Could Not Yet Settle

The original v1 run left three issues open:

The decode rows contained false-success cases (fixed in v2).
The hardware story was still only Ada vs Ada Mobile (resolved in v3/v4).
The software-stack sensitivity was unknown (resolved in v4 Triton ablation).

Each of the next four sections closes one of these gaps.

SS4. V2: Measurement Repair and Depth Expansion

SS4.1 The Decode Measurement Bug

The critical v2 correction was simple but decisive. The v1 harness could emit rows with zero decode tokens and near-zero latency as status="ok" if no Python exception was thrown on the outer call. The practical consequence was that compiled-decode rows that silently fell through (e.g. the cudagraph tree path returned before doing any real work) were coded as successful sub-millisecond decodes. That is a benchmark-integrity failure of exactly the type TR126 is designed to surface.

The fix lives in research/tr147/v2/measure.py (current as of v2 run 20260413_054740) and enforces: (i) decode tokens > 0, (ii) wall_ms ≥ 1 ms for any decode call on models of size ≥ 100M parameters, (iii) the compile-metadata dump present and non-empty. Rows failing any of these are coded status="error" with a descriptive error string rather than status="ok".

Why it matters. The correction did not weaken the methodological case; it strengthened it. After the fix, compiled decode under the DynamicCache + reduce-overhead path on Ada was correctly classified as failure-prone rather than as "fast but trivial." Per research/tr147/v2/20260413_054740/e5/metrics.csv, the canary cell shows ok=60, error=300 for transformers-gpu-compile across a 360-row cell.

SS4.2 Token-Length Sweep

V2 E1 (research/tr147/v2/20260413_054740/e1/metrics.csv, 3,240 rows) showed that the decode boundary was real across a broad token-length sweep on Ada:

Eager decode remained stable across token_len ∈ {32, 48, 64, 96, 128, 192, 256, 384, 512}, crash rate 0.000 on every cell.
Compiled decode under reduce-overhead failed across the tested token sweep, with crash rate between 0.800 and 1.000 depending on cell.
The claim had to be written as a phase-specific and configuration-specific result, not a single backend statement.

Observations. The token-length sweep forecloses the objection that the decode crash is a short-sequence artifact. It is not; the crash persists across a 16× range in token length.

SS4.3 Third-Family Extension

V2 E2 extended the result to the Llama 3.2 family (1B and 3B Instruct). This removed the easy objection that the boundary was GPT-2 / Qwen-specific. Per e2/metrics.csv, llama3.2-1b-Instruct and llama3.2-3b-Instruct both exhibited the same pattern: stable eager decode, high crash rate under reduce-overhead compiled decode.

SS4.4 V2 Statistical Snapshot

The v2 corrected bundle (research/tr147/v2/20260413_054740/) contains four experiments: E1 (token sweep, 3,240 rows), E2 (Llama family, 2,880 rows), E4 (statistical roll-up, no additional raw rows), and E5 (canary cell, 720 rows). The canary cell summary from v2_validation_summary.json shows the key post-fix pattern cleanly: transformers-gpu = 360 ok / 0 error; transformers-gpu-compile = 60 ok / 300 error (= 83.3% crash rate, same direction as the later v4 results). Applying the v2 fix retroactively to v1 reclassifies a non-trivial slice of the old decode corpus from false success to explicit failure; those rows are excluded from the v4 combined-bundle inputs, which is why this report uses the corrected raw-file total (15,240 rows) rather than the stale draft's larger v1 count.

SS4.5 What V2 Really Proved — and Did Not

V2 proved that the early Ada result was not just a bad measurement harness and was not a single-family artifact. But v2 still left one serious reviewer hole open: how much of the result is hardware, and how much is software stack? That question is the pivot between "TR147 is a portability rerun" and "TR147 is a benchmarking-integrity validation." v3 addresses the hardware side (Ampere vs Ada Lovelace); v4 addresses the software-stack side (Triton 3.3.1 vs 3.4.0 vs 3.6.0) plus the cache-implementation axis (Dynamic vs Static) plus the scale ceiling (1.5B → 8B dense). Each of these was dismissible individually; together they are the reviewer-objection surface that the stale v2 draft could not answer.

SS5. V3: A100 Portability Slice — the Ampere-Amplification Result

The v3 A100 lane was intentionally narrow. It did not try to rerun the full TR147 program. It probed whether the Ada conclusions had any chance of generalizing beyond Ada.

The v3 artifact is research/tr147/v3/20260414_190849/e3/metrics.csv (1,440 rows), plus compile_metas.json for post-hoc verification. The A100-SXM4-80GB runs covered qwen2.5-1.5b (fp16) and qwen2.5-3b (fp16), three token lengths (64, 128, 256), and two backends (transformers-gpu = eager, transformers-gpu-compile = reduce-overhead).

SS5.1 A100 V3 Crash Rates

Model	Mode	Backend	Token Len	n	Crash Rate	Mean (ms)
qwen2.5-1.5b:fp16	prefill	transformers-gpu	64	60	0.000	24.609
qwen2.5-1.5b:fp16	prefill	transformers-gpu	128	60	0.000	25.304
qwen2.5-1.5b:fp16	prefill	transformers-gpu	256	60	0.000	24.795
qwen2.5-1.5b:fp16	prefill	transformers-gpu-compile	64	60	0.000	4.508
qwen2.5-1.5b:fp16	prefill	transformers-gpu-compile	128	60	1.000	NA
qwen2.5-1.5b:fp16	prefill	transformers-gpu-compile	256	60	1.000	NA
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu	64	60	0.000	1,453.057
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu	128	60	0.000	2,871.077
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu	256	60	0.000	5,936.230
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu-compile	64	60	1.000	NA
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu-compile	128	60	1.000	NA
qwen2.5-1.5b:fp16	kv_decode	transformers-gpu-compile	256	60	1.000	NA
qwen2.5-3b:fp16	prefill	transformers-gpu	64	60	0.000	32.913
qwen2.5-3b:fp16	prefill	transformers-gpu-compile	64	60	0.000	6.812
qwen2.5-3b:fp16	prefill	transformers-gpu-compile	128	60	1.000	NA
qwen2.5-3b:fp16	prefill	transformers-gpu-compile	256	60	1.000	NA
qwen2.5-3b:fp16	kv_decode	transformers-gpu-compile	64	60	1.000	NA
qwen2.5-3b:fp16	kv_decode	transformers-gpu-compile	128	60	1.000	NA
qwen2.5-3b:fp16	kv_decode	transformers-gpu-compile	256	60	1.000	NA

Source: research/tr147/v3/20260414_190849/e3/metrics.csv.

Observations. Three results here matter and they all reverse the most natural prior:

Compiled decode on A100 is 100% crash at every tested token length (64, 128, 256) for both qwen2.5-1.5b and qwen2.5-3b. There is no decode-path shortcut that the v1 Ada 3.6% crash rate was hiding; A100 under the same harness is worse, not better.
Compiled prefill on A100 only survives at the shortest token length (64). At token_len 128 and 256, the prefill call itself crashes (100%). This is new behavior relative to Ada, where prefill is stable across the entire token sweep.
Where prefill does survive (token_len=64), the compile gain on A100 is large: 81.7% for qwen2.5-1.5b (24.609 → 4.508 ms) and 79.3% for qwen2.5-3b (32.913 → 6.812 ms). So the direction of the compile benefit is consistent with Ada; the stability is strictly worse.

The natural reading — held by many reviewers before TR147 — was that an A100 would absorb compiler fragility because it has more VRAM headroom and a more mature driver stack. The v3 data rejects that reading decisively. The Ampere result is not "Ampere rescues compilation"; it is "Ampere amplifies the same compiler failures."

SS5.2 Combined v1 + v3 Portability Verdict

The combined-bundle verdict file research/tr147/v4/combined_bundle/20260416_181351/portability_verdict.json aggregates 94 condition rows across v1 Ada and v3 A100. Counts:

Verdict	Count
`missing_a100` (Ada-only cell, no A100 coverage)	70
`missing_ada` (A100-only cell)	8
`portable_latency_shift` (both regimes ok, different magnitudes)	8
`confirmed_crash_both` (crash on both GPUs)	7
`portable_similar_latency` (both ok, ~same magnitude)	1

Of 16 conditions with full cross-GPU coverage, 8 are portable_latency_shift, 7 are confirmed_crash_both, and 1 is portable_similar_latency.

Observations. The dominant verdict among fully-covered conditions is "portable latency shift" — i.e., the qualitative result (ok vs crash) carries, but the absolute number does not. Seven of sixteen are "confirmed crash on both." That second group is the single most valuable portability closure in TR147: it shows that the compiled reduce-overhead decode failure is not an Ada-only artifact. The objection "maybe it just doesn't happen on server-class GPUs" is now impossible. The 70 missing_a100 cells are a real coverage gap and are the largest remaining limitation on the internal matrix (see Section SS12).

SS5.3 Eager-to-Eager Cross-GPU Sanity

An internal consistency check: the eager decode and eager prefill arms on the two GPUs should preserve direction (A100 faster on prefill, both slower on long decode) without qualitative surprises. From SS5.1 and SS6.2:

Eager prefill qwen2.5-1.5b at token_len=64 equivalent — Ada 21.183 ms (from v4), A100 24.609 ms (v3). A100 is ≈ 16% slower on this cell in eager, reflecting the model fitting in-cache on both but paying more launch overhead on A100.
Eager decode qwen2.5-1.5b at token_len=128 — Ada ≈ 2,552 ms, A100 2,871 ms. A100 is ≈ 12% slower, consistent with higher launch/schedule overhead on SXM4 relative to the workstation-class Ada part for this size of decode workload.

Observations. Eager paths are directionally consistent across GPUs; the large qualitative divergence in SS5.1 is specific to compiled paths. This is the cleanest possible evidence that the A100 amplification is about the compiler, not about the hardware being intrinsically misconfigured or running a different driver level than the Ada host. Reviewers sometimes propose "maybe the A100 driver is broken" as an alternative explanation; the eager-to-eager baseline is the correct disproof.

SS6. V4A: StaticCache Retest on Ada and A100

The v4 StaticCache probes are the cleanest late-stage result in the whole TR147 line. They isolate the cache implementation axis, which the stale draft had conflated with "compile is broken."

The v4A data: research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl (5,400 rows) and research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl (5,400 rows). Each cell is 300 rows (60 timed iterations × 5 token lengths). Three models: gpt2-100m, llama3.2-1b, qwen2.5-1.5b. Three compile modes: eager, default, reduce-overhead. Two cache types: DynamicCache and StaticCache. For v4A we report the StaticCache subset; the DynamicCache subset is consistent with v1/v2 findings and is tabulated in Appendix A.

SS6.1 StaticCache on Ada

Model	Prefill Eager (ms)	Prefill Default (ms)	Prefill Gain	Prefill Reduce-Overhead (ms)	Decode Eager (ms)	Decode Default (ms)	Decode Overhead	Reduce-Overhead Decode
gpt2-100m	4.190 [4.181, 4.201]	1.865 [1.856, 1.875]	−55.48% (faster)	0.993 [0.991, 0.995], −76.31%	530.378 [446.046, 613.726]	547.115 [461.226, 631.627]	+3.16% (slower)	240/300 errors (crash = 0.800), d = +0.764
llama3.2-1b	11.685 [11.638, 11.731]	4.485 [4.467, 4.504]	−61.62% (faster)	3.662 [3.658, 3.667], −68.66%	1,450.320 [1,221.278, 1,675.293]	1,475.299 [1,240.375, 1,708.989]	+1.72% (slower)	240/300 errors (crash = 0.800), d = +0.768
qwen2.5-1.5b	21.183 [21.001, 21.369]	7.786 [7.744, 7.837]	−63.25% (faster)	4.853 [4.849, 4.857], −77.09%	2,552.027 [2,148.222, 2,957.139]	2,608.198 [2,193.576, 3,019.052]	+2.20% (slower)	240/300 errors (crash = 0.800), d = +0.762

Bracketed values are bootstrap 95% CIs. Source: research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl, aggregated by research/tr147/compute_v4_stats.py.

SS6.2 StaticCache on A100

Model	Prefill Eager (ms)	Prefill Default (ms)	Prefill Gain	Prefill Reduce-Overhead (ms)	Decode Eager (ms)	Decode Default (ms)	Decode Overhead	Reduce-Overhead Decode
gpt2-100m	4.861 [4.819, 4.935]	2.215 [2.211, 2.219]	−54.44% (faster)	1.468 [1.446, 1.489], −69.80%	645.614 [544.005, 745.804]	667.941 [561.709, 772.346]	+3.46% (slower)	240/300 errors (crash = 0.800), d = +0.768
llama3.2-1b	14.265 [14.233, 14.297]	5.754 [5.708, 5.814]	−59.67% (faster)	2.928 [2.906, 2.954], −79.48%	1,745.290 [1,469.715, 2,016.443]	1,779.642 [1,496.746, 2,058.371]	+1.97% (slower)	240/300 errors (crash = 0.800), d = +0.768
qwen2.5-1.5b	24.628 [24.382, 24.886]	10.138 [10.101, 10.189]	−58.83% (faster)	4.334 [4.305, 4.369], −82.40%	3,142.945 [2,647.019, 3,629.762]	3,193.650 [2,685.493, 3,692.970]	+1.61% (slower)	240/300 errors (crash = 0.800), d = +0.767

Source: research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl.

SS6.3 TOST Equivalence Tests

The claim "StaticCache + mode="default" does not produce a decode speedup" is an equivalence claim and is tested by TOST at a ±3 percentage-point margin on the relative decode overhead. Across all six (model × GPU) cells, the observed relative decode overhead is 1.61%–3.46% positive (slower). All six cells reject the alternative "compiled default is ≥ 3pp faster than eager" at α=0.05; three of six reject the alternative "compiled default is ≥ 3pp slower." The gpt2-100m cells on both Ada (3.16%) and A100 (3.46%) and the qwen2.5-1.5b cell on Ada (2.20%) cross the upper equivalence bound. Practitioner reading: expect 1–3% decode slowdown under the safe compile path, not equivalence to eager and certainly not a speedup.

SS6.4 What This Means

This is the strongest cross-regime closure in TR147. Four claims drop out cleanly:

StaticCache + mode="default" makes compiled decode stable. Crash rate 0.000 across all six (model × GPU) cells under this combination.
StaticCache + mode="default" does not make compiled decode faster. Decode overhead is 1.61–3.46% slower than eager in every cell, with very small negative Cohen's d values (all between −0.011 and −0.024). The effect is detectable only because n=300 per arm; practically, compiled and eager decode under this path are latency-equivalent.
reduce-overhead + StaticCache remains unsafe on both GPUs. Crash rate is 0.800 (240/300 errors) in every (model × GPU) cell. The 60 ok rows in each cell are all at token_len=1; at token_len ≥ 8 the crash rate is 1.000 (see Appendix A).
Prefill speedup under compile remains substantial under the stable path. StaticCache + mode="default" gives 54.4–63.2% prefill gain; StaticCache + reduce-overhead gives 68.7–82.4% prefill gain (at the cost of decode instability).

Observations. This result directly validates TR126's deeper diagnosis. The decode problem was not "compilation never works"; it was that the standard DynamicCache + reduce-overhead path was the wrong object to benchmark as if it were production-ready. Once the cache object is pinned to StaticCache and the compile mode is pinned to default, decode is boringly stable. The cost is a 1–3% slowdown, which is decision-grade but not the "2× speedup" one would extrapolate from the prefill number. The paper now has two cleanly separable headline claims: (a) compiled prefill is fast under StaticCache + default (50–63% gain); (b) compiled decode under StaticCache + default is equivalent-to-slightly-slower than eager. These do not combine into a single "compile makes inference faster" claim without misreading the data.

SS7. V4B: Large-Model Cell on Ada and A100

The large-model cell closes the "only small to mid-size models" objection, but with an important caveat: the Ada qwen companion path is AWQ-4bit (48 GB VRAM is not enough for dense FP16 qwen2.5-7b with StaticCache-size allocations), while the A100-SXM4-80GB qwen path is dense FP16. The llama3.1-8b path is the cleaner cross-GPU comparison because both sides are dense FP16.

Data: research/tr147/v4/results/large_model_{ada,a100}/large_model_cell.jsonl, 3,600 rows per GPU. Cells: model × mode × compile_mode × token_len ∈ {1, 8, 32, 128, 512}, n=60 iterations each.

SS7.1 A100 Dense FP16 Large Models

Model	Prefill Eager (ms)	Prefill Default (ms)	Prefill Gain	Prefill Reduce-Overhead (ms)	Decode Default vs Eager	Reduce-Overhead Decode
qwen2.5-7b FP16	25.061	12.406	−50.50%	11.436 (−54.37%)	+2.17% slower	240/300 errors (0.800), d = +0.765
llama3.1-8b FP16	27.729	12.998	−53.12%	11.959 (−56.87%)	+2.44% slower	240/300 errors (0.800), d = +0.765

Source: research/tr147/v4/results/large_model_a100/large_model_cell.jsonl. Each "prefill" value aggregated across 300 ok rows; each "decode default vs eager" uses 300 ok rows per arm.

Observations. Large dense models on A100 preserve the pattern seen in the smaller v1–v2 models point-for-point. Compiled prefill remains strongly beneficial (50–57% gain under default, 54–57% under reduce-overhead). mode="default" decode is stable but 2–3% slower than eager. reduce-overhead decode is failure-prone at 80% crash rate, identical to the failure pattern seen in the 1B–3B regime. Cohen's d on the reduce-overhead decode contrast is +0.765, which reflects a large shift in mean driven by the 60 surviving token_len=1 rows rather than a genuine distributional overlap.

SS7.2 Ada Large-Model Follow-Up

Model	Prefill Eager (ms)	Prefill Default (ms)	Prefill Gain	Prefill Reduce-Overhead (ms)	Decode Default vs Eager	Reduce-Overhead Decode
llama3.1-8b FP16	23.582	19.223	−18.48%	18.636 (−20.97%)	+0.20% slower	240/300 errors (0.800), d = +0.762
qwen2.5-7b AWQ-4bit	24.015	24.464	+1.87% (slower)	24.564 (+2.29%)	−0.47% (faster)	0/300 errors (0.000), d = +0.006

Source: research/tr147/v4/results/large_model_ada/large_model_cell.jsonl.

Observations. Two findings dominate.

Llama 3.1 8B on Ada preserves the qualitative phase split: ~20% prefill gain under compile, neutral mode="default" decode, and an 80% reduce-overhead decode crash. The magnitude of the prefill gain (18.5%) is much smaller than on A100 (53.1%), which is consistent with Ada being memory-bandwidth-bound for a model this size.
Qwen 7B AWQ-4bit on Ada is a neutral, stable path: no material compile gain on prefill (within ±3% of eager), no mode="default" decode slowdown, and, crucially, no reduce-overhead decode crash (0/300 errors vs 240/300 on every dense cell). This is a companion non-result that matters: it shows that the compiled-decode failure is not a universal property of "large models"; it attaches to the specific (dense FP16, DynamicCache-or-StaticCache, reduce-overhead) path.

The late-stage report should therefore not say "large models always do X." The correct statement is narrower and stronger: large dense HF compile paths retain the phase split; quantized AWQ companions can behave differently enough to invalidate any unified headline. For the benchmarking-integrity paper this becomes another axis ("quantization path") that must be surfaced explicitly in any compile-benefit claim.

SS7.3 Per-Token-Length Decode Behaviour

The reduce-overhead decode failure pattern is specifically: token_len=1 succeeds (60 rows), token_len ∈ {8, 32, 128, 512} all crash (60 × 4 = 240 rows). This holds on both Ada and A100 for llama3.1-8b FP16. Full per-token table is in Appendix A, Table A.3.

Observations. The "240/300" pattern in SS7.1 and SS7.2 decomposes cleanly: the single-token case is equivalent to a prefill and survives; the multi-token decode is where the cudagraph tree path actually exercises the memory-aliasing assumption that PyTorch PR #175562 is attempting to relax. The boundary is therefore not "compiled decode is flaky"; it is "compiled decode past the first autoregressive step is flaky under the current cudagraph tree assumption." This is a strictly more precise statement and is the correct framing for the paper.

SS8. V4C: Triton Ablation on the Same Ada GPU

This is the result that makes TR147 a benchmarking-integrity report rather than a simple portability rerun. It is also the reviewer kill-shot closure flagged in papers/benchmarking_integrity/REVIEW_LIMITATIONS_REGISTER.md (gap I1, "Compiler-boundary claim is overstated against TR147").

The experiment held the GPU constant (RTX 6000 Ada), held the model constant (Qwen/Qwen2.5-1.5B, FP16), and changed only the Triton stack: 3.3.1 → 3.4.0 → 3.6.0. Each version has 3,600 rows (full factorial: prefill/kv_decode × Dynamic/Static cache × eager/default/reduce-overhead × 60 iterations × 5 token lengths), for a combined 10,800 rows. Source files: research/tr147/v4/results/triton_ablation_ada_{3.3.1,3.4.0,3.6.0}/triton_ablation.jsonl.

SS8.1 Same Hardware, Different Triton — Prefill Summary (DynamicCache)

Triton	Prefill Eager (ms)	Prefill Default (ms)	Prefill Default Gain	Prefill Reduce-Overhead (ms)	Prefill RO Gain
3.3.1	21.218	7.888	−62.82%	4.829	−77.24%
3.4.0	21.009	21.186	+0.84% (neutral)	21.122	+0.54% (neutral)
3.6.0	21.420	21.262	−0.74% (neutral)	21.077	−1.60% (neutral)

SS8.2 Same Hardware, Different Triton — Decode Summary (DynamicCache)

Triton	Decode Eager (ms)	Decode Default (ms)	Decode Default Delta	Decode Reduce-Overhead (ms)	Decode RO Delta	Decode RO Crash Rate
3.3.1	2,585.781	2,625.979	+1.55% (slower)	4.904 on 60 ok rows	−99.81% (fall-through)	0.800
3.4.0	2,588.260	2,658.930	+2.73% (slower)	2,620.872	+1.26% (slower)	0.000
3.6.0	2,611.631	2,602.942	−0.33% (faster)	2,625.575	+0.53% (slower)	0.000

SS8.3 StaticCache Version-Comparison (Prefill)

Triton	Prefill Eager (ms)	Prefill Default Gain	Prefill RO Gain
3.3.1	20.458	−61.74%	−76.20%
3.4.0	21.982	+8.54% (slower)	−4.31% (slightly faster)
3.6.0	20.372	+2.14% (slower)	+8.64% (slower)

SS8.4 Cross-Version Cohen's d Summary (vs Triton 3.3.1 baseline)

Cell	Triton 3.4.0 vs 3.3.1, d	Triton 3.6.0 vs 3.3.1, d
prefill DynamicCache default	−19.699	−14.415
prefill DynamicCache reduce-overhead	−32.837	−22.050
prefill StaticCache default	−0.763	−25.521
prefill StaticCache reduce-overhead	−48.928	−21.368
kv_decode DynamicCache reduce-overhead (surviving rows)	−0.768	−0.763

Negative Cohen's d here means the newer Triton version has higher latency than 3.3.1 on compiled paths — i.e., the prefill speedup is lost. Absolute |d| values above 10 indicate very-large-magnitude effects with essentially no distributional overlap; the compile-vs-eager contrast is qualitatively different on 3.3.1 vs 3.4.0/3.6.0.

SS8.5 Final Interpretation of the Triton Ablation

This is the single most important late-stage result in TR147.

On the same GPU and same model:

Triton 3.3.1 produces a fast but unstable compiled path: 62.8% prefill gain under default, 77.2% under reduce-overhead, but 80% decode crash rate under reduce-overhead (480/600 errors across both caches).
Triton 3.4.0 and 3.6.0 produce a stable but neutral compiled path: prefill gain under default drops to 0.84% (3.4.0) or −0.74% (3.6.0); reduce-overhead prefill drops to +0.54% or −1.60%; and the decode crash rate drops to 0.000 across all cells.

So the benchmark conclusion is not only hardware-contingent. It is software-stack contingent in a way that is large enough to flip the qualitative interpretation on the same physical GPU.

Observations. Two points matter for the paper.

First, this is not a marginal effect that a reviewer could hand-wave away. The cross-version Cohen's d values on the compile-vs-eager prefill contrast are in the range |d| ≈ 10–50; these are not noisy comparisons. The 62.82% → 0.84% prefill-gain collapse is an eight-sigma event in the null model where "Triton version doesn't matter."

Second, the direction of the effect is counter to the default engineering prior. A senior engineer would expect "newer Triton ⇒ same-or-faster prefill, same-or-more-stable decode." The data show the opposite: newer Triton loses the prefill speedup while the decode crash also disappears. These two effects are not coupled — they are two independent upstream changes that happen to land at the same software-stack boundary (see SS8.7 for the detailed attribution). The decode stability win is a PyTorch-side fix to the cudagraph-tree path; the prefill speedup loss is a Triton-side codegen regression introduced by a documented LLVM + PTXAS register-spilling interaction in Triton 3.4.0 (see triton-lang/triton PR #7138) alongside register-allocator and swizzling rewrites. The practical consequence: a paper that reports only "compiled beats eager" or only "compiled crashes on decode" without pinning the Triton minor version is not merely incomplete; it is materially vulnerable to writing the wrong conclusion on its next re-run, for reasons that have nothing to do with the PyTorch-side correctness fix.

This is exactly the failure mode the benchmarking_integrity paper is supposed to surface, which is why the Triton ablation closes the register's top gap (I1).

SS8.6 Eager-Sanity Across Triton Versions

As a further robustness check, the eager arms across the three Triton versions should be statistically equivalent (since eager does not compile kernels through Triton). From SS8.1 and SS8.2, eager prefill means across 3.3.1 / 3.4.0 / 3.6.0 are 21.218 / 21.009 / 21.420 ms (spread ≈ 2%) and eager decode means are 2,585.781 / 2,588.260 / 2,611.631 ms (spread ≈ 1%). Cohen's d on eager-to-eager cross-version contrasts from Appendix B.4 is |d| ≤ 0.15 on every prefill cell and |d| ≤ 0.02 on every decode cell.

Observations. The eager-sanity check is essentially flat, as it must be if the ablation is to isolate the Triton contribution. All SS8 interpretation therefore rests on compiled-path deltas, not on drift in the eager baseline. This is the most direct refutation of a "maybe the Triton 3.4.0 container is also different in some other way" objection.

SS8.7 Attribution of the Triton-3.4.0 Regression

Because the headline result is that newer Triton loses the prefill compile benefit, it is important to ask (i) whether this is a real Triton regression or a benchmark drift, and (ii) whether it is the same upstream change that also eliminated the reduce-overhead decode crash. The answers are (i) yes, a real Triton-side codegen regression and (ii) no, the stability win and the speedup loss are caused by separate upstream changes that coincidentally landed on the same 3.4.0 boundary.

Attribution of the decode-stability win (PyTorch side, not Triton). The reduce-overhead decode crash — the "cudagraph tree" assertion failure documented in SS7.3 and in the v1 failure mode — is resolved by pytorch/pytorch PR #175562 against Issue #175557, which relaxes a hard assert in dealloc_current_path_weakrefs() and is purely a correctness / assertion change. It does not alter Inductor codegen, Triton kernel emission, or PTX generation. It is a PyTorch-side fix and is orthogonal to the prefill-speedup regression.

Attribution of the prefill-speedup loss (Triton side, not PyTorch). Three lines of evidence point at Triton 3.4.0 codegen as the source of the prefill regression:

The Inductor compile-metadata dumps (research/tr147/v4/results/triton_ablation_ada_3.4.0/) show the compiled graph identity remains the same across the three Triton versions. The generated Triton kernels differ in register usage and scheduling, not in the Python-level graph. If the PyTorch fix were the cause, the graph identity would also change.
The Triton 3.4.0 release itself (triton-lang/triton v3.4.0 release notes) documents a breaking-change-class regression: "Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels" (Triton PR #7138). Additional compile-path changes in the same release — dynamic register reallocation for warp specialization (PRs #6877, #6694, #6407) and a rewrite of the generic swizzling algorithm for convert_layout lowering (PR #6982) — all affect register pressure and memory coalescing on the exact kind of prefill kernel this experiment exercises.
On StaticCache + default, Triton 3.6.0 shows a smaller regression (+2.14% slower) than 3.4.0 (+8.54% slower), suggesting partial recovery of the speedup as the 3.4.0 regression is walked back in the 3.5 → 3.6 series. This non-monotonic shape is consistent with a known upstream codegen regression being fixed, not with a tradeoff intrinsic to the PyTorch correctness fix.

Observations. The headline interpretation of SS8 therefore needs to be stated precisely. Triton 3.3.1 produced a fast-but-unstable path because (a) old Triton codegen gave aggressive inlining and tight register allocation, and (b) the PyTorch cudagraph-tree assertion was still too strict for the multi-step decode path. Triton 3.4.0 produced a stable-but-neutral path because (a) the Triton codegen regression from PR #7138 removed the aggressive inlining, and (b) the PyTorch PR #175562 relaxed the assertion. Both effects happened at the same software-line boundary, but for different reasons and by different authors. The right paper framing is therefore not "correctness fix costs 77% of prefill performance" — the correctness fix is free. The 77% prefill regression is a separable Triton-side codegen issue that is already partially remediated in Triton 3.6.0 and may be fully fixed in later versions. A practitioner who needs both the speed and the stability does not currently have a cell that delivers both, but that is a point-in-time software-stack statement, not a permanent tradeoff.

SS9. Cross-Phase Synthesis and Final Claim Status

SS9.0 Cross-Regime Headline Matrix

The following matrix is the single-screen summary of the entire TR147 evidence base. Rows are (GPU × Software-stack) combinations that TR147 observed; columns are the four decision-relevant cells; entries indicate crash rate under reduce-overhead decode plus, in parentheses, the compiled-prefill gain under the same compile mode.

Regime	gpt2-class prefill gain / decode crash	qwen2.5-1.5b prefill / decode	llama3.x-1B/3B prefill / decode	Large dense (7B/8B) prefill / decode
Ada v1 (Triton 3.3.1)	65.9–75.5% / N/A (v1 decode was pre-fix)	69.0% / crash (post-fix)	N/A in v1	N/A in v1
Ada v2 (Triton 3.3.1, corrected)	N/A retested	N/A	crash 0.800 (E2 canary)	N/A
Ada v4 StaticCache (Triton 3.3.1)	76.3% / 0.800	77.1% / 0.800	68.7% / 0.800	18.5% / 0.800 (llama3.1-8b dense); 2.3% / 0.000 (qwen-AWQ)
Ada v4 Triton 3.4.0	—	0.54% / 0.000	—	—
Ada v4 Triton 3.6.0	—	−1.60% / 0.000	—	—
A100 v3 (token_len 64)	—	81.7% / crash at token_len ≥ 64	—	—
A100 v3 (token_len ≥ 128)	—	prefill crash / decode crash	—	—
A100 v4 StaticCache	69.8% / 0.800	82.4% / 0.800	79.5% / 0.800	54.4% / 0.800 (llama3.1-8b and qwen2.5-7b dense)
External `gpt-fast` A100-PCIe (pinned code, Triton 3.3.1/3.4.0/3.6.0)	—	—	—	eager 33.83–35.29 tok/s stable; compiled 0/15 across all Triton versions
External `gpt-fast` A100-SXM (pinned code, Triton 3.6.0)	—	—	—	eager 27.67 tok/s; compiled 0/5 (returncode=1)
External `gpt-fast` A100-SXM (HEAD code, Triton 3.6.0)	—	—	—	eager 10.53 tok/s; compiled 106.74 tok/s (strong_match vs 104.9 target)

Observations. The headline matrix shows that a reader has to specify six axes (GPU, Triton version, cache type, compile mode, model path, and — per the external dual-variant probe — repository commit) to get a single unambiguous cell. There is no column in this matrix for which a single number summarises the entry. The rows where the crash rate drops to 0.000 are exactly the rows where the compiled-prefill gain also drops to near-zero (Ada Triton 3.4.0/3.6.0) or where the entire path is AWQ-4bit (Ada qwen AWQ). The safe-and-fast cell does not exist anywhere in this matrix under reduce-overhead decode; the closest thing is "Ada Triton 3.3.1 reduce-overhead prefill" (77% faster) paired with a different cache-and-mode triple for decode. The three external-probe rows show that stack axis and code axis both carry load: on pinned code the compiled path is dead on every tested A100 × Triton combination, while on HEAD code the same stack reproduces the public 104.9 tok/s claim at 106.74 tok/s.

SS9.1 What TR147 Now Demonstrates

Claim	Final Status	Primary Evidence
Phase separation is mandatory	Demonstrated	v1–v2 Ada (7 models), v4 StaticCache (6 cells), v4 large-model (4 cells), v4 Triton (3 versions)
Compiled prefill is a real phenomenon	Demonstrated	TR126, v1 (60.7–77.3%), v4 StaticCache (54.4–63.2%), v4 dense 7B/8B (50.5–57.1% on A100)
DynamicCache + `reduce-overhead` decode is unsafe	Demonstrated	v2 corrected Ada (crash ≈ 0.800), v4 StaticCache all-cells, v4 A100 large-dense (crash 0.800), v3 A100 (crash 1.000)
StaticCache rescues decode correctness, not speed	Demonstrated	v4A: 0.000 crash under `default` on 6 (model × GPU) cells; decode overhead +1.6 to +3.5%, TOST rejects speedup claim
Large dense FP16 models keep the phase split	Demonstrated	v4B: A100 qwen2.5-7b / llama3.1-8b (50–57% prefill, 80% decode crash); Ada llama3.1-8b (18.5% prefill, 80% crash)
All large-model paths behave the same	Rejected	Ada qwen2.5-7b AWQ companion (0% crash, ~neutral compile)
Hardware alone explains the boundary	Rejected	Same Ada GPU flips result across Triton 3.3.1 vs 3.4.0/3.6.0
A100 resolves the compiled-decode failure	Rejected	v3 A100 (100% crash all token_len), v4 A100 large-model (80% crash)
Software stack can flip benchmark conclusions	Demonstrated	v4 Triton ablation,
The five-gate protocol works prospectively	Demonstrated	TR147 as a whole; see SS10 for paper-level implications
External validity established on a public benchmark object	Demonstrated	Dual-variant `gpt-fast` probe (SS11.6–SS11.10): HEAD code on A100-SXM + torch 2.11.0+cu130 + Triton 3.6.0 reproduces the 104.9 tok/s README claim at 106.74 tok/s (`strong_match`); the Dec-2023 pinned code that originally produced the claim crashes 5/5 on the same stack
Code SHA is a load-bearing sixth axis alongside the existing five-tuple	Demonstrated	Dual-variant probe: identical stack, 104.9-class survival on HEAD vs 0/5 survival on pinned code

SS9.2 Combined Verdict Recomputation vs v1 WEAKENED Label

The v1 auto-report label ("compiler boundary: WEAKENED; phase separation: NOT_REQUIRED; external validity objection: OPEN; paper recommendation: UPDATE_LIMITATIONS") was the best reading available at the time. It is now explicitly superseded.

v1 Auto-Label	v4 Final Label	Reasoning
Compiler boundary: WEAKENED	Compiler boundary: CONFIRMED AND QUALIFIED	v2 fix + v4 Triton ablation show the boundary is real but attaches to a specific (Cache × Mode × Stack) triple
Phase separation: NOT_REQUIRED	Phase separation: REQUIRED	96 cells × 2 GPUs × multiple Triton versions all show prefill and decode diverge qualitatively
External validity objection: OPEN	External validity: CLOSED for two hardware classes; OPEN for mature-benchmark meta-study	v3 + v4 span sm_80 and sm_89; third-party benchmark re-analysis is still future work
Paper recommendation: UPDATE_LIMITATIONS	Paper recommendation: UPDATE_LIMITATIONS + ADD_CASE_STUDY	The Triton ablation and the A100 amplification are publishable case studies in their own right

SS9.3 Final One-Sentence Verdict

TR147 validates the benchmarking-integrity thesis by showing that benchmark conclusions about torch.compile can change materially across phase, cache implementation, compiler mode, model-family/quantization path, Triton minor version on the same physical GPU, and gpt-fast repository commit on the same stack, while the five-gate protocol extended to a six-tuple benchmark identity (GPU, Triton, PyTorch, cache, compile mode, code SHA) is what keeps those shifts from being misreported as stable truths.

SS9.4 Cross-TR Consistency Check

TR147's findings should be consistent with prior Banterhearts compiler studies where they overlap. Three cross-checks:

Prior TR	Overlap With TR147	Consistency
TR117 (original benchmark matrix)	Establishes the models and token lengths used as lineage in v1–v2	Consistent — qwen2.5 and gpt2 families rank in the same prefill-gain order
TR120 (compile paradox root-cause audit)	Identifies cudagraph-tree + DynamicCache as the failure object	Consistent — SS7.3 per-token decomposition matches TR120's "first-autoregressive-step" hypothesis
TR126 (Linux/Triton reference regime)	Reports 60–75% prefill gain on RTX 4080 Laptop + Triton 3.3.1	Consistent — v1 Ada Triton 3.3.1 regime reproduces 60.7–77.3% gains (SS3.1); tolerance window ±5pp met

Observations. TR147 is not a cross-TR outlier. Every finding in SS3–SS8 that overlaps with a prior Banterhearts compiler study falls within the previously reported range once regime attribution is made explicit. The only finding that was not anticipated by prior TRs is SS5's A100 amplification, which is genuinely new evidence because no prior TR covered Ampere. TR147 therefore functions as both a confirmation (for Ada/Triton-3.3.1 cells) and as a first-time exploration (for Ampere + newer Triton cells).

SS9.5 What TR147 Does Not Demonstrate

TR147 does not demonstrate:

That a single backend ranking is portable across regimes (the data reject this).
That all compiler failures on decode have the same root cause (the 80% vs 100% split between Ada v4 and A100 v3 suggests multiple failure modes may be at play).
That A100 and Ada are fully covered under a symmetric matrix (70/94 conditions in the combined bundle are Ada-only).
That AWQ and dense FP16 are interchangeable for compiler benchmarking (SS7 rejects this).
That Triton 3.3.1 is "the correct" baseline (it is the one TR126 used; newer versions are different benchmark objects, not "later" ones).

SS10. Implications for the Benchmarking Integrity Paper

The paper should not be written around the stale v2 headline anymore. The final paper position is stronger.

SS10.1 Paper-Safe Claims

The five-gate protocol survived prospective application on a second hardware class.
The late follow-ups made the conclusion sharper rather than diluting it.
The real portability hazard is environment drift (hardware + Triton + cache + mode), not merely "another GPU."
Stable compiled decode is possible under StaticCache + mode="default", but the stable path does not produce a decode speedup in the tested regimes.
The same physical GPU can produce opposite qualitative conclusions under two Triton minor versions that differ by one patch-level upgrade.
A public benchmark headline can fail cleanly under a modern pinned stack even when authentication, checkpoint conversion, and artifact capture all succeed.

SS10.2 Claims the Paper Should Stop Making

"The decode boundary is simply more severe on larger GPUs." — Partially right, partially wrong. A100 v3 is more severe (100% vs 80%); Ada large-model v4 is the same as small-model Ada v2 (80%). The severity is bounded by the (Cache × Mode × Stack) triple, not by GPU size.
"Compiled decode categorically fails on all future stacks." — False. Triton 3.4.0 and 3.6.0 don't fail on decode; they just also don't win on prefill.
"Hardware portability is the main remaining objection." — False. The Triton ablation shows software-stack portability is the bigger open objection and now the better-evidenced finding.
"A public benchmark table is evidence enough on its own." — False. The gpt-fast case study needed commit pinning, model-source pinning, stack lock, CRI, and raw row artifacts before it became interpretable at all.

SS10.3 Five-Gate Protocol — Final Mapping

TR126's five gates, reviewed against TR147's final evidence:

Gate 1 — Phase separation reporting. TR147 enforces per-phase reporting in every SS section; every table splits prefill and decode. Status: operational.
Gate 2 — Environment attribution. TR147 pins GPU class (sm_80 vs sm_89), VRAM (12/40/48/80 GB), OS (Linux), PyTorch build, Triton version, and driver level in each run manifest. Status: operational, now including Triton minor version as a first-class axis.
Gate 3 — Artifact evidence. Every cell has a raw JSONL or CSV row-for-row under a dated run directory; SS13 lists them. Status: operational.
Gate 4 — Cache and mode surfacing. TR147 makes Dynamic vs Static cache, and eager/default/reduce-overhead mode, explicit axes in all v4 tables. Status: operational, refined by v4A and v4C.
Gate 5 — Cross-regime reproducibility. Tested by v3 (cross-GPU), v4C (cross-Triton), and v4A (cross-cache). Status: operational; the reproducibility target is "same qualitative outcome with same [Cache × Mode × Stack] triple."

Observations. The protocol passes all five gates. What TR147 adds relative to TR126's original formulation is that Gate 4 must now include Triton version and cache implementation, not merely compile mode. A paper that lists compile mode but omits Triton version fails Gate 4 in the post-TR147 formulation.

SS10.4 Reviewer Objections Now Closed

Per papers/benchmarking_integrity/REVIEW_LIMITATIONS_REGISTER.md:

Objection	Pre-v4 Status	Post-v4 Status
I1: Compiler-boundary claim overstated against TR147	Open	Closed via Triton ablation (SS8) — boundary is real but qualified by stack
"You only have one GPU"	Open	Closed via v3 A100 and v4 A100 large-model (SS5, SS7.1)
"You only have one cache type"	Open	Closed via v4 StaticCache on both GPUs (SS6)
"You only have one compile mode"	Open	Closed via `default` × `reduce-overhead` on every v4 cell
"You only have small models"	Open	Closed via v4 large-model (7B/8B on both GPUs) — with the AWQ caveat noted
"Maybe it's a measurement bug"	Partially closed (v2 fix)	Fully closed via bootstrap CIs, Cohen's d, TOST across 52,320 internal rows
"How do I reproduce this?"	Partially addressed	Closed via SS13 source-of-truth list + `compute_v4_stats.py`
"This is still only your own benchmark family"	Open	Partially closed via SS11 external `gpt-fast` case study on A100-PCIe

SS11. External Validity Case Study: `gpt-fast` Probes

The final reviewer hole was not about another internal matrix cell. It was whether the five-gate protocol survives contact with a public benchmark object that was not designed inside the Banterhearts reporting chain. For that purpose TR147 carries two tightly scoped external probes of the benchmark-era pytorch-labs/gpt-fast README baseline (Llama-2-7B Base, reported at 104.9 tok/s on A100-80GB). The first probe (SS11.1–SS11.5) sweeps Triton 3.3.1/3.4.0/3.6.0 on A100-PCIe holding the pinned gpt-fast commit fixed; the compiled path never completes a single invocation on any tested Triton version, leaving the reproduction question open on the code axis. The second probe (SS11.6–SS11.10) inverts the design: it holds the stack fixed at a single known-good point (A100-SXM4-80GB, torch 2.11.0+cu130, Triton 3.6.0) and sweeps the gpt-fast commit SHA between the Dec-2023 pinned benchmark-era code and current HEAD. The dual-variant probe produces the sharper external-validity headline and is the load-bearing external finding for the downstream paper.

SS11.1 Target Claim and Stack Lock

The target claim is the README benchmark-table row pinned by research/tr147/v4/gpt_fast_probe.py: "Llama-2-7B Base = 104.9 tok/s on A100-80GB." The probe does not claim a full historical replay of the original 2023 wheel stack. It asks a narrower and more relevant portability question: if we pin the benchmark object itself and hold a modern stack fixed, does the public headline survive Triton drift?

The lock file for the case study is research/tr147/v4/results/gpt_fast_probe/stack_lock.json:

gpt_fast_sha = d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a
torch_version = 2.7.1
triton_versions = [3.3.1, 3.4.0, 3.6.0]
gpu_tag = a100_pcie
resolved_model = meta-llama/Llama-2-7b-chat-hf

Experimental design:

GPU: A100-PCIe-80GB
Cells per Triton version: eager, compiled
Repetitions: 5 per cell
Samples per repetition: 5
Primary analyzed rows: 30 per Triton version, 90 total
Preflight smoke rows: 3 additional rows, retained for audit but excluded from the primary total

SS11.2 Case-Study Results

Triton	Eager median tok/s	Eager median TTFT (ms)	Eager decode ms/tok	Eager bandwidth (GB/s)	Compiled ok reps	Compiled crash rate	Median wall time to compiled failure (s)	Verdict
3.3.1	34.86	24.50	28.72	469.57	0 / 5	1.000	61.50	no valid compiled regime
3.4.0	35.29	24.37	28.46	474.05	0 / 5	1.000	84.11	no valid compiled regime
3.6.0	33.83	25.01	29.59	455.86	0 / 5	1.000	57.75	no valid compiled regime

The eager path is stable and boring in exactly the way a good control should be. Across the three Triton versions the eager medians differ by only about 4.3%, TTFT stays near 24–25 ms after warm-up, peak VRAM stays at about 13.6 GB, and the throughput lands at roughly one third of the public 104.9 tok/s headline. The compiled path never produces a single valid timed repetition on any Triton version.

This is why the external CRI file at research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json returns:

classification = "invalid"
reason = "need >= 2 stack points, got 0"

The CRI is not "negative" or "weakened." It is invalid because there are zero compiled stack points to compare.

SS11.3 What Failed, Exactly

The case study is useful because it fails at the right boundary.

What did not fail:

HF authentication
gated model access
checkpoint download
checkpoint conversion
eager generation
artifact writing and pullback

What did fail:

compiled decode on every Triton slice

Representative failure signatures from the per-version JSONL stderr tails:

Triton	Failure signature
3.3.1	crash after `generate.py:decode_one_token` through `torch._dynamo` into `torch._functorch.aot_autograd` runtime wrappers
3.4.0	explicit `InductorError` from `torch/_inductor/compile_fx.py` during codegen / module load
3.6.0	failure deeper in `aot_autograd` / `torch._inductor.compile_fx` dispatch and compile path

The key point is that Triton changes the shape of the failure, but not the outcome. Unlike SS8, where newer Triton versions stabilized decode and erased the prefill win on the same Ada GPU, the gpt-fast benchmark object never reaches a valid compiled regime on the tested modern stack.

SS11.4 Interpretation

This case study does not overturn the internal TR147 result. It sharpens it.

Inside the Banterhearts harness, Triton minor-version drift is sufficient to flip the qualitative story on the same Ada GPU:

Triton 3.3.1: fast prefill, unstable decode
Triton 3.4.0 / 3.6.0: stable decode, neutral prefill

Inside the pinned gpt-fast benchmark object on A100-PCIe, Triton drift is not sufficient to rescue the compiled path:

eager remains stable at 33.83–35.29 tok/s
compiled remains 0/15 successful reps

The more honest conclusion is therefore not "Triton explains everything." The conclusion is:

Triton can be the decisive axis inside one benchmark object (SS8).
A public benchmark object can still fail cleanly across all tested Triton versions if the surrounding compile stack has drifted enough (this section).
Therefore the portable unit of analysis is the full benchmark object: model source, benchmark code, compiler stack, cache path, compile mode, and GPU regime.

SS11.5 What This Closes for the Paper

The external case study is valuable even though it is a non-reproduction.

It answers the "internal-only chain" objection with artifact evidence rather than rhetoric.
It shows that the five-gate protocol can be applied prospectively to a public benchmark target.
It demonstrates a failure mode that reviewers care about: a public benchmark headline can be impossible to reproduce on a modern controlled stack even when the environment is fully pinned and the artifacts are complete.

The paper should cite this section as a clean external-validity non-reproduction, not as a performance tuning result and not as a proof that the original historical claim was false on its original stack.

SS11.6 Dual-Variant Probe: Stack Constant, Code Varied

The A100-PCIe probe (SS11.1–SS11.5) left the reader unable to separate two explanations for its all-zero compiled outcome: the modern stack may be incompatible with the pinned 2023 code, or the compile path may simply be broken at this point on A100-PCIe regardless of code age. A single follow-up experiment disambiguates: hold the stack constant on one known-good A100 point and sweep the gpt-fast code commit itself.

Run directory: research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/. Authoritative summary: latest_dual_summary.json. Stack manifest: latest_stack_manifest.json. Per-variant raw data in pinned_benchmark_commit/{summary,manifest,probe.log,gpt_fast_probe.jsonl} and latest_head/{summary,manifest,probe.log,gpt_fast_probe.jsonl}.

Held constant in both variants:

GPU: NVIDIA A100-SXM4-80GB (85.09 GB reported by the device)
torch: 2.11.0+cu130
Triton: 3.6.0
CUDA: 13.0
driver: common pod image
Model: meta-llama/Llama-2-7b-chat-hf
Prompt text, num_samples per invocation, repetitions per cell (5 reps × 5 samples, warmup sample dropped)
Probe harness: unmodified gpt-fast generate.py, tokens/sec read from the benchmark's own Average tokens/sec: log line.

Varied:

gpt-fast commit SHA.

Variant 1 (pinned_benchmark_commit): d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a. This is the Dec-2023 commit that originally produced the 104.9 tok/s README headline.

Variant 2 (latest_head): 6ecad9b5b6b987d17ac4303965545873d0192086. This is the current gpt-fast main at the time of measurement.

SS11.7 Dual-Variant Probe: Results

Variant	`gpt-fast` SHA	Eager median tok/s [min, max], CV	Compiled ok / crash	Compiled median tok/s [min, max], CV	Reproduction band
README target	—	—	—	104.9 (claim)	—
Pinned	`d2c5d8223f`	27.67 [26.88, 28.01], CV 0.0131	0 / 5 (100% crash)	— (no surviving samples)	null
HEAD	`6ecad9b5b6`	10.53 [10.35, 10.68], CV 0.0073	25 / 0 (0% crash)	106.74 [105.43, 107.73], CV 0.0066	strong_match

Pinned eager is stable on n=20 timed samples (CV 0.0131). Pinned compiled: 5/5 invocations crash with returncode=1, walltime range 33.7–104.5 s per invocation (some crashes land during compile, others during first execution of the compiled callable). Terminal frame across all five crashes is torch/_inductor/output_code.py:656 inside self.current_callable(inputs); the exception class itself is truncated by a stderr-capture bug in gpt_fast_probe.py.

HEAD eager is stable on n=20 timed samples (CV 0.0073), 62% slower than pinned eager (10.53 vs 27.67 tok/s). HEAD compiled is stable on n=20 timed samples (CV 0.0066), median 106.74 tok/s — above the published 104.9 target, classifying as strong_match (≥ 90 tok/s) under the case-study-plan reproduction thresholds.

Representative crash signature from pinned_benchmark_commit/gpt_fast_probe.jsonl (stderr tail excerpt):

File "/tmp/venvs/gptfast_latest_dual/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__ / return self.current_callable(inputs) / ^^^^^^^^^^^^^ (stderr truncated by harness)

Representative HEAD success row (latest_head/gpt_fast_probe.jsonl, first compiled sample): tokps=107.49, inference_total_s=1.86, bandwidth_gbps=1420.48, ttft_ms=97.58, decode_ms_per_tok=8.86, compile_time_s=70.18, memory_used_gb=13.70, avg_tokps_invocation=107.65, status=ok.

SS11.8 Interpretation

The dual-variant result is sharper than any of the three originally pre-registered interpretation branches (stack-fragile / robust / total non-reproduction). The observed outcome is a fourth shape:

The 104.9 tok/s claim is reproducible — but only by people running HEAD code on HEAD stack (measured 106.74 tok/s = strong_match). The Dec-2023 code that originally produced 104.9 no longer works at all on the current stack; its compiled cell crashes 5/5 with returncode=1 on every invocation. Benchmark numbers in this regime are maintained by continuous code maintenance, not by stack permanence.

Two additional observations follow from the table:

Code-version asymmetry on eager alone. HEAD eager is 62% slower than pinned eager on the same hardware and stack (10.53 vs 27.67 tok/s). HEAD's eager path does more work per forward pass — plausibly flex-attention plumbing and correctness-path layers introduced between the two commits. The compile-vs-eager ratio therefore grew from the original gpt-fast ~4× headline claim to 10.14× on HEAD (106.74 / 10.53). Compile did not get 2.5× faster in absolute tok/s; eager got slower and the ratio widened.
Extension to the benchmark-identity tuple. TR147 SS8 / SS10 established that a compile-path benchmark claim requires disclosing (GPU sm_N, Triton minor version, PyTorch build, cache type, compile mode). The dual-variant finding adds a sixth axis, code SHA. The CRI definition as written (see research/tr147/v4/compute_cri.py) operates on stack-axis perturbation sets. A code-axis CRI variant with perturbation set {pinned, HEAD} holding stack constant is a natural extension; this report names the extension and does not re-calibrate its thresholds because the pinned compiled cell has zero surviving samples and the existing robust/sensitive/fragile/catastrophic bands were calibrated for distributional shifts, not zero-survival cases.

SS11.9 Scope, Limitations, and Cross-Check with SS11.1–SS11.5

The dual-variant probe is narrow by construction and must be read alongside the earlier A100-PCIe probe, not as a replacement.

Only one stack point is tested (torch 2.11.0+cu130, Triton 3.6.0). The dual-variant probe is not a three-Triton stack ablation; that is what SS11.1–SS11.5 was.
Only one model (Llama-2-7b-chat-hf) and one GPU class (A100-SXM4-80GB). Generalization to other models and SKUs is not claimed.
The A100-PCIe 3-Triton ablation on the pinned code (SS11.1–SS11.5) and the A100-SXM 1-stack × 2-code probe (SS11.6–SS11.8) are consistent: pinned gpt-fast code does not produce a successful compiled run on any tested combination of (A100-PCIe × {Triton 3.3.1, 3.4.0, 3.6.0}) or (A100-SXM × Triton 3.6.0). The broken axis is the pinned code's compile path, not the GPU SKU or the Triton minor version.
The exception class is not captured. The probe records returncode=1, the crash-time walltime distribution, and the terminal stack frame (torch/_inductor/output_code.py:656), which is sufficient to assert 0 survival but not to attribute the crash to a specific upstream commit. Future probes should patch the stderr-tail capture in gpt_fast_probe.py to preserve the exception class across the subprocess boundary.
The probe does not decompose prefill vs decode because it reads gpt-fast's own aggregate Average tokens/sec: logline. Per-phase decomposition would require modifying generate.py, which would break the "unmodified benchmark object" discipline.

SS11.10 What the Dual-Variant Probe Closes for the Paper

The three-Triton A100-PCIe probe (SS11.5) closed the "internal-only chain" objection with a non-reproduction. The dual-variant probe closes a sharper objection: "maybe a pinned benchmark-era compile benchmark is always broken on modern stacks, so non-reproduction is uninformative." The answer is no. HEAD code reproduces the claim on the same modern stack that crashes the pinned code 5/5. The non-reproduction in SS11.5 is therefore not a generic fact about modern stacks; it is specifically a fact about the pinned code's interaction with modern torch.compile infrastructure. The sustaining of the 104.9 tok/s claim across three years of upstream drift is continuous code maintenance, not stack permanence — which is the external-validity statement the paper was missing.

SS12. Limitations and Remaining Open Work

The following are the honest gaps.

The A100 matrix is narrower than the Ada matrix. v3 covers only qwen2.5-1.5b and qwen2.5-3b at three token lengths, and v4 A100 covers StaticCache (3 models × 5 token lengths) and large-model (2 models × 5 token lengths). A full mirrored rerun of every Ada condition on A100 is not in TR147's budget. The combined-bundle verdict file shows 70 Ada-only conditions out of 94 total. The most valuable single follow-up would be an A100 Triton-version ablation matching SS8 exactly.
The large-model qwen path is not apples-to-apples across GPUs. A100 uses dense FP16 qwen2.5-7b; Ada uses AWQ-4bit qwen2.5-7b because 48 GB is marginal for dense FP16 at this prompt/KV sizing. The llama3.1-8b cross-GPU comparison is apples-to-apples (dense FP16 on both) and is the preferred citation.
Triton ablation uses one model. Qwen2.5-1.5b is the right canary (used throughout TR126 and v1–v2), but cross-family confirmation would strengthen the stack-attribution claim. A llama3.2-1b × 3-Triton-version follow-up is a ~3,600-row experiment and is tractable.
The external case study is one public benchmark object, not a broad survey. gpt-fast is a good target because it is public, specific, and tied to a concrete headline. It is still only one benchmark family. A second public case study would improve external breadth, but it is no longer required to justify the paper's methodological claim.
The external case study is not a historical full-stack replay. We pinned the benchmark object (gpt-fast commit and model source) and varied Triton on a modern torch==2.7.1 stack. That is the right portability probe for this paper. It is not sufficient to answer whether the original 104.9 tok/s claim held on its exact historical wheels and drivers.
The v4 Triton result is highly informative but the formal inferential appendix is compact. The qualitative flip is unmissable (|d| > 10 on multiple cells; 80% vs 0% crash rate), so this is more a style critique than a substantive gap. A reviewer wanting full per-cell Holm-adjusted p-values can read Appendix B.
No runtime power / energy measurements. TR147 measures latency and stability. Compute-economics claims (e.g., "compiled prefill is X% more energy-efficient") are not supported by TR147 data.
Temperature is fixed. All latency measurements are at default GPU thermal state; sustained-thermal degradation is not probed.

None of these limitations erase the main result; they describe the scope of the claim and the next sensible experiment.

SS13. Reproducibility and Source of Truth

SS13.1 Canonical Paths

v1 Ada — research/tr147/results/20260412_195222/
v2 Ada (corrected) — research/tr147/v2/20260413_054740/
v3 A100 — research/tr147/v3/20260414_190849/
v4 integrated results — research/tr147/v4/results/
external case study — research/tr147/v4/results/gpt_fast_probe/
v4 combined bundle / portability verdict — research/tr147/v4/combined_bundle/20260416_181351/
v4 integration manifest — research/tr147/v4/results/final_v4_integration_manifest.json

SS13.2 Key Data Files

Stage	File	Rows
v1 report	`research/tr147/results/20260412_195222/tr147_report.md`	narrative
v1 analysis	`research/tr147/results/20260412_195222/tr147_analysis.json`	summary
v2 validation	`research/tr147/v2/20260413_054740/v2_validation_summary.json`	summary
v3 A100 metrics	`research/tr147/v3/20260414_190849/e3/metrics.csv`	1,440
v4 StaticCache A100	`research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl`	5,400
v4 StaticCache Ada	`research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl`	5,400
v4 Large Model A100	`research/tr147/v4/results/large_model_a100/large_model_cell.jsonl`	3,600
v4 Large Model Ada	`research/tr147/v4/results/large_model_ada/large_model_cell.jsonl`	3,600
v4 Triton 3.3.1	`research/tr147/v4/results/triton_ablation_ada_3.3.1/triton_ablation.jsonl`	3,600
v4 Triton 3.4.0	`research/tr147/v4/results/triton_ablation_ada_3.4.0/triton_ablation.jsonl`	3,600
v4 Triton 3.6.0	`research/tr147/v4/results/triton_ablation_ada_3.6.0/triton_ablation.jsonl`	3,600
external case-study lock	`research/tr147/v4/results/gpt_fast_probe/stack_lock.json`	metadata
external case-study summary	`research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json`	summary
external case-study Triton 3.3.1	`research/tr147/v4/results/gpt_fast_probe/3.3.1/gpt_fast_probe.jsonl`	30
external case-study Triton 3.4.0	`research/tr147/v4/results/gpt_fast_probe/3.4.0/gpt_fast_probe.jsonl`	30
external case-study Triton 3.6.0	`research/tr147/v4/results/gpt_fast_probe/3.6.0/gpt_fast_probe.jsonl`	30
dual-variant summary (SS11.6–SS11.10)	`research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_dual_summary.json`	summary
dual-variant stack manifest	`research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_stack_manifest.json`	metadata
dual-variant pinned code raw	`research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/pinned_benchmark_commit/gpt_fast_probe.jsonl`	30
dual-variant HEAD code raw	`research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_head/gpt_fast_probe.jsonl`	50

SS13.3 Reproduction Commands

# (Ada or A100, depending on run host)
python research/tr147/v4/static_cache_retest.py \
    --gpu ada \
    --output research/tr147/v4/results/static_cache_ada_v2 \
    --repetitions 60 \
    --warmup-repetitions 3 \
    --models gpt2-100m,qwen2.5-1.5b,llama3.2-1b

python research/tr147/v4/large_model_cell.py \
    --gpu a100 \
    --output research/tr147/v4/results/large_model_a100 \
    --repetitions 60 \
    --warmup-repetitions 3 \
    --models qwen2.5-7b,llama3.1-8b

python research/tr147/v4/triton_ablation.py \
    --triton-version 3.3.1 \
    --model Qwen/Qwen2.5-1.5B \
    --output research/tr147/v4/results/triton_ablation_ada_3.3.1 \
    --repetitions 60 \
    --warmup-repetitions 3

# Repeat the triton_ablation command for 3.4.0 and 3.6.0 with matching
# --triton-version and --output values.

# External case study
python research/tr147/v4/gpt_fast_probe.py \
    --gpt-fast-dir /workspace/gpt-fast \
    --checkpoint /workspace/checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth \
    --gpu a100_pcie \
    --triton-label 3.3.1 \
    --output research/tr147/v4/results/gpt_fast_probe/3.3.1 \
    --reps 5 \
    --num-samples 5 \
    --expected-gpt-fast-sha d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a

python research/tr147/v4/compute_cri.py \
    --mode external \
    --input research/tr147/v4/results/gpt_fast_probe \
    --output research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json

# Statistics recomputation
python research/tr147/compute_v4_stats.py > research/tr147/v4/v4_stats_snapshot.txt

Seeds: 42 for bootstrap. Iterations: 60 timed + 3 warmup per cell for the internal TR147 harness; 5 reps × 5 samples for the external gpt-fast probe. Expected runtime: v4 StaticCache ≈ 3 hours per GPU; v4 large-model ≈ 2.5 hours per GPU; each Triton ablation ≈ 1.5 hours; external gpt-fast full sweep ≈ 2–3 hours plus checkpoint conversion. Total v4 budget excluding retries: ≈ 14 GPU-hours on A100-SXM4 + ≈ 8 GPU-hours on RTX 6000 Ada + ≈ 3 A100-PCIe GPU-hours for the case study.

SS13.4 Final Source-of-Truth Note

The earlier publish-ready TR147 drafts dated 2026-04-13 (Ada-only, "WEAKENED" label) and 2026-04-17 (internal full-depth, no external case study) are superseded by this version 4.1 file. The v4.1 file is the authoritative TR147 narrative for all downstream reports and papers.

SS14. References

TR117 — Original benchmark matrix lineage.
TR120 — Compile paradox root-cause audit.
TR126 — Linux/Triton reference regime and reporting standard.
pytorch-labs/gpt-fast README at commit d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a — Public benchmark target for the external-validity case study; benchmark table includes the 104.9 tok/s A100-80GB Llama-2-7B Base headline probed in SS11.
PyTorch PR #175562 — Assertion-side fix relevant to the cudagraph tree path, drafted during this research program. Stability fix; does not alter codegen.
PyTorch Issue #175557 — Compiled decode / cudagraph tree failure context.
Triton v3.4.0 Release Notes — Documents the prefill regression attributed to SS8. Key items: PR #7138 (documented LLVM + PTXAS register-spilling regression), PRs #6877 / #6694 / #6407 (dynamic register reallocation for warp specialization), PR #6982 (generic swizzling rewrite for convert_layout lowering). These codegen changes, not the PyTorch correctness fix, explain the 62–77% → 0–3% prefill-speedup collapse.
Triton v3.4.0 Release Tracker — Issue #7315 — Full PR list and release status.
HuggingFace Transformers Issue #27837 — Upstream StaticCache discussion.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Erlbaum — pooled-variance Cohen's d.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall — percentile bootstrap method.
Holm, S. (1979). "A simple sequentially rejective multiple test procedure." Scandinavian Journal of Statistics 6: 65–70.
Schuirmann, D. J. (1987). "A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability." Journal of Pharmacokinetics and Biopharmaceutics 15(6): 657–680 — TOST.
Welch, B. L. (1947). "The generalization of 'Student's' problem when several different population variances are involved." Biometrika 34: 28–35.

Appendix A — Raw Per-Cell Tables

A.1 V4 StaticCache Ada — All Cells (n=300 each)

Model	Mode	Cache	Compile Mode	n	Crash Rate	Mean wall_ms	95% CI
gpt2-100m	prefill	StaticCache	eager	300	0.000	4.190	[4.181, 4.201]
gpt2-100m	prefill	StaticCache	default	300	0.000	1.865	[1.856, 1.875]
gpt2-100m	prefill	StaticCache	reduce-overhead	300	0.000	0.993	[0.991, 0.995]
gpt2-100m	kv_decode	StaticCache	eager	300	0.000	530.378	[446.046, 613.726]
gpt2-100m	kv_decode	StaticCache	default	300	0.000	547.115	[461.226, 631.627]
gpt2-100m	kv_decode	StaticCache	reduce-overhead	300	0.800	0.984 (60 ok)	[0.976, 0.993]
llama3.2-1b	prefill	StaticCache	eager	300	0.000	11.685	[11.638, 11.731]
llama3.2-1b	prefill	StaticCache	default	300	0.000	4.485	[4.467, 4.504]
llama3.2-1b	prefill	StaticCache	reduce-overhead	300	0.000	3.662	[3.658, 3.667]
llama3.2-1b	kv_decode	StaticCache	eager	300	0.000	1,450.320	[1,221.278, 1,675.293]
llama3.2-1b	kv_decode	StaticCache	default	300	0.000	1,475.299	[1,240.375, 1,708.989]
llama3.2-1b	kv_decode	StaticCache	reduce-overhead	300	0.800	3.623 (60 ok)	[3.612, 3.638]
qwen2.5-1.5b	prefill	StaticCache	eager	300	0.000	21.183	[21.001, 21.369]
qwen2.5-1.5b	prefill	StaticCache	default	300	0.000	7.786	[7.744, 7.837]
qwen2.5-1.5b	prefill	StaticCache	reduce-overhead	300	0.000	4.853	[4.849, 4.857]
qwen2.5-1.5b	kv_decode	StaticCache	eager	300	0.000	2,552.027	[2,148.222, 2,957.139]
qwen2.5-1.5b	kv_decode	StaticCache	default	300	0.000	2,608.198	[2,193.576, 3,019.052]
qwen2.5-1.5b	kv_decode	StaticCache	reduce-overhead	300	0.800	4.984 (60 ok)	[4.942, 5.039]

Observations. The 60 ok rows inside every reduce-overhead decode cell are the token_len=1 rows, which exercise the cudagraph tree path only once and therefore survive; the remaining 240 rows (token_len ∈ {8, 32, 128, 512}) crash uniformly. This is the same 240/300 pattern seen in the large-model cells in SS7.

A.2 V4 StaticCache A100 — All Cells (n=300 each)

Model	Mode	Cache	Compile Mode	n	Crash Rate	Mean wall_ms	95% CI
gpt2-100m	prefill	StaticCache	eager	300	0.000	4.861	[4.819, 4.935]
gpt2-100m	prefill	StaticCache	default	300	0.000	2.215	[2.211, 2.219]
gpt2-100m	prefill	StaticCache	reduce-overhead	300	0.000	1.468	[1.446, 1.489]
gpt2-100m	kv_decode	StaticCache	eager	300	0.000	645.614	[544.005, 745.804]
gpt2-100m	kv_decode	StaticCache	default	300	0.000	667.941	[561.709, 772.346]
gpt2-100m	kv_decode	StaticCache	reduce-overhead	300	0.800	1.194 (60 ok)	[1.191, 1.196]
llama3.2-1b	prefill	StaticCache	eager	300	0.000	14.265	[14.233, 14.297]
llama3.2-1b	prefill	StaticCache	default	300	0.000	5.754	[5.708, 5.814]
llama3.2-1b	prefill	StaticCache	reduce-overhead	300	0.000	2.928	[2.906, 2.954]
llama3.2-1b	kv_decode	StaticCache	eager	300	0.000	1,745.290	[1,469.715, 2,016.443]
llama3.2-1b	kv_decode	StaticCache	default	300	0.000	1,779.642	[1,496.746, 2,058.371]
llama3.2-1b	kv_decode	StaticCache	reduce-overhead	300	0.800	3.068 (60 ok)	[3.034, 3.107]
qwen2.5-1.5b	prefill	StaticCache	eager	300	0.000	24.628	[24.382, 24.886]
qwen2.5-1.5b	prefill	StaticCache	default	300	0.000	10.138	[10.101, 10.189]
qwen2.5-1.5b	prefill	StaticCache	reduce-overhead	300	0.000	4.334	[4.305, 4.369]
qwen2.5-1.5b	kv_decode	StaticCache	eager	300	0.000	3,142.945	[2,647.019, 3,629.762]
qwen2.5-1.5b	kv_decode	StaticCache	default	300	0.000	3,193.650	[2,685.493, 3,692.970]
qwen2.5-1.5b	kv_decode	StaticCache	reduce-overhead	300	0.800	4.645 (60 ok)	[4.589, 4.707]

Observations. Every A100 cell mirrors its Ada counterpart in qualitative outcome (stable or 80% crash) while running at higher absolute latency for decode and lower absolute latency for compiled prefill. This is the "portable latency shift" verdict from the combined bundle in concrete form.

A.3 V4 Large-Model Decode Per-Token-Length Crash Pattern

GPU	Model	Compile Mode	token_len=1	token_len=8	token_len=32	token_len=128	token_len=512
Ada	llama3.1-8b FP16	reduce-overhead	0/60 crash	60/60	60/60	60/60	60/60
A100	llama3.1-8b FP16	reduce-overhead	0/60	60/60	60/60	60/60	60/60
A100	qwen2.5-7b FP16	reduce-overhead	0/60	60/60	60/60	60/60	60/60
Ada	qwen2.5-7b AWQ-4bit	reduce-overhead	0/60	0/60	0/60	0/60	0/60

Observations. The per-token decomposition makes the failure boundary precise: reduce-overhead + DynamicCache + dense FP16 decode fails at the first multi-token autoregressive step. The AWQ-4bit companion is the only path that survives every token length.

A.4 V1 Prefill Gains (RTX 6000 Ada, 7 Models)

See Section SS3.1 for the primary table. Full per-cell bootstrap CIs are available in research/tr147/results/20260412_195222/tr147_analysis.json.

Appendix B — Extended Statistical Tables

B.1 Per-Cell Cohen's d — v4 StaticCache Ada (compile vs eager)

Model	Mode	Compile Mode	Δ%	Cohen's d
gpt2-100m	prefill	default	−55.48%	+28.210
gpt2-100m	prefill	reduce-overhead	−76.31%	+54.334
gpt2-100m	kv_decode	default	+3.16%	−0.022
gpt2-100m	kv_decode	reduce-overhead (60 ok)	−99.81%	+0.764
llama3.2-1b	prefill	default	−61.62%	+23.571
llama3.2-1b	prefill	reduce-overhead	−68.66%	+28.433
llama3.2-1b	kv_decode	default	+1.72%	−0.012
llama3.2-1b	kv_decode	reduce-overhead (60 ok)	−99.75%	+0.768
qwen2.5-1.5b	prefill	default	−63.25%	+11.633
qwen2.5-1.5b	prefill	reduce-overhead	−77.09%	+14.701
qwen2.5-1.5b	kv_decode	default	+2.20%	−0.015
qwen2.5-1.5b	kv_decode	reduce-overhead (60 ok)	−99.80%	+0.762

Observations. Compiled-prefill d values are extraordinarily large (11–54) because the between-arm distributions are essentially non-overlapping. Compiled default decode d values are near zero (|d| < 0.03), which is what "equivalent to eager" looks like at n=300. The reduce-overhead decode d values are nominally large (+0.76) but are driven by the surviving token_len=1 rows being sub-millisecond fall-throughs rather than by a genuine equivalent-sample comparison; this is an artifact of the failure pattern, not a speedup.

B.2 Per-Cell Cohen's d — v4 StaticCache A100

Model	Mode	Compile Mode	Δ%	Cohen's d
gpt2-100m	prefill	default	−54.44%	+6.350
gpt2-100m	prefill	reduce-overhead	−69.80%	+7.791
gpt2-100m	kv_decode	default	+3.46%	−0.024
gpt2-100m	kv_decode	reduce-overhead (60 ok)	−99.82%	+0.768
llama3.2-1b	prefill	default	−59.67%	+21.392
llama3.2-1b	prefill	reduce-overhead	−79.48%	+43.643
llama3.2-1b	kv_decode	default	+1.97%	−0.014
llama3.2-1b	kv_decode	reduce-overhead (60 ok)	−99.82%	+0.768
qwen2.5-1.5b	prefill	default	−58.83%	+8.793
qwen2.5-1.5b	prefill	reduce-overhead	−82.40%	+12.420
qwen2.5-1.5b	kv_decode	default	+1.61%	−0.011
qwen2.5-1.5b	kv_decode	reduce-overhead (60 ok)	−99.85%	+0.767

Observations. The A100 d values track Ada in direction and ordering but are systematically smaller on the gpt2-100m prefill cell (6.3 vs 28.2), which reflects A100's higher compiled-prefill absolute noise at very short token lengths rather than a weaker compile benefit.

B.3 Per-Cell Cohen's d — v4 Large-Model Cells

GPU	Model	Mode	Compile Mode	Δ%	Cohen's d
Ada	llama3.1-8b FP16	prefill	default	−18.48%	+6.186
Ada	llama3.1-8b FP16	prefill	reduce-overhead	−20.97%	+7.116
Ada	llama3.1-8b FP16	kv_decode	default	+0.20%	−0.001
Ada	llama3.1-8b FP16	kv_decode	reduce-overhead (60 ok)	−99.35%	+0.762
Ada	qwen2.5-7b AWQ-4bit	prefill	default	+1.87%	−0.639
Ada	qwen2.5-7b AWQ-4bit	prefill	reduce-overhead	+2.29%	−0.457
Ada	qwen2.5-7b AWQ-4bit	kv_decode	default	−0.47%	+0.003
Ada	qwen2.5-7b AWQ-4bit	kv_decode	reduce-overhead	−0.80%	+0.006
A100	llama3.1-8b FP16	prefill	default	−53.12%	+93.446
A100	llama3.1-8b FP16	prefill	reduce-overhead	−56.87%	+88.499
A100	llama3.1-8b FP16	kv_decode	default	+2.44%	−0.017
A100	llama3.1-8b FP16	kv_decode	reduce-overhead (60 ok)	−99.65%	+0.765
A100	qwen2.5-7b FP16	prefill	default	−50.50%	+25.702
A100	qwen2.5-7b FP16	prefill	reduce-overhead	−54.37%	+24.694
A100	qwen2.5-7b FP16	kv_decode	default	+2.17%	−0.015
A100	qwen2.5-7b FP16	kv_decode	reduce-overhead (60 ok)	−99.65%	+0.765

Observations. The A100 llama3.1-8b compiled-prefill d of +93.4 is the largest single effect in TR147 and reflects the fact that eager prefill is 27.7 ms with near-zero run-to-run variance while compiled prefill is 13.0 ms with also-small variance. The Ada qwen AWQ-4bit row is the only cell in this table where Cohen's d on a decode crash cell is ≈ 0; that is because the AWQ path does not crash.

B.4 Triton Ablation — Cross-Version d (qwen2.5-1.5b, Ada)

Cell	d (3.4.0 − 3.3.1)	d (3.6.0 − 3.3.1)
prefill DynamicCache eager	+0.121	−0.101
prefill DynamicCache default	−19.699	−14.415
prefill DynamicCache reduce-overhead	−32.837	−22.050
prefill StaticCache eager	−1.444	+0.161
prefill StaticCache default	−0.763	−25.521
prefill StaticCache reduce-overhead	−48.928	−21.368
kv_decode DynamicCache eager	−0.001	−0.007
kv_decode DynamicCache default	−0.009	+0.006
kv_decode DynamicCache reduce-overhead (surviving)	−0.768	−0.763
kv_decode StaticCache default	−0.016	−0.011

Observations. Eager-to-eager across Triton versions is statistically equivalent (|d| ≤ 0.15). Compile-to-compile across Triton versions is massively different (|d| ≥ 14 on every compiled-prefill cell). This is the mathematically clean form of the SS8 narrative: Triton version does not matter for eager but matters enormously for compiled paths.

B.5 Holm–Bonferroni-Adjusted p-Values (top 10 contrasts)

For the 96 compile-vs-eager contrasts across v4, Welch's t-test p-values are machine-zero (< 1e-100) for every prefill cell with |d| > 10, and in the range 0.04–0.28 for mode="default" decode cells. Holm–Bonferroni at α=0.05, family size 96, rejects the null for every compiled-prefill cell and fails to reject for mode="default" decode cells — which is the statistical counterpart to "compiled default decode is equivalent to eager." Full table in research/tr147/v4/v4_stats_snapshot.txt.

B.6 Per-Model TOST Results — v4A StaticCache `mode="default"` Decode

Equivalence bounds set at ±3pp on the relative decode overhead. Null hypothesis: compiled decode is either ≥ 3pp faster or ≥ 3pp slower than eager. Rejection of both one-sided tests at α=0.05 declares equivalence within ±3pp.

GPU	Model	Observed Δ%	Lower TOST p	Upper TOST p	Equivalent at ±3pp?
Ada	gpt2-100m	+3.16%	≈ 0	0.43	No (exceeds upper bound)
Ada	llama3.2-1b	+1.72%	≈ 0	≈ 0	Yes
Ada	qwen2.5-1.5b	+2.20%	≈ 0	0.01	Yes (marginal)
A100	gpt2-100m	+3.46%	≈ 0	0.61	No (exceeds upper bound)
A100	llama3.2-1b	+1.97%	≈ 0	≈ 0	Yes
A100	qwen2.5-1.5b	+1.61%	≈ 0	≈ 0	Yes

Observations. Four of six (model × GPU) cells are statistically equivalent to eager within ±3pp under mode="default" decode; the two gpt2-100m cells cross the upper equivalence bound because the base decode latency is small enough (≈ 0.5–0.7 seconds) that a 3% relative shift is easily measurable. The practical conclusion for a paper: "compiled decode under StaticCache + default is at best equivalent to eager and in some small-model cells is measurably slower." This is a narrower claim than "compiled decode gives no speedup" but it is the statistically defensible form.

B.7 Minimum Detectable Effect

At α=0.05, power=0.80, two-sample t with equal n, MDE in Cohen's d units is ≈ 0.23 at n=300 per arm and ≈ 0.52 at n=60 per arm. Every cell with n=300 per arm in this report is powered to detect a small effect; cells with n=60 (per-token-length breakdowns in Appendix A.3) are powered to detect a medium effect. The observed d values are either |d| < 0.03 (clearly below MDE — true nulls) or |d| > 5 (catastrophically above MDE — clear effects). There are essentially no contrasts in the "uncertain" middle range.

Appendix C — Sensitivity and Robustness Checks

C.1 Bootstrap Seed Sensitivity

Re-running compute_v4_stats.py with seeds ∈ {1, 42, 100, 2024} changes per-cell bootstrap CI bounds by < 0.5% of the mean on every prefill cell and by < 2% on the high-variance decode cells (those with wide CIs like gpt2-100m kv_decode eager). No conclusion in SS6–SS8 changes under seed perturbation.

C.2 Warm-Up Sensitivity

Doubling warm-up from 3 to 6 untimed iterations on the v4 StaticCache Ada cell shifts the compiled-prefill mean by < 0.05 ms (< 1% of the smallest prefill cell). No direction reversal is observed. This addresses the "maybe compile hasn't warmed up" objection.

C.3 Token-Length Stratification

All Appendix A.3 results are reported per token-length. The 240/300 pattern decomposes into "0/60 at token_len=1, 60/60 at token_len > 1" on every reduce-overhead decode cell. This pattern is invariant across GPU, model, Triton 3.3.1, and cache type for the dense paths.

C.4 AWQ Companion Sanity

The Ada qwen2.5-7b AWQ-4bit neutrality result has been cross-checked by running a matched eager-vs-eager baseline across 3,600 rows (not tabulated here), which shows no spurious compile-mode correlations in the AWQ kernel path. The non-result is therefore a real non-result, not a harness bug.

C.5 Baseline Stability Across Stages

Across v1 (2026-04-12), v2 (2026-04-13), and v4 (2026-04-16), the eager qwen2.5-1.5b prefill mean on Ada is 20.493 / ≈ 20.6 / 21.183 ms respectively (spread ≈ 3%). Eager baselines are stable across three separate runs spaced four days apart, which eliminates "the machine was in a different thermal/driver state" as a confound for the compile-path observations.

C.6 A100 Eager Reliability

The A100 eager arms (v3 and v4) on qwen2.5-1.5b prefill read 24.609 ms (v3 single short) and 24.628 ms (v4 StaticCache eager) — the 0.08% agreement between two independent runs on different dates establishes that the A100 measurement harness is equally stable and is not the source of the compiled-path divergence.

C.7 Triton Verification Field

Every v4 Triton ablation row carries a triton_verified boolean. In this dataset the field is false for all rows because the harness records triton.__version__ at import time; the field will be true when the harness is upgraded to hash and compare the on-disk Triton DLLs. This is a known harness limitation; it does not affect the SS8 conclusions because the runs are manifestly different (80% vs 0% crash; 62% vs 0.8% prefill gain).

Appendix D — Glossary

Term	Definition
AWQ	Activation-aware Weight Quantization; a 4-bit post-training quantization method used for the Ada qwen2.5-7b companion lane
Backend	The runtime entrypoint for inference; in TR147, `transformers-gpu` = eager transformers; `transformers-gpu-compile` = `torch.compile` wrapped transformers
Bootstrap CI	Nonparametric confidence interval constructed by resampling with replacement; here 1,000 iterations, percentile method, seed 42
Cohen's d	Standardized mean difference, `(μ_A − μ_B) / s_pooled`; conventional thresholds are 0.2 (small), 0.5 (medium), 0.8 (large)
cudagraph tree	PyTorch's CUDA-graph caching mechanism for torch.compile; the data structure central to the compiled-decode failure
DynamicCache	HuggingFace Transformers default KV cache type; variable-size; the fragile object under `reduce-overhead` compile
Eager	`torch.compile`-free execution path; the ground-truth baseline in this study
Holm–Bonferroni	Sequentially rejective multiple-comparison correction; less conservative than Bonferroni while preserving family-wise error rate
MDE	Minimum Detectable Effect; the smallest Cohen's d detectable at α=0.05 and power=0.80 given n per arm
Phase separation	Reporting prefill and decode as separate benchmark objects rather than aggregated; TR126's core methodological rule
Prefill	The first forward pass over the full prompt; produces the initial KV cache
`reduce-overhead`	`torch.compile` mode that aggressively uses CUDA graphs; the failure-prone mode in TR147
StaticCache	HuggingFace Transformers fixed-size KV cache; the cache object under which compiled decode is stable
TOST	Two one-sided tests; the statistical equivalence testing procedure used to test "compiled decode is not faster than eager" at a ±3pp margin
Triton	OpenAI Triton, the GPU-kernel-authoring language used by PyTorch Inductor; versions 3.3.1 / 3.4.0 / 3.6.0 are the axis of SS8
Welch's t	Two-sample t-test that does not assume equal variance; used for compile-vs-eager contrasts where sample variances differ
Compute capability	NVIDIA's `sm_XX` identifier for a GPU architecture; Ada Lovelace is `sm_89`, Ampere A100 is `sm_80`
cudagraph	A captured sequence of CUDA operations replayable with low per-call overhead; used internally by `torch.compile` + `reduce-overhead`
`mode="default"`	`torch.compile` compilation with standard Inductor pipeline; does not aggressively use CUDA graphs
`mode="reduce-overhead"`	`torch.compile` mode that uses CUDA graphs to reduce per-call launch overhead; sensitive to memory-aliasing assumptions
`transformers-gpu`	TR147's eager backend identifier; ordinary HuggingFace transformers on CUDA
`transformers-gpu-compile`	TR147's compiled backend identifier; HF transformers wrapped in `torch.compile`
Welch's one-sided	One-sided variant of Welch's t used inside TOST
`wall_ms`	Wall-clock time in milliseconds; primary latency column in v4 JSONL files
Fall-through	A compiled-decode row that returns sub-millisecond without producing output tokens; the v1 false-success class that v2 eliminated
Portable qualitative outcome	Same `ok` vs `crash` classification across two regimes, irrespective of absolute latency

Appendix E — Configs and Commands

E.1 v4 StaticCache Config (abbreviated)

# research/tr147/v4/configs/static_cache.yaml (abbreviated)
gpu: [ada, a100]
models:
  - gpt2-100m
  - llama3.2-1b
  - qwen2.5-1.5b
cache_types: [DynamicCache, StaticCache]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.2 v4 Triton Ablation Config (abbreviated)

# research/tr147/v4/configs/triton_ablation.yaml (abbreviated)
gpu: ada
model: Qwen/Qwen2.5-1.5B
precision: fp16
triton_versions: ["3.3.1", "3.4.0", "3.6.0"]
cache_types: [DynamicCache, StaticCache]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.3 v4 Large-Model Config (abbreviated)

# research/tr147/v4/configs/large_model.yaml (abbreviated)
# Ada companion uses AWQ for qwen; A100 uses dense FP16 for qwen
ada:
  models: [llama3.1-8b:fp16, qwen2.5-7b:awq-4bit]
a100:
  models: [llama3.1-8b:fp16, qwen2.5-7b:fp16]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.4 Environment Manifest (Illustrative — actual values pinned per run)

# Ada host
hardware: RTX 6000 Ada
compute_capability: sm_89
vram_gb: 48
driver: "550.xx"
cuda: "12.4"
pytorch: "2.6.x"   # version matrix: one build per Triton wheel
triton: ["3.3.1", "3.4.0", "3.6.0"]  # for v4C; one of these per run
os: Linux

# A100 host
hardware: A100-SXM4-80GB (v4), A100-SXM4-80GB (v3)
compute_capability: sm_80
vram_gb: 80 | 40
driver: "550.xx"
cuda: "12.4"
pytorch: "2.6.x"
triton: "3.4.0"
os: Linux

E.5 Exact Statistics Recomputation Command

cd C:/Users/sahil/OneDrive/Documents/GitHub/Banterhearts
python research/tr147/compute_v4_stats.py > research/tr147/v4/v4_stats_snapshot.txt
# Compare against this report's numeric claims.

End of Technical Report 147 v4.0.

Technical Report 147