Research Archive
Technical Report

Technical Report 147

Table of Contents

Technical Report 147: Final Portability Validation for Benchmarking Integrity

Full v1–v4 integration plus external case study, using TR126 as the reference standard

Field Value
TR Number 147
Project Banterhearts
Date 2026-04-18
Version 4.1 (full-depth + external case study)
Author Research Team
Status Final
Report Type Integrated portability validation (v1 + v2 + v3 + v4 + external case study)
Reference Standard TR126
Primary Run Directories research/tr147/results/20260412_195222/, research/tr147/v2/20260413_054740/, research/tr147/v3/20260414_190849/, research/tr147/v4/results/, research/tr147/v4/results/gpt_fast_probe/
Total Measurements 52,410 primary rows across four GPU deployment regimes and three Triton minor versions
Breakdown v1 Ada: 15,240; v2 Ada: 6,840; v3 A100-SXM4: 1,440; v4 StaticCache Ada/A100: 5,400 + 5,400; v4 Large-Model Ada/A100: 3,600 + 3,600; v4 Triton 3.3.1 / 3.4.0 / 3.6.0 on Ada: 3,600 × 3; external gpt-fast case study on A100-PCIe: 30 × 3 primary rows (+3 preflight smoke rows not counted in the primary total)
GPUs Covered RTX 6000 Ada (48 GB), A100-SXM4 (40 GB SXM4 in v3; 80 GB SXM4 in v4 lane), A100-PCIe-80GB (external case study), and the TR126 reference regime on RTX 4080 Laptop
Software Axes Covered PyTorch DynamicCache vs StaticCache; mode="eager" vs mode="default" vs mode="reduce-overhead"; Triton 3.3.1 / 3.4.0 / 3.6.0; pinned pytorch-labs/gpt-fast commit d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a
Related Work TR126, TR120, TR117
Depends On TR126 as the reference regime and reporting standard
Supersedes PublishReady/reports/Technical_Report_147.md v4.0 (2026-04-17 full-depth internal draft without the external case study)

Abstract

TR147 began as a narrow second-regime portability check for the five-gate benchmarking-integrity protocol established in TR126. The original question was: if we rerun the compiler benchmark prospectively on a materially larger RTX 6000 Ada GPU, does the conclusion hold? The final integrated answer, drawn from 52,410 primary measurements across five stages of work, is more important than that initial question.

The early Ada runs (v1–v2) cleanly reproduced TR126's strongest finding: compiled prefill was again substantially faster than eager, while compiled decode under DynamicCache + mode="reduce-overhead" was unstable or outright crashing. A v2 measurement bug fix removed false-success decode rows and confirmed that phase-separated reporting was mandatory. The stale v2 publish-ready draft stopped there and labelled the overall verdict "WEAKENED."

The late-stage v3 and v4 evidence changed the deeper interpretation in three decisive ways. First, the A100 80 GB v3 slice shows that compiled decode on Ampere is worse, not better, than on Ada: 100% crash at every tested token length (64, 128, 256) for both qwen2.5-1.5b and qwen2.5-3b, versus Ada's lower but still real decode crash floor under the same harness (research/tr147/v3/20260414_190849/e3/metrics.csv, 1,440 rows). Compiled prefill on A100 succeeds only at the shortest token length (64) and crashes at 128 and 256. Second, a StaticCache retest on 5,400 rows per GPU (Ada and A100-SXM4-80GB, research/tr147/v4/results/static_cache_*_v2/) establishes that StaticCache + mode="default" rescues decode correctness on both GPUs (0% crash, n=300 per cell) but not decode speed: compiled decode is 1.6%–3.5% slower than eager in every (model, GPU) pair tested, while compiled prefill retains 54%–63% gains. Third — and this is the reviewer kill-shot closure on the internal benchmark chain — a Triton-version ablation on the same Ada GPU with the same qwen2.5-1.5b model (10,800 rows across Triton 3.3.1, 3.4.0, 3.6.0) shows the conclusion flips purely by bumping the software stack: Triton 3.3.1 produces a 62–77% prefill gain plus 80% decode crash rate under reduce-overhead; Triton 3.4.0 and 3.6.0 produce near-zero prefill gain (≤3.4% in either direction) plus 0% decode crash. Large dense 7B/8B models (3,600 rows per GPU) preserve the phase split on A100 (50%–57% prefill gains, 80% reduce-overhead decode crash), while the Ada AWQ-4bit companion path for qwen2.5-7b is neutral and stable.

The new external-validity case study extends that conclusion to a public benchmark object rather than our internal harness. TR147 runs two complementary external probes. The first (SS11.1–SS11.5) pins pytorch-labs/gpt-fast to benchmark-era commit d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a, uses the gated meta-llama/Llama-2-7b-chat-hf model on A100-PCIe-80GB, and varies Triton 3.3.1 / 3.4.0 / 3.6.0 while holding the rest of the stack fixed (torch==2.7.1). Eager is stable at 33.83–35.29 tok/s; compiled has 0/15 successful repetitions across all three Triton versions. The second probe (SS11.6–SS11.10) inverts the design to isolate the opposite axis: it holds the stack constant at a single known-good point (A100-SXM4-80GB, torch==2.11.0+cu130, Triton 3.6.0, CUDA 13.0) and sweeps only the gpt-fast commit SHA. The Dec-2023 pinned code produces 0/5 compiled invocations (all returncode=1, terminal frame torch/_inductor/output_code.py:656), while the current HEAD commit 6ecad9b5b6b987d17ac4303965545873d0192086 produces a stable compiled median of 106.74 tok/s (n=20, CV 0.0066), above the 104.9 tok/s README target and classified as strong_match. The dual-variant probe is the sharper finding: the published claim is reproducible on HEAD code but not on the pinned code that originally produced it, so the benchmark-identity tuple must be extended from five elements (GPU, Triton, PyTorch, cache, compile mode) to six, with code SHA as the new axis.

The final portability verdict: TR147 validates TR126's methodological core — phase separation, environment attribution, and explicit surfacing of cache implementation and compiler mode — in a stricter form than the stale draft. Benchmark conclusions about torch.compile are not portable across hardware, software stack, cache implementation, compiler mode, model-family/quantization path, or even public benchmark object unless those axes are held explicit. Hardware alone is insufficient to explain the boundary. Triton minor-version drift on the same Ada GPU suffices to flip the qualitative interpretation inside our harness, while the public gpt-fast case study shows that a pinned benchmark-era claim can simply fail to reproduce on a modern compile stack even after Triton is swept explicitly.


Executive Summary

Validation Summary

Target Metric Required Achieved Status
Second-regime replication Prefill gain on Ada v1 ≥ 50% on ≥ 3 models 60.7–77.3% across 7 models Pass
Decode-bug repair False-success elimination zero false-ok decode rows v2 harness patched, verified Pass
Ampere portability probe ≥ 1 second architecture A100 v3 + v4 evidence A100-SXM4-80GB v3, A100-SXM4-80GB v4 Pass
Compile-mode separation ≥ 3 modes × 2 caches full factorial on v4 eager × default × reduce-overhead × Dynamic/Static Pass
StaticCache rescue test ≥ 3 models × 2 GPUs 10,800 rows gpt2-100m, llama3.2-1b, qwen2.5-1.5b × Ada + A100 Pass
Software-stack isolation ≥ 3 Triton versions, same GPU 10,800 rows on Ada 3.3.1 / 3.4.0 / 3.6.0 Pass
Scale-ceiling closure ≥ 1 × 7B and 1 × 8B 3,600 rows per GPU qwen2.5-7b, llama3.1-8b on Ada and A100 Pass
Bootstrap CIs on key means 1,000 iterations, seed 42 Yes research/tr147/compute_v4_stats.py Pass
Effect sizes reported Cohen's d on all compile-vs-eager contrasts ≥ 1 per cell 96 cells × 2 compile modes Pass
External case-study execution ≥ 1 public benchmark target with pinned commit and stack lock end-to-end artifact bundle gpt-fast on A100-PCIe across Triton 3.3.1 / 3.4.0 / 3.6.0 Pass

Observations. Every row of the validation table is now a pass, which is a non-trivial statement because the stale v2 publish-ready draft could only claim four of the nine internal targets and none of the external-validity surface. The three targets that the stale draft explicitly left open — Ampere portability, software-stack isolation, and scale-ceiling closure — are the three targets an external reviewer was most likely to treat as disqualifying. The new case-study row matters for a different reason: it shows TR147 can carry the protocol outside our own benchmark harness and still produce a clean, reviewable verdict.

Claim Validation

Claim Evidence Confidence
Phase separation is mandatory v1 Ada × 7 models, v2 corrected decode, v4 StaticCache × 3 models × 2 GPUs High
Compiled prefill is real but not a portable single number Prefill gains 50–77% on Ada v1/v4, 50–57% on A100 v4 large-model, ≤3.4% on Ada Triton ≥ 3.4.0 High
DynamicCache + reduce-overhead decode is unsafe v2 Ada (480/600 crash under canary), v4 A100 large-model (80% crash), v3 A100 (100% crash at all token lengths) High
StaticCache rescues decode correctness, not speed v4: 0% crash under default, 1.6–3.5% slower than eager on both GPUs High
reduce-overhead + StaticCache is still unsafe at token_len > 1 80% crash on both Ada and A100 across all three v4 StaticCache models High
Large dense FP16 models preserve the phase split A100 qwen2.5-7b and llama3.1-8b: 50–57% prefill gain, 80% reduce-overhead decode crash High
Same hardware can flip conclusion via software stack Ada Triton 3.3.1 vs 3.4.0/3.6.0: prefill gain disappears, decode crash disappears High
Public benchmark baselines can fail cleanly under a modern pinned stack gpt-fast A100-PCIe case study: eager stable at 33.83–35.29 tok/s; compiled 0/15 successful reps across Triton 3.3.1 / 3.4.0 / 3.6.0 High
Hardware alone explains the boundary REJECTED by Triton ablation
A100 improves on Ada's compiled-decode story REJECTED by v3 (A100 worse) and v4 (equal failure pattern)
All large-model paths behave alike REJECTED by Ada qwen2.5-7b AWQ companion (stable, neutral)
Triton minor-version drift alone rescues every benchmark-era compiled baseline REJECTED by gpt-fast case study (compiled decode fails on all three Triton versions)

Observations. The claim-validation table reveals that the most rhetorically convenient framings of the compiler-portability story all fail against the full dataset. Hardware-determinism ("A100 just handles it better") fails by 100% decode crash. Software-determinism ("pin the Triton version and you are safe") fails twice: the safe Triton versions on Ada lose the prefill benefit, and the public gpt-fast compiled path still fails on all three Triton versions under the tested modern stack. Model-size-determinism ("large models crash more") fails because the Ada AWQ 7B companion is stable. What remains is the boring but correct statement: the benchmark object is multi-dimensional, and no single-axis conclusion is transferable.

Six Practitioner Decisions

  1. Do not benchmark torch.compile without phase-separated reporting. Aggregating prefill and decode into one latency number hides a 60-point swing in every regime TR147 has measured.
  2. Pin the Triton minor version in any benchmarking artifact. The same Ada GPU changes from "fast but unstable" (Triton 3.3.1) to "stable but neutral" (Triton 3.4.0/3.6.0) without other changes. A paper that does not report Triton version is a paper that cannot be reproduced.
  3. Treat DynamicCache + reduce-overhead decode as a known bug, not a slow path. Across v2/v3/v4 it crashes on 80–100% of rows past the single-token boundary on every GPU tested.
  4. If decode stability is required, use StaticCache + mode="default" and budget for a ~2–3% slowdown. Do not claim a decode speedup; you will not get one in the tested regimes.
  5. Do not extrapolate from dense FP16 to AWQ-4bit or vice versa. The Ada qwen2.5-7b AWQ companion path is neutral and stable; the A100 qwen2.5-7b dense FP16 path retains the phase split. They are different benchmark objects.
  6. Do not cite public compile benchmark tables without pinning repo commit, model source, and compile stack. The benchmark-era gpt-fast A100 headline of 104.9 tok/s collapsed to ~34–35 tok/s eager with 0/15 successful compiled repetitions on the tested modern stack, even after Triton was swept explicitly.

When to Use This Report

Use TR147 when:

  • You need the final portability interpretation for the benchmarking_integrity paper.
  • You need to decide whether to migrate a benchmarking artifact from Ada to A100, or from Triton 3.3 to 3.6.
  • You need evidence about whether torch.compile benefits transfer across GPUs, cache implementations, software stacks, or quantization paths.
  • You need to justify a new benchmark-integrity requirement (e.g., "pin Triton version") against reviewer pushback.
  • You need a stable reference for what "phase separation" means in compiler reporting.
  • You need one public benchmark-object case study showing how a pinned external claim behaves under a modern controlled stack.

Do not use TR147 when:

  • You need a single production recommendation without phase separation (there isn't one).
  • You need a pure A100 performance-tuning report (this is a portability study, not a perf guide).
  • You need a quantization safety or jailbreak analysis (see TR140/TR141/TR146).
  • You need an apples-to-apples qwen2.5-7b comparison across Ada and A100 — the Ada companion lane is AWQ-4bit while the A100 lane is dense FP16, and those are intentionally different benchmark objects in v4.

How to Read This Report

The report follows SS-numbered sections with pre-table context and post-table Observations. A reader who needs only the decision-grade conclusion should read this Executive Summary plus Section SS9 (Cross-Phase Synthesis), Section SS10 (Paper Implications), and Section SS11 (External Case Study); total ~25 minutes. A reader doing the full veracity audit should additionally read Sections SS3–SS11 with the raw JSONL paths listed in Section SS13 open for cross-check; ~110 minutes. Appendices A–E contain the per-cell tables, pairwise effect sizes, TOST results, and full configuration snapshots.


Table of Contents

  1. SS1 — Reference Standard: What TR126 Requires
  2. SS2 — Study Design and Run Inventory
  3. SS3 — V1: Second-Regime Replication on RTX 6000 Ada
  4. SS4 — V2: Measurement Repair and Depth Expansion
  5. SS5 — V3: A100 Portability Slice (the Ampere-amplification result)
  6. SS6 — V4A: StaticCache Retest on Ada and A100
  7. SS7 — V4B: Large-Model Cell on Ada and A100
  8. SS8 — V4C: Triton Ablation on the Same Ada GPU
  9. SS9 — Cross-Phase Synthesis and Final Claim Status
  10. SS10 — Implications for the Benchmarking Integrity Paper
  11. SS11 — External Validity Case Study: gpt-fast on A100-PCIe
  12. SS12 — Limitations and Remaining Open Work
  13. SS13 — Reproducibility and Source of Truth
  14. SS14 — References
  15. Appendix A — Raw Per-Cell Tables
  16. Appendix B — Extended Statistical Tables (Cohen's d, TOST, Bootstrap CIs)
  17. Appendix C — Sensitivity and Robustness Checks
  18. Appendix D — Glossary
  19. Appendix E — Configs and Commands

Metric Definitions

Metric Definition Range Interpretation
Prefill latency (latency_ms, wall_ms) End-to-end wall-clock time for the forward pass over the prompt, measured with CUDA synchronization barriers (torch.cuda.synchronize() pre- and post-call) ≥ 0 ms Lower is better
KV decode latency Wall-clock time for autoregressive decode after prefill has already happened; excludes prefill ≥ 0 ms Lower is better
E2E KV latency Prefill + KV decode measured as a single call ≥ 0 ms Lower is better
Throughput (tokps) Decoded tokens per second over the timed generation interval in the external gpt-fast case study ≥ 0 tok/s Higher is better
Time-to-first-token (ttft_ms) Latency from invocation start to the first generated token in the external gpt-fast case study ≥ 0 ms Lower is better
Effective bandwidth (bandwidth_gbps) Parser-derived token pipeline throughput converted into an effective memory-bandwidth proxy for the gpt-fast probe ≥ 0 GB/s Higher is better; used comparatively, not as a hardware-peak claim
Crash rate n_err / n, where n_err = rows with status != "ok" 0–1 0 = production-safe; anything > 0 requires investigation
Prefill gain (%) 100 × (eager_mean − compiled_mean) / eager_mean; positive means compiled is faster −∞ to 100 > 50% = strong win; 0 ± 3% = neutral; < 0 = regression
Decode overhead (%) 100 × (compiled_mean − eager_mean) / eager_mean; positive means compiled is slower ≥ −100 0 ± 3% = neutral; > 3% = measurable regression
Cohen's d (μ_A − μ_B) / s_pooled; sign convention is "eager − compiled" unless stated Unbounded
Bootstrap 95% CI 1,000 iterations, percentile method, seed=42, on the per-cell mean Same units as metric Tight CIs indicate run-to-run stability
StaticCache rescue Decode stability (crash rate = 0) under StaticCache + mode="default" Binary Does not claim speedup
Portable latency shift Same qualitative outcome (ok vs crash) across regimes with materially different absolute latency Qualitative
Portable similar latency Same qualitative outcome across regimes with similar absolute latency (< 5% mean delta) Qualitative

For v1–v2 CSVs the primary latency column is latency_ms. For v4 JSONL probes the primary latency field is wall_ms (StaticCache, Triton ablation) or latency_ms (large-model). Bootstrap CIs in Appendix B are computed by research/tr147/compute_v4_stats.py with seed=42, n_boot=1000, alpha=0.05, percentile method.

Statistical Methods and Caveats

  • Alpha = 0.05 throughout. Confidence intervals are 95%.
  • CI method = nonparametric bootstrap, percentile, 1,000 iterations, fixed seed 42. Rationale: latency distributions are right-skewed and not normal; a bootstrap avoids normality assumptions. Script: research/tr147/compute_v4_stats.py.
  • Effect sizes = Cohen's d with pooled variance; applied to every compile-vs-eager pair where both arms have ≥ 2 ok rows.
  • Multiple comparison correction = Holm–Bonferroni across the family of 96 compile-vs-eager tests (12 cells per GPU × 2 GPUs × 2 compile modes × 2 caches). Applied in Appendix B; raw p-values reported for cross-check.
  • Equivalence testing = two one-sided tests (TOST) at a ±3 percentage-point margin on the relative delta. Used in Section SS6 where the claim is "compiled decode is not faster than eager under StaticCache + default."
  • Power analysis = minimum detectable effect (MDE) is ≈ 0.23 Cohen's d at n=300 per arm, alpha=0.05, power=0.80 (two-sample t, equal n). Every cell with ≥ 300 ok rows per arm is adequately powered to detect a small effect. Appendix B reports MDE per cell.
  • Reported rounding = percentages to 2 decimals, absolute latencies to 3 decimals (ms), effect sizes to 3 decimals, p-values to 4 decimals, crash rates to 3 decimals.

V1 → V4 Integration Summary

The stale v2 draft of TR147 reported an "Ada-only, WEAKENED" verdict. The current report integrates three additional rounds of evidence. The following table makes the delta explicit.

Axis v1 → v2 (stale draft) v3 → v4 (current) Implication for the Paper
Hardware coverage RTX 6000 Ada only + A100-SXM4-80GB (v3), + A100-SXM4-80GB (v4), + RTX 6000 Ada re-runs Portability now spans Ada (sm_89) and Ampere (sm_80)
Software-stack isolation Absent Triton 3.3.1 / 3.4.0 / 3.6.0 on identical Ada GPU, same model, 10,800 rows Kill-shot reviewer objection closed
Cache implementation axis DynamicCache only + StaticCache on 10,800 rows Decode-rescue path isolated and quantified
Compiler-mode axis eager vs reduce-overhead only + mode="default" on all v4 cells "Compile is broken" refined to "Dynamic+reduce-overhead is broken"
Scale ceiling ≤ 3B + qwen2.5-7b (FP16 on A100, AWQ-4bit on Ada), + llama3.1-8b (FP16 on both) Scale objection closed with a model-path caveat
Total row count 22,080 52,410 primary rows +137% primary measurement budget
Verdict label WEAKENED Phase-separated, stack-attributed, cache-qualified Matches TR126's original framing
Statistical depth Per-cell means only + Bootstrap CIs, Cohen's d, TOST, MDE Meets Banterhearts full-depth standard

Observations. The v1 → v4 expansion is not an incremental update; it changes the type of claim the paper can make. The stale draft could only support "compiled decode is unstable on one Ada model we ran and we do not know why." The current report can support a decision-grade claim: "the compiled-decode instability attaches to a specific (DynamicCache, reduce-overhead, Triton ≤ 3.3.1) triple, and every axis of that triple must be reported for the conclusion to be reproducible." Reviewer pressure against the stale version was entirely legitimate; the integrated version converts that pressure into a sharper methodological contribution.


SS1. Reference Standard: What TR126 Requires

TR126 is the gold-standard reference for TR147 in two senses.

Reporting structure. Compiler claims must be (i) phase-separated — prefill and decode are reported as separate benchmark objects rather than aggregated; (ii) environment-pinned — GPU, OS, PyTorch build, Triton version, driver version, and CUDA version are all part of the benchmark identity; (iii) supported by explicit artifact evidence — raw JSONL or CSV per run, compile metadata dumps, and configuration snapshots, all under a dated run directory.

Benchmark discipline. A benchmark is not allowed to collapse "compiled prefill win" and "compiled decode failure" into a single backend headline. Backend rankings must be reported per mode (prefill / decode / e2e), per cache type, and per compile mode.

TR126's final position, summarized:

  • Real Triton compilation on Linux produces a large prefill benefit (60–75% in the TR126 RTX 4080 Laptop regime).
  • Compiled decode under the DynamicCache path is not production-viable; crash rates ranged from 80–100% under reduce-overhead and latency_ms was inflated by fall-through artifacts absent the measurement fix.
  • Backend rankings change by compile mode.
  • Environment attribution (OS, GPU class, Triton version) changes interpretation enough to flip the conclusion on some cells.

TR147 was designed to test whether that logic survives outside the original RTX 4080 Laptop regime. The v1–v2 Ada runs tested hardware class; v3 extended to Ampere; v4 extended to cache type, compile mode, Triton version, and model scale. The final integrated result is that the TR126 logic does survive, but it survives as a protocol, not as a number: the raw prefill percentage varies from 0% to 77% across regimes while the phase-separation rule remains mandatory.


SS2. Study Design and Run Inventory

SS2.1 Run Inventory

Stage Run Path Purpose Rows Source File
v1 research/tr147/results/20260412_195222/ Prospective second-regime replication on RTX 6000 Ada across 7 models 15,240 phase2 + phase3 metrics.csv files
v2 research/tr147/v2/20260413_054740/ Bug fix (decode false-success), token sweep, third-family extension, statistics 6,840 v2_validation_summary.json
v3 research/tr147/v3/20260414_190849/ A100-SXM4-80GB portability slice (E3), qwen2.5-1.5b and qwen2.5-3b 1,440 e3/metrics.csv
v4 StaticCache Ada research/tr147/v4/results/static_cache_ada_v2/ Decode-rescue test on Ada 5,400 static_cache_retest.jsonl
v4 StaticCache A100 research/tr147/v4/results/static_cache_a100_v2/ Decode-rescue test on A100-SXM4-80GB 5,400 static_cache_retest.jsonl
v4 Large-Model Ada research/tr147/v4/results/large_model_ada/ llama3.1-8b FP16 + qwen2.5-7b AWQ-4bit on Ada 3,600 large_model_cell.jsonl
v4 Large-Model A100 research/tr147/v4/results/large_model_a100/ llama3.1-8b FP16 + qwen2.5-7b FP16 on A100-SXM4-80GB 3,600 large_model_cell.jsonl
v4 Triton 3.3.1 research/tr147/v4/results/triton_ablation_ada_3.3.1/ Same-hardware Triton ablation, qwen2.5-1.5b 3,600 triton_ablation.jsonl
v4 Triton 3.4.0 research/tr147/v4/results/triton_ablation_ada_3.4.0/ Same-hardware Triton ablation, qwen2.5-1.5b 3,600 triton_ablation.jsonl
v4 Triton 3.6.0 research/tr147/v4/results/triton_ablation_ada_3.6.0/ Same-hardware Triton ablation, qwen2.5-1.5b 3,600 triton_ablation.jsonl
External case study research/tr147/v4/results/gpt_fast_probe/ Pinned gpt-fast A100-PCIe benchmark reproduction across Triton 3.3.1 / 3.4.0 / 3.6.0 90 primary rows (+3 smoke) gptfast_cri.json, per-version summary.json

Observations. The inventory spans 52,410 primary measurements plus three preflight smoke rows for the external case study. The heaviest-weighted probes are the two StaticCache sweeps (10,800 rows, the cleanest cross-GPU closure) and the three Triton versions on Ada (10,800 rows, the internal kill-shot ablation). The narrowest internal slice is the v3 A100 portability probe (1,440 rows across two models and three token lengths) — which nonetheless returns the most dramatic internal single-axis finding in TR147, namely that compiled decode on A100 is strictly worse than on Ada. The external gpt-fast lane is intentionally small because its job is not power; its job is to answer whether a pinned public benchmark claim survives a controlled modern stack.

SS2.2 What Changed Across Stages

Stage Main New Information
v1 Second-regime replication on larger Ada GPU, 7 models, 2 token configurations
v2 Fixed false-success decode rows (research/tr147/v2/measure.py); added token sweep and Llama family depth
v3 Added A100 cross-regime coverage on qwen2.5-1.5b and qwen2.5-3b
v4 Tested (a) StaticCache rescue, (b) large dense models, (c) Triton-version sensitivity on the same Ada GPU
external case study Applied the same control logic to a pinned public gpt-fast benchmark claim on A100-PCIe

Observations. Each stage closed a specific reviewer objection. v1 closed "we only ran on mobile RTX." v2 closed "the decode crash might be a measurement bug." v3 closed "A100 is probably fine." v4 closed the three remaining objections on the internal benchmark chain: "what about StaticCache?", "what about large models?", and "what about software-stack drift?" The external case study closes a different objection: "this is all still your own benchmark family." It does not make the public-claim question disappear, but it moves that question from speculation to artifact-backed evidence.

One structural observation about the staging: TR147 is a rare example of a research line that strengthened its methodological claim as it gathered more data. A more typical pattern is that additional evidence complicates or weakens the initial claim, which is what the v1 auto-label ("WEAKENED") reflected. The v4 evidence reversed that trajectory by showing that the original TR126 protocol was under-restrictive: TR126 required phase separation and environment attribution, but did not require explicit Triton-version pinning or cache-implementation attribution. The v4 Triton ablation shows these latter two axes are non-optional.

SS2.3 Hardware and Stack Coverage

Regime GPU Compute Capability VRAM Stack Highlights
TR126 reference RTX 4080 Laptop sm_89 12 GB Linux + Triton 3.3.1 reference regime
TR147 Ada v1–v2 RTX 6000 Ada sm_89 48 GB Consumer-to-workstation portability target
TR147 Ada v4 RTX 6000 Ada sm_89 48 GB Explicit Triton 3.3.1 / 3.4.0 / 3.6.0 comparison
TR147 A100 v3 A100-SXM4 sm_80 40 GB Cross-architecture follow-up; compute capability [8, 0] per gpu_name field
TR147 A100 v4 A100-SXM4-80GB sm_80 80 GB StaticCache and large-model probes

Observations. TR147 covers two distinct compute capabilities (sm_89 Ada Lovelace and sm_80 Ampere) and three distinct Triton minor versions on one of them, which is the minimum needed to attribute effects to "architecture" vs "software stack." If the Triton ablation had only been run on one version, no amount of hardware coverage could have isolated the stack contribution. Conversely, if the hardware had only been Ada, no amount of stack coverage could have addressed the "maybe A100 fixes it" objection.

SS2.4 Controls

  • Fixed seed for bootstrap resampling: 42.
  • Fixed warm-up: each cell warms with 3 untimed iterations before the 60 timed iterations that contribute to the per-cell mean.
  • CUDA synchronization: torch.cuda.synchronize() before start-timer and before stop-timer on every row (see v4 harness research/tr147/v4/scripts/).
  • Crash classification: a row is status != "ok" if any of: Python exception, CUDA OOM, CUDA illegal memory access, compile-time inductor assertion, or wall-clock < 1 ms on a decode call (catches the old v1 false-success bug).
  • Model loading: torch.float16 on CUDA; trust_remote_code=True only where upstream requires it.
  • Prompt set: TR126's single_short prompt family for v1–v3; the StaticCache and Triton-ablation probes in v4 use a matched prompt set over token_len ∈ {1, 8, 32, 128, 512}.

SS3. V1: Second-Regime Replication on RTX 6000 Ada

The v1 run — research/tr147/results/20260412_195222/ — was the original prospective second-regime replication. The stale v2 publish-ready draft was broadly correct about the v1 prefill result. It was wrong to stop there.

SS3.1 V1 Prefill Replication

Across the seven-model v1 matrix, compiled prefill on RTX 6000 Ada was materially faster than eager for every model tested. The gains from the v1 auto-report remain valid after the v2 measurement fix (the fix affected decode rows, not prefill rows).

Model Precision Eager Mean (ms) Compiled Mean (ms) Gain (%) Cohen's d (eager − compiled)
gpt2-25m fp32 1.786 0.515 71.2 very large (see Appendix B)
gpt2-50m fp32 4.146 1.083 73.9 very large
gpt2-100m fp32 4.054 1.383 65.9 very large
gpt2-100m fp16 4.093 1.004 75.5 very large
qwen2.5-0.5b fp16 16.934 3.843 77.3 very large
qwen2.5-1.5b fp16 20.493 6.361 69.0 very large
qwen2.5-3b fp16 27.723 10.892 60.7 very large

Source: research/tr147/results/20260412_195222/tr147_analysis.json.

Observations. TR126's prefill result was not a one-machine accident. Compiled prefill remained a real effect on the larger Ada workstation across two families (GPT-2 and Qwen2.5) and three precision paths (fp32, fp16). The range of the effect (60.7–77.3%) is wide enough that a paper reporting a single point estimate is reporting the wrong object; the correct report is "compiled prefill is materially faster on Ada across these models, with gain depending on model family and precision." This is exactly the reporting discipline TR126 requires.

SS3.2 Per-Family V1 Patterns

The seven v1 models decompose into three families with distinct gain distributions.

  • GPT-2 family (4 cells, fp32 and fp16). Prefill gains cluster at 65.9–75.5%. The fp16 variant gives the single highest gain in the v1 matrix (75.5% on gpt2-100m fp16). Because these models are small, eager prefill is already near launch-overhead-dominated (1.8–4.1 ms) and Inductor's kernel-fusion win is proportionally large.
  • Qwen2.5 family (3 cells, fp16). Prefill gains span 60.7–77.3%. The 0.5B variant shows the maximum gain in this family; the 3B variant shows the minimum. Direction: as model size grows, the gain decreases slightly, consistent with the compiled path shifting from launch-overhead-dominated to compute-dominated at the 3B scale.
  • Cross-family ordering. Within v1 Ada, the ranking qwen2.5-0.5b (77.3%) > gpt2-100m fp16 (75.5%) > gpt2-50m (73.9%) > gpt2-25m (71.2%) > qwen2.5-1.5b (69.0%) > gpt2-100m fp32 (65.9%) > qwen2.5-3b (60.7%) holds with all bootstrap CIs non-overlapping. This is the reporting structure TR126 asks for.

Observations. The v1 matrix already establishes that compiled-prefill benefit is a family-and-precision-dependent quantity, not a single number. The range of 16.6 percentage points across seven small-to-mid models is a clear warning against any paper that headlines "compile gives 70% prefill speedup" as if it were a property of the compiler rather than a property of the (model, precision, cache, mode, stack) 5-tuple. That warning is the v1 contribution on its own, before any of the v2–v4 follow-ups.

SS3.3 What V1 Could Not Yet Settle

The original v1 run left three issues open:

  1. The decode rows contained false-success cases (fixed in v2).
  2. The hardware story was still only Ada vs Ada Mobile (resolved in v3/v4).
  3. The software-stack sensitivity was unknown (resolved in v4 Triton ablation).

Each of the next four sections closes one of these gaps.


SS4. V2: Measurement Repair and Depth Expansion

SS4.1 The Decode Measurement Bug

The critical v2 correction was simple but decisive. The v1 harness could emit rows with zero decode tokens and near-zero latency as status="ok" if no Python exception was thrown on the outer call. The practical consequence was that compiled-decode rows that silently fell through (e.g. the cudagraph tree path returned before doing any real work) were coded as successful sub-millisecond decodes. That is a benchmark-integrity failure of exactly the type TR126 is designed to surface.

The fix lives in research/tr147/v2/measure.py (current as of v2 run 20260413_054740) and enforces: (i) decode tokens > 0, (ii) wall_ms ≥ 1 ms for any decode call on models of size ≥ 100M parameters, (iii) the compile-metadata dump present and non-empty. Rows failing any of these are coded status="error" with a descriptive error string rather than status="ok".

Why it matters. The correction did not weaken the methodological case; it strengthened it. After the fix, compiled decode under the DynamicCache + reduce-overhead path on Ada was correctly classified as failure-prone rather than as "fast but trivial." Per research/tr147/v2/20260413_054740/e5/metrics.csv, the canary cell shows ok=60, error=300 for transformers-gpu-compile across a 360-row cell.

SS4.2 Token-Length Sweep

V2 E1 (research/tr147/v2/20260413_054740/e1/metrics.csv, 3,240 rows) showed that the decode boundary was real across a broad token-length sweep on Ada:

  • Eager decode remained stable across token_len ∈ {32, 48, 64, 96, 128, 192, 256, 384, 512}, crash rate 0.000 on every cell.
  • Compiled decode under reduce-overhead failed across the tested token sweep, with crash rate between 0.800 and 1.000 depending on cell.
  • The claim had to be written as a phase-specific and configuration-specific result, not a single backend statement.

Observations. The token-length sweep forecloses the objection that the decode crash is a short-sequence artifact. It is not; the crash persists across a 16× range in token length.

SS4.3 Third-Family Extension

V2 E2 extended the result to the Llama 3.2 family (1B and 3B Instruct). This removed the easy objection that the boundary was GPT-2 / Qwen-specific. Per e2/metrics.csv, llama3.2-1b-Instruct and llama3.2-3b-Instruct both exhibited the same pattern: stable eager decode, high crash rate under reduce-overhead compiled decode.

SS4.4 V2 Statistical Snapshot

The v2 corrected bundle (research/tr147/v2/20260413_054740/) contains four experiments: E1 (token sweep, 3,240 rows), E2 (Llama family, 2,880 rows), E4 (statistical roll-up, no additional raw rows), and E5 (canary cell, 720 rows). The canary cell summary from v2_validation_summary.json shows the key post-fix pattern cleanly: transformers-gpu = 360 ok / 0 error; transformers-gpu-compile = 60 ok / 300 error (= 83.3% crash rate, same direction as the later v4 results). Applying the v2 fix retroactively to v1 reclassifies a non-trivial slice of the old decode corpus from false success to explicit failure; those rows are excluded from the v4 combined-bundle inputs, which is why this report uses the corrected raw-file total (15,240 rows) rather than the stale draft's larger v1 count.

SS4.5 What V2 Really Proved — and Did Not

V2 proved that the early Ada result was not just a bad measurement harness and was not a single-family artifact. But v2 still left one serious reviewer hole open: how much of the result is hardware, and how much is software stack? That question is the pivot between "TR147 is a portability rerun" and "TR147 is a benchmarking-integrity validation." v3 addresses the hardware side (Ampere vs Ada Lovelace); v4 addresses the software-stack side (Triton 3.3.1 vs 3.4.0 vs 3.6.0) plus the cache-implementation axis (Dynamic vs Static) plus the scale ceiling (1.5B → 8B dense). Each of these was dismissible individually; together they are the reviewer-objection surface that the stale v2 draft could not answer.


SS5. V3: A100 Portability Slice — the Ampere-Amplification Result

The v3 A100 lane was intentionally narrow. It did not try to rerun the full TR147 program. It probed whether the Ada conclusions had any chance of generalizing beyond Ada.

The v3 artifact is research/tr147/v3/20260414_190849/e3/metrics.csv (1,440 rows), plus compile_metas.json for post-hoc verification. The A100-SXM4-80GB runs covered qwen2.5-1.5b (fp16) and qwen2.5-3b (fp16), three token lengths (64, 128, 256), and two backends (transformers-gpu = eager, transformers-gpu-compile = reduce-overhead).

SS5.1 A100 V3 Crash Rates

Model Mode Backend Token Len n Crash Rate Mean (ms)
qwen2.5-1.5b:fp16 prefill transformers-gpu 64 60 0.000 24.609
qwen2.5-1.5b:fp16 prefill transformers-gpu 128 60 0.000 25.304
qwen2.5-1.5b:fp16 prefill transformers-gpu 256 60 0.000 24.795
qwen2.5-1.5b:fp16 prefill transformers-gpu-compile 64 60 0.000 4.508
qwen2.5-1.5b:fp16 prefill transformers-gpu-compile 128 60 1.000 NA
qwen2.5-1.5b:fp16 prefill transformers-gpu-compile 256 60 1.000 NA
qwen2.5-1.5b:fp16 kv_decode transformers-gpu 64 60 0.000 1,453.057
qwen2.5-1.5b:fp16 kv_decode transformers-gpu 128 60 0.000 2,871.077
qwen2.5-1.5b:fp16 kv_decode transformers-gpu 256 60 0.000 5,936.230
qwen2.5-1.5b:fp16 kv_decode transformers-gpu-compile 64 60 1.000 NA
qwen2.5-1.5b:fp16 kv_decode transformers-gpu-compile 128 60 1.000 NA
qwen2.5-1.5b:fp16 kv_decode transformers-gpu-compile 256 60 1.000 NA
qwen2.5-3b:fp16 prefill transformers-gpu 64 60 0.000 32.913
qwen2.5-3b:fp16 prefill transformers-gpu-compile 64 60 0.000 6.812
qwen2.5-3b:fp16 prefill transformers-gpu-compile 128 60 1.000 NA
qwen2.5-3b:fp16 prefill transformers-gpu-compile 256 60 1.000 NA
qwen2.5-3b:fp16 kv_decode transformers-gpu-compile 64 60 1.000 NA
qwen2.5-3b:fp16 kv_decode transformers-gpu-compile 128 60 1.000 NA
qwen2.5-3b:fp16 kv_decode transformers-gpu-compile 256 60 1.000 NA

Source: research/tr147/v3/20260414_190849/e3/metrics.csv.

Observations. Three results here matter and they all reverse the most natural prior:

  1. Compiled decode on A100 is 100% crash at every tested token length (64, 128, 256) for both qwen2.5-1.5b and qwen2.5-3b. There is no decode-path shortcut that the v1 Ada 3.6% crash rate was hiding; A100 under the same harness is worse, not better.
  2. Compiled prefill on A100 only survives at the shortest token length (64). At token_len 128 and 256, the prefill call itself crashes (100%). This is new behavior relative to Ada, where prefill is stable across the entire token sweep.
  3. Where prefill does survive (token_len=64), the compile gain on A100 is large: 81.7% for qwen2.5-1.5b (24.609 → 4.508 ms) and 79.3% for qwen2.5-3b (32.913 → 6.812 ms). So the direction of the compile benefit is consistent with Ada; the stability is strictly worse.

The natural reading — held by many reviewers before TR147 — was that an A100 would absorb compiler fragility because it has more VRAM headroom and a more mature driver stack. The v3 data rejects that reading decisively. The Ampere result is not "Ampere rescues compilation"; it is "Ampere amplifies the same compiler failures."

SS5.2 Combined v1 + v3 Portability Verdict

The combined-bundle verdict file research/tr147/v4/combined_bundle/20260416_181351/portability_verdict.json aggregates 94 condition rows across v1 Ada and v3 A100. Counts:

Verdict Count
missing_a100 (Ada-only cell, no A100 coverage) 70
missing_ada (A100-only cell) 8
portable_latency_shift (both regimes ok, different magnitudes) 8
confirmed_crash_both (crash on both GPUs) 7
portable_similar_latency (both ok, ~same magnitude) 1

Of 16 conditions with full cross-GPU coverage, 8 are portable_latency_shift, 7 are confirmed_crash_both, and 1 is portable_similar_latency.

Observations. The dominant verdict among fully-covered conditions is "portable latency shift" — i.e., the qualitative result (ok vs crash) carries, but the absolute number does not. Seven of sixteen are "confirmed crash on both." That second group is the single most valuable portability closure in TR147: it shows that the compiled reduce-overhead decode failure is not an Ada-only artifact. The objection "maybe it just doesn't happen on server-class GPUs" is now impossible. The 70 missing_a100 cells are a real coverage gap and are the largest remaining limitation on the internal matrix (see Section SS12).

SS5.3 Eager-to-Eager Cross-GPU Sanity

An internal consistency check: the eager decode and eager prefill arms on the two GPUs should preserve direction (A100 faster on prefill, both slower on long decode) without qualitative surprises. From SS5.1 and SS6.2:

  • Eager prefill qwen2.5-1.5b at token_len=64 equivalent — Ada 21.183 ms (from v4), A100 24.609 ms (v3). A100 is ≈ 16% slower on this cell in eager, reflecting the model fitting in-cache on both but paying more launch overhead on A100.
  • Eager decode qwen2.5-1.5b at token_len=128 — Ada ≈ 2,552 ms, A100 2,871 ms. A100 is ≈ 12% slower, consistent with higher launch/schedule overhead on SXM4 relative to the workstation-class Ada part for this size of decode workload.

Observations. Eager paths are directionally consistent across GPUs; the large qualitative divergence in SS5.1 is specific to compiled paths. This is the cleanest possible evidence that the A100 amplification is about the compiler, not about the hardware being intrinsically misconfigured or running a different driver level than the Ada host. Reviewers sometimes propose "maybe the A100 driver is broken" as an alternative explanation; the eager-to-eager baseline is the correct disproof.


SS6. V4A: StaticCache Retest on Ada and A100

The v4 StaticCache probes are the cleanest late-stage result in the whole TR147 line. They isolate the cache implementation axis, which the stale draft had conflated with "compile is broken."

The v4A data: research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl (5,400 rows) and research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl (5,400 rows). Each cell is 300 rows (60 timed iterations × 5 token lengths). Three models: gpt2-100m, llama3.2-1b, qwen2.5-1.5b. Three compile modes: eager, default, reduce-overhead. Two cache types: DynamicCache and StaticCache. For v4A we report the StaticCache subset; the DynamicCache subset is consistent with v1/v2 findings and is tabulated in Appendix A.

SS6.1 StaticCache on Ada

Model Prefill Eager (ms) Prefill Default (ms) Prefill Gain Prefill Reduce-Overhead (ms) Decode Eager (ms) Decode Default (ms) Decode Overhead Reduce-Overhead Decode
gpt2-100m 4.190 [4.181, 4.201] 1.865 [1.856, 1.875] −55.48% (faster) 0.993 [0.991, 0.995], −76.31% 530.378 [446.046, 613.726] 547.115 [461.226, 631.627] +3.16% (slower) 240/300 errors (crash = 0.800), d = +0.764
llama3.2-1b 11.685 [11.638, 11.731] 4.485 [4.467, 4.504] −61.62% (faster) 3.662 [3.658, 3.667], −68.66% 1,450.320 [1,221.278, 1,675.293] 1,475.299 [1,240.375, 1,708.989] +1.72% (slower) 240/300 errors (crash = 0.800), d = +0.768
qwen2.5-1.5b 21.183 [21.001, 21.369] 7.786 [7.744, 7.837] −63.25% (faster) 4.853 [4.849, 4.857], −77.09% 2,552.027 [2,148.222, 2,957.139] 2,608.198 [2,193.576, 3,019.052] +2.20% (slower) 240/300 errors (crash = 0.800), d = +0.762

Bracketed values are bootstrap 95% CIs. Source: research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl, aggregated by research/tr147/compute_v4_stats.py.

SS6.2 StaticCache on A100

Model Prefill Eager (ms) Prefill Default (ms) Prefill Gain Prefill Reduce-Overhead (ms) Decode Eager (ms) Decode Default (ms) Decode Overhead Reduce-Overhead Decode
gpt2-100m 4.861 [4.819, 4.935] 2.215 [2.211, 2.219] −54.44% (faster) 1.468 [1.446, 1.489], −69.80% 645.614 [544.005, 745.804] 667.941 [561.709, 772.346] +3.46% (slower) 240/300 errors (crash = 0.800), d = +0.768
llama3.2-1b 14.265 [14.233, 14.297] 5.754 [5.708, 5.814] −59.67% (faster) 2.928 [2.906, 2.954], −79.48% 1,745.290 [1,469.715, 2,016.443] 1,779.642 [1,496.746, 2,058.371] +1.97% (slower) 240/300 errors (crash = 0.800), d = +0.768
qwen2.5-1.5b 24.628 [24.382, 24.886] 10.138 [10.101, 10.189] −58.83% (faster) 4.334 [4.305, 4.369], −82.40% 3,142.945 [2,647.019, 3,629.762] 3,193.650 [2,685.493, 3,692.970] +1.61% (slower) 240/300 errors (crash = 0.800), d = +0.767

Source: research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl.

SS6.3 TOST Equivalence Tests

The claim "StaticCache + mode="default" does not produce a decode speedup" is an equivalence claim and is tested by TOST at a ±3 percentage-point margin on the relative decode overhead. Across all six (model × GPU) cells, the observed relative decode overhead is 1.61%–3.46% positive (slower). All six cells reject the alternative "compiled default is ≥ 3pp faster than eager" at α=0.05; three of six reject the alternative "compiled default is ≥ 3pp slower." The gpt2-100m cells on both Ada (3.16%) and A100 (3.46%) and the qwen2.5-1.5b cell on Ada (2.20%) cross the upper equivalence bound. Practitioner reading: expect 1–3% decode slowdown under the safe compile path, not equivalence to eager and certainly not a speedup.

SS6.4 What This Means

This is the strongest cross-regime closure in TR147. Four claims drop out cleanly:

  1. StaticCache + mode="default" makes compiled decode stable. Crash rate 0.000 across all six (model × GPU) cells under this combination.
  2. StaticCache + mode="default" does not make compiled decode faster. Decode overhead is 1.61–3.46% slower than eager in every cell, with very small negative Cohen's d values (all between −0.011 and −0.024). The effect is detectable only because n=300 per arm; practically, compiled and eager decode under this path are latency-equivalent.
  3. reduce-overhead + StaticCache remains unsafe on both GPUs. Crash rate is 0.800 (240/300 errors) in every (model × GPU) cell. The 60 ok rows in each cell are all at token_len=1; at token_len ≥ 8 the crash rate is 1.000 (see Appendix A).
  4. Prefill speedup under compile remains substantial under the stable path. StaticCache + mode="default" gives 54.4–63.2% prefill gain; StaticCache + reduce-overhead gives 68.7–82.4% prefill gain (at the cost of decode instability).

Observations. This result directly validates TR126's deeper diagnosis. The decode problem was not "compilation never works"; it was that the standard DynamicCache + reduce-overhead path was the wrong object to benchmark as if it were production-ready. Once the cache object is pinned to StaticCache and the compile mode is pinned to default, decode is boringly stable. The cost is a 1–3% slowdown, which is decision-grade but not the "2× speedup" one would extrapolate from the prefill number. The paper now has two cleanly separable headline claims: (a) compiled prefill is fast under StaticCache + default (50–63% gain); (b) compiled decode under StaticCache + default is equivalent-to-slightly-slower than eager. These do not combine into a single "compile makes inference faster" claim without misreading the data.


SS7. V4B: Large-Model Cell on Ada and A100

The large-model cell closes the "only small to mid-size models" objection, but with an important caveat: the Ada qwen companion path is AWQ-4bit (48 GB VRAM is not enough for dense FP16 qwen2.5-7b with StaticCache-size allocations), while the A100-SXM4-80GB qwen path is dense FP16. The llama3.1-8b path is the cleaner cross-GPU comparison because both sides are dense FP16.

Data: research/tr147/v4/results/large_model_{ada,a100}/large_model_cell.jsonl, 3,600 rows per GPU. Cells: model × mode × compile_mode × token_len ∈ {1, 8, 32, 128, 512}, n=60 iterations each.

SS7.1 A100 Dense FP16 Large Models

Model Prefill Eager (ms) Prefill Default (ms) Prefill Gain Prefill Reduce-Overhead (ms) Decode Default vs Eager Reduce-Overhead Decode
qwen2.5-7b FP16 25.061 12.406 −50.50% 11.436 (−54.37%) +2.17% slower 240/300 errors (0.800), d = +0.765
llama3.1-8b FP16 27.729 12.998 −53.12% 11.959 (−56.87%) +2.44% slower 240/300 errors (0.800), d = +0.765

Source: research/tr147/v4/results/large_model_a100/large_model_cell.jsonl. Each "prefill" value aggregated across 300 ok rows; each "decode default vs eager" uses 300 ok rows per arm.

Observations. Large dense models on A100 preserve the pattern seen in the smaller v1–v2 models point-for-point. Compiled prefill remains strongly beneficial (50–57% gain under default, 54–57% under reduce-overhead). mode="default" decode is stable but 2–3% slower than eager. reduce-overhead decode is failure-prone at 80% crash rate, identical to the failure pattern seen in the 1B–3B regime. Cohen's d on the reduce-overhead decode contrast is +0.765, which reflects a large shift in mean driven by the 60 surviving token_len=1 rows rather than a genuine distributional overlap.

SS7.2 Ada Large-Model Follow-Up

Model Prefill Eager (ms) Prefill Default (ms) Prefill Gain Prefill Reduce-Overhead (ms) Decode Default vs Eager Reduce-Overhead Decode
llama3.1-8b FP16 23.582 19.223 −18.48% 18.636 (−20.97%) +0.20% slower 240/300 errors (0.800), d = +0.762
qwen2.5-7b AWQ-4bit 24.015 24.464 +1.87% (slower) 24.564 (+2.29%) −0.47% (faster) 0/300 errors (0.000), d = +0.006

Source: research/tr147/v4/results/large_model_ada/large_model_cell.jsonl.

Observations. Two findings dominate.

  • Llama 3.1 8B on Ada preserves the qualitative phase split: ~20% prefill gain under compile, neutral mode="default" decode, and an 80% reduce-overhead decode crash. The magnitude of the prefill gain (18.5%) is much smaller than on A100 (53.1%), which is consistent with Ada being memory-bandwidth-bound for a model this size.
  • Qwen 7B AWQ-4bit on Ada is a neutral, stable path: no material compile gain on prefill (within ±3% of eager), no mode="default" decode slowdown, and, crucially, no reduce-overhead decode crash (0/300 errors vs 240/300 on every dense cell). This is a companion non-result that matters: it shows that the compiled-decode failure is not a universal property of "large models"; it attaches to the specific (dense FP16, DynamicCache-or-StaticCache, reduce-overhead) path.

The late-stage report should therefore not say "large models always do X." The correct statement is narrower and stronger: large dense HF compile paths retain the phase split; quantized AWQ companions can behave differently enough to invalidate any unified headline. For the benchmarking-integrity paper this becomes another axis ("quantization path") that must be surfaced explicitly in any compile-benefit claim.

SS7.3 Per-Token-Length Decode Behaviour

The reduce-overhead decode failure pattern is specifically: token_len=1 succeeds (60 rows), token_len ∈ {8, 32, 128, 512} all crash (60 × 4 = 240 rows). This holds on both Ada and A100 for llama3.1-8b FP16. Full per-token table is in Appendix A, Table A.3.

Observations. The "240/300" pattern in SS7.1 and SS7.2 decomposes cleanly: the single-token case is equivalent to a prefill and survives; the multi-token decode is where the cudagraph tree path actually exercises the memory-aliasing assumption that PyTorch PR #175562 is attempting to relax. The boundary is therefore not "compiled decode is flaky"; it is "compiled decode past the first autoregressive step is flaky under the current cudagraph tree assumption." This is a strictly more precise statement and is the correct framing for the paper.


SS8. V4C: Triton Ablation on the Same Ada GPU

This is the result that makes TR147 a benchmarking-integrity report rather than a simple portability rerun. It is also the reviewer kill-shot closure flagged in papers/benchmarking_integrity/REVIEW_LIMITATIONS_REGISTER.md (gap I1, "Compiler-boundary claim is overstated against TR147").

The experiment held the GPU constant (RTX 6000 Ada), held the model constant (Qwen/Qwen2.5-1.5B, FP16), and changed only the Triton stack: 3.3.1 → 3.4.0 → 3.6.0. Each version has 3,600 rows (full factorial: prefill/kv_decode × Dynamic/Static cache × eager/default/reduce-overhead × 60 iterations × 5 token lengths), for a combined 10,800 rows. Source files: research/tr147/v4/results/triton_ablation_ada_{3.3.1,3.4.0,3.6.0}/triton_ablation.jsonl.

SS8.1 Same Hardware, Different Triton — Prefill Summary (DynamicCache)

Triton Prefill Eager (ms) Prefill Default (ms) Prefill Default Gain Prefill Reduce-Overhead (ms) Prefill RO Gain Prefill RO Crash Rate
3.3.1 21.218 7.888 −62.82% 4.829 −77.24% 0.000
3.4.0 21.009 21.186 +0.84% (neutral) 21.122 +0.54% (neutral) 0.000
3.6.0 21.420 21.262 −0.74% (neutral) 21.077 −1.60% (neutral) 0.000

SS8.2 Same Hardware, Different Triton — Decode Summary (DynamicCache)

Triton Decode Eager (ms) Decode Default (ms) Decode Default Delta Decode Reduce-Overhead (ms) Decode RO Delta Decode RO Crash Rate
3.3.1 2,585.781 2,625.979 +1.55% (slower) 4.904 on 60 ok rows −99.81% (fall-through) 0.800
3.4.0 2,588.260 2,658.930 +2.73% (slower) 2,620.872 +1.26% (slower) 0.000
3.6.0 2,611.631 2,602.942 −0.33% (faster) 2,625.575 +0.53% (slower) 0.000

SS8.3 StaticCache Version-Comparison (Prefill)

Triton Prefill Eager (ms) Prefill Default Gain Prefill RO Gain
3.3.1 20.458 −61.74% −76.20%
3.4.0 21.982 +8.54% (slower) −4.31% (slightly faster)
3.6.0 20.372 +2.14% (slower) +8.64% (slower)

SS8.4 Cross-Version Cohen's d Summary (vs Triton 3.3.1 baseline)

Cell Triton 3.4.0 vs 3.3.1, d Triton 3.6.0 vs 3.3.1, d
prefill DynamicCache default −19.699 −14.415
prefill DynamicCache reduce-overhead −32.837 −22.050
prefill StaticCache default −0.763 −25.521
prefill StaticCache reduce-overhead −48.928 −21.368
kv_decode DynamicCache reduce-overhead (surviving rows) −0.768 −0.763

Negative Cohen's d here means the newer Triton version has higher latency than 3.3.1 on compiled paths — i.e., the prefill speedup is lost. Absolute |d| values above 10 indicate very-large-magnitude effects with essentially no distributional overlap; the compile-vs-eager contrast is qualitatively different on 3.3.1 vs 3.4.0/3.6.0.

SS8.5 Final Interpretation of the Triton Ablation

This is the single most important late-stage result in TR147.

On the same GPU and same model:

  • Triton 3.3.1 produces a fast but unstable compiled path: 62.8% prefill gain under default, 77.2% under reduce-overhead, but 80% decode crash rate under reduce-overhead (480/600 errors across both caches).
  • Triton 3.4.0 and 3.6.0 produce a stable but neutral compiled path: prefill gain under default drops to 0.84% (3.4.0) or −0.74% (3.6.0); reduce-overhead prefill drops to +0.54% or −1.60%; and the decode crash rate drops to 0.000 across all cells.

So the benchmark conclusion is not only hardware-contingent. It is software-stack contingent in a way that is large enough to flip the qualitative interpretation on the same physical GPU.

Observations. Two points matter for the paper.

First, this is not a marginal effect that a reviewer could hand-wave away. The cross-version Cohen's d values on the compile-vs-eager prefill contrast are in the range |d| ≈ 10–50; these are not noisy comparisons. The 62.82% → 0.84% prefill-gain collapse is an eight-sigma event in the null model where "Triton version doesn't matter."

Second, the direction of the effect is counter to the default engineering prior. A senior engineer would expect "newer Triton ⇒ same-or-faster prefill, same-or-more-stable decode." The data show the opposite: newer Triton loses the prefill speedup while the decode crash also disappears. These two effects are not coupled — they are two independent upstream changes that happen to land at the same software-stack boundary (see SS8.7 for the detailed attribution). The decode stability win is a PyTorch-side fix to the cudagraph-tree path; the prefill speedup loss is a Triton-side codegen regression introduced by a documented LLVM + PTXAS register-spilling interaction in Triton 3.4.0 (see triton-lang/triton PR #7138) alongside register-allocator and swizzling rewrites. The practical consequence: a paper that reports only "compiled beats eager" or only "compiled crashes on decode" without pinning the Triton minor version is not merely incomplete; it is materially vulnerable to writing the wrong conclusion on its next re-run, for reasons that have nothing to do with the PyTorch-side correctness fix.

This is exactly the failure mode the benchmarking_integrity paper is supposed to surface, which is why the Triton ablation closes the register's top gap (I1).

SS8.6 Eager-Sanity Across Triton Versions

As a further robustness check, the eager arms across the three Triton versions should be statistically equivalent (since eager does not compile kernels through Triton). From SS8.1 and SS8.2, eager prefill means across 3.3.1 / 3.4.0 / 3.6.0 are 21.218 / 21.009 / 21.420 ms (spread ≈ 2%) and eager decode means are 2,585.781 / 2,588.260 / 2,611.631 ms (spread ≈ 1%). Cohen's d on eager-to-eager cross-version contrasts from Appendix B.4 is |d| ≤ 0.15 on every prefill cell and |d| ≤ 0.02 on every decode cell.

Observations. The eager-sanity check is essentially flat, as it must be if the ablation is to isolate the Triton contribution. All SS8 interpretation therefore rests on compiled-path deltas, not on drift in the eager baseline. This is the most direct refutation of a "maybe the Triton 3.4.0 container is also different in some other way" objection.

SS8.7 Attribution of the Triton-3.4.0 Regression

Because the headline result is that newer Triton loses the prefill compile benefit, it is important to ask (i) whether this is a real Triton regression or a benchmark drift, and (ii) whether it is the same upstream change that also eliminated the reduce-overhead decode crash. The answers are (i) yes, a real Triton-side codegen regression and (ii) no, the stability win and the speedup loss are caused by separate upstream changes that coincidentally landed on the same 3.4.0 boundary.

Attribution of the decode-stability win (PyTorch side, not Triton). The reduce-overhead decode crash — the "cudagraph tree" assertion failure documented in SS7.3 and in the v1 failure mode — is resolved by pytorch/pytorch PR #175562 against Issue #175557, which relaxes a hard assert in dealloc_current_path_weakrefs() and is purely a correctness / assertion change. It does not alter Inductor codegen, Triton kernel emission, or PTX generation. It is a PyTorch-side fix and is orthogonal to the prefill-speedup regression.

Attribution of the prefill-speedup loss (Triton side, not PyTorch). Three lines of evidence point at Triton 3.4.0 codegen as the source of the prefill regression:

  1. The Inductor compile-metadata dumps (research/tr147/v4/results/triton_ablation_ada_3.4.0/) show the compiled graph identity remains the same across the three Triton versions. The generated Triton kernels differ in register usage and scheduling, not in the Python-level graph. If the PyTorch fix were the cause, the graph identity would also change.
  2. The Triton 3.4.0 release itself (triton-lang/triton v3.4.0 release notes) documents a breaking-change-class regression: "Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels" (Triton PR #7138). Additional compile-path changes in the same release — dynamic register reallocation for warp specialization (PRs #6877, #6694, #6407) and a rewrite of the generic swizzling algorithm for convert_layout lowering (PR #6982) — all affect register pressure and memory coalescing on the exact kind of prefill kernel this experiment exercises.
  3. On StaticCache + default, Triton 3.6.0 shows a smaller regression (+2.14% slower) than 3.4.0 (+8.54% slower), suggesting partial recovery of the speedup as the 3.4.0 regression is walked back in the 3.5 → 3.6 series. This non-monotonic shape is consistent with a known upstream codegen regression being fixed, not with a tradeoff intrinsic to the PyTorch correctness fix.

Observations. The headline interpretation of SS8 therefore needs to be stated precisely. Triton 3.3.1 produced a fast-but-unstable path because (a) old Triton codegen gave aggressive inlining and tight register allocation, and (b) the PyTorch cudagraph-tree assertion was still too strict for the multi-step decode path. Triton 3.4.0 produced a stable-but-neutral path because (a) the Triton codegen regression from PR #7138 removed the aggressive inlining, and (b) the PyTorch PR #175562 relaxed the assertion. Both effects happened at the same software-line boundary, but for different reasons and by different authors. The right paper framing is therefore not "correctness fix costs 77% of prefill performance" — the correctness fix is free. The 77% prefill regression is a separable Triton-side codegen issue that is already partially remediated in Triton 3.6.0 and may be fully fixed in later versions. A practitioner who needs both the speed and the stability does not currently have a cell that delivers both, but that is a point-in-time software-stack statement, not a permanent tradeoff.


SS9. Cross-Phase Synthesis and Final Claim Status

SS9.0 Cross-Regime Headline Matrix

The following matrix is the single-screen summary of the entire TR147 evidence base. Rows are (GPU × Software-stack) combinations that TR147 observed; columns are the four decision-relevant cells; entries indicate crash rate under reduce-overhead decode plus, in parentheses, the compiled-prefill gain under the same compile mode.

Regime gpt2-class prefill gain / decode crash qwen2.5-1.5b prefill / decode llama3.x-1B/3B prefill / decode Large dense (7B/8B) prefill / decode
Ada v1 (Triton 3.3.1) 65.9–75.5% / N/A (v1 decode was pre-fix) 69.0% / crash (post-fix) N/A in v1 N/A in v1
Ada v2 (Triton 3.3.1, corrected) N/A retested N/A crash 0.800 (E2 canary) N/A
Ada v4 StaticCache (Triton 3.3.1) 76.3% / 0.800 77.1% / 0.800 68.7% / 0.800 18.5% / 0.800 (llama3.1-8b dense); 2.3% / 0.000 (qwen-AWQ)
Ada v4 Triton 3.4.0 0.54% / 0.000
Ada v4 Triton 3.6.0 −1.60% / 0.000
A100 v3 (token_len 64) 81.7% / crash at token_len ≥ 64
A100 v3 (token_len ≥ 128) prefill crash / decode crash
A100 v4 StaticCache 69.8% / 0.800 82.4% / 0.800 79.5% / 0.800 54.4% / 0.800 (llama3.1-8b and qwen2.5-7b dense)
External gpt-fast A100-PCIe (pinned code, Triton 3.3.1/3.4.0/3.6.0) eager 33.83–35.29 tok/s stable; compiled 0/15 across all Triton versions
External gpt-fast A100-SXM (pinned code, Triton 3.6.0) eager 27.67 tok/s; compiled 0/5 (returncode=1)
External gpt-fast A100-SXM (HEAD code, Triton 3.6.0) eager 10.53 tok/s; compiled 106.74 tok/s (strong_match vs 104.9 target)

Observations. The headline matrix shows that a reader has to specify six axes (GPU, Triton version, cache type, compile mode, model path, and — per the external dual-variant probe — repository commit) to get a single unambiguous cell. There is no column in this matrix for which a single number summarises the entry. The rows where the crash rate drops to 0.000 are exactly the rows where the compiled-prefill gain also drops to near-zero (Ada Triton 3.4.0/3.6.0) or where the entire path is AWQ-4bit (Ada qwen AWQ). The safe-and-fast cell does not exist anywhere in this matrix under reduce-overhead decode; the closest thing is "Ada Triton 3.3.1 reduce-overhead prefill" (77% faster) paired with a different cache-and-mode triple for decode. The three external-probe rows show that stack axis and code axis both carry load: on pinned code the compiled path is dead on every tested A100 × Triton combination, while on HEAD code the same stack reproduces the public 104.9 tok/s claim at 106.74 tok/s.

SS9.1 What TR147 Now Demonstrates

Claim Final Status Primary Evidence
Phase separation is mandatory Demonstrated v1–v2 Ada (7 models), v4 StaticCache (6 cells), v4 large-model (4 cells), v4 Triton (3 versions)
Compiled prefill is a real phenomenon Demonstrated TR126, v1 (60.7–77.3%), v4 StaticCache (54.4–63.2%), v4 dense 7B/8B (50.5–57.1% on A100)
DynamicCache + reduce-overhead decode is unsafe Demonstrated v2 corrected Ada (crash ≈ 0.800), v4 StaticCache all-cells, v4 A100 large-dense (crash 0.800), v3 A100 (crash 1.000)
StaticCache rescues decode correctness, not speed Demonstrated v4A: 0.000 crash under default on 6 (model × GPU) cells; decode overhead +1.6 to +3.5%, TOST rejects speedup claim
Large dense FP16 models keep the phase split Demonstrated v4B: A100 qwen2.5-7b / llama3.1-8b (50–57% prefill, 80% decode crash); Ada llama3.1-8b (18.5% prefill, 80% crash)
All large-model paths behave the same Rejected Ada qwen2.5-7b AWQ companion (0% crash, ~neutral compile)
Hardware alone explains the boundary Rejected Same Ada GPU flips result across Triton 3.3.1 vs 3.4.0/3.6.0
A100 resolves the compiled-decode failure Rejected v3 A100 (100% crash all token_len), v4 A100 large-model (80% crash)
Software stack can flip benchmark conclusions Demonstrated v4 Triton ablation,
The five-gate protocol works prospectively Demonstrated TR147 as a whole; see SS10 for paper-level implications
External validity established on a public benchmark object Demonstrated Dual-variant gpt-fast probe (SS11.6–SS11.10): HEAD code on A100-SXM + torch 2.11.0+cu130 + Triton 3.6.0 reproduces the 104.9 tok/s README claim at 106.74 tok/s (strong_match); the Dec-2023 pinned code that originally produced the claim crashes 5/5 on the same stack
Code SHA is a load-bearing sixth axis alongside the existing five-tuple Demonstrated Dual-variant probe: identical stack, 104.9-class survival on HEAD vs 0/5 survival on pinned code

SS9.2 Combined Verdict Recomputation vs v1 WEAKENED Label

The v1 auto-report label ("compiler boundary: WEAKENED; phase separation: NOT_REQUIRED; external validity objection: OPEN; paper recommendation: UPDATE_LIMITATIONS") was the best reading available at the time. It is now explicitly superseded.

v1 Auto-Label v4 Final Label Reasoning
Compiler boundary: WEAKENED Compiler boundary: CONFIRMED AND QUALIFIED v2 fix + v4 Triton ablation show the boundary is real but attaches to a specific (Cache × Mode × Stack) triple
Phase separation: NOT_REQUIRED Phase separation: REQUIRED 96 cells × 2 GPUs × multiple Triton versions all show prefill and decode diverge qualitatively
External validity objection: OPEN External validity: CLOSED for two hardware classes; OPEN for mature-benchmark meta-study v3 + v4 span sm_80 and sm_89; third-party benchmark re-analysis is still future work
Paper recommendation: UPDATE_LIMITATIONS Paper recommendation: UPDATE_LIMITATIONS + ADD_CASE_STUDY The Triton ablation and the A100 amplification are publishable case studies in their own right

SS9.3 Final One-Sentence Verdict

TR147 validates the benchmarking-integrity thesis by showing that benchmark conclusions about torch.compile can change materially across phase, cache implementation, compiler mode, model-family/quantization path, Triton minor version on the same physical GPU, and gpt-fast repository commit on the same stack, while the five-gate protocol extended to a six-tuple benchmark identity (GPU, Triton, PyTorch, cache, compile mode, code SHA) is what keeps those shifts from being misreported as stable truths.

SS9.4 Cross-TR Consistency Check

TR147's findings should be consistent with prior Banterhearts compiler studies where they overlap. Three cross-checks:

Prior TR Overlap With TR147 Consistency
TR117 (original benchmark matrix) Establishes the models and token lengths used as lineage in v1–v2 Consistent — qwen2.5 and gpt2 families rank in the same prefill-gain order
TR120 (compile paradox root-cause audit) Identifies cudagraph-tree + DynamicCache as the failure object Consistent — SS7.3 per-token decomposition matches TR120's "first-autoregressive-step" hypothesis
TR126 (Linux/Triton reference regime) Reports 60–75% prefill gain on RTX 4080 Laptop + Triton 3.3.1 Consistent — v1 Ada Triton 3.3.1 regime reproduces 60.7–77.3% gains (SS3.1); tolerance window ±5pp met

Observations. TR147 is not a cross-TR outlier. Every finding in SS3–SS8 that overlaps with a prior Banterhearts compiler study falls within the previously reported range once regime attribution is made explicit. The only finding that was not anticipated by prior TRs is SS5's A100 amplification, which is genuinely new evidence because no prior TR covered Ampere. TR147 therefore functions as both a confirmation (for Ada/Triton-3.3.1 cells) and as a first-time exploration (for Ampere + newer Triton cells).

SS9.5 What TR147 Does Not Demonstrate

TR147 does not demonstrate:

  • That a single backend ranking is portable across regimes (the data reject this).
  • That all compiler failures on decode have the same root cause (the 80% vs 100% split between Ada v4 and A100 v3 suggests multiple failure modes may be at play).
  • That A100 and Ada are fully covered under a symmetric matrix (70/94 conditions in the combined bundle are Ada-only).
  • That AWQ and dense FP16 are interchangeable for compiler benchmarking (SS7 rejects this).
  • That Triton 3.3.1 is "the correct" baseline (it is the one TR126 used; newer versions are different benchmark objects, not "later" ones).

SS10. Implications for the Benchmarking Integrity Paper

The paper should not be written around the stale v2 headline anymore. The final paper position is stronger.

SS10.1 Paper-Safe Claims

  • The five-gate protocol survived prospective application on a second hardware class.
  • The late follow-ups made the conclusion sharper rather than diluting it.
  • The real portability hazard is environment drift (hardware + Triton + cache + mode), not merely "another GPU."
  • Stable compiled decode is possible under StaticCache + mode="default", but the stable path does not produce a decode speedup in the tested regimes.
  • The same physical GPU can produce opposite qualitative conclusions under two Triton minor versions that differ by one patch-level upgrade.
  • A public benchmark headline can fail cleanly under a modern pinned stack even when authentication, checkpoint conversion, and artifact capture all succeed.

SS10.2 Claims the Paper Should Stop Making

  • "The decode boundary is simply more severe on larger GPUs." — Partially right, partially wrong. A100 v3 is more severe (100% vs 80%); Ada large-model v4 is the same as small-model Ada v2 (80%). The severity is bounded by the (Cache × Mode × Stack) triple, not by GPU size.
  • "Compiled decode categorically fails on all future stacks." — False. Triton 3.4.0 and 3.6.0 don't fail on decode; they just also don't win on prefill.
  • "Hardware portability is the main remaining objection." — False. The Triton ablation shows software-stack portability is the bigger open objection and now the better-evidenced finding.
  • "A public benchmark table is evidence enough on its own." — False. The gpt-fast case study needed commit pinning, model-source pinning, stack lock, CRI, and raw row artifacts before it became interpretable at all.

SS10.3 Five-Gate Protocol — Final Mapping

TR126's five gates, reviewed against TR147's final evidence:

  1. Gate 1 — Phase separation reporting. TR147 enforces per-phase reporting in every SS section; every table splits prefill and decode. Status: operational.
  2. Gate 2 — Environment attribution. TR147 pins GPU class (sm_80 vs sm_89), VRAM (12/40/48/80 GB), OS (Linux), PyTorch build, Triton version, and driver level in each run manifest. Status: operational, now including Triton minor version as a first-class axis.
  3. Gate 3 — Artifact evidence. Every cell has a raw JSONL or CSV row-for-row under a dated run directory; SS13 lists them. Status: operational.
  4. Gate 4 — Cache and mode surfacing. TR147 makes Dynamic vs Static cache, and eager/default/reduce-overhead mode, explicit axes in all v4 tables. Status: operational, refined by v4A and v4C.
  5. Gate 5 — Cross-regime reproducibility. Tested by v3 (cross-GPU), v4C (cross-Triton), and v4A (cross-cache). Status: operational; the reproducibility target is "same qualitative outcome with same [Cache × Mode × Stack] triple."

Observations. The protocol passes all five gates. What TR147 adds relative to TR126's original formulation is that Gate 4 must now include Triton version and cache implementation, not merely compile mode. A paper that lists compile mode but omits Triton version fails Gate 4 in the post-TR147 formulation.

SS10.4 Reviewer Objections Now Closed

Per papers/benchmarking_integrity/REVIEW_LIMITATIONS_REGISTER.md:

Objection Pre-v4 Status Post-v4 Status
I1: Compiler-boundary claim overstated against TR147 Open Closed via Triton ablation (SS8) — boundary is real but qualified by stack
"You only have one GPU" Open Closed via v3 A100 and v4 A100 large-model (SS5, SS7.1)
"You only have one cache type" Open Closed via v4 StaticCache on both GPUs (SS6)
"You only have one compile mode" Open Closed via default × reduce-overhead on every v4 cell
"You only have small models" Open Closed via v4 large-model (7B/8B on both GPUs) — with the AWQ caveat noted
"Maybe it's a measurement bug" Partially closed (v2 fix) Fully closed via bootstrap CIs, Cohen's d, TOST across 52,320 internal rows
"How do I reproduce this?" Partially addressed Closed via SS13 source-of-truth list + compute_v4_stats.py
"This is still only your own benchmark family" Open Partially closed via SS11 external gpt-fast case study on A100-PCIe

SS11. External Validity Case Study: gpt-fast Probes

The final reviewer hole was not about another internal matrix cell. It was whether the five-gate protocol survives contact with a public benchmark object that was not designed inside the Banterhearts reporting chain. For that purpose TR147 carries two tightly scoped external probes of the benchmark-era pytorch-labs/gpt-fast README baseline (Llama-2-7B Base, reported at 104.9 tok/s on A100-80GB). The first probe (SS11.1–SS11.5) sweeps Triton 3.3.1/3.4.0/3.6.0 on A100-PCIe holding the pinned gpt-fast commit fixed; the compiled path never completes a single invocation on any tested Triton version, leaving the reproduction question open on the code axis. The second probe (SS11.6–SS11.10) inverts the design: it holds the stack fixed at a single known-good point (A100-SXM4-80GB, torch 2.11.0+cu130, Triton 3.6.0) and sweeps the gpt-fast commit SHA between the Dec-2023 pinned benchmark-era code and current HEAD. The dual-variant probe produces the sharper external-validity headline and is the load-bearing external finding for the downstream paper.

SS11.1 Target Claim and Stack Lock

The target claim is the README benchmark-table row pinned by research/tr147/v4/gpt_fast_probe.py: "Llama-2-7B Base = 104.9 tok/s on A100-80GB." The probe does not claim a full historical replay of the original 2023 wheel stack. It asks a narrower and more relevant portability question: if we pin the benchmark object itself and hold a modern stack fixed, does the public headline survive Triton drift?

The lock file for the case study is research/tr147/v4/results/gpt_fast_probe/stack_lock.json:

  • gpt_fast_sha = d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a
  • torch_version = 2.7.1
  • triton_versions = [3.3.1, 3.4.0, 3.6.0]
  • gpu_tag = a100_pcie
  • resolved_model = meta-llama/Llama-2-7b-chat-hf

Experimental design:

  • GPU: A100-PCIe-80GB
  • Cells per Triton version: eager, compiled
  • Repetitions: 5 per cell
  • Samples per repetition: 5
  • Primary analyzed rows: 30 per Triton version, 90 total
  • Preflight smoke rows: 3 additional rows, retained for audit but excluded from the primary total

SS11.2 Case-Study Results

Triton Eager median tok/s Eager median TTFT (ms) Eager decode ms/tok Eager bandwidth (GB/s) Compiled ok reps Compiled crash rate Median wall time to compiled failure (s) Verdict
3.3.1 34.86 24.50 28.72 469.57 0 / 5 1.000 61.50 no valid compiled regime
3.4.0 35.29 24.37 28.46 474.05 0 / 5 1.000 84.11 no valid compiled regime
3.6.0 33.83 25.01 29.59 455.86 0 / 5 1.000 57.75 no valid compiled regime

The eager path is stable and boring in exactly the way a good control should be. Across the three Triton versions the eager medians differ by only about 4.3%, TTFT stays near 24–25 ms after warm-up, peak VRAM stays at about 13.6 GB, and the throughput lands at roughly one third of the public 104.9 tok/s headline. The compiled path never produces a single valid timed repetition on any Triton version.

This is why the external CRI file at research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json returns:

  • classification = "invalid"
  • reason = "need >= 2 stack points, got 0"

The CRI is not "negative" or "weakened." It is invalid because there are zero compiled stack points to compare.

SS11.3 What Failed, Exactly

The case study is useful because it fails at the right boundary.

What did not fail:

  • HF authentication
  • gated model access
  • checkpoint download
  • checkpoint conversion
  • eager generation
  • artifact writing and pullback

What did fail:

  • compiled decode on every Triton slice

Representative failure signatures from the per-version JSONL stderr tails:

Triton Failure signature
3.3.1 crash after generate.py:decode_one_token through torch._dynamo into torch._functorch.aot_autograd runtime wrappers
3.4.0 explicit InductorError from torch/_inductor/compile_fx.py during codegen / module load
3.6.0 failure deeper in aot_autograd / torch._inductor.compile_fx dispatch and compile path

The key point is that Triton changes the shape of the failure, but not the outcome. Unlike SS8, where newer Triton versions stabilized decode and erased the prefill win on the same Ada GPU, the gpt-fast benchmark object never reaches a valid compiled regime on the tested modern stack.

SS11.4 Interpretation

This case study does not overturn the internal TR147 result. It sharpens it.

Inside the Banterhearts harness, Triton minor-version drift is sufficient to flip the qualitative story on the same Ada GPU:

  • Triton 3.3.1: fast prefill, unstable decode
  • Triton 3.4.0 / 3.6.0: stable decode, neutral prefill

Inside the pinned gpt-fast benchmark object on A100-PCIe, Triton drift is not sufficient to rescue the compiled path:

  • eager remains stable at 33.83–35.29 tok/s
  • compiled remains 0/15 successful reps

The more honest conclusion is therefore not "Triton explains everything." The conclusion is:

  1. Triton can be the decisive axis inside one benchmark object (SS8).
  2. A public benchmark object can still fail cleanly across all tested Triton versions if the surrounding compile stack has drifted enough (this section).
  3. Therefore the portable unit of analysis is the full benchmark object: model source, benchmark code, compiler stack, cache path, compile mode, and GPU regime.

SS11.5 What This Closes for the Paper

The external case study is valuable even though it is a non-reproduction.

  • It answers the "internal-only chain" objection with artifact evidence rather than rhetoric.
  • It shows that the five-gate protocol can be applied prospectively to a public benchmark target.
  • It demonstrates a failure mode that reviewers care about: a public benchmark headline can be impossible to reproduce on a modern controlled stack even when the environment is fully pinned and the artifacts are complete.

The paper should cite this section as a clean external-validity non-reproduction, not as a performance tuning result and not as a proof that the original historical claim was false on its original stack.

SS11.6 Dual-Variant Probe: Stack Constant, Code Varied

The A100-PCIe probe (SS11.1–SS11.5) left the reader unable to separate two explanations for its all-zero compiled outcome: the modern stack may be incompatible with the pinned 2023 code, or the compile path may simply be broken at this point on A100-PCIe regardless of code age. A single follow-up experiment disambiguates: hold the stack constant on one known-good A100 point and sweep the gpt-fast code commit itself.

Run directory: research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/. Authoritative summary: latest_dual_summary.json. Stack manifest: latest_stack_manifest.json. Per-variant raw data in pinned_benchmark_commit/{summary,manifest,probe.log,gpt_fast_probe.jsonl} and latest_head/{summary,manifest,probe.log,gpt_fast_probe.jsonl}.

Held constant in both variants:

  • GPU: NVIDIA A100-SXM4-80GB (85.09 GB reported by the device)
  • torch: 2.11.0+cu130
  • Triton: 3.6.0
  • CUDA: 13.0
  • driver: common pod image
  • Model: meta-llama/Llama-2-7b-chat-hf
  • Prompt text, num_samples per invocation, repetitions per cell (5 reps × 5 samples, warmup sample dropped)
  • Probe harness: unmodified gpt-fast generate.py, tokens/sec read from the benchmark's own Average tokens/sec: log line.

Varied:

  • gpt-fast commit SHA.

Variant 1 (pinned_benchmark_commit): d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a. This is the Dec-2023 commit that originally produced the 104.9 tok/s README headline.

Variant 2 (latest_head): 6ecad9b5b6b987d17ac4303965545873d0192086. This is the current gpt-fast main at the time of measurement.

SS11.7 Dual-Variant Probe: Results

Variant gpt-fast SHA Eager median tok/s [min, max], CV Compiled ok / crash Compiled median tok/s [min, max], CV Reproduction band
README target 104.9 (claim)
Pinned d2c5d8223f 27.67 [26.88, 28.01], CV 0.0131 0 / 5 (100% crash) — (no surviving samples) null
HEAD 6ecad9b5b6 10.53 [10.35, 10.68], CV 0.0073 25 / 0 (0% crash) 106.74 [105.43, 107.73], CV 0.0066 strong_match

Pinned eager is stable on n=20 timed samples (CV 0.0131). Pinned compiled: 5/5 invocations crash with returncode=1, walltime range 33.7–104.5 s per invocation (some crashes land during compile, others during first execution of the compiled callable). Terminal frame across all five crashes is torch/_inductor/output_code.py:656 inside self.current_callable(inputs); the exception class itself is truncated by a stderr-capture bug in gpt_fast_probe.py.

HEAD eager is stable on n=20 timed samples (CV 0.0073), 62% slower than pinned eager (10.53 vs 27.67 tok/s). HEAD compiled is stable on n=20 timed samples (CV 0.0066), median 106.74 tok/s — above the published 104.9 target, classifying as strong_match (≥ 90 tok/s) under the case-study-plan reproduction thresholds.

Representative crash signature from pinned_benchmark_commit/gpt_fast_probe.jsonl (stderr tail excerpt):

File "/tmp/venvs/gptfast_latest_dual/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 656, in __call__ / return self.current_callable(inputs) / ^^^^^^^^^^^^^ (stderr truncated by harness)

Representative HEAD success row (latest_head/gpt_fast_probe.jsonl, first compiled sample): tokps=107.49, inference_total_s=1.86, bandwidth_gbps=1420.48, ttft_ms=97.58, decode_ms_per_tok=8.86, compile_time_s=70.18, memory_used_gb=13.70, avg_tokps_invocation=107.65, status=ok.

SS11.8 Interpretation

The dual-variant result is sharper than any of the three originally pre-registered interpretation branches (stack-fragile / robust / total non-reproduction). The observed outcome is a fourth shape:

The 104.9 tok/s claim is reproducible — but only by people running HEAD code on HEAD stack (measured 106.74 tok/s = strong_match). The Dec-2023 code that originally produced 104.9 no longer works at all on the current stack; its compiled cell crashes 5/5 with returncode=1 on every invocation. Benchmark numbers in this regime are maintained by continuous code maintenance, not by stack permanence.

Two additional observations follow from the table:

  1. Code-version asymmetry on eager alone. HEAD eager is 62% slower than pinned eager on the same hardware and stack (10.53 vs 27.67 tok/s). HEAD's eager path does more work per forward pass — plausibly flex-attention plumbing and correctness-path layers introduced between the two commits. The compile-vs-eager ratio therefore grew from the original gpt-fast ~4× headline claim to 10.14× on HEAD (106.74 / 10.53). Compile did not get 2.5× faster in absolute tok/s; eager got slower and the ratio widened.
  2. Extension to the benchmark-identity tuple. TR147 SS8 / SS10 established that a compile-path benchmark claim requires disclosing (GPU sm_N, Triton minor version, PyTorch build, cache type, compile mode). The dual-variant finding adds a sixth axis, code SHA. The CRI definition as written (see research/tr147/v4/compute_cri.py) operates on stack-axis perturbation sets. A code-axis CRI variant with perturbation set {pinned, HEAD} holding stack constant is a natural extension; this report names the extension and does not re-calibrate its thresholds because the pinned compiled cell has zero surviving samples and the existing robust/sensitive/fragile/catastrophic bands were calibrated for distributional shifts, not zero-survival cases.

SS11.9 Scope, Limitations, and Cross-Check with SS11.1–SS11.5

The dual-variant probe is narrow by construction and must be read alongside the earlier A100-PCIe probe, not as a replacement.

  • Only one stack point is tested (torch 2.11.0+cu130, Triton 3.6.0). The dual-variant probe is not a three-Triton stack ablation; that is what SS11.1–SS11.5 was.
  • Only one model (Llama-2-7b-chat-hf) and one GPU class (A100-SXM4-80GB). Generalization to other models and SKUs is not claimed.
  • The A100-PCIe 3-Triton ablation on the pinned code (SS11.1–SS11.5) and the A100-SXM 1-stack × 2-code probe (SS11.6–SS11.8) are consistent: pinned gpt-fast code does not produce a successful compiled run on any tested combination of (A100-PCIe × {Triton 3.3.1, 3.4.0, 3.6.0}) or (A100-SXM × Triton 3.6.0). The broken axis is the pinned code's compile path, not the GPU SKU or the Triton minor version.
  • The exception class is not captured. The probe records returncode=1, the crash-time walltime distribution, and the terminal stack frame (torch/_inductor/output_code.py:656), which is sufficient to assert 0 survival but not to attribute the crash to a specific upstream commit. Future probes should patch the stderr-tail capture in gpt_fast_probe.py to preserve the exception class across the subprocess boundary.
  • The probe does not decompose prefill vs decode because it reads gpt-fast's own aggregate Average tokens/sec: logline. Per-phase decomposition would require modifying generate.py, which would break the "unmodified benchmark object" discipline.

SS11.10 What the Dual-Variant Probe Closes for the Paper

The three-Triton A100-PCIe probe (SS11.5) closed the "internal-only chain" objection with a non-reproduction. The dual-variant probe closes a sharper objection: "maybe a pinned benchmark-era compile benchmark is always broken on modern stacks, so non-reproduction is uninformative." The answer is no. HEAD code reproduces the claim on the same modern stack that crashes the pinned code 5/5. The non-reproduction in SS11.5 is therefore not a generic fact about modern stacks; it is specifically a fact about the pinned code's interaction with modern torch.compile infrastructure. The sustaining of the 104.9 tok/s claim across three years of upstream drift is continuous code maintenance, not stack permanence — which is the external-validity statement the paper was missing.


SS12. Limitations and Remaining Open Work

The following are the honest gaps.

  1. The A100 matrix is narrower than the Ada matrix. v3 covers only qwen2.5-1.5b and qwen2.5-3b at three token lengths, and v4 A100 covers StaticCache (3 models × 5 token lengths) and large-model (2 models × 5 token lengths). A full mirrored rerun of every Ada condition on A100 is not in TR147's budget. The combined-bundle verdict file shows 70 Ada-only conditions out of 94 total. The most valuable single follow-up would be an A100 Triton-version ablation matching SS8 exactly.
  2. The large-model qwen path is not apples-to-apples across GPUs. A100 uses dense FP16 qwen2.5-7b; Ada uses AWQ-4bit qwen2.5-7b because 48 GB is marginal for dense FP16 at this prompt/KV sizing. The llama3.1-8b cross-GPU comparison is apples-to-apples (dense FP16 on both) and is the preferred citation.
  3. Triton ablation uses one model. Qwen2.5-1.5b is the right canary (used throughout TR126 and v1–v2), but cross-family confirmation would strengthen the stack-attribution claim. A llama3.2-1b × 3-Triton-version follow-up is a ~3,600-row experiment and is tractable.
  4. The external case study is one public benchmark object, not a broad survey. gpt-fast is a good target because it is public, specific, and tied to a concrete headline. It is still only one benchmark family. A second public case study would improve external breadth, but it is no longer required to justify the paper's methodological claim.
  5. The external case study is not a historical full-stack replay. We pinned the benchmark object (gpt-fast commit and model source) and varied Triton on a modern torch==2.7.1 stack. That is the right portability probe for this paper. It is not sufficient to answer whether the original 104.9 tok/s claim held on its exact historical wheels and drivers.
  6. The v4 Triton result is highly informative but the formal inferential appendix is compact. The qualitative flip is unmissable (|d| > 10 on multiple cells; 80% vs 0% crash rate), so this is more a style critique than a substantive gap. A reviewer wanting full per-cell Holm-adjusted p-values can read Appendix B.
  7. No runtime power / energy measurements. TR147 measures latency and stability. Compute-economics claims (e.g., "compiled prefill is X% more energy-efficient") are not supported by TR147 data.
  8. Temperature is fixed. All latency measurements are at default GPU thermal state; sustained-thermal degradation is not probed.

None of these limitations erase the main result; they describe the scope of the claim and the next sensible experiment.


SS13. Reproducibility and Source of Truth

SS13.1 Canonical Paths

  • v1 Adaresearch/tr147/results/20260412_195222/
  • v2 Ada (corrected)research/tr147/v2/20260413_054740/
  • v3 A100research/tr147/v3/20260414_190849/
  • v4 integrated resultsresearch/tr147/v4/results/
  • external case studyresearch/tr147/v4/results/gpt_fast_probe/
  • v4 combined bundle / portability verdictresearch/tr147/v4/combined_bundle/20260416_181351/
  • v4 integration manifestresearch/tr147/v4/results/final_v4_integration_manifest.json

SS13.2 Key Data Files

Stage File Rows
v1 report research/tr147/results/20260412_195222/tr147_report.md narrative
v1 analysis research/tr147/results/20260412_195222/tr147_analysis.json summary
v2 validation research/tr147/v2/20260413_054740/v2_validation_summary.json summary
v3 A100 metrics research/tr147/v3/20260414_190849/e3/metrics.csv 1,440
v4 StaticCache A100 research/tr147/v4/results/static_cache_a100_v2/static_cache_retest.jsonl 5,400
v4 StaticCache Ada research/tr147/v4/results/static_cache_ada_v2/static_cache_retest.jsonl 5,400
v4 Large Model A100 research/tr147/v4/results/large_model_a100/large_model_cell.jsonl 3,600
v4 Large Model Ada research/tr147/v4/results/large_model_ada/large_model_cell.jsonl 3,600
v4 Triton 3.3.1 research/tr147/v4/results/triton_ablation_ada_3.3.1/triton_ablation.jsonl 3,600
v4 Triton 3.4.0 research/tr147/v4/results/triton_ablation_ada_3.4.0/triton_ablation.jsonl 3,600
v4 Triton 3.6.0 research/tr147/v4/results/triton_ablation_ada_3.6.0/triton_ablation.jsonl 3,600
external case-study lock research/tr147/v4/results/gpt_fast_probe/stack_lock.json metadata
external case-study summary research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json summary
external case-study Triton 3.3.1 research/tr147/v4/results/gpt_fast_probe/3.3.1/gpt_fast_probe.jsonl 30
external case-study Triton 3.4.0 research/tr147/v4/results/gpt_fast_probe/3.4.0/gpt_fast_probe.jsonl 30
external case-study Triton 3.6.0 research/tr147/v4/results/gpt_fast_probe/3.6.0/gpt_fast_probe.jsonl 30
dual-variant summary (SS11.6–SS11.10) research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_dual_summary.json summary
dual-variant stack manifest research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_stack_manifest.json metadata
dual-variant pinned code raw research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/pinned_benchmark_commit/gpt_fast_probe.jsonl 30
dual-variant HEAD code raw research/tr147/v4/pulls/gpt_fast_latest_dual_20260418/gpt_fast_latest_dual/latest_head/gpt_fast_probe.jsonl 50

SS13.3 Reproduction Commands

# (Ada or A100, depending on run host)
python research/tr147/v4/static_cache_retest.py \
    --gpu ada \
    --output research/tr147/v4/results/static_cache_ada_v2 \
    --repetitions 60 \
    --warmup-repetitions 3 \
    --models gpt2-100m,qwen2.5-1.5b,llama3.2-1b

python research/tr147/v4/large_model_cell.py \
    --gpu a100 \
    --output research/tr147/v4/results/large_model_a100 \
    --repetitions 60 \
    --warmup-repetitions 3 \
    --models qwen2.5-7b,llama3.1-8b

python research/tr147/v4/triton_ablation.py \
    --triton-version 3.3.1 \
    --model Qwen/Qwen2.5-1.5B \
    --output research/tr147/v4/results/triton_ablation_ada_3.3.1 \
    --repetitions 60 \
    --warmup-repetitions 3

# Repeat the triton_ablation command for 3.4.0 and 3.6.0 with matching
# --triton-version and --output values.

# External case study
python research/tr147/v4/gpt_fast_probe.py \
    --gpt-fast-dir /workspace/gpt-fast \
    --checkpoint /workspace/checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth \
    --gpu a100_pcie \
    --triton-label 3.3.1 \
    --output research/tr147/v4/results/gpt_fast_probe/3.3.1 \
    --reps 5 \
    --num-samples 5 \
    --expected-gpt-fast-sha d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a

python research/tr147/v4/compute_cri.py \
    --mode external \
    --input research/tr147/v4/results/gpt_fast_probe \
    --output research/tr147/v4/results/gpt_fast_probe/gptfast_cri.json

# Statistics recomputation
python research/tr147/compute_v4_stats.py > research/tr147/v4/v4_stats_snapshot.txt

Seeds: 42 for bootstrap. Iterations: 60 timed + 3 warmup per cell for the internal TR147 harness; 5 reps × 5 samples for the external gpt-fast probe. Expected runtime: v4 StaticCache ≈ 3 hours per GPU; v4 large-model ≈ 2.5 hours per GPU; each Triton ablation ≈ 1.5 hours; external gpt-fast full sweep ≈ 2–3 hours plus checkpoint conversion. Total v4 budget excluding retries: ≈ 14 GPU-hours on A100-SXM4 + ≈ 8 GPU-hours on RTX 6000 Ada + ≈ 3 A100-PCIe GPU-hours for the case study.

SS13.4 Final Source-of-Truth Note

The earlier publish-ready TR147 drafts dated 2026-04-13 (Ada-only, "WEAKENED" label) and 2026-04-17 (internal full-depth, no external case study) are superseded by this version 4.1 file. The v4.1 file is the authoritative TR147 narrative for all downstream reports and papers.


SS14. References

  • TR117 — Original benchmark matrix lineage.
  • TR120 — Compile paradox root-cause audit.
  • TR126 — Linux/Triton reference regime and reporting standard.
  • pytorch-labs/gpt-fast README at commit d2c5d8223fd00ab5ce469d8fdd93a7de78fb8b4a — Public benchmark target for the external-validity case study; benchmark table includes the 104.9 tok/s A100-80GB Llama-2-7B Base headline probed in SS11.
  • PyTorch PR #175562 — Assertion-side fix relevant to the cudagraph tree path, drafted during this research program. Stability fix; does not alter codegen.
  • PyTorch Issue #175557 — Compiled decode / cudagraph tree failure context.
  • Triton v3.4.0 Release Notes — Documents the prefill regression attributed to SS8. Key items: PR #7138 (documented LLVM + PTXAS register-spilling regression), PRs #6877 / #6694 / #6407 (dynamic register reallocation for warp specialization), PR #6982 (generic swizzling rewrite for convert_layout lowering). These codegen changes, not the PyTorch correctness fix, explain the 62–77% → 0–3% prefill-speedup collapse.
  • Triton v3.4.0 Release Tracker — Issue #7315 — Full PR list and release status.
  • HuggingFace Transformers Issue #27837 — Upstream StaticCache discussion.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Erlbaum — pooled-variance Cohen's d.
  • Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall — percentile bootstrap method.
  • Holm, S. (1979). "A simple sequentially rejective multiple test procedure." Scandinavian Journal of Statistics 6: 65–70.
  • Schuirmann, D. J. (1987). "A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability." Journal of Pharmacokinetics and Biopharmaceutics 15(6): 657–680 — TOST.
  • Welch, B. L. (1947). "The generalization of 'Student's' problem when several different population variances are involved." Biometrika 34: 28–35.

Appendix A — Raw Per-Cell Tables

A.1 V4 StaticCache Ada — All Cells (n=300 each)

Model Mode Cache Compile Mode n Crash Rate Mean wall_ms 95% CI
gpt2-100m prefill StaticCache eager 300 0.000 4.190 [4.181, 4.201]
gpt2-100m prefill StaticCache default 300 0.000 1.865 [1.856, 1.875]
gpt2-100m prefill StaticCache reduce-overhead 300 0.000 0.993 [0.991, 0.995]
gpt2-100m kv_decode StaticCache eager 300 0.000 530.378 [446.046, 613.726]
gpt2-100m kv_decode StaticCache default 300 0.000 547.115 [461.226, 631.627]
gpt2-100m kv_decode StaticCache reduce-overhead 300 0.800 0.984 (60 ok) [0.976, 0.993]
llama3.2-1b prefill StaticCache eager 300 0.000 11.685 [11.638, 11.731]
llama3.2-1b prefill StaticCache default 300 0.000 4.485 [4.467, 4.504]
llama3.2-1b prefill StaticCache reduce-overhead 300 0.000 3.662 [3.658, 3.667]
llama3.2-1b kv_decode StaticCache eager 300 0.000 1,450.320 [1,221.278, 1,675.293]
llama3.2-1b kv_decode StaticCache default 300 0.000 1,475.299 [1,240.375, 1,708.989]
llama3.2-1b kv_decode StaticCache reduce-overhead 300 0.800 3.623 (60 ok) [3.612, 3.638]
qwen2.5-1.5b prefill StaticCache eager 300 0.000 21.183 [21.001, 21.369]
qwen2.5-1.5b prefill StaticCache default 300 0.000 7.786 [7.744, 7.837]
qwen2.5-1.5b prefill StaticCache reduce-overhead 300 0.000 4.853 [4.849, 4.857]
qwen2.5-1.5b kv_decode StaticCache eager 300 0.000 2,552.027 [2,148.222, 2,957.139]
qwen2.5-1.5b kv_decode StaticCache default 300 0.000 2,608.198 [2,193.576, 3,019.052]
qwen2.5-1.5b kv_decode StaticCache reduce-overhead 300 0.800 4.984 (60 ok) [4.942, 5.039]

Observations. The 60 ok rows inside every reduce-overhead decode cell are the token_len=1 rows, which exercise the cudagraph tree path only once and therefore survive; the remaining 240 rows (token_len ∈ {8, 32, 128, 512}) crash uniformly. This is the same 240/300 pattern seen in the large-model cells in SS7.

A.2 V4 StaticCache A100 — All Cells (n=300 each)

Model Mode Cache Compile Mode n Crash Rate Mean wall_ms 95% CI
gpt2-100m prefill StaticCache eager 300 0.000 4.861 [4.819, 4.935]
gpt2-100m prefill StaticCache default 300 0.000 2.215 [2.211, 2.219]
gpt2-100m prefill StaticCache reduce-overhead 300 0.000 1.468 [1.446, 1.489]
gpt2-100m kv_decode StaticCache eager 300 0.000 645.614 [544.005, 745.804]
gpt2-100m kv_decode StaticCache default 300 0.000 667.941 [561.709, 772.346]
gpt2-100m kv_decode StaticCache reduce-overhead 300 0.800 1.194 (60 ok) [1.191, 1.196]
llama3.2-1b prefill StaticCache eager 300 0.000 14.265 [14.233, 14.297]
llama3.2-1b prefill StaticCache default 300 0.000 5.754 [5.708, 5.814]
llama3.2-1b prefill StaticCache reduce-overhead 300 0.000 2.928 [2.906, 2.954]
llama3.2-1b kv_decode StaticCache eager 300 0.000 1,745.290 [1,469.715, 2,016.443]
llama3.2-1b kv_decode StaticCache default 300 0.000 1,779.642 [1,496.746, 2,058.371]
llama3.2-1b kv_decode StaticCache reduce-overhead 300 0.800 3.068 (60 ok) [3.034, 3.107]
qwen2.5-1.5b prefill StaticCache eager 300 0.000 24.628 [24.382, 24.886]
qwen2.5-1.5b prefill StaticCache default 300 0.000 10.138 [10.101, 10.189]
qwen2.5-1.5b prefill StaticCache reduce-overhead 300 0.000 4.334 [4.305, 4.369]
qwen2.5-1.5b kv_decode StaticCache eager 300 0.000 3,142.945 [2,647.019, 3,629.762]
qwen2.5-1.5b kv_decode StaticCache default 300 0.000 3,193.650 [2,685.493, 3,692.970]
qwen2.5-1.5b kv_decode StaticCache reduce-overhead 300 0.800 4.645 (60 ok) [4.589, 4.707]

Observations. Every A100 cell mirrors its Ada counterpart in qualitative outcome (stable or 80% crash) while running at higher absolute latency for decode and lower absolute latency for compiled prefill. This is the "portable latency shift" verdict from the combined bundle in concrete form.

A.3 V4 Large-Model Decode Per-Token-Length Crash Pattern

GPU Model Compile Mode token_len=1 token_len=8 token_len=32 token_len=128 token_len=512
Ada llama3.1-8b FP16 reduce-overhead 0/60 crash 60/60 60/60 60/60 60/60
A100 llama3.1-8b FP16 reduce-overhead 0/60 60/60 60/60 60/60 60/60
A100 qwen2.5-7b FP16 reduce-overhead 0/60 60/60 60/60 60/60 60/60
Ada qwen2.5-7b AWQ-4bit reduce-overhead 0/60 0/60 0/60 0/60 0/60

Observations. The per-token decomposition makes the failure boundary precise: reduce-overhead + DynamicCache + dense FP16 decode fails at the first multi-token autoregressive step. The AWQ-4bit companion is the only path that survives every token length.

A.4 V1 Prefill Gains (RTX 6000 Ada, 7 Models)

See Section SS3.1 for the primary table. Full per-cell bootstrap CIs are available in research/tr147/results/20260412_195222/tr147_analysis.json.


Appendix B — Extended Statistical Tables

B.1 Per-Cell Cohen's d — v4 StaticCache Ada (compile vs eager)

Model Mode Compile Mode Δ% Cohen's d
gpt2-100m prefill default −55.48% +28.210
gpt2-100m prefill reduce-overhead −76.31% +54.334
gpt2-100m kv_decode default +3.16% −0.022
gpt2-100m kv_decode reduce-overhead (60 ok) −99.81% +0.764
llama3.2-1b prefill default −61.62% +23.571
llama3.2-1b prefill reduce-overhead −68.66% +28.433
llama3.2-1b kv_decode default +1.72% −0.012
llama3.2-1b kv_decode reduce-overhead (60 ok) −99.75% +0.768
qwen2.5-1.5b prefill default −63.25% +11.633
qwen2.5-1.5b prefill reduce-overhead −77.09% +14.701
qwen2.5-1.5b kv_decode default +2.20% −0.015
qwen2.5-1.5b kv_decode reduce-overhead (60 ok) −99.80% +0.762

Observations. Compiled-prefill d values are extraordinarily large (11–54) because the between-arm distributions are essentially non-overlapping. Compiled default decode d values are near zero (|d| < 0.03), which is what "equivalent to eager" looks like at n=300. The reduce-overhead decode d values are nominally large (+0.76) but are driven by the surviving token_len=1 rows being sub-millisecond fall-throughs rather than by a genuine equivalent-sample comparison; this is an artifact of the failure pattern, not a speedup.

B.2 Per-Cell Cohen's d — v4 StaticCache A100

Model Mode Compile Mode Δ% Cohen's d
gpt2-100m prefill default −54.44% +6.350
gpt2-100m prefill reduce-overhead −69.80% +7.791
gpt2-100m kv_decode default +3.46% −0.024
gpt2-100m kv_decode reduce-overhead (60 ok) −99.82% +0.768
llama3.2-1b prefill default −59.67% +21.392
llama3.2-1b prefill reduce-overhead −79.48% +43.643
llama3.2-1b kv_decode default +1.97% −0.014
llama3.2-1b kv_decode reduce-overhead (60 ok) −99.82% +0.768
qwen2.5-1.5b prefill default −58.83% +8.793
qwen2.5-1.5b prefill reduce-overhead −82.40% +12.420
qwen2.5-1.5b kv_decode default +1.61% −0.011
qwen2.5-1.5b kv_decode reduce-overhead (60 ok) −99.85% +0.767

Observations. The A100 d values track Ada in direction and ordering but are systematically smaller on the gpt2-100m prefill cell (6.3 vs 28.2), which reflects A100's higher compiled-prefill absolute noise at very short token lengths rather than a weaker compile benefit.

B.3 Per-Cell Cohen's d — v4 Large-Model Cells

GPU Model Mode Compile Mode Δ% Cohen's d
Ada llama3.1-8b FP16 prefill default −18.48% +6.186
Ada llama3.1-8b FP16 prefill reduce-overhead −20.97% +7.116
Ada llama3.1-8b FP16 kv_decode default +0.20% −0.001
Ada llama3.1-8b FP16 kv_decode reduce-overhead (60 ok) −99.35% +0.762
Ada qwen2.5-7b AWQ-4bit prefill default +1.87% −0.639
Ada qwen2.5-7b AWQ-4bit prefill reduce-overhead +2.29% −0.457
Ada qwen2.5-7b AWQ-4bit kv_decode default −0.47% +0.003
Ada qwen2.5-7b AWQ-4bit kv_decode reduce-overhead −0.80% +0.006
A100 llama3.1-8b FP16 prefill default −53.12% +93.446
A100 llama3.1-8b FP16 prefill reduce-overhead −56.87% +88.499
A100 llama3.1-8b FP16 kv_decode default +2.44% −0.017
A100 llama3.1-8b FP16 kv_decode reduce-overhead (60 ok) −99.65% +0.765
A100 qwen2.5-7b FP16 prefill default −50.50% +25.702
A100 qwen2.5-7b FP16 prefill reduce-overhead −54.37% +24.694
A100 qwen2.5-7b FP16 kv_decode default +2.17% −0.015
A100 qwen2.5-7b FP16 kv_decode reduce-overhead (60 ok) −99.65% +0.765

Observations. The A100 llama3.1-8b compiled-prefill d of +93.4 is the largest single effect in TR147 and reflects the fact that eager prefill is 27.7 ms with near-zero run-to-run variance while compiled prefill is 13.0 ms with also-small variance. The Ada qwen AWQ-4bit row is the only cell in this table where Cohen's d on a decode crash cell is ≈ 0; that is because the AWQ path does not crash.

B.4 Triton Ablation — Cross-Version d (qwen2.5-1.5b, Ada)

Cell d (3.4.0 − 3.3.1) d (3.6.0 − 3.3.1)
prefill DynamicCache eager +0.121 −0.101
prefill DynamicCache default −19.699 −14.415
prefill DynamicCache reduce-overhead −32.837 −22.050
prefill StaticCache eager −1.444 +0.161
prefill StaticCache default −0.763 −25.521
prefill StaticCache reduce-overhead −48.928 −21.368
kv_decode DynamicCache eager −0.001 −0.007
kv_decode DynamicCache default −0.009 +0.006
kv_decode DynamicCache reduce-overhead (surviving) −0.768 −0.763
kv_decode StaticCache default −0.016 −0.011

Observations. Eager-to-eager across Triton versions is statistically equivalent (|d| ≤ 0.15). Compile-to-compile across Triton versions is massively different (|d| ≥ 14 on every compiled-prefill cell). This is the mathematically clean form of the SS8 narrative: Triton version does not matter for eager but matters enormously for compiled paths.

B.5 Holm–Bonferroni-Adjusted p-Values (top 10 contrasts)

For the 96 compile-vs-eager contrasts across v4, Welch's t-test p-values are machine-zero (< 1e-100) for every prefill cell with |d| > 10, and in the range 0.04–0.28 for mode="default" decode cells. Holm–Bonferroni at α=0.05, family size 96, rejects the null for every compiled-prefill cell and fails to reject for mode="default" decode cells — which is the statistical counterpart to "compiled default decode is equivalent to eager." Full table in research/tr147/v4/v4_stats_snapshot.txt.

B.6 Per-Model TOST Results — v4A StaticCache mode="default" Decode

Equivalence bounds set at ±3pp on the relative decode overhead. Null hypothesis: compiled decode is either ≥ 3pp faster or ≥ 3pp slower than eager. Rejection of both one-sided tests at α=0.05 declares equivalence within ±3pp.

GPU Model Observed Δ% Lower TOST p Upper TOST p Equivalent at ±3pp?
Ada gpt2-100m +3.16% ≈ 0 0.43 No (exceeds upper bound)
Ada llama3.2-1b +1.72% ≈ 0 ≈ 0 Yes
Ada qwen2.5-1.5b +2.20% ≈ 0 0.01 Yes (marginal)
A100 gpt2-100m +3.46% ≈ 0 0.61 No (exceeds upper bound)
A100 llama3.2-1b +1.97% ≈ 0 ≈ 0 Yes
A100 qwen2.5-1.5b +1.61% ≈ 0 ≈ 0 Yes

Observations. Four of six (model × GPU) cells are statistically equivalent to eager within ±3pp under mode="default" decode; the two gpt2-100m cells cross the upper equivalence bound because the base decode latency is small enough (≈ 0.5–0.7 seconds) that a 3% relative shift is easily measurable. The practical conclusion for a paper: "compiled decode under StaticCache + default is at best equivalent to eager and in some small-model cells is measurably slower." This is a narrower claim than "compiled decode gives no speedup" but it is the statistically defensible form.

B.7 Minimum Detectable Effect

At α=0.05, power=0.80, two-sample t with equal n, MDE in Cohen's d units is ≈ 0.23 at n=300 per arm and ≈ 0.52 at n=60 per arm. Every cell with n=300 per arm in this report is powered to detect a small effect; cells with n=60 (per-token-length breakdowns in Appendix A.3) are powered to detect a medium effect. The observed d values are either |d| < 0.03 (clearly below MDE — true nulls) or |d| > 5 (catastrophically above MDE — clear effects). There are essentially no contrasts in the "uncertain" middle range.


Appendix C — Sensitivity and Robustness Checks

C.1 Bootstrap Seed Sensitivity

Re-running compute_v4_stats.py with seeds ∈ {1, 42, 100, 2024} changes per-cell bootstrap CI bounds by < 0.5% of the mean on every prefill cell and by < 2% on the high-variance decode cells (those with wide CIs like gpt2-100m kv_decode eager). No conclusion in SS6–SS8 changes under seed perturbation.

C.2 Warm-Up Sensitivity

Doubling warm-up from 3 to 6 untimed iterations on the v4 StaticCache Ada cell shifts the compiled-prefill mean by < 0.05 ms (< 1% of the smallest prefill cell). No direction reversal is observed. This addresses the "maybe compile hasn't warmed up" objection.

C.3 Token-Length Stratification

All Appendix A.3 results are reported per token-length. The 240/300 pattern decomposes into "0/60 at token_len=1, 60/60 at token_len > 1" on every reduce-overhead decode cell. This pattern is invariant across GPU, model, Triton 3.3.1, and cache type for the dense paths.

C.4 AWQ Companion Sanity

The Ada qwen2.5-7b AWQ-4bit neutrality result has been cross-checked by running a matched eager-vs-eager baseline across 3,600 rows (not tabulated here), which shows no spurious compile-mode correlations in the AWQ kernel path. The non-result is therefore a real non-result, not a harness bug.

C.5 Baseline Stability Across Stages

Across v1 (2026-04-12), v2 (2026-04-13), and v4 (2026-04-16), the eager qwen2.5-1.5b prefill mean on Ada is 20.493 / ≈ 20.6 / 21.183 ms respectively (spread ≈ 3%). Eager baselines are stable across three separate runs spaced four days apart, which eliminates "the machine was in a different thermal/driver state" as a confound for the compile-path observations.

C.6 A100 Eager Reliability

The A100 eager arms (v3 and v4) on qwen2.5-1.5b prefill read 24.609 ms (v3 single short) and 24.628 ms (v4 StaticCache eager) — the 0.08% agreement between two independent runs on different dates establishes that the A100 measurement harness is equally stable and is not the source of the compiled-path divergence.

C.7 Triton Verification Field

Every v4 Triton ablation row carries a triton_verified boolean. In this dataset the field is false for all rows because the harness records triton.__version__ at import time; the field will be true when the harness is upgraded to hash and compare the on-disk Triton DLLs. This is a known harness limitation; it does not affect the SS8 conclusions because the runs are manifestly different (80% vs 0% crash; 62% vs 0.8% prefill gain).


Appendix D — Glossary

Term Definition
AWQ Activation-aware Weight Quantization; a 4-bit post-training quantization method used for the Ada qwen2.5-7b companion lane
Backend The runtime entrypoint for inference; in TR147, transformers-gpu = eager transformers; transformers-gpu-compile = torch.compile wrapped transformers
Bootstrap CI Nonparametric confidence interval constructed by resampling with replacement; here 1,000 iterations, percentile method, seed 42
Cohen's d Standardized mean difference, (μ_A − μ_B) / s_pooled; conventional thresholds are 0.2 (small), 0.5 (medium), 0.8 (large)
cudagraph tree PyTorch's CUDA-graph caching mechanism for torch.compile; the data structure central to the compiled-decode failure
DynamicCache HuggingFace Transformers default KV cache type; variable-size; the fragile object under reduce-overhead compile
Eager torch.compile-free execution path; the ground-truth baseline in this study
Holm–Bonferroni Sequentially rejective multiple-comparison correction; less conservative than Bonferroni while preserving family-wise error rate
MDE Minimum Detectable Effect; the smallest Cohen's d detectable at α=0.05 and power=0.80 given n per arm
Phase separation Reporting prefill and decode as separate benchmark objects rather than aggregated; TR126's core methodological rule
Prefill The first forward pass over the full prompt; produces the initial KV cache
reduce-overhead torch.compile mode that aggressively uses CUDA graphs; the failure-prone mode in TR147
StaticCache HuggingFace Transformers fixed-size KV cache; the cache object under which compiled decode is stable
TOST Two one-sided tests; the statistical equivalence testing procedure used to test "compiled decode is not faster than eager" at a ±3pp margin
Triton OpenAI Triton, the GPU-kernel-authoring language used by PyTorch Inductor; versions 3.3.1 / 3.4.0 / 3.6.0 are the axis of SS8
Welch's t Two-sample t-test that does not assume equal variance; used for compile-vs-eager contrasts where sample variances differ
Compute capability NVIDIA's sm_XX identifier for a GPU architecture; Ada Lovelace is sm_89, Ampere A100 is sm_80
cudagraph A captured sequence of CUDA operations replayable with low per-call overhead; used internally by torch.compile + reduce-overhead
mode="default" torch.compile compilation with standard Inductor pipeline; does not aggressively use CUDA graphs
mode="reduce-overhead" torch.compile mode that uses CUDA graphs to reduce per-call launch overhead; sensitive to memory-aliasing assumptions
transformers-gpu TR147's eager backend identifier; ordinary HuggingFace transformers on CUDA
transformers-gpu-compile TR147's compiled backend identifier; HF transformers wrapped in torch.compile
Welch's one-sided One-sided variant of Welch's t used inside TOST
wall_ms Wall-clock time in milliseconds; primary latency column in v4 JSONL files
Fall-through A compiled-decode row that returns sub-millisecond without producing output tokens; the v1 false-success class that v2 eliminated
Portable qualitative outcome Same ok vs crash classification across two regimes, irrespective of absolute latency

Appendix E — Configs and Commands

E.1 v4 StaticCache Config (abbreviated)

# research/tr147/v4/configs/static_cache.yaml (abbreviated)
gpu: [ada, a100]
models:
  - gpt2-100m
  - llama3.2-1b
  - qwen2.5-1.5b
cache_types: [DynamicCache, StaticCache]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.2 v4 Triton Ablation Config (abbreviated)

# research/tr147/v4/configs/triton_ablation.yaml (abbreviated)
gpu: ada
model: Qwen/Qwen2.5-1.5B
precision: fp16
triton_versions: ["3.3.1", "3.4.0", "3.6.0"]
cache_types: [DynamicCache, StaticCache]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.3 v4 Large-Model Config (abbreviated)

# research/tr147/v4/configs/large_model.yaml (abbreviated)
# Ada companion uses AWQ for qwen; A100 uses dense FP16 for qwen
ada:
  models: [llama3.1-8b:fp16, qwen2.5-7b:awq-4bit]
a100:
  models: [llama3.1-8b:fp16, qwen2.5-7b:fp16]
compile_modes: [eager, default, reduce-overhead]
token_lengths: [1, 8, 32, 128, 512]
iterations: 60
warmup: 3
seed: 42

E.4 Environment Manifest (Illustrative — actual values pinned per run)

# Ada host
hardware: RTX 6000 Ada
compute_capability: sm_89
vram_gb: 48
driver: "550.xx"
cuda: "12.4"
pytorch: "2.6.x"   # version matrix: one build per Triton wheel
triton: ["3.3.1", "3.4.0", "3.6.0"]  # for v4C; one of these per run
os: Linux

# A100 host
hardware: A100-SXM4-80GB (v4), A100-SXM4-80GB (v3)
compute_capability: sm_80
vram_gb: 80 | 40
driver: "550.xx"
cuda: "12.4"
pytorch: "2.6.x"
triton: "3.4.0"
os: Linux

E.5 Exact Statistics Recomputation Command

cd C:/Users/sahil/OneDrive/Documents/GitHub/Banterhearts
python research/tr147/compute_v4_stats.py > research/tr147/v4/v4_stats_snapshot.txt
# Compare against this report's numeric claims.

End of Technical Report 147 v4.0.