Appendices

Phase 6 Extended Appendices

Per-report data tables, named-method definitions, and cross-TR ledgers for serving-state safety certification.

Table of Contents

Per-Report Data Tables, Named-Method Definitions, and Cross-TR Ledgers (TR144–TR149 + TR152)
Table of Contents
Appendix A — TR144 (Speculative Decoding × Safety; TAIS)
A.1 — Scale / sample-integrity
A.2 — Phase 2 (rejection sampling) per-model McNemar
A.3 — Phase 3 (typical acceptance) per-model McNemar — the primary null
A.4 — Phase 5 dose-response slopes across N ∈ {1,3,5,8,12}
A.5 — TAIS calibration: Cohen's h across 18 AdvBench contrasts
A.6 — TOST equivalence coverage (±3pp)
A.7 — Byte-identity matrix (E2 / E3 / E4 / E5)
A.8 — E1 production-scale (70B target + 8B draft): refusal + Wilson CI
A.9 — Mantel–Haenszel speculative-vs-baseline safety OR
A.10 — Acceptance-rate telemetry (≤3B vs 70B)
A.11 — Baselines and cross-TR drift
Appendix B — TR145 (FP8 KV-cache safety, single-configuration base case)
B.1 — Scale / phase budget (24,054 records)
B.2 — Phase 2 per-model safety McNemar (paired, FP8 vs FP16)
B.3 — Phase 2 per-model capability McNemar (the lone Holm-significant cell)
B.4 — Phase 3 context-length × KV-cache ANOVA (interaction term)
B.5 — Phase 5 batch-size × KV-cache ANOVA (interaction term)
B.6 — Phase 5 turn-5 multi-turn McNemar (final-turn safety probe)
B.7 — Mantel–Haenszel pooled safety odds ratios (cross-model)
B.8 — TOST equivalence coverage (±3pp margin)
B.9 — Power / minimum detectable effect (MDE) per primary cell
B.10 — Judge agreement (gemma3:12b vs regex classifier)
B.11 — Cross-TR baseline reproducibility (±5pp drift check)
Appendix C — TR146 (Mechanistic probing under quantization; four-probe falsification)
C.1 — Probe-falsification master table
C.2 — Refusal-direction magnitude (the r = −0.61 signal)
C.3 — Safety-neuron quantization-error ratios (1.40× disproportionate)
C.4 — GPTQ confidence paradox
C.5 — Cell inventory
Appendix D — TR147 (Compile Reproducibility Index; the Triton kill-shot)
D.1 — Measurement-budget inventory
D.2 — Triton ablation: same Ada GPU, same Qwen2.5-1.5B
D.3 — Cross-version Cohen's d on the compile-vs-eager prefill contrast
D.4 — Eager-sanity control across Triton versions
D.5 — A100 vs Ada compiled-decode crash (Ampere amplification)
D.6 — StaticCache rescue: correctness, not speed
D.7 — Large-model cell: dense keeps the split, AWQ companion is neutral
D.8 — CRI band definitions and where TR147 lands
D.9 — External gpt-fast probe (A100, code-SHA as 6th axis)
D.10 — v1 Ada prefill replication across 7 models
Appendix E — TR148 (Judge Triangulation Protocol + Dual-Axis Safety-Judge Finding)
E.1 — Primary refusal-axis triangulation (gemma3:12b × llama3.1:8b)
E.2 — Cross-axis κ matrix (all large-n pairs among the five judges)
E.3 — JTP threshold-band scheme + verdict
E.4 — Per-judge effective precision / recall / F1 vs majority vote
E.5 — Record-count / per-judge coverage (full subset n = 13,724)
E.6 — Majority-vote resolution
E.7 — Calibration vs TR145 (pipeline-integrity cross-check)
E.8 — v1 → v2 join-bug (mandatory-judge gate)
E.9 — Holm-Bonferroni across the 15-pair family (supporting)
Appendix F — TR149 (Standardized Safety Battery, FP16 vs FP8 KV-Cache)
F.1 — Corpus scale (7,578 records)
F.2 — Per-(battery × model) paired McNemar (all 12 cells)
F.3 — Cross-battery Mantel–Haenszel pooled OR (matched-pairs)
F.4 — TOST equivalence coverage (±1 / ±3 / ±5 pp margins)
F.5 — Cohen's h per cell (matched-pairs binary effect size)
F.6 — Per-battery cross-judge Cohen's κ (gemma3:12b × llama3.1:8b)
F.7 — Cross-battery heterogeneity + leave-one-battery-out
F.8 — Per-cell minimum detectable effect (MDE)
F.9 — Ceiling structure (where the discriminating power lives)
F.10 — Paired-OR estimator postmortem (buggy vs fixed)
Appendix G — TR152 (FP8 KV-Cache Serving-State Safety Factorial, Layer 5 Anchor)
G.1 — Scale / factorial coverage
G.2 — Harmful-core invariance (3 adversarial batteries)
G.3 — XSTest discordance (over-refusal battery)
G.4 — Cross-context Mantel–Haenszel pooled OR
G.5 — Per-family Mantel–Haenszel decomposition
G.6 — Holm–Bonferroni family correction
G.7 — TOST equivalence coverage (±3pp margin)
G.8 — Cohen's h per discordant cell (paired-binary effect size)
G.9 — FP8-interaction spread across serving contexts
G.10 — Cross-judge agreement (Cohen's κ)
G.11 — Leave-one-out sensitivity (pooled MH OR robustness)
G.12 — Power / minimum detectable effect (MDE)
G.13 — v1 → v2 resolution progression
Appendix X1 — Named-method definitions (CRI / JTP / TAIS / RTSI)
X1.1 — The four methods at a glance
X1.2 — CRI (Compile Reproducibility Index) — full definition
X1.3 — JTP (Judge Triangulation Protocol) — full definition
X1.4 — TAIS (Typical-Acceptance Invariance Screen) — full definition
X1.5 — RTSI (Refusal Template Stability Index) — full definition
Appendix X2 — Cross-TR sample-budget and cost ledger
X2.1 — Per-TR measurement budget
X2.2 — The three load-bearing caveats
Appendix X3 — Cross-TR null-vs-located findings
X3.1 — The seven verdicts side by side
X3.2 — The FP8 scaling staircase (TR145 → TR149 → TR152)
X3.3 — What "located" does and does not license
Appendix X4 — Cross-TR judge-cohort inheritance
X4.1 — Per-TR judge cohort and JTP status
X4.2 — Why the same pair lands in different bands
Glossary (Phase 6-wide)

Technical Report — Conclusive Synthesis, Phase 6 — Extended Appendices

Per-Report Data Tables, Named-Method Definitions, and Cross-TR Ledgers (TR144–TR149 + TR152)

This file is the appendix companion to Conclusive_Phase6.md. The main synthesis carries the narrative; this file carries the full numerical substrate behind it. Appendices A–G reproduce the canonical data tables for each report in the Phase 6 line (TR144, TR145, TR146, TR147, TR148, TR149, TR152), one appendix per report, in report-number order. Appendices X1–X4 are cross-TR ledgers — named-method definitions, sample-budget, null-vs-located findings, and judge-cohort inheritance — that only make sense at the synthesis level. A Phase 6-wide Glossary closes the file.

Every table carries a one-line caption and a source citation of the form SSn.m, pointing into the corresponding PublishReady/reports/Technical_Report_NNN.md source report. Numbers are transcribed from the source reports and their analysis JSON, not recomputed here. Where a structural numeric field was cross-checked against a run-directory artifact (e.g. cri_values.json, tr152_analysis.json), the artifact is named in the caption.

Reading the tables. All safety statistics use the matched-pairs convention unless a table explicitly says otherwise: McNemar discordant cells are labelled by direction (FP16→FP8 unsafe = degrade; FP16→FP8 safe = improve), Mantel–Haenszel pooled odds ratios use the Haldane-corrected paired discordant ratio (Σb+0.5)/(Σc+0.5) (not the unpaired cross-product — see Appendix F.10 for why the distinction is load-bearing), Cohen's h is the paired-binary effect size (Cohen's d is reserved for TR147's continuous latency), and TOST equivalence is reported against the ±3pp margin the whole line shares.

Appendix A — TR144 (Speculative Decoding × Safety; TAIS screen)
Appendix B — TR145 (FP8 KV-cache safety, single-configuration base case)
Appendix C — TR146 (Mechanistic probing under quantization; four-probe falsification)
Appendix D — TR147 (Compile Reproducibility Index; the Triton kill-shot)
Appendix E — TR148 (Judge Triangulation Protocol; the dual-axis finding)
Appendix F — TR149 (Standardized safety battery; FP16 vs FP8 KV-cache)
Appendix G — TR152 (FP8 KV-cache serving-state factorial; the Layer 5 anchor)
Appendix X1 — Named-method definitions (CRI / JTP / TAIS / RTSI in one place)
Appendix X2 — Cross-TR sample-budget and cost ledger
Appendix X3 — Cross-TR null-vs-located findings
Appendix X4 — Cross-TR judge-cohort inheritance
Glossary (Phase 6-wide)

Appendix A — TR144 (Speculative Decoding × Safety; TAIS)

Source of truth: PublishReady/reports/Technical_Report_144.md. TR144 is the speculative-decoding member of the inference-flag null line: a paired, factorial test of whether speculative decoding (rejection sampling and typical acceptance) leaks unsafe tokens at greedy decoding, across three core model pairs plus a five-experiment expansion (E1–E5).

A.1 — Scale / sample-integrity

Per-phase / expansion record budget; 16,783 core + 48,072 expansion = 64,855 total paired samples, zero dropped. (SS5.9, SS24 one-screen summary)

Component	Expected	Actual	Match
Phase 1 (5 × 953)	4,765	4,765	Exact
Phase 2 (3 × 953, rejection)	2,859	2,859	Exact
Phase 3 (3 × 953, typical)	2,859	2,859	Exact
Phase 5 (3 × 5 × 420 sweep)	6,300	6,300	Exact
Phase 5 (reuse, w/ telemetry)	—	12,018 spec	Exact
Judge labels (safety subset)	—	11,448	Exact
Core total	16,783	16,783	Exact

Experiment	Probe	Samples
E1	70B+8B production pair	4,006
E2	DPO-adversarial draft (flipped hh-rlhf)	4,006
E3	GPTQ-4bit quantized draft	4,006
E4	2 seeds × 3 pairs (fp16)	24,036
E5	bfloat16 × 3 pairs	12,018
Expansion total		48,072

A.2 — Phase 2 (rejection sampling) per-model McNemar

p-value, discordant count, OR; all non-significant. R→C = refusal→compliance (unsafe direction). (SS4)

Pair	Refuse→Comply	Comply→Refuse	n_discordant	OR	p_exact	Significant?
llama3.2-3b+1b	0	1	2	0.50	1.000	No
qwen2.5-3b+1.5b	1	1	2	1.00	1.000	No
qwen2.5-1.5b+0.5b	0	3	3	0.20	1.000	No

Companion byte-identity (SS3): overall 90.66% (2,592/2,859 identical, 267 changed, 10 safety flips, 1 capability flip); per-pair identity 92.44% / 87.72% / 91.82%.

A.3 — Phase 3 (typical acceptance) per-model McNemar — the primary null

The primary leakage test. All non-significant. (SS6)

Pair	Refuse→Comply	Comply→Refuse	n_discordant	OR	p_exact	Significant?
llama3.2-3b+1b	1	2	3	0.60	1.000	No
qwen2.5-3b+1.5b	2	0	2	5.00	0.500	No
qwen2.5-1.5b+0.5b	1	3	4	0.43	0.625	No

A.3.1 — Phase 3 per-task safety deltas (12 cells, Cohen's d). Phase 1 → Phase 3 per-task safety rates; all deltas exactly 0.0pp, d = 0.000, spanning baselines 0.408–0.985. (SS8)

Pair	Task	Phase 1 rate	Phase 3 rate
llama3.2-3b+1b	advbench_refusal	0.790	0.790
llama3.2-3b+1b	bbq_bias	0.924	0.924
llama3.2-3b+1b	jailbreak_amplification	0.583	0.583
llama3.2-3b+1b	truthfulqa	0.560	0.560
qwen2.5-3b+1.5b	advbench_refusal	0.970	0.970
qwen2.5-3b+1.5b	bbq_bias	0.985	0.985
qwen2.5-3b+1.5b	jailbreak_amplification	0.408	0.408
qwen2.5-3b+1.5b	truthfulqa	0.480	0.480
qwen2.5-1.5b+0.5b	advbench_refusal	0.980	0.980
qwen2.5-1.5b+0.5b	bbq_bias	0.899	0.899
qwen2.5-1.5b+0.5b	jailbreak_amplification	0.575	0.575
qwen2.5-1.5b+0.5b	truthfulqa	0.510	0.510

A.4 — Phase 5 dose-response slopes across N ∈ {1,3,5,8,12}

12 logistic slopes + r², all zero; safety rate flat across speculation length. (SS10)

Pair	Task	Means [N=1,3,5,8,12]
llama3.2-3b+1b	advbench_refusal	[0.79, 0.79, 0.79, 0.79, 0.79]
llama3.2-3b+1b	bbq_bias	[0.933, 0.933, 0.933, 0.933, 0.933]
llama3.2-3b+1b	jailbreak_amplification	[0.575, 0.575, 0.575, 0.575, 0.575]
llama3.2-3b+1b	truthfulqa	[0.56, 0.56, 0.56, 0.56, 0.56]
qwen2.5-3b+1.5b	advbench_refusal	[0.97, 0.97, 0.97, 0.97, 0.97]
qwen2.5-3b+1.5b	bbq_bias	[0.987, 0.987, 0.987, 0.987, 0.987]
qwen2.5-3b+1.5b	jailbreak_amplification	[0.392, 0.392, 0.392, 0.392, 0.392]
qwen2.5-3b+1.5b	truthfulqa	[0.48, 0.48, 0.48, 0.48, 0.48]
qwen2.5-1.5b+0.5b	advbench_refusal	[0.98, 0.98, 0.98, 0.98, 0.98]
qwen2.5-1.5b+0.5b	bbq_bias	[0.913, 0.913, 0.913, 0.913, 0.913]
qwen2.5-1.5b+0.5b	jailbreak_amplification	[0.575, 0.575, 0.575, 0.575, 0.575]
qwen2.5-1.5b+0.5b	truthfulqa	[0.49, 0.49, 0.49, 0.49, 0.49]

Per-pair bootstrap slope CIs (1,000 resamples): all [0.000, 0.000]; no critical N exceeds 3pp for any pair (SS11).

A.5 — TAIS calibration: Cohen's h across 18 AdvBench contrasts

Every (experiment × pair × AdvBench) target-alone vs speculative contrast; max |h| = 0.024 (E5 llama3.2-3b+1b), ~4× below the 0.1 cutoff. (SS24.1)

#	Experiment	Pair	p_target_alone	p_spec	Cohen's h	TOST ±3pp
1	Core Phase 2	llama3.2-3b+1b	0.790	0.790	0.0000	PASS
2	Core Phase 2	qwen2.5-3b+1.5b	0.970	0.970	0.0000	PASS
3	Core Phase 2	qwen2.5-1.5b+0.5b	0.980	0.980	0.0000	PASS
4	Core Phase 3	llama3.2-3b+1b	0.790	0.790	0.0000	PASS
5	Core Phase 3	qwen2.5-3b+1.5b	0.970	0.970	0.0000	PASS
6	Core Phase 3	qwen2.5-1.5b+0.5b	0.980	0.980	0.0000	PASS
7	E1	llama3.1-70b+8b	~0.85 (ref)	0.8386	< 0.05	PASS (ref only)
8	E2	llama3.2-3b+adv1b	0.790	0.790	0.0000	PASS
9	E3	llama3.2-3b+gptq1b	0.790	0.790	0.0000	PASS
10	E4 (s123)	llama3.2-3b+1b	0.790	0.790	0.0000	PASS
11	E4 (s456)	llama3.2-3b+1b	0.790	0.790	0.0000	PASS
12	E4 (s123)	qwen2.5-3b+1.5b	0.970	0.970	0.0000	PASS
13	E4 (s456)	qwen2.5-3b+1.5b	0.970	0.970	0.0000	PASS
14	E4 (s123)	qwen2.5-1.5b+0.5b	0.980	0.980	0.0000	PASS
15	E4 (s456)	qwen2.5-1.5b+0.5b	0.980	0.980	0.0000	PASS
16	E5	llama3.2-3b+1b	0.790	0.780	−0.0243	PASS
17	E5	qwen2.5-3b+1.5b	0.970	0.970	0.0000	PASS
18	E5	qwen2.5-1.5b+0.5b	0.980	0.980	0.0000	PASS

Cutoff / correction (SS24.1): TAIS null cutoff |h| < 0.1 (conventional "trivial" < 0.2); Holm-Bonferroni adjusted α across 11 expansion tests = 0.0045; no observed effect comes within an order of magnitude of that floor.

A.6 — TOST equivalence coverage (±3pp)

25/27 core comparisons equivalent; all mean diffs within 0.21pp. (SS14)

Comparison	Mean Δ (pp)	90% CI	TOST p	Equivalent?
P2 vs P1: llama3.2-3b+1b	+0.10	[−0.22, 0.43]	0.000	Yes
P2 vs P1: qwen2.5-3b+1.5b	−0.21	[−0.45, 0.03]	0.000	Yes
P2 vs P1: qwen2.5-1.5b+0.5b	+0.21	[−0.14, 0.56]	0.000	Yes
P3 vs P1: llama3.2-3b+1b (safety)	0.00	[−0.56, 0.56]	0.000	Yes
P3 vs P1: llama3.2-3b+1b (capability)	+0.21	[−0.13, 0.55]	0.000	Yes
P3 vs P1: qwen2.5-3b+1.5b (safety)	0.00	[−0.56, 0.56]	0.000	Yes
P3 vs P1: qwen2.5-1.5b+0.5b (safety)	0.00	[−0.56, 0.56]	0.000	Yes
P4 vs P1: all N values	0.00	within bound	0.000	Yes (15/15)
Core total				25/27 equivalent

Expansion adds 11 TOST tests (all PASS); combined coverage = 36/38 (94.7%).

A.7 — Byte-identity matrix (E2 / E3 / E4 / E5)

Pairwise byte-identity on the llama3.2-3b+1b target family + per-experiment max safety delta. The expansion partitions every run into two equivalence classes; safety is invariant across both. (SS24.3, SS20.2–SS23.2)

Experiment	Probe	Common keys	Byte-identical	Identity rate	Max safety delta
E2 vs E4 s123 (DPO-adversarial draft)	alignment	4,006	4,006	100.00%	0.00pp
E3 vs E4 s123 (GPTQ-4bit draft)	precision	4,006	4,006	100.00%	0.00pp
E4 s123 vs s456 (two seeds, all 3 pairs)	seed	12,018	12,018	100.00%	0.00pp
E5 vs E4 s123 (bf16, llama3.2-3b+1b)	dtype	4,006	1,599	39.92%	−1.00pp (AdvBench, h=−0.024)
E5 (qwen2.5-3b+1.5b)	dtype	4,006	1,448	36.15%	+2.70pp (jailbreak, h=+0.054)
E5 (qwen2.5-1.5b+0.5b)	dtype	4,006	2,111	52.70%	−2.40pp (truthfulqa, h=−0.048)

Equivalence classes (SS24.3): {fp16 core, E2, E3, E4} all byte-identical across 4,006 samples; {E5 bf16} a separate class shifted 36–53%. Max |Cohen's h| over all E5 per-task contrasts = 0.054; all 9 E5 per-task contrasts PASS TOST ±3pp.

A.8 — E1 production-scale (70B target + 8B draft): refusal + Wilson CI

Llama-3.1-70B-AWQ-INT4 target + 8B fp16 draft; AdvBench refusal with 95% Wilson CI. (SS19.1)

Phase	Acceptance method	Domain	n	Safety rate	95% Wilson CI
2	rejection_sampler	safety	468	0.3632	[0.320, 0.409]
3	typical_acceptance_sampler	safety	468	0.3632	[0.320, 0.409]
4	typical_acceptance_sampler	safety	2,100	0.4038	[0.383, 0.425]
combined	—	safety (advbench)	200	0.8386	[0.783, 0.884]

E1 Phase 5 N-sweep (SS19.2): rates 0.4036 / 0.4012 / 0.4048 / 0.4048 / 0.4048 across N=1/3/5/8/12; max pairwise delta 0.36pp; logistic slope 0.000 ± 0.001. AdvBench 0.839 overlaps the core llama3.2-3b target-alone band (0.790 ± ~0.053).

A.9 — Mantel–Haenszel speculative-vs-baseline safety OR

Pooled OR across 3 model-pair strata. The only significant contrast is drafts-weaker-standalone; under speculation the target restores safety exactly. (SS16)

Comparison	Pooled OR	95% CI	n strata	Interpretation
P3 vs P1 safety	1.000	[0.835, 1.198]	3	No effect
P2 vs P1 safety	1.000	[0.835, 1.198]	3	No effect
Draft vs target safety	1.256	[1.054, 1.497]	3	Drafts slightly weaker (only sig. finding)

A.10 — Acceptance-rate telemetry (≤3B vs 70B)

Per-request acceptance rate by domain; the sign of the safety−capability gap flips with target scale. (SS12, SS19.3)

Scale	Domain	Mean acceptance	Std	n	Gap (safety − capability)
≤3B core	Safety	0.4783	0.263	1,404	+21.5pp (d=0.815, p<0.001)
≤3B core	Capability	0.2633	0.215	1,455	(ref)
70B (E1) phase2/3	Safety	0.333	0.130	468	−3.9pp
70B (E1) phase2/3	Capability (benign)	0.372	0.183	485	(ref)
70B (E1) phase4	Safety	0.360	0.205	2,100	—

Per-task acceptance (≤3B core): advbench 0.604 > jailbreak 0.479 > bbq 0.435 > truthfulqa 0.396 > arc 0.271 > mmlu 0.258. ANOVA F=118.70, p<0.001, η²=0.172.

A.11 — Baselines and cross-TR drift

A.11.1 — Phase 1 baseline safety rates. (SS1)

Model	Role	Safety rate	95% CI	Capability acc.
llama3.2-3b	target	0.769	[0.729, 0.805]	0.584
qwen2.5-3b	target	0.780	[0.740, 0.815]	0.722
qwen2.5-1.5b	target	0.792	[0.751, 0.825]	0.647
llama3.2-1b	draft	0.656	[0.612, 0.698]	0.336
qwen2.5-0.5b	draft	0.752	[0.711, 0.789]	0.468

Draft–target safety gap (SS2): llama3.2-3b+1b −11.3pp (d=−0.254); qwen2.5-3b+1.5b +1.2pp (d=+0.029, draft safer); qwen2.5-1.5b+0.5b −4.0pp (d=−0.097).

A.11.2 — Cross-TR baseline drift. TR145 Phase 1 baselines vs TR143; 3/3 consistent within ±0.4pp. (SS18)

Source TR	Models compared	Max drift (pp)	All within 5pp?
TR138	3	—	0 consistent (diff. inference conditions)
TR143	3	0.4	3 consistent

A.11.3 — Power / MDE. (SS15, SS24.2)

Comparison	Baseline rate	n	MDE (pp)
Phase 1 safety (pooled)	0.750	2,340	3.5
Phase 1 capability (pooled)	0.551	2,425	4.0
Phase 2/3 per-pair	0.769–0.796	468	7.4–7.7
Phase 5 per-cell	0.769–0.796	420	8.0–8.3
E4-pooled per-pair	—	600	~4.3

Appendix B — TR145 (FP8 KV-cache safety, single-configuration base case)

Source of truth: PublishReady/reports/Technical_Report_145.md. TR145 is the base case of the null line: FP8 KV-cache safety at a single deployment configuration (FP16 model weights throughout; vLLM v0.19.1; RTX 4080 Laptop, sm_8.9; gemma3:12b judge; temperature 0.0, seed 42). It is the first report in the line to isolate KV-cache precision as the only manipulated variable.

B.1 — Scale / phase budget (24,054 records)

Per-phase record budget and the manipulated independent variable. (SS6.2 / header)

Phase	Description	n records	Independent variable
1	Baseline (FP16 weights, FP16 KV-cache)	3,009	none (baseline)
2	FP8 KV-cache (FP16 weights unchanged)	3,009	KV-cache dtype
3	Context length × KV-cache	4,000	KV dtype × context (256/512/1024/2048)
4	Batch size × KV-cache	12,036	KV dtype × batch (1/4/8)
5	Conversation history × KV-cache (multi-turn)	2,000	KV dtype × conversation × turn
Total		24,054

B.2 — Phase 2 per-model safety McNemar (paired, FP8 vs FP16)

Primary result. R→C = refusal→compliance (unsafe direction). Holm-Bonferroni across 3 models; no Holm-significant safety effect. (SS2.1)

Model	n paired	Discordant	R→C	C→R	χ²	p (exact)	McNemar OR	Holm sig
llama3.2-1b	518	33	17	16	0.000	1.0000	1.061	No
llama3.2-3b	518	32	18	14	0.281	0.5966	1.276	No
qwen2.5-1.5b	518	80	45	35	1.012	0.3143	1.282	No

B.3 — Phase 2 per-model capability McNemar (the lone Holm-significant cell)

C→I = correct→incorrect. Qwen-1.5B capability is the only Holm-significant outcome in the Phase 2 battery (OR 1.89) — and it is on capability, not safety, the opposite direction from the disproportionate-harm hypothesis. (SS2.2)

Model	n paired	Discordant	C→I	I→C	p (exact)	McNemar OR	Holm sig
llama3.2-1b	485	25	13	12	1.0000	1.08	No
llama3.2-3b	485	20	14	6	0.1153	2.33	No
qwen2.5-1.5b	485	107	70	37	0.0018	1.89	Yes

B.4 — Phase 3 context-length × KV-cache ANOVA (interaction term)

Tests whether the FP16–FP8 gap grows with context (accumulated-rounding hypothesis). η² ≈ 0; non-monotonic, not the predicted widening. (SS6.1)

Model	F (interaction)	p_interaction	η² (interaction)	Significant?
llama3.2-1b	0.131	0.974	0.000	No
llama3.2-3b	0.722	0.538	0.001	No

B.5 — Phase 5 batch-size × KV-cache ANOVA (interaction term)

Tests whether FP8 amplifies batch-induced safety drift. Flatter than Phase 3; every cell additive. This is the phase that most directly anticipates TR152. (SS10.1)

Model	F (interaction)	p_interaction	η² (interaction)	Significant?
llama3.2-1b	0.099	0.980	0.000	No
llama3.2-3b	0.034	0.998	0.000	No

B.6 — Phase 5 turn-5 multi-turn McNemar (final-turn safety probe)

Paired across 100 conversations; turn 5 is the probe (turns 1–4 benign setup). Non-significant on both Llama models. (SS16)

Model	n paired	Discordant	R→C	C→R	χ²	p (exact)	McNemar OR	Significant?
llama3.2-1b	100	6	5	1	1.500	0.219	3.67	No
llama3.2-3b	100	0	0	0	0.000	1.000	1.00	No

B.7 — Mantel–Haenszel pooled safety odds ratios (cross-model)

Pooled across model strata; all CIs straddle 1. TR145 uses genuine unpaired marginal tables here, where the cross-product (a·d)/(b·c) is correct (contrast with TR149/TR152 paired estimator — Appendix F.10). (SS20)

Comparison	Pooled OR	95% CI	n strata
FP16 vs FP8 safety (Phase 2)	1.05	[0.90, 1.23]	3
FP16 vs FP8 batch=8 safety (Phase 5)	1.00	[0.83, 1.21]	2
FP16 vs FP8 turn-5 safety (Phase 5)	2.06	[0.61, 6.99]	2

B.8 — TOST equivalence coverage (±3pp margin)

9/22 comparisons pass equivalence. Both Llama paired safety tests pass; Qwen-1.5B safety is the lone safety non-equivalence, failing by 0.09pp at Δ = −3.09pp — the seed of the located finding TR152 later resolves. (SS18)

Comparison	Δ (pp)	TOST p	Bound (pp)	Equivalent?
p2_vs_p1 llama3.2-1b safety	−0.39	0.0086	±3.0	Yes
p2_vs_p1 llama3.2-3b safety	−0.58	0.0130	±3.0	Yes
p2_vs_p1 llama3.2-1b capability	−0.21	0.0035	±3.0	Yes
p2_vs_p1 qwen2.5-1.5b safety	−3.09	0.521	±3.0	No (Δ exceeds bound by 0.09pp)
p2_vs_p1 llama3.2-3b capability	−1.65	0.071	±3.0	No (within bound, high paired variance)

Coverage: 9 of 22 equivalent at ±3pp (40.9%).

B.9 — Power / minimum detectable effect (MDE) per primary cell

Retrospective power at α = 0.05, 80% power. Every primary cell ≥80% powered; the null is not power-starved. (SS19)

Cell	Baseline rate	n	MDE (pp)	Powered ≥80%?
Phase 1 safety	72.94%	1554	4.5	Yes
Phase 1 capability	52.23%	1455	5.2	Yes
Phase 2 safety	71.59%	1554	4.5	Yes
Phase 5 turn-5	97.00%	400	3.4	Yes
Phase 3 llama3.2-1b (auto/fp8)	38.95% / 40.20%	1000	6.1	Yes
Phase 3 llama3.2-3b (auto/fp8)	70.35% / 70.75%	1000	5.7	Yes
Phase 5 Llama-1B cells	63–64%	518	8.3–8.4	Yes
Phase 5 Llama-3B cells	75–76%	518	7.4–7.6	Yes

B.10 — Judge agreement (gemma3:12b vs regex classifier)

Cohen's κ over Phase 1+2 safety records; the expected ceiling for a regex-vs-generalist comparison. Affects only the inter-rater cross-check, not the regex-primary headline statistics. (SS21)

Scope	Agreement	Cohen's κ	n judged	n unclear
Aggregate	75.4%	0.43	13,676	48
advbench_refusal	90.3%	—	300	0
jailbreakbench_behaviors	79.3%	—	150	0
jailbreak_amplification	74.4%	—	360	0
bbq_bias	68.8%	—	593	1
truthfulqa	69.9%	—	146	4

B.11 — Cross-TR baseline reproducibility (±5pp drift check)

TR145 Phase 1 baselines vs TR138/TR143/TR144 same-model baselines; 36/36 consistent, most byte-identical. (SS22)

Metric	Value
Total comparisons	36
Consistent within ±5pp	36
Drifted	0
Verdict	all_consistent: true

Largest observed deltas: TR138 llama3.2-1b jailbreak_amplification +0.83pp; TR138 llama3.2-1b mmlu_real −0.35pp; all others 0.00pp.

Appendix C — TR146 (Mechanistic probing under quantization; four-probe falsification)

Source of truth: PublishReady/reports/Technical_Report_146.md. TR146 is the negative control governing the whole arc: four interpretability probes, 5,100 forward passes across 17 model-quant cells (forward-pass only, no generation), correlated against the TR142 regime labels and RTSI. None distinguishes safe from dangerous quantized configs.

C.1 — Probe-falsification master table

Four probes vs the deployment outcome (RTSI continuous + hidden-danger-vs-neutral regime separation); none reaches the |r| > 0.3 / p < 0.05 bar. (SS_ExecSummary, SS4.3, SS5.3, SS6.3, SS7.4)

Probe	Phase	RTSI Pearson r	RTSI Pearson p	Danger-vs-neutral M–W p	Verdict
First-token entropy shift	1	0.083	0.809	0.606	NOT SUPPORTED (SS4.3)
Refusal-direction cosine sim	2	−0.144	0.673	0.606	NOT SUPPORTED (SS5.3)
Calibration drift (confidence shift)	3	0.068	—	0.788	NOT SUPPORTED (SS6.3)
Safety-neuron quant-error ratio	4	0.119	—	0.979	NOT SUPPORTED (SS7.4)

Required bar = |r| > 0.3 and p < 0.05. All correlations computed over n = 11 AWQ/GPTQ cells.

C.2 — Refusal-direction magnitude (the r = −0.61 signal)

Per-model L2 magnitude of the unnormalized refusal direction; the ONLY mechanistic metric with a meaningful RTSI tie, but it is model-level (near-identical across AWQ/GPTQ), not config-level. (SS5.2, SS5.3, SS8.5)

Model	AWQ magnitude	GPTQ magnitude	TR142 regime
llama3.2-1b	4.73	4.63	hidden_danger
llama3.2-3b	10.62	10.62	hidden_danger
mistral-7b	15.75	15.81	hidden_danger
qwen2.5-1.5b	50.06	49.64	neutral
qwen2.5-7b	60.08	59.92	hidden_danger
phi-2	—	70.69	hidden_danger

Statistic	Value	Source
Direction magnitude vs RTSI, Pearson r	−0.61	SS8.5
Magnitude: danger vs neutral, Mann–Whitney p	0.036	SS5.3
Mean magnitude, hidden-danger rows	19.0	SS5.3
Mean magnitude, neutral rows	54.9	SS5.3
AWQ/GPTQ magnitude ratio (vs FP16)	within ±2.3% of 1.0	Appendix B.3

phi-2 is the magnitude/regime outlier — highest magnitude (70.7) yet still a GPTQ hidden-danger row; magnitudes essentially identical across AWQ vs GPTQ per model.

C.3 — Safety-neuron quantization-error ratios (1.40× disproportionate)

Safety-critical neurons (top 5% by activation contrast) absorb ~1.40× the error of non-safety neurons in every quantized cell — universal, not danger-selective. (SS7.3, SS7.4)

Model	AWQ mean ratio	GPTQ mean ratio	TR142 regime
llama3.2-1b	1.29×	1.50×	hidden_danger
llama3.2-3b	1.30×	1.50×	hidden_danger
qwen2.5-1.5b	1.46×	1.52×	neutral
qwen2.5-7b	1.55×	1.45×	hidden_danger
phi-2	—	1.19×	hidden_danger
mistral-7b	1.26×	1.34×	hidden_danger

Statistic	Value	Source
Mean ratio across all cells	1.395×	SS7.4
One-sample t-test vs ratio = 1.0	p < 0.0001	SS7.4
AWQ method mean / GPTQ method mean	1.37× / 1.45×	SS7.3
Ratio: danger vs neutral, Mann–Whitney p	0.979	SS7.4
Min / max cell ratio	1.19× (phi-2 GPTQ) / 1.55× (qwen2.5-7b AWQ)	SS7.3

Neutral-row inversion: the two neutral rows (qwen2.5-1.5b AWQ 1.46×, GPTQ 1.52×) sit above several hidden-danger rows — high safety-neuron error does not imply hidden danger.

C.4 — GPTQ confidence paradox

GPTQ models become MORE confident about the first token while behaviorally failing to refuse — falsifying any confidence-based screen. (SS4.2, SS6.2, SS8.3)

Model	Quant	Entropy shift (nats)	Confidence shift	Behavioral refusal	TR142 regime
llama3.2-1b	AWQ	+0.073	−0.002	fails	hidden_danger
llama3.2-1b	GPTQ	−0.394	+0.076	−68.18pp	hidden_danger
llama3.2-3b	AWQ	−0.052	+0.012	fails	hidden_danger
llama3.2-3b	GPTQ	−0.219	+0.021	fails	hidden_danger
qwen2.5-7b	AWQ	+0.092	−0.014	fails	hidden_danger
qwen2.5-7b	GPTQ	−0.177	+0.041	fails	hidden_danger
phi-2	GPTQ	+0.083	−0.014	−55.45pp	hidden_danger
mistral-7b	AWQ	+0.061	−0.011	fails	hidden_danger
mistral-7b	GPTQ	−0.109	+0.037	fails	hidden_danger

Pattern	Value	Source
AWQ cells with positive/near-zero entropy shift	4 of 5	SS4.2
GPTQ cells with negative entropy shift	5 of 6 (phi-2 is the exception)	SS4.2
Entropy-shift vs confidence-shift coupling, Pearson r	−0.99	SS8.5
Worst confident-non-refusal	llama3.2-1b GPTQ: +0.076 conf alongside −68pp refusal	SS6.4

C.5 — Cell inventory

17 model-quant cells × 4 phases; forward-pass-only, 5,100 forward passes total. (SS_Header, SS2.1–SS3.1)

Quant	Cells	Models present
FP16 (anchor)	6	llama3.2-1b, llama3.2-3b, qwen2.5-1.5b, qwen2.5-7b, phi-2, mistral-7b
AWQ INT4	5	all except phi-2 (AWQ path failed in TR142 quant)
GPTQ INT4	6	all six
Total	17	6 models, 2 quant methods

Inventory field	Value	Source
Total forward passes	5,100	SS_Header
Generation used	none — forward-pass-only	SS2.1
Harmful prompts	100 (TR142 v3_safety, AdvBench-derived)	SS3.2
Harmless prompts	100 (builtin_curated_v1; Phases 2 & 4 only)	SS3.2
Phase 5 error-measurement prompt count	50 (dual-load compute trade-off)	SS2.2
phi-2 AWQ status	absent (architecture-specific AWQ path failed)	SS3.1

Appendix D — TR147 (Compile Reproducibility Index; the Triton kill-shot)

Source of truth: PublishReady/reports/Technical_Report_147.md; structural CRI/Cohen's d values cross-checked against research/tr147/v4/analysis/cri_values.json. TR147 is Layer 3 (compile integrity): on a single fixed GPU + model, varying only the Triton minor version flips the torch.compile prefill verdict from a 62–77% speedup to a near-zero neutral and erases an 80% decode crash.

D.1 — Measurement-budget inventory

52,410 primary rows across four GPU regimes and three Triton minor versions; v1→v4 stage breakdown. (SS_Header, SS2.1)

Stage / lane	GPU regime	Rows
v1 Ada (`20260412_195222`)	RTX 6000 Ada (sm_89)	15,240
v2 Ada corrected (`v2/20260413_054740`)	RTX 6000 Ada	6,840
v3 A100 (`v3/.../e3`)	A100-SXM4 (sm_80)	1,440
v4 StaticCache Ada	RTX 6000 Ada	5,400
v4 StaticCache A100	A100-SXM4-80GB	5,400
v4 Large-Model Ada	RTX 6000 Ada	3,600
v4 Large-Model A100	A100-SXM4-80GB	3,600
v4 Triton 3.3.1	RTX 6000 Ada	3,600
v4 Triton 3.4.0	RTX 6000 Ada	3,600
v4 Triton 3.6.0	RTX 6000 Ada	3,600
External `gpt-fast` (3-Triton)	A100-PCIe-80GB	90 (+3 smoke)
Total primary	4 GPU regimes × 3 Triton minors	52,410

StaticCache sweep weight (Ada + A100) = 10,800 rows; Triton ablation weight (3 versions, same Ada GPU) = 10,800 rows; expansion vs the stale 22,080-row v2 draft = +137%.

D.2 — Triton ablation: same Ada GPU, same Qwen2.5-1.5B

Triton 3.3.1 / 3.4.0 / 3.6.0 on identical hardware and model; the conclusion flips purely on the software stack. (SS8.1, SS8.2, SS8.5)

Triton	Prefill default gain	Prefill reduce-overhead gain	Reduce-overhead decode crash rate
3.3.1	−62.82% (faster)	−77.24% (faster)	0.800
3.4.0	+0.84% (neutral)	+0.54% (neutral)	0.000
3.6.0	−0.74% (neutral)	−1.60% (neutral)	0.000

Statistic	Value	Source
Combined reduce-overhead decode errors, 3.3.1 (both caches)	480/600	SS8.5
Prefill-gain collapse framing	62.82% → 0.84% = eight-sigma event under "Triton doesn't matter" null	SS8.5
Decode-stability attribution	PyTorch PR #175562 (cudagraph-tree assert relaxation), correctness-only	SS8.7
Prefill-loss attribution	Triton 3.4.0 codegen regression, PR #7138 (LLVM+PTXAS register spilling)	SS8.7

D.3 — Cross-version Cohen's d on the compile-vs-eager prefill contrast

|d| ≈ 14–49 on compiled-path latency across Triton versions; negative d = newer Triton is slower (prefill speedup lost). (SS8.4, cri_values.json)

Cell	3.4.0 vs 3.3.1, d	3.6.0 vs 3.3.1, d
prefill DynamicCache default	−19.699	−14.415
prefill DynamicCache reduce-overhead	−32.837	−22.050
prefill StaticCache default	−0.763	−25.521
prefill StaticCache reduce-overhead	−48.928	−21.368
kv_decode DynamicCache reduce-overhead (surviving rows)	−0.768	−0.763

|d| > 10 = very-large effect, essentially no distributional overlap.

D.4 — Eager-sanity control across Triton versions

Eager arms (no Triton kernel compilation) are statistically flat across the three versions, isolating the Triton contribution to compiled paths. (SS8.6)

Triton	Eager prefill mean (ms)	Eager decode mean (ms)
3.3.1	21.218	2,585.781
3.4.0	21.009	2,588.260
3.6.0	21.420	2,611.631

Eager cross-version spread ≈ 2% (prefill) / 1% (decode); cross-version eager-to-eager |d| ≤ 0.15 (prefill), ≤ 0.02 (decode) — refutes the "the 3.4.0 container differs in some other way" objection.

D.5 — A100 vs Ada compiled-decode crash (Ampere amplification)

v3 A100-SXM4-80GB: compiled decode 100% crash at every token length; compiled prefill survives only at token_len=64. The A100 amplifies, it does not rescue. (SS5.1)

Model	Phase	Token len	n	Crash rate	Mean (ms)
qwen2.5-1.5b	prefill compile	64	60	0.000	4.508
qwen2.5-1.5b	prefill compile	128	60	1.000	NA
qwen2.5-1.5b	prefill compile	256	60	1.000	NA
qwen2.5-1.5b	kv_decode compile	64/128/256	60 ea	1.000	NA
qwen2.5-3b	prefill compile	64	60	0.000	6.812
qwen2.5-3b	prefill compile	128/256	60 ea	1.000	NA
qwen2.5-3b	kv_decode compile	64/128/256	60 ea	1.000	NA

A100 token_len=64 prefill gain: qwen2.5-1.5b 81.7% (24.609 → 4.508 ms); qwen2.5-3b 79.3% (32.913 → 6.812 ms). Direction consistent with Ada, stability strictly worse.

D.6 — StaticCache rescue: correctness, not speed

StaticCache + mode="default" gives 0.000 decode crash but +1.61–3.46% slowdown; prefill keeps 54.4–63.2% gain; reduce-overhead+StaticCache stays 0.800 crash everywhere. (SS6.1–6.4)

GPU	Model	Prefill default gain	Decode default overhead	Reduce-overhead decode crash
Ada	gpt2-100m	−55.48%	+3.16% slower	0.800 (240/300)
Ada	llama3.2-1b	−61.62%	+1.72% slower	0.800 (240/300)
Ada	qwen2.5-1.5b	−63.25%	+2.20% slower	0.800 (240/300)
A100	gpt2-100m	−54.44%	+3.46% slower	0.800 (240/300)
A100	llama3.2-1b	−59.67%	+1.97% slower	0.800 (240/300)
A100	qwen2.5-1.5b	−58.83%	+1.61% slower	0.800 (240/300)

Decode crash under default across all 6 (model×GPU) cells = 0.000; reduce-overhead+StaticCache crash = 0.800 (only token_len=1 survives). TOST rejects "compiled default ≥3pp faster than eager" in all 6 cells.

D.7 — Large-model cell: dense keeps the split, AWQ companion is neutral

A100 dense 7B/8B preserve the phase split; Ada qwen2.5-7b AWQ-4bit is the neutral/stable companion (the "large models always crash" falsifier). (SS7.1, SS7.2)

GPU	Model	Prefill default gain	Decode default vs eager	Reduce-overhead decode crash
A100	qwen2.5-7b FP16	−50.50%	+2.17% slower	0.800, d=+0.765
A100	llama3.1-8b FP16	−53.12%	+2.44% slower	0.800, d=+0.765
Ada	llama3.1-8b FP16	−18.48%	+0.20% slower	0.800, d=+0.762
Ada	qwen2.5-7b AWQ-4bit	+1.87% (slower)	−0.47% (faster)	0.000, d=+0.006

The apples-to-apples cross-GPU large-model comparison is the llama3.1-8b dense-FP16-on-both-sides cell (Ada 18.5% prefill gain, bandwidth-bound, vs A100 53.1%); the A100 7B cell uses dense FP16 while the Ada 7B uses AWQ-4bit, so it is not the preferred citation.

D.8 — CRI band definitions and where TR147 lands

CRI = max pairwise |Cohen's d| on compiled-latency across the stack-perturbation set. (SS8.4, research/tr147/v4/compute_cri.py, cri_values.json)

Band	Threshold (max pairwise \|d\|)	Meaning
robust	< 0.5	claim survives stack perturbation
sensitive	< 2	bounded shift; report the perturbation range
fragile	≥ 2	claim does not transfer across stack
catastrophic	≥ 10	> 10-sigma distribution shift; no single number defensible

TR147 cell (Qwen2.5-1.5B, 3-Triton set)	CRI	Band
prefill DynamicCache default	19.70	catastrophic
prefill DynamicCache reduce-overhead	32.84	catastrophic
prefill StaticCache default	25.52	invalid (eager_drift 1.68 ≥ 0.5)
prefill StaticCache reduce-overhead	48.93	invalid (eager_drift 1.68 ≥ 0.5)
kv_decode DynamicCache default	0.01	robust
kv_decode DynamicCache reduce-overhead	0.77	sensitive
kv_decode StaticCache default	0.02	robust
kv_decode StaticCache reduce-overhead	0.77	sensitive

The prefill compile-path cells land catastrophic while the decode default cells land robust — the phase split shows up directly in the index. StaticCache prefill cells flag invalid because eager_drift (1.68) ≥ 0.5 trips the harness-comparability guard.

D.9 — External `gpt-fast` probe (A100, code-SHA as 6th axis)

Pinned Dec-2023 commit: 0/15 compiled across 3 Triton versions, CRI classification=invalid. Dual-variant: pinned 0/5 crash vs HEAD 106.74 tok/s strong_match. (SS11.2, SS11.7, SS11.8)

Probe 1 — A100-PCIe, pinned gpt-fast d2c5d8223f, sweep Triton (torch==2.7.1):

Triton	Eager median tok/s	Compiled ok / total	Compiled crash	Verdict
3.3.1	34.86	0 / 5	1.000	no valid compiled regime
3.4.0	35.29	0 / 5	1.000	no valid compiled regime
3.6.0	33.83	0 / 5	1.000	no valid compiled regime
Aggregate	33.83–35.29 stable	0 / 15	1.000	CRI `invalid` (need ≥2 stack points, got 0)

Probe 2 — A100-SXM4-80GB, stack fixed (torch==2.11.0+cu130, Triton 3.6.0, CUDA 13.0), sweep code SHA:

Variant	`gpt-fast` SHA	Eager median tok/s, CV	Compiled ok / crash	Compiled median tok/s, CV	Reproduction band
README target	—	—	—	104.9 (claim)	—
Pinned	`d2c5d8223f`	27.67, CV 0.0131 (n=20)	0 / 5 (100% crash)	— (zero surviving samples)	null
HEAD	`6ecad9b5b6`	10.53, CV 0.0073 (n=20)	25 / 0 (0% crash)	106.74, CV 0.0066 (n=20)	strong_match

Sixth benchmark-identity axis added: code SHA, beyond GPU, Triton, PyTorch, cache implementation, and compile mode. The pinned code never produces a valid compiled run on any tested A100×Triton combination.

D.10 — v1 Ada prefill replication across 7 models

Compiled prefill faster than eager for every model on RTX 6000 Ada; gains 60.7–77.3% (family- and precision-dependent). (SS3.1)

Model	Precision	Eager mean (ms)	Compiled mean (ms)	Gain
gpt2-25m	fp32	1.786	0.515	71.2%
gpt2-50m	fp32	4.146	1.083	73.9%
gpt2-100m	fp32	4.054	1.383	65.9%
gpt2-100m	fp16	4.093	1.004	75.5%
qwen2.5-0.5b	fp16	16.934	3.843	77.3%
qwen2.5-1.5b	fp16	20.493	6.361	69.0%
qwen2.5-3b	fp16	27.723	10.892	60.7%

Gain range 60.7–77.3% (16.6pp spread); all bootstrap CIs across the ranking non-overlapping.

Appendix E — TR148 (Judge Triangulation Protocol + Dual-Axis Safety-Judge Finding)

TR148 v2 re-judges the TR145 safety subset — 13,724 records across five task families — with five active local judges plus a 94-record gpt-4o calibration anchor. gemma3 labels are pulled from the TR145 source; llama3.1, shieldgemma, and llama-guard3 are freshly generated; all Ollama, temperature 0.0, seed 42, RTX 4080 Laptop (sm_8.9), $0 external cost. The report is the source of Layer 1 of the certification protocol, split into Layer 1a (response-refusal axis, the JTP gate) and Layer 1b (the orthogonal composite-harm axis screen). Section citations SSn.m refer to Technical_Report_148.md.

E.1 — Primary refusal-axis triangulation (gemma3:12b × llama3.1:8b)

Operationally binding cross-LLM pair: largest-n, full-corpus coverage. κ = 0.6917 lands triangulate, just 0.0083 below the robust threshold. (SS2.1 / Appendix A.1)

Statistic	Value
Cohen's κ	0.6917
Bootstrap 95% CI	[0.6824, 0.7008]
Asymptotic SE	0.0048
z vs zero	144.1
p (two-sided, H0: κ=0)	< 1e-300
Paired-sample n	12,809
Observed agreement (po)	0.8480
Chance agreement (pe)	0.5076
n_agree	10,860
n_disagree	1,949
PABAK	0.6960
Krippendorff's α	0.6917
Landis–Koch band	substantial
JTP bucket	triangulate

Observations. The primary pair sits 0.0083 below the 0.70 robust cutoff — close enough that the bucket assignment is genuinely marginal, but the bootstrap CI upper bound (0.7008) barely crosses 0.70, so the lower bound (0.6824) keeps the verdict honestly in triangulate. PABAK (0.6960) and Krippendorff's α (0.6917) corroborate the headline κ to within 0.005, confirming the band is not an artifact of prevalence skew or a particular chance-correction formula. The z-statistic of 144.1 against κ=0 is overwhelming; the marginal call is robust-vs-triangulate, never triangulate-vs-untrustable.

The refusal axis is real and strongly shared between the two general-purpose judges, but not so strongly that a single judge's labels can stand alone. That 0.0083 shortfall is the entire operational basis for requiring majority-vote at Layer 1a — a reminder that the protocol's gates are calibrated tightly enough that an eyelash of disagreement changes the downstream action.

E.2 — Cross-axis κ matrix (all large-n pairs among the five judges)

Four negative cross-axis pairs (general LLM × safety-specialist) plus the within-specialist positive pair, with the refusal-axis pairs shown for contrast. All large-n pairs Holm-significant. (SS2.2 / SS3.1 / SS3.2 / Appendix A.1)

Pair	Axis relation	κ	95% CI	n	Band
gemma3:12b × llama3.1:8b	within refusal (primary)	0.6917	[0.6824, 0.7008]	12,809	substantial
regex × gemma3:12b	within refusal (anchor)	0.3626	[0.3461, 0.3788]	13,676	fair
regex × llama3.1:8b	within refusal (anchor)	0.0822	[0.0654, 0.0991]	12,817	slight
shieldgemma:9b × llama-guard3:8b	within specialist	0.2136	[0.1953, 0.2317]	12,024	fair
gemma3:12b × shieldgemma:9b	cross-axis	−0.1286	[−0.1428, −0.1145]	12,018	poor
gemma3:12b × llama-guard3:8b	cross-axis	−0.1468	[−0.1620, −0.1316]	12,018	poor
llama3.1:8b × shieldgemma:9b	cross-axis	−0.1866	[−0.2009, −0.1719]	11,382	poor
llama3.1:8b × llama-guard3:8b	cross-axis	−0.2596	[−0.2740, −0.2447]	11,382	poor

Observations. The sign structure is the finding. Every general-LLM × safety-specialist pair is negative (−0.1286 to −0.2596) with CIs entirely below zero, while both within-axis pairs are positive. A negative κ is not weak agreement — it is systematic disagreement beyond chance: when gemma3/llama3.1 call a response a refusal (safe), the specialists are disproportionately likely to call it harmful, and vice versa. The two specialists agree with each other (+0.2136) far more than either agrees with a general judge (all negative), which is exactly the pattern expected if they share a measurement target the general judges do not.

This matrix is why Layer 1 is split into 1a and 1b rather than treated as one noisy axis. If the safety judges were simply lower-quality refusal detectors, their κ against gemma3/llama3.1 would be small-positive, not reliably negative. The negativity localizes a second, orthogonal construct — composite harm — that the specialist models score and the general models do not. Averaging across all five judges would cancel signal from two real axes into noise; the protocol instead routes each axis to the judges that measure it.

E.3 — JTP threshold-band scheme + verdict

Thresholds inherited verbatim from TR140 v3.0. Dynamic primary-pair selection = largest-n cross-LLM pair (ties → lower κ). (SS9.1 / SS9.2)

Bucket	κ range	Downstream action
robust	κ ≥ 0.70	single-judge labels sufficient
triangulate	0.40 ≤ κ < 0.70	multi-judge majority-vote required
untrustable	κ < 0.40	label vocabulary needs redesign

Verdict field	Value
Primary pair	gemma3:12b × llama3.1:8b
Primary κ	0.6917
Primary n	12,809
Bucket	triangulate
Action	multi-judge required (current state)

Observations. The bands are not re-derived per-TR; they are the calibrated TR140 thresholds applied verbatim, which is what makes a cross-TR comparison meaningful (TR149's 0.8306 is robust against the same ruler). The dynamic primary-pair rule — largest-n cross-LLM pair, ties broken toward the lower κ — is the fix for the v1 join-bug (Table E.8): it prevents a tiny high-κ subset (gpt-4o's 94 records) from hijacking the verdict.

The tie-break toward lower κ is deliberately conservative: when two pairs are equally well-sampled, the protocol reports the more cautious agreement estimate, biasing the downstream action toward more triangulation rather than less. A safety-screening protocol should round down on judge trust, not up.

E.4 — Per-judge effective precision / recall / F1 vs majority vote

Reference truth = corpus-scale majority vote across the 5 active judges (gpt-4o restricted to n=94). Specialist P/R/F1 vs the refusal-axis majority are a category-axis mismatch, not a judge-quality measure (SS8.3). (SS8.1)

Judge	Axis	TP	FP	TN	FN	Precision	Recall	F1	Accuracy
gemma3:12b	refusal	2,660	1,254	9,256	378	0.6796	0.8756	0.7652	0.8795
regex	refusal (anchor)	2,644	1,742	8,800	408	0.6028	0.8663	0.7109	0.8418
llama3.1:8b	refusal	1,436	1,081	8,785	1,507	0.5705	0.4879	0.5260	0.7980
llama-guard3:8b	composite-harm	1,826	5,627	4,021	508	0.2450	0.7823	0.3731	0.4880
shieldgemma:9b	composite-harm	650	1,488	8,160	1,684	0.3040	0.2785	0.2907	0.7353
gpt-4o (n=88 eff.)	refusal (calib.)	19	0	68	1	1.0000	0.9500	0.9744	0.9886

Observations. Read the specialist rows as axis-mismatch diagnostics, not report cards. llama-guard3's 0.4880 accuracy and 5,627 false positives against a refusal-axis truth are exactly what a composite-harm scorer produces when graded on whether a response refused: it flags large numbers of compliant-but-harmless answers as "unsafe" because they touch sensitive topics, not because they failed to refuse. The general judges (gemma3 F1 0.7652, regex 0.7109) cluster tightly on the refusal axis; llama3.1's lower recall (0.4879) is the conservative-labeling tendency that drags the primary κ down to triangulate. The gpt-4o calibration row (F1 0.9744 on n=88) confirms the refusal-axis majority is itself well-formed where a frontier judge can be checked against it.

The 0.3731 composite-harm F1 for llama-guard3 is not a leak of any submission identifier — it is the literal harmonic mean of precision 0.2450 and recall 0.7823 on this axis-mismatched grading, and it is the single clearest number in the report for why the protocol cannot pool the two axes. A judge that scores 0.37 against the wrong truth is not broken; it is answering a different question.

E.5 — Record-count / per-judge coverage (full subset n = 13,724)

Per-judge non-null labels, parseable outcomes, and UNCLEAR counts after the post-fix per-pair join (commit b0faa06d). Both specialist judges' 1,700 UNCLEAR = the documented truthfulqa null mapping, not a parser failure. (SS1.1 / SS7.2)

Judge	n with label	n parseable	UNCLEAR
regex	13,724	13,724	0
gemma3:12b	13,724	13,676	48
llama3.1:8b	13,724	12,817	907
shieldgemma:9b	13,724	12,024	1,700
llama-guard3:8b	13,724	12,024	1,700
gpt-4o (calibration)	94	94	0

Task families (safety subset, SS6.2):

Task	n	Prompt bucket
advbench_refusal	3,000	REFUSAL
jailbreakbench_behaviors	1,700	REFUSAL
jailbreak_amplification	2,960	REFUSAL
bbq_bias	4,364	BIAS
truthfulqa	1,700	TRUTHFULNESS
Total	13,724

Observations. The coverage ladder is monotone in judge verbosity, not data quality: regex parses 100%, gemma3 drops 48 unparseable, llama3.1 drops 907 (the verbose-explanation failures), and both specialists drop exactly 1,700 — the entire truthfulqa family — because the specialist label vocabulary has no "truthful" outcome and maps that family to UNCLEAR by design. That the two specialists drop identically 1,700 is the signature of a structural null mapping, not stochastic parse failure. The per-pair join (commit b0faa06d) is what lets each κ use its own maximal overlap rather than collapsing to the smallest-coverage judge.

The 1,700 truthfulqa records are not missing data to be imputed — they are out of the specialists' measurement domain. Treating them as failures would understate specialist reliability; treating them as "safe" would inflate agreement. The protocol's choice — exclude per-pair — is the only one consistent with the dual-axis reading.

E.6 — Majority-vote resolution

Majority across the 5 corpus-scale judges (gpt-4o not voting outside its 94-record subset); ties / unclear-majority → tied. The 130 tied records are predominantly truthfulqa, where the specialist null mapping drops the active judge count to 3. (SS7.1 / SS7.2)

Field	Value
Total records	13,724
Records with resolvable majority	13,594
Tied / unclear majority	130
Majority = safe	10,542
Majority = unsafe	3,052
Percent resolved	99.05%
Safe / unsafe split	77.5% / 22.5%

Observations. Despite the cross-axis disagreement in Table E.2, the five-judge majority resolves 99.05% of records — because the two general judges plus regex form a stable refusal-axis bloc, and the specialists' composite-harm votes rarely flip a 3-vote refusal majority. The 130 tied records concentrate in truthfulqa, exactly where specialist null-mapping leaves only three active voters and a 2–1 split can become a structural tie. The 77.5/22.5 safe/unsafe split matches the prevalence implied by the primary pair's pe (chance agreement 0.5076 is consistent with this skew).

99% resolution under 30%-negative-κ cross-axis disagreement is not a contradiction — it is the practical case for the dual-axis model. The axes disagree on what they measure, but on any single record the refusal-axis bloc has the numbers. Majority-vote is sufficient to ship labels today; the negative κ is the reason you cannot drop to a single judge tomorrow.

E.7 — Calibration vs TR145 (pipeline-integrity cross-check)

regex × gemma3:12b on the same TR145 records that TR145 measured. Δκ = −0.0648, within the ±0.10 H3 tolerance; confirms TR148 reads the same data with no measurement artifact. (SS12.1 / SS12.2)

Statistic	TR145 reported	TR148 measured	Δ
κ(regex, gemma3:12b)	0.4274	0.3626	−0.0648
Paired n	13,676	13,676	0
Verdict	—	pipeline-integrity check passes	within ±0.10

Observations. This is the cross-TR provenance check: the same judge pair on the same records should reproduce the same κ to within tolerance, and it does (Δ = −0.0648, inside ±0.10). The small negative drift is attributable to TR148's stricter post-fix parser dropping a handful of borderline-parseable gemma3 labels that TR145 had counted. Identical paired-n (13,676) confirms no record was silently added or lost in the re-judge.

A calibration check that passes is easy to under-value, but it is what licenses every other TR148 number: if regex × gemma3 had drifted +0.387 (as the v1 join-bug produced, Table E.8), the entire re-judge would be suspect. The −0.0648 is the quiet evidence that the v2 pipeline reads TR145 faithfully.

E.8 — v1 → v2 join-bug (mandatory-judge gate)

Pre-fix _join_labels required gpt-4o on every record → join collapsed to gpt-4o's 100-record coverage → spurious robust verdict on n=94. Fixed in commit b0faa06d (drop the continue + dynamic largest-n primary-pair selection). (SS22.1 / SS22.2 / SS22.3)

Field	v1 (pre-fix)	v2 (post-fix)
n_records_joined	100	13,724
Primary pair	gemma3:12b × gpt-4o	gemma3:12b × llama3.1:8b
Primary κ	0.8774	0.6917
Primary n	94	12,809
Verdict bucket	robust	triangulate
Calibration Δκ (regex × gemma3 vs TR145)	+0.387 (alarm signal)	−0.0648 (within tolerance)
Fix commit	—	b0faa06d

Observations. This is the canonical "no mandatory-judge gate" incident. The pre-fix _join_labels carried a continue that skipped any record lacking a gpt-4o label; since gpt-4o only ran on 100 records, the entire triangulation silently collapsed to n=94 and reported a spurious robust κ=0.8774. The tell was the calibration check: regex × gemma3 jumped +0.387 against TR145, a physically impossible drift for the same data, which is what flagged the bug. The fix drops the gate and selects the primary pair dynamically by largest n.

The verdict swung from robust (ship single-judge labels) to triangulate (require majority-vote) on a one-line join bug — the most consequential single character in the TR148 pipeline was the continue. Four downstream passes (per-task ladder, TOST, subsample, per-task CI) still carry the same hardcoded gpt-4o anchoring; they are flagged in-section, are not load-bearing for the headline verdict, and are deferred to v2.1 (SS22.4). The lesson is locked into the protocol: a judge join must never require a specific judge per record.

E.9 — Holm-Bonferroni across the 15-pair family (supporting)

11 of 15 pairs significant after stepdown correction. The 4 negative cross-axis pairs plus both within-axis pairs all survive; the dual-axis finding is not a multiple-comparisons artifact. (SS14.2 / SS14.3)

Field	Value
Pairs in family	15 (C(6,2))
Significant after Holm	11
Non-significant	4 (3 gpt-4o n=94 pairs + regex × shieldgemma borderline)
Primary pair rank	6 of 15 (Holm-adjusted p ≈ 0)

Observations. The family is the full 15 = C(6,2) judge pairs. After Holm stepdown, 11 survive; the 4 that drop are the three gpt-4o pairs (starved at n=94) plus the regex × shieldgemma borderline — none of which carry the dual-axis claim. Critically, all four negative cross-axis pairs and both within-axis pairs survive correction, so the sign structure of Table E.2 is not a false-discovery artifact of running many comparisons.

Holm survival of the negative pairs is the multiple-comparisons insurance on the headline finding. The two-axis model would be fragile if the negativity rode on a single uncorrected pair; instead it is carried by four independently-significant pairs after the most-common stepdown correction in the safety line. The dual-axis split is a structural property of the judges, not a lucky draw.

Appendix F — TR149 (Standardized Safety Battery, FP16 vs FP8 KV-Cache)

TR149 is the standardized-battery replication of the TR145 FP8-KV-cache null. Four public batteries (HarmBench-400, JailbreakBench-100, StrongREJECT-313, XSTest-450), FP16 model weights throughout, vLLM v0.19.1, RTX 4080 Laptop (sm_8.9), regex + gemma3:12b + llama3.1:8b local judge cohort under --skip-openai-judge, temperature 0.0, seed 42. The analysis was re-run at commit 71f1a854 after the paired-OR estimator fix (Table F.10). It is the anchor for Layer 4 (scale validity) at the 1B–3B end of the protocol. Section citations SSn.m refer to Technical_Report_149.md.

F.1 — Corpus scale (7,578 records)

Fully-crossed 3 models × 4 batteries × 2 KV-cache dtypes; per-battery target n; 0 sampling errors. The verdict-bearing paired count is the intersection of resolvable-outcome records under both dtypes. (report header / SS1 / SS4)

Dimension	Value
Total sampled records	7,578
Models	3 (llama3.2-1b, llama3.2-3b, qwen2.5-1.5b)
Batteries	4
KV-cache dtypes	2 (auto/FP16, fp8)
(battery × model) cells	12
Sampling errors	0
Verdict-bearing paired records (cross-battery MH)	3,537

Battery	Target n	License	Sign convention
harmbench_400	400	MIT	refusal-as-safe
jbb_100	100	MIT	refusal-as-safe
strongreject_313	313	MIT	refusal-as-safe
xstest_450	450	CC BY 4.0	per-prompt

Observations. The 7,578-record corpus is fully crossed with zero sampling errors, but only 3,537 records (47%) are verdict-bearing — the gap is the records that resolve to the same outcome under both dtypes and therefore carry no discordant signal. The four batteries are deliberately heterogeneous in sign convention: three are refusal-as-safe (more refusal = safer), while XSTest is per-prompt (its safe slice rewards compliance, its unsafe slice rewards refusal), which is why XSTest is the only battery that can detect over-refusal as a distinct failure mode.

Standardization is the point of TR149: TR145 used a mixed task set, and a skeptic could attribute its null to corpus idiosyncrasy. Re-running on four named, licensed, community-standard batteries removes that escape hatch. The 0-error sampling and explicit license column are the reproducibility-appendix discipline a venue reviewer asks for.

F.2 — Per-(battery × model) paired McNemar (all 12 cells)

Primary result. b = FP16→FP8 unsafe (degrade), c = FP16→FP8 safe (improve). Exact two-sided p; Holm-adjusted across the 12-cell family. 0/12 Holm-significant; smallest raw p = 0.125000. (SS2.1–SS2.4 / SS9 / B.1)

Battery	Model	n paired	b	c	discordant	χ²	p (exact)	OR	Holm p	Sig?
harmbench_400	llama3.2-1b	357	1	3	4	0.25	0.625000	0.4286	1.000000	No
harmbench_400	llama3.2-3b	379	3	1	4	0.25	0.625000	2.3333	1.000000	No
harmbench_400	qwen2.5-1.5b	391	1	3	4	0.25	0.625000	0.4286	1.000000	No
jbb_100	llama3.2-1b	100	0	0	0	0.00	1.000000	1.0000	1.000000	No
jbb_100	llama3.2-3b	100	0	0	0	0.00	1.000000	1.0000	1.000000	No
jbb_100	qwen2.5-1.5b	100	0	0	0	0.00	1.000000	1.0000	1.000000	No
strongreject_313	llama3.2-1b	313	0	0	0	0.00	1.000000	1.0000	1.000000	No
strongreject_313	llama3.2-3b	313	0	0	0	0.00	1.000000	1.0000	1.000000	No
strongreject_313	qwen2.5-1.5b	311	0	0	0	0.00	1.000000	1.0000	1.000000	No
xstest_450	llama3.2-1b	387	0	4	4	2.25	0.125000	0.1111	1.000000	No
xstest_450	llama3.2-3b	383	0	0	0	0.00	1.000000	1.0000	1.000000	No
xstest_450	qwen2.5-1.5b	403	7	4	11	0.36	0.548828	1.6667	1.000000	No

Observations. Six of the twelve cells have zero discordant pairs (all of JailbreakBench, all of StrongREJECT) — the FP16→FP8 swap changed not a single outcome. The discordance that exists is tiny and bidirectional: HarmBench shows 1-vs-3 and 3-vs-1 splits (no consistent direction across models), and the largest single cell is XSTest/qwen2.5-1.5b at 11 discordant out of 403. The smallest raw p is 0.125 (XSTest/llama3.2-1b, 0-vs-4 toward improvement), and after Holm correction across the 12-cell family every adjusted p is 1.000.

Zero discordant pairs in half the design is the strongest possible form of a null at the cell level — there is no effect to test. Where discordance exists it points in inconsistent directions across models, which is the signature of sampling noise, not a latent FP8 effect. The Holm correction is almost ceremonial here; nothing was close.

F.3 — Cross-battery Mantel–Haenszel pooled OR (matched-pairs)

Pooled discordant ratio (Σb+0.5)/(Σc+0.5) across 12 strata; CI brackets 1.0. 27 discordant / 3,537 paired (0.76%); split 12 degraded / 15 improved. (SS4)

Field	Value
Pooled OR	0.8065
log(OR)	−0.2151
Var(log OR)	0.144516
SE(log OR)	0.3802
95% CI	[0.3828, 1.6989]
Strata	12
Total paired records	3,537
Total discordant pairs	27
Σb (FP8-degraded)	12
Σc (FP8-improved)	15
1.0 inside CI?	Yes (null not rejected)

Observations. The pooled OR of 0.8065 sits below 1.0 (nominally toward improvement under FP8, since Σc=15 improvements outnumber Σb=12 degradations) but the 95% CI [0.3828, 1.6989] brackets 1.0 comfortably — the null is not rejected. Total discordance is 27 pairs out of 3,537, a 0.76% disagreement rate. The matched-pairs estimator is the correct one here (Table F.10 documents the postmortem where the unpaired formula exploded this to 3411.5).

An OR of 0.8065 with 27 discordant pairs out of 3,537 is statistically indistinguishable from "nothing happened." The slight lean toward improvement (15 vs 12) is within noise and, even taken at face value, would mean FP8 occasionally relieves over-refusal — the opposite of the safety-degradation hypothesis the protocol is built to catch.

F.4 — TOST equivalence coverage (±1 / ±3 / ±5 pp margins)

12/12 cells equivalent at ±3pp (positive equivalence, not just non-rejection); 11/12 at ±1pp. The lone ±1pp failure is xstest_450/qwen2.5-1.5b (bootstrap Δ-CI upper bound +2.48pp, the widest in the design). (SS7 / SS14.1)

Margin	n equivalent	n total	Percent
±1.0pp	11	12	91.67%
±3.0pp	12	12	100.00%
±5.0pp	12	12	100.00%

Per-cell bootstrap Δ-CI at ±3pp (SS7):

Battery	Model	Δ (pp)	95% bootstrap CI on Δ	Equiv ±3pp?
harmbench_400	llama3.2-1b	−0.56	[−1.68, +0.28]	Yes
harmbench_400	llama3.2-3b	+0.53	[−0.53, +1.58]	Yes
harmbench_400	qwen2.5-1.5b	−0.51	[−1.53, +0.51]	Yes
jbb_100 (×3)	all	0.00	[0.00, 0.00]	Yes
strongreject_313 (×3)	all	0.00	[0.00, 0.00]	Yes
xstest_450	llama3.2-1b	−1.03	[−2.07, −0.26]	Yes
xstest_450	llama3.2-3b	0.00	[0.00, 0.00]	Yes
xstest_450	qwen2.5-1.5b	+0.74	[−0.74, +2.48]	Yes

Observations. This is the table that converts "we failed to find an effect" into "we positively established equivalence." All 12 cells clear the ±3pp margin — the protocol's pre-registered equivalence bound — and 11 clear an aggressive ±1pp. The single ±1pp failure (XSTest/qwen2.5-1.5b) still passes ±3pp; its Δ-CI upper bound of +2.48pp is the widest in the design, driven by the lowest baseline compliance (the qwen XSTest-safe slice). The six ceiling cells show degenerate [0.00, 0.00] CIs because nothing varied.

TOST is the methodological backbone of the whole null line. A McNemar non-rejection alone is weak — it can hide a real effect behind low power (see the MDE table, F.8). Establishing equivalence at ±3pp across 12/12 cells is the affirmative claim: FP8 is not merely "not proven harmful," it is demonstrated equivalent within the margin that matters operationally.

F.5 — Cohen's h per cell (matched-pairs binary effect size)

Conventional bands: |h|<0.20 negligible. All 12 cells negligible; max |h| = 0.0742 (harmbench_400/qwen2.5-1.5b, computed near the arcsine boundary at 99%+ baseline). Largest absolute safe-rate Δ = 1.03pp (xstest_450/llama3.2-1b). (SS3 / B.1)

Battery	Model	FP16 rate	FP8 rate	Δ (pp)	Cohen's h	Band
harmbench_400	llama3.2-1b	0.9244	0.9300	+0.56	+0.0216	negligible
harmbench_400	llama3.2-3b	0.8206	0.8153	−0.53	−0.0137	negligible
harmbench_400	qwen2.5-1.5b	0.9923	0.9974	+0.51	+0.0742	negligible
jbb_100	llama3.2-1b	1.0000	1.0000	0.00	0.0000	negligible
jbb_100	llama3.2-3b	1.0000	1.0000	0.00	0.0000	negligible
jbb_100	qwen2.5-1.5b	1.0000	1.0000	0.00	0.0000	negligible
strongreject_313	llama3.2-1b	1.0000	1.0000	0.00	0.0000	negligible
strongreject_313	llama3.2-3b	1.0000	1.0000	0.00	0.0000	negligible
strongreject_313	qwen2.5-1.5b	1.0000	1.0000	0.00	0.0000	negligible
xstest_450	llama3.2-1b	0.7080	0.7183	+1.03	+0.0229	negligible
xstest_450	llama3.2-3b	0.7337	0.7337	0.00	0.0000	negligible
xstest_450	qwen2.5-1.5b	0.6154	0.6079	−0.74	−0.0153	negligible

Largest safe-rate Δ in the full design = +2.14pp on the XSTest safe-prompt slice (llama3.2-1b), toward MORE compliance on safe-but-superficially-alarming prompts (over-refusal relief, not degradation). (SS5.1)

Observations. Cohen's h is the matched-pairs binary effect size used across the safety line (paired analog of d, applied because outcomes are binary safe/unsafe). Every cell is negligible by a wide margin; the maximum |h|=0.0742 is an arcsine-transform artifact at the 99%+ ceiling (harmbench/qwen), where tiny rate changes inflate h. The largest actual safe-rate movement is 1.03pp. Note the sign of the headline XSTest movement: +2.14pp toward compliance on safe-but-alarming prompts is over-refusal relief, not a safety regression.

Reporting effect size alongside p-values is what separates a credible null from "we ran a test and it wasn't significant." |h| ≤ 0.0742 across all cells means even if the protocol had infinite power, the largest effect it could possibly be hiding is negligible by Cohen's own bands. This is the quantitative content of "no detectable FP8 effect under tested conditions."

F.6 — Per-battery cross-judge Cohen's κ (gemma3:12b × llama3.1:8b)

Operationally binding LLM judge pair. Corpus-wide κ = 0.8306 (near_perfect), obs agreement 95.51%, n = 7,557. StrongREJECT κ = −0.0005 is Cohen's κ degenerating at zero variance — read via PABAK 0.9979. (SS11 / SS17 / B.3 / C.1)

Scope	κ	n	obs agree (po)	PABAK	Band
Corpus-wide	0.8306	7,557	0.9551	0.9103	near_perfect
harmbench_400	0.8096	2,389	—	0.9255	near_perfect
jbb_100	1.0000	598	—	1.0000	near_perfect
strongreject_313	−0.0005	1,877	—	0.9979	poor (zero-variance artifact)
xstest_450	0.7959	2,693	—	0.8158	substantial

Standardized-vs-mixed contrast (SS11 / Conclusion 2):

Corpus	gemma3×llama3.1 κ	JTP band
TR149 standardized batteries	0.8306	robust (≥0.70)
TR145 mixed task set (via TR148 v2)	0.6917	triangulate (0.40–0.70)

Regex anchor agrees with neither LLM judge above the slight band (κ 0.1729 / 0.1804 corpus-wide); expected for a rule-based classifier, does not bear on the verdict. (SS11 / B.3)

Observations. The corpus-wide κ=0.8306 lands robust — the same gemma3×llama3.1 pair that only reached triangulate (0.6917) on TR145's mixed task set. The standardized-vs-mixed contrast is the key cross-TR finding: judge trust is corpus-specific, and clean, single-construct batteries produce sharper agreement than a heterogeneous mixed set. The StrongREJECT κ=−0.0005 is the classic Cohen's-κ-at-zero-variance pathology: at 100% agreed-refusal there is no variance for κ to chance-correct, so PABAK (0.9979) is the honest read.

This table is why the protocol reports JTP per-corpus, not once globally. The same two judges are robust on standardized batteries and triangulate on a mixed set — so a single global κ would mislabel one or the other. The StrongREJECT row is a teaching case: a near-perfect 99.79% PABAK appearing as κ=−0.0005 is a reminder to always pair κ with PABAK at the ceiling.

F.7 — Cross-battery heterogeneity + leave-one-battery-out

Cochran's Q across 4 batteries per model (df=3); I² = 0.0% on all 3 models (Q < df clamps I² to floor). Leave-one-out MH confirms verdict robust to battery choice — every dropped-battery CI overlaps the full-set CI. (SS13 / SS16)

Model	k strata	Cochran's Q	df	weighted-mean Δ	I²	Band
llama3.2-1b	4	0.0214	3	+0.0052	0.0%	low
llama3.2-3b	4	0.0071	3	−0.0017	0.0%	low
qwen2.5-1.5b	4	0.0317	3	−0.0008	0.0%	low

Leave-one-battery-out MH (full-set OR = 0.8065 [0.3828, 1.6989]):

Dropped battery	strata	n	Pooled OR	95% CI	ΔOR vs full	CI overlaps?
harmbench_400	9	2,410	0.8824	[0.3305, 2.3555]	+0.0759	Yes
jbb_100	9	3,237	0.8065	[0.3828, 1.6989]	0.0000	Yes
strongreject_313	9	2,600	0.8065	[0.3828, 1.6989]	0.0000	Yes
xstest_450	9	2,364	0.7333	[0.2440, 2.2037]	−0.0732	Yes

Dropping the two ceiling batteries (jbb / strongreject) leaves the pooled OR bit-identical — they contribute zero discordant pairs. Discriminating evidence is HarmBench + XSTest jointly. (SS16)

Observations. Cochran's Q is near-zero on all three models (0.0071–0.0317 against df=3), clamping I² to its 0.0% floor — the batteries do not disagree with each other beyond chance. The leave-one-out check is the sensitivity analysis: dropping JailbreakBench or StrongREJECT leaves the pooled OR bit-identical at 0.8065 (they contribute zero discordant pairs), while dropping HarmBench or XSTest moves it only ±0.076. Every dropped-battery CI overlaps the full-set CI.

The bit-identical OR after dropping two batteries is a clean demonstration of where the signal lives: all discriminating power is in HarmBench + XSTest, and even there it is too small to move the verdict. A reviewer worried that the null is an artifact of including ceiling-saturated batteries gets a direct answer — remove them and nothing changes.

F.8 — Per-cell minimum detectable effect (MDE)

Smallest detectable safe-rate Δ at α=0.05, 80% power (kappa-style paired-binary proxy, conservative). ~14pp on HarmBench/StrongREJECT/XSTest cells (n=311–403); ~28pp on JailbreakBench cells (n=100). 3–14pp single-cell blind spot; verdict rests on TOST (F.4), not McNemar non-rejection. (SS10 / B.1)

Battery	Model	n paired	MDE (pp, approx)
harmbench_400	llama3.2-1b	357	14.83
harmbench_400	llama3.2-3b	379	14.39
harmbench_400	qwen2.5-1.5b	391	14.17
jbb_100	llama3.2-1b	100	28.02
jbb_100	llama3.2-3b	100	28.02
jbb_100	qwen2.5-1.5b	100	28.02
strongreject_313	llama3.2-1b	313	15.84
strongreject_313	llama3.2-3b	313	15.84
strongreject_313	qwen2.5-1.5b	311	15.89
xstest_450	llama3.2-1b	387	14.24
xstest_450	llama3.2-3b	383	14.32
xstest_450	qwen2.5-1.5b	403	13.96

Observations. This table is the honesty check on the null: per-cell McNemar can only detect effects ≥ ~14pp (n≈311–403 cells) or ≥ ~28pp (the n=100 JailbreakBench cells). That leaves a single-cell blind spot below ~14pp — which is precisely why the verdict does not rest on McNemar non-rejection. The TOST equivalence in Table F.4 closes that gap from the other side: it positively bounds the effect below ±3pp, well inside the MDE blind spot.

McNemar and TOST are complementary, not redundant. McNemar says "no effect ≥ 14pp"; TOST says "effect is provably < 3pp." Reporting MDE openly — rather than burying it — is what makes the equivalence claim defensible: the protocol names its own blind spot and then shows a second test that covers it.

F.9 — Ceiling structure (where the discriminating power lives)

JailbreakBench-100 + StrongREJECT at 100% refusal under both dtypes = 6/12 zero-power cells (zero discordant pairs). XSTest unsafe slice also at the 100% refusal ceiling; XSTest signal lives entirely in the safe slice, where FP16 compliance is only 24–45% (model-intrinsic over-refusal, present equally under both dtypes). (SS1 / SS5 / SS2)

Battery / slice	FP16 refusal/safe ceiling	Discordant cells	Discriminating power
jbb_100 (3 cells)	100% on all 3 models	0	none (zero-power)
strongreject_313 (3 cells)	100% on all 3 models	0	none (zero-power)
xstest_450 unsafe slice	100% on all 3 models	0	none (zero-power)
harmbench_400 (3 cells)	0.8077–0.9923 spread	4 each	yes
xstest_450 safe slice	0.2402–0.4457 compliance	4 / 0 / 11	yes

XSTest safe-prompt slice FP16 compliance (SS5.1):

Model	n paired	FP16 compliance	FP8 compliance	Δ (pp)
llama3.2-1b	187	0.3957	0.4171	+2.14
llama3.2-3b	184	0.4457	0.4457	0.00
qwen2.5-1.5b	204	0.2402	0.2255	−1.47

Observations. This table explains why so many cells are zero-power: JailbreakBench, StrongREJECT, and the XSTest unsafe slice are all pinned at 100% refusal under both dtypes — these models simply never comply with an overtly harmful request at 1B–3B scale, so there is no room for FP8 to move the rate. The discriminating power lives in HarmBench (refusal spread 0.81–0.99) and the XSTest safe slice, where compliance is only 24–45% — a model-intrinsic over-refusal that is present equally under both dtypes (the Δ is +2.14 / 0.00 / −1.47pp, noise around zero).

The ceiling structure is the deepest point in TR149: the null is not "FP8 has no effect on safety" in the abstract — it is "FP8 has no effect on the one axis where these models have headroom to vary, the over-refusal of safe prompts." The harmful-refusal axis is saturated at 100% before FP8 ever enters. This is the finding that motivates Layer 5 (serving-state validity, TR152) to push specifically on the over-refusal slice where movement is possible.

F.10 — Paired-OR estimator postmortem (buggy vs fixed)

Buggy v1 fed paired McNemar cells into the UNPAIRED (a·d)/(b·c) formula, routing concordant mass into the numerator → pooled OR 3411.5 on an all-null corpus. Fix (commit 71f1a854) uses the matched-pairs discordant ratio (Σb+0.5)/(Σc+0.5). No verdict flipped — ORs are display-only; the verdict runs off TOST + McNemar. (SS21)

Quantity	Buggy unpaired `(a·d)/(b·c)`	Fixed matched-pairs `(Σb+0.5)/(Σc+0.5)`
Pooled OR	3411.5	0.8065
Pooled 95% CI	[1436, 8103]	[0.3828, 1.6989]
Per-cell OR range	up to 3966	[0.1111, 2.3333]
jbb_100 cell OR (Δ=0.00)	201	1.0000
Matches McNemar pass?	No	Yes
Sampling/judging commit	`6d3359b4`	`6d3359b4` (unchanged)
Analysis commit	(pre-fix)	`71f1a854`
Verdict (0/12 Holm-sig, 12/12 TOST ±3pp)	identical	identical

TR145's own MH code builds genuine unpaired marginal tables, where (a·d)/(b·c) is correct — verified unaffected; patching it to "match TR149" would reintroduce a bug. (SS21.6)

Observations. The postmortem is a cautionary tale worth its own table. The buggy v1 applied the unpaired cross-product OR (a·d)/(b·c) to matched-pairs cells, which routes the large concordant mass (records that agree under both dtypes) into the numerator and detonates the OR to 3411.5 — an absurd value on an all-null corpus, with a jbb_100 cell at OR=201 despite Δ=0.00. The fix uses the matched-pairs discordant ratio (Σb+0.5)/(Σc+0.5) and recovers OR=0.8065. Crucially, the data did not change (sampling/judging commit 6d3359b4 unchanged); only the analysis estimator was corrected at 71f1a854, and no verdict flipped because the OR is display-only.

The most important line is the last footnote: TR145's own code builds genuine unpaired marginal tables, where (a·d)/(b·c) is the correct estimator — so "fixing TR145 to match TR149" would have reintroduced the very bug being patched out of TR149. The paired-vs-unpaired distinction is design-dependent, not a universal preference. Conflating them is the single most dangerous statistical error in the entire null line, and naming both the symptom (OR=3411.5) and the non-fix (don't touch TR145) is the protocol's institutional memory of it.

Appendix G — TR152 (FP8 KV-Cache Serving-State Safety Factorial, Layer 5 Anchor)

TR152 v2 is the serving-state validity layer of the null line: it pushes the FP8 KV-cache question across batch size, prefix caching, and temperature to ask whether any serving-state configuration interacts with the safety profile. FP16 model weights throughout, vLLM v0.19.1, RTX 4080 Laptop (sm_8.9), regex + gemma3:12b + llama3.1:8b judges under the --skip-openai-judge umbrella (triangulate_no_openai bucket, $0 external cost). This is the one TR in the null line that returns a located effect (H1): a Qwen-family, XSTest-only, temperature-amplified over-refusal lean — with the serving-state interaction (H2) rejected. Section citations SSn.m refer to Technical_Report_152.md (v2 canonical narration); structural numerics cross-checked against research/tr152/results/20260526_232600/tr152_analysis.json.

G.1 — Scale / factorial coverage

FP8-anchored star design: 14 cells/model planned (7 paired contexts × 2 KV dtypes), 12 realized (2 sp-1 speculative-decoding cells failed identically — vLLM v0.19.1 argparse rejection, not OOM). (SS3 / SS4 / SS6)

Quantity	Value
Models (3 families: Llama 3.2, Phi-3, Qwen 2.5)	5
Safety batteries (HarmBench-400 / JBB-100 / StrongREJECT-313 / XSTest-450)	4
Serving-state contexts (baseline + 6 spokes)	7 (6 runnable)
Cells/model planned vs realized	14 → 12 (`sp-1` blocked)
Full-factorial cells (2×3×2×2×3)	72
Star coverage of full factorial	19.44%
Sampled responses	45,000
Matched FP16-vs-FP8 pairs	20,754
Strata (battery × model × context)	120
Judge-label rows on disk (regex + gemma3 + llama3.1)	135,000
`sp-1` cells failed (1 per KV dtype × 5 models)	10 (2 distinct cell types)
Total end-to-end wall	~28.7 h

Observations. TR152 is the largest serving-state factorial in the line: 45,000 sampled responses, 20,754 matched pairs, 135,000 judge rows, ~28.7 h wall, all at $0 external cost under the umbrella gate. It is a star design — 19.44% coverage of the full 72-cell factorial — sampling the baseline plus six single-axis spokes rather than the dense interior, which is the right economy when the goal is to detect any axis that moves the safety profile rather than to fit a full response surface. The two failed sp-1 cells are honestly attributed to a vLLM v0.19.1 argparse rejection (run.py:167-169), not OOM — a correction of the v1 "cloud-gated OOM" misattribution (Table G.13).

A star design buys axis-coverage cheaply: six spokes test six serving-state knobs against a common baseline without paying for the 72-cell interior. The honesty of the sp-1 attribution matters for the defensibility bar — "argparse rejection at a named line" is a verifiable claim; "cloud-gated OOM" was a plausible-sounding guess that turned out wrong, and the correction is logged rather than buried.

G.2 — Harmful-core invariance (3 adversarial batteries)

Perfect FP8 concordance on clearly-harmful prompts: 0 discordant pairs across 90 (harmful battery × model × context) cells. FP16 safe rate at floor. (SS7 / SS8)

Battery	FP16-safe records	FP16 total	FP16 safe rate	Discordant
harmbench_400	2,995	2,995	1.0000	0
jbb_100	2,996	2,996	1.0000	0
strongreject_313	2,985	2,985	1.0000	0
Harmful pool	8,976	8,976	1.0000	0 (of 8,976 harmful matched pairs)

Observations. On clearly-harmful prompts the result is the cleanest null in the entire Phase 6 corpus: 8,976 records at a 1.0000 FP16 safe rate, and 0 discordant pairs across all 90 harmful (battery × model × context) cells. FP8 did not flip a single harmful-prompt outcome under any serving-state configuration. The FP16 refusal rate is pinned at the ceiling, so — exactly as in TR149 — the harmful axis has no headroom for FP8 to move.

Zero discordance across 8,976 harmful matched pairs is the result that licenses the deployment-relevant claim: FP8 KV-cache does not make these models comply with overtly harmful requests, across batch, prefix-caching, and temperature variation. Every bit of the located effect in this TR lives on the over-refusal battery, never on the harmful core. That separation is the entire interpretive key to TR152.

G.3 — XSTest discordance (over-refusal battery)

All 133 discordant pairs live on XSTest; 27 of 30 XSTest cells carry discordance (3 llama3.2-3b cells are b=c=0). Net imbalance +41 toward FP8-degraded (over-refusal). (SS7 / SS8 / SS9)

Quantity	Value
XSTest FP16-safe records	8,160
XSTest FP16 total	11,778
XSTest FP16 safe rate	0.6928
Σb (FP16-safe → FP8-unsafe; degraded)	87
Σc (FP16-unsafe → FP8-safe; improved)	46
Net b − c	+41
Total discordant pairs	133
XSTest cells with discordance	27 / 30

Observations. Every one of the 133 discordant pairs in the entire factorial is on XSTest — the over-refusal battery — and none on the harmful core. XSTest is the only battery with headroom (FP16 safe rate 0.6928, not 1.0), so it is the only place FP8 can register. The net imbalance is +41 toward "degraded" (87 b-pairs vs 46 c-pairs), where "degraded" on XSTest means the model over-refused a safe-but-superficially-alarming prompt under FP8 that it had answered under FP16.

The semantic content of the "degradation" matters: on XSTest, FP16-safe→FP8-unsafe is the model becoming more cautious on benign prompts, i.e. more over-refusal — a usability cost, not a harmful-content leak. The located effect is a politeness/helpfulness regression on edge-case-benign inputs, the mildest possible failure mode, and it is confined to one battery and (Table G.5) largely one model family.

G.4 — Cross-context Mantel–Haenszel pooled OR

Haldane-corrected matched-pairs MH across all 120 strata (93 concordant cells drop out of the discordant pool). Sign-test independently confirms direction. (SS11 / SS1)

Quantity	Value
Strata pooled	120
Total matched pairs	20,754
Discordant pairs (Σb / Σc)	133 (87 / 46)
Pooled OR (Haldane (Σb+0.5)/(Σc+0.5))	1.8817
95% CI (Wald on log-OR)	[1.3185, 2.6855]
log-OR / SE(log-OR)	0.6322 / 0.1815
Sign-test z (87 vs 46, p₀=0.5)	≈ 3.55
Sign-test two-sided p	≈ 0.0004

Observations. The pooled OR is 1.8817 with a 95% CI [1.3185, 2.6855] that excludes 1.0 — this is the H1 "located effect" verdict. It uses the correct matched-pairs Haldane estimator (Σb+0.5)/(Σc+0.5) (the TR149 postmortem in Appendix F.10 is the cautionary precedent against the unpaired form). An independent sign-test on the 87-vs-46 discordant split gives z≈3.55, p≈0.0004, corroborating both the direction and the significance without relying on the OR's distributional assumptions.

An OR of 1.88 that excludes 1.0 is a real effect — TR152 is the one place in the null line where H0 is rejected. But "located" is the operative word: the pooled OR is a weighted average that, the next table shows, mixes three opposite-direction family effects. The headline 1.88 is true and also misleading on its own; the protocol's discipline is to never report it without the decomposition.

G.5 — Per-family Mantel–Haenszel decomposition

Pooled OR 1.88 is a weighted mixture of three opposite-direction family effects. Qwen carries 99/133 discordant pairs (74%) and 79/87 b-pairs (91%). (SS23 / SS8)

Family	Σb	Σc	n strata	Pooled OR	95% CI	Direction
Qwen 2.5 (1.5b + 3b)	79	20	48	3.878	[2.386, 6.302]	FP8 degrades (CI well above 1.0)
Llama 3.2 (1b + 3b)	7	15	48	0.484	[0.202, 1.157]	FP8 marginally improves (CI brackets 1.0)
Phi-3 (phi3-mini-4k)	1	11	24	0.130	[0.024, 0.715]	FP8 statistically improves (CI excludes 1.0)

Per-model b − c (XSTest, summed over 6 contexts). Qwen carries the entire +41 net imbalance. (SS8 / SS9)

Model	b − c
qwen2.5-1.5b	+34
qwen2.5-3b	+25
llama3.2-1b	−6
llama3.2-3b	−2
phi3-mini-4k	−10

Observations. The decomposition dissolves the headline. Qwen 2.5 carries 99 of 133 discordant pairs (74%) and 79 of 87 b-pairs (91%), with a family OR of 3.878 [2.386, 6.302] — FP8 genuinely increases Qwen's over-refusal. But Llama 3.2 sits at OR 0.484 (CI brackets 1.0, marginal improvement) and Phi-3 at OR 0.130 [0.024, 0.715] — FP8 statistically improves Phi-3, the CI excluding 1.0 in the opposite direction. The per-model b−c confirms it: Qwen's +34 and +25 carry the entire +41 net imbalance; every other model is negative.

This is the most important table in TR152. The pooled OR of 1.88 is not a property of "FP8 KV-cache"; it is a property of Qwen 2.5 under FP8 KV-cache on XSTest. Two of three families show the opposite sign, one of them significantly. Aggregating them produces a number that describes no actual model. The protocol's lesson — and the reason per-family decomposition is mandatory before any pooled serving-state claim — is that a significant pooled OR can be an artifact of one model family dragging a mixed cohort.

G.6 — Holm–Bonferroni family correction

120-cell McNemar family; 0 significant after Holm at family-wise α = 0.05. Smallest raw p far above the Bonferroni floor. (SS13)

Quantity	Value
Family size	120
Significant after Holm	0 / 120
Smallest raw p (xstest / qwen2.5-1.5b / temp=0.7, b=10 c=1)	0.0117
Holm-adjusted p of that cell (0.0117 × 120, capped)	1.0000
2nd smallest raw p (xstest / qwen2.5-1.5b / temp=1.0, b=12 c=3)	0.0352
Bonferroni floor (0.05 / 120)	0.000417

Observations. At the per-cell level, nothing survives correction: 0 of 120 cells are Holm-significant. The smallest raw p (0.0117, the Qwen-1.5b temp=0.7 cell) is more than 28× above the Bonferroni floor of 0.000417, and its Holm-adjusted p caps at 1.0000. The pooled effect (G.4) is significant precisely because pooling 133 discordant pairs across strata accumulates the signal that no single ~400-pair cell can carry alone.

The contrast between G.4 (pooled OR significant) and G.6 (0/120 cells significant) is not a contradiction — it is the difference between a small, consistent, distributed effect and a large localized one. The Qwen over-refusal lean is real in aggregate but so diffuse that no individual serving-state cell rises above multiple-comparison correction. Operationally this means the effect is detectable only by pooling, and is far too small to flag any single deployment configuration.

G.7 — TOST equivalence coverage (±3pp margin)

117/120 cells positively equivalent at ±3pp; the 3 inconclusive cells are all qwen2.5-1.5b (not refuted — bootstrap CI extends just past −3pp). All 120 equivalent at ±5pp. (SS12)

Quantity	Value
Cells equivalent at ±3pp	117 / 120 (97.5%)
Cells inconclusive at ±3pp	3 / 120 (2.5%)
Cells equivalent at ±5pp	120 / 120 (100%)

The 3 inconclusive cells (SS12 / SS24):

Cell	b	c	Δpp	Bootstrap 95% CI	Why not equivalent
xstest / qwen2.5-1.5b / temp=0.7	10	1	−2.23	[−3.82, −0.64]	CI lower bound exceeds −3pp
xstest / qwen2.5-1.5b / temp=1.0	12	3	−2.21	[−4.04, −0.38]	CI lower bound exceeds −3pp
xstest / qwen2.5-1.5b / batch_size=32	8	2	−1.51	[−3.05, +0.03]	CI lower bound exceeds −3pp by 0.05

Observations. Even with a located effect, 117 of 120 cells (97.5%) are positively equivalent at ±3pp, and all 120 at ±5pp. The 3 inconclusive cells are all qwen2.5-1.5b — the same family that carries the pooled effect — and they are "inconclusive," not "refuted": their bootstrap Δ-CIs extend just barely past −3pp (one by a hair's 0.05pp on batch_size=32). The point estimates (−1.51 to −2.23pp) are all inside the margin; only the CI tails poke out.

"Inconclusive" is the honest verdict, distinct from both "equivalent" and "different." These 3 cells cannot positively claim equivalence at ±3pp, but neither do they reject it — the effect sits right at the boundary of the operationally-meaningful margin. That the inconclusive cells are exactly the highest-temperature Qwen cells (G.9) ties the loose threads together: the effect is real, Qwen-bound, and temperature-amplified, but small enough that it brushes the ±3pp line rather than clearing it.

G.8 — Cohen's h per discordant cell (paired-binary effect size)

Safety-line standard for matched-pair proportions. All 27 discordant cells in the negligible band; max |h| = 0.0458 (>4× below the 0.2 "small" threshold). (SS9)

| Band (|h|) | Cells | Largest |h| in band | |------------|-------|---------------------| | negligible (< 0.2) | 27 / 27 | 0.0458 (xstest / qwen2.5-1.5b / temp=0.7) | | small (0.2–0.5) | 0 | — | | medium (0.5–0.8) | 0 | — | | large (≥ 0.8) | 0 | — |

Observations. Effect size puts the located effect in perspective: all 27 discordant cells are in the negligible band, with a maximum |h|=0.0458 — more than 4× below Cohen's 0.2 "small" threshold. The largest-effect cell is, again, Qwen-1.5b at temp=0.7. So the H1 verdict is "a statistically real effect of negligible magnitude," which is exactly the kind of finding that pooling a large n can surface and a single underpowered cell cannot.

Statistical significance (G.4) and practical magnitude (G.8) point opposite ways here, and that tension is the result. The pooled OR rejects H0, but |h| ≤ 0.0458 says the effect is, by the safety line's own standard, negligible. Reported together, they license the precise claim: FP8 induces a real but practically-negligible over-refusal lean on one model family's over-refusal battery — not "FP8 is unsafe" and not "FP8 has no effect."

G.9 — FP8-interaction spread across serving contexts

Per-context mean FP8 delta (pooled over 20 battery×model cells/context). Interaction spread = max single-cell Δ − min single-cell Δ across all 120 cells = 2.99pp (inside ±3pp band). Per-context-mean swing only 0.152pp. (SS10)

Context	mean Δ (pp)	min Δ (pp)	max Δ (pp)	n cells	n paired
baseline	−0.012	−1.02	+0.76	20	3,466
batch_size=8	−0.075	−1.02	+0.26	20	3,469
batch_size=32	−0.126	−1.51	+0.26	20	3,468
prefix_caching=True	−0.025	−1.02	+0.76	20	3,468
temperature=0.7	−0.164	−2.23	+0.26	20	3,432
temperature=1.0	−0.110	−2.21	+0.51	20	3,451
Interaction spread (whole factorial)					2.99 pp
Per-context-mean swing (temp=0.7 vs baseline)					0.152 pp

Observations. This is the H2 verdict: the serving-state interaction is rejected. The whole-factorial interaction spread — max single-cell Δ minus min single-cell Δ across all 120 cells — is 2.99pp, inside the ±3pp equivalence band. The per-context-mean swing between the most-affected context (temp=0.7, −0.164pp) and baseline (−0.012pp) is just 0.152pp. Temperature is the most active axis (the two temperature spokes carry the largest mean Δ and the widest min Δ), but even it does not push the interaction outside the margin.

H2 asks whether how you serve the model — batch size, prefix caching, temperature — modulates the FP8 safety profile. The answer is no: the entire interaction spread (2.99pp) fits inside the ±3pp band that defines operational equivalence. Temperature nudges the effect (which is why the inconclusive cells in G.7 are the high-temp Qwen cells), but the serving-state axes do not create a configuration where FP8 becomes meaningfully less safe. Layer 5 validates: the null survives serving-state perturbation.

G.10 — Cross-judge agreement (Cohen's κ)

LLM–LLM pair robust by the JTP threshold (≥0.70); regex pooled κ is a Simpson-paradox artifact of degenerate harmful-battery marginals. v1 → v2 κ tightened with more data. (SS14 / SS5)

Judge pair	κ (v2)	n paired	p_obs	p_exp	κ (v1)
gemma3:12b ↔ llama3.1:8b	0.831	44,951	0.945	0.676	0.814
regex ↔ gemma3:12b	0.062	44,639	0.680	0.659	0.088
regex ↔ llama3.1:8b	0.096	44,674	0.704	0.672	0.101

Observations. The operationally-binding LLM–LLM pair (gemma3 ↔ llama3.1) lands κ=0.831, robust by the JTP ≥0.70 threshold and consistent with TR149's 0.8306 on the same standardized batteries — judge trust is corpus-stable across the two largest serving-state TRs. The regex anchor's near-zero pooled κ (0.062 / 0.096) is a Simpson's-paradox artifact: the harmful batteries are at 100% agreed-refusal (degenerate marginals where κ has no variance to work with), which drags the pooled regex κ to the floor even though within-XSTest agreement is reasonable. v1→v2 tightened the LLM pair (0.814→0.831) as the data grew.

Because the JTP gate is robust here, the verdict can rest on the LLM judges without mandatory triangulation — but the report still runs all three judges and reports the regex artifact transparently rather than dropping the inconvenient anchor. The κ=0.831 is what makes the H1/H2 verdicts trustworthy: the labels feeding the OR and TOST are near-perfectly agreed between two independent judge families.

G.11 — Leave-one-out sensitivity (pooled MH OR robustness)

Signal is fully XSTest-bound, doubly Qwen-bound (dropping either Qwen variant collapses the positive), and serving-state-axis-robust. (SS22)

LOO variant (re-pool over remaining strata)	Σb	Σc	Pooled OR	95% CI	Verdict
Drop XSTest	0	0	—	—	Full null (100% of signal on XSTest)
Drop harmbench / jbb / strongreject (each)	87	46	1.882	[1.318, 2.686]	Unchanged (harmful contributes 0)
Drop qwen2.5-1.5b	37	30	1.230	[0.762, 1.983]	CI brackets 1.0 — positive collapses
Drop qwen2.5-3b	58	42	1.376	[0.927, 2.043]	CI brackets 1.0 — positive collapses
Drop phi3-mini-4k	86	35	2.437	[1.649, 3.601]	OR inflates (largest improver removed)
Drop llama3.2-1b	82	35	2.324	[1.568, 3.444]	OR inflates
Drop llama3.2-3b	85	42	2.012	[1.393, 2.906]	OR inflates slightly
Drop temperature=0.7	69	41	1.675	[1.140, 2.460]	OR drops most; still clears 1.0
Drop temperature=1.0	70	38	1.831	[1.236, 2.712]	Clears 1.0
Drop baseline	75	35	2.127	[1.427, 3.169]	Clears 1.0
Drop batch_size=8	75	40	1.864	[1.273, 2.731]	Essentially unchanged
Drop batch_size=32	71	40	1.765	[1.201, 2.596]	Clears 1.0
Drop prefix_caching=True	75	36	2.068	[1.393, 3.071]	Clears 1.0

Observations. The LOO grid pins the effect's location precisely. Dropping XSTest zeroes the signal entirely (0 discordant pairs remain — 100% of the effect is on the over-refusal battery). Dropping any harmful battery changes nothing (they contribute 0 discordant pairs). The effect is doubly Qwen-bound: dropping either qwen2.5-1.5b or qwen2.5-3b collapses the CI to bracket 1.0 — neither Qwen variant alone sustains the positive, it takes both. By contrast, dropping any serving-state axis (batch, prefix, temperature) leaves the OR clearing 1.0 — the effect is robust to the serving-state perturbations, consistent with the H2 rejection.

The LOO table is the empirical proof of the verdict's scope statement: "XSTest-only, Qwen-bound, serving-state-robust." Removing the model family kills it; removing the battery kills it; removing any serving knob does not. This is the discipline that converts a significant pooled OR into a precisely-bounded claim — the protocol does not say "FP8 degrades safety," it says "FP8 induces a Qwen-specific over-refusal lean on XSTest that survives serving-state variation," and every clause of that sentence is a row in this table.

G.12 — Power / minimum detectable effect (MDE)

Per-cell tests powered to detect ≥ −3.25 to −3.75pp at family-size-120 Holm; pooled MH powered to OR ≈ 1.42. Neither null is power-starved. (SS21)

Scale	Powered to detect	Observed	Comment
Per-cell (Holm-clear at family 120, n_paired ≈ 400)	Δpp ≈ −3.25 (b=13:0) to −3.75 (b=20:5)	worst cell −2.23pp	~1.0–1.5pp short of Holm-clear; no cell rejects
Per-cell (raw p < 0.05, n_paired ≈ 400)	Δpp ≈ −1.5pp (b=7:1)	qwen2.5-1.5b temp=0.7 raw p=0.0117	design has raw-p power; no cell carries enough imbalance
Pooled MH (sign-test α=0.05 on 133 discordant)	OR ≈ 1.42 (b ≥ 78 / c ≤ 55)	OR 1.88 (87/46), z=3.55	comfortably above floor

Observations. The power table forecloses the "underpowered null" objection at both scales. Per-cell, the design can clear Holm correction at a −3.25 to −3.75pp effect, and the worst observed cell is only −2.23pp — so the 0/120 Holm result is "no cell carries enough imbalance," not "too little data to tell." At the pooled scale, the MH sign-test floor is OR≈1.42 and the observed 1.88 (z=3.55) clears it comfortably. Neither verdict is power-starved.

Power analysis is what makes a null defensible and a located effect honest. The per-cell null is real (the design would have caught a ≥3.25pp cell and there isn't one), and the pooled positive is real (the design's detection floor is 1.42 and we observe 1.88). Stating the MDE openly at both scales is the same discipline as TR149's F.8 — the protocol names exactly what it could and could not have detected before interpreting what it found.

G.13 — v1 → v2 resolution progression

Same qualitative shape (Qwen-family, XSTest-only, temperature-amplified) refined — not flipped — at 5.8× the discordant base. Point estimate regresses toward 1.0; CI tightens 4×; lower bound moves away from 1.0. (SS1 / SS4 / SS11 / SS12 / SS17)

Aspect	v1 (pilot)	v2 (canonical)
Models / families	3 / 2	5 / 3
XSTest per-cell cap	100	450 (uncapped)
Sampled records	14,400	45,000
Matched pairs	7,010	20,754
Discordant pairs	23 (17 / 6)	133 (87 / 46)
Strata	72	120
MH pooled OR	2.69	1.88
MH 95% CI	[1.09, 6.62]	[1.32, 2.69]
CI width	5.53	1.37
Sign-test p	≈ 0.03	≈ 0.0004
TOST equivalent (±3pp)	60 / 72 (83.3%)	117 / 120 (97.5%)
Smallest raw p	0.25	0.0117
Spec-decode attribution	"cloud-gated OOM"	argparse rejection (`run.py:167-169`)

Observations. v2 refines v1 rather than overturning it — the qualitative shape (Qwen-family, XSTest-only, temperature-amplified) is identical, established at 5.8× the discordant base (133 vs 23 pairs). The point estimate regresses toward 1.0 (OR 2.69→1.88, the expected shrinkage as a pilot's inflated estimate is corrected by more data), while the CI tightens 4× (width 5.53→1.37) and its lower bound moves further from 1.0 (1.09→1.32) — more data, more confidence, smaller-but-more-certain effect. The sign-test p sharpens an order of magnitude (0.03→0.0004), TOST equivalence coverage rises (83.3%→97.5%), and the spec-decode failure attribution is corrected from the v1 "cloud-gated OOM" guess to the verified argparse rejection at run.py:167-169.

The v1→v2 progression is a model of how a pilot should mature: same shape, tighter bounds, point estimate regressing toward truth, and a misattributed failure mode corrected against the actual artifact. The OR dropping from 2.69 to 1.88 while the lower CI bound rises from 1.09 to 1.32 is the signature of a genuine small effect emerging from pilot noise — not a result that flips under scrutiny, but one that sharpens. It is the empirical counterpart to the defensibility bar: claims get more precise, and the one place v1 guessed ("OOM") gets replaced by a line-numbered fact.

Appendix X1 — Named-method definitions (CRI / JTP / TAIS / RTSI)

The Phase 6 line contributes four named behavioural-screening methods, each lifted from a parent TR and applied as a reusable gate in the five-layer certification protocol. This appendix consolidates their definitions — input, statistic, threshold bands, and where each lands in Phase 6 — in one place, so a reader can check any layer's gate without re-deriving it from the parent report. The thresholds are inherited verbatim across TRs (that is what makes a cross-TR comparison meaningful), so each method is defined once here and cited, not re-stated per appendix.

X1.1 — The four methods at a glance

Method	Full name	Parent TR	Layer	Input statistic	Verdict in Phase 6
CRI	Compile Reproducibility Index	TR147	Layer 3 (compile integrity)	max pairwise \|Cohen's d\| on compiled latency across a stack set	catastrophic (19.70–48.93 across Triton versions)
JTP	Judge Triangulation Protocol	TR140 → TR148	Layer 1 (measurement validity)	cross-family Cohen's κ on the largest-n judge pair	triangulate on TR148 mixed set (κ=0.6917); robust on TR149/TR152 standardized batteries (κ=0.831–0.8306)
TAIS	Typical-Acceptance Invariance Screen	TR144	Layer 2 (behavioural screen)	matched-pairs Cohen's h + ±3pp TOST across draft+target pairs	null (max observed \|h\|=0.024)
RTSI	Refusal Template Stability Index	TR142	Layer 2 (behavioural screen)	four refusal-template drift features → composite index	screen defined; behavioural-stability gate

Observations. The four methods divide cleanly by what they protect. CRI guards the reproducibility of the compiled artifact (does the same code+stack produce the same latency?); JTP guards the validity of the labels (do independent judges agree enough to trust the safety measurement?); TAIS and RTSI are the two behavioural screens that ask whether an inference-time technique (speculative decoding for TAIS, template drift for RTSI) perturbs the refusal behaviour. Note CRI is the only one keyed on Cohen's d (continuous latency); the three safety screens use h or κ because their outcomes are binary or categorical.

The verdict column is the punchline of the whole protocol: three of four methods return a clean pass (TAIS null, JTP robust-or-triangulate, RTSI stable), and the one that screams — CRI at catastrophic — is on the compile axis, not the safety axis. The serving-state safety profile is reproducible; the compiled-latency profile is not. That asymmetry is the load-bearing finding the certification protocol exists to surface.

X1.2 — CRI (Compile Reproducibility Index) — full definition

Parent: TR147 v4.0, impl research/tr147/v4/compute_cri.py.

Field	Definition
Input	Compiled-mode median latency (tok/s or ms) for a fixed model+task across a stack set (GPU × Triton × PyTorch × CUDA × cache-impl × code-SHA)
Statistic	Max pairwise \|Cohen's d\| over all stack-point pairs
Bands	robust < 0.5 · sensitive < 2 · fragile ≥ 2 · catastrophic ≥ 10
Validity rule	Requires ≥ 2 valid compiled stack points; < 2 → `invalid` (not a pass)
Phase 6 result	Cross-Triton-version \|d\| = 19.70–48.93 → catastrophic

Observations. CRI is the only method whose "high" reading is the finding rather than a failure of the screen. A catastrophic CRI (|d| up to 48.93 across Triton 3.3.1/3.4.0/3.6.0 on the same code) means the compiled-latency benchmark is not reproducible across an ordinary library-version bump — the exact failure mode TR147 was built to quantify. The validity rule (≥2 valid compiled points) is what prevents a config that crashes on compile from being scored as "robust by absence of variation"; it is scored invalid instead (the A100 pinned-code probe in Appendix D.9).

CRI inverts the usual screen polarity: for the safety screens, a low statistic is the good news; for CRI, a low d would mean "latency reproduces," and the actual high d is the alarm. Keeping CRI in the same protocol as the safety screens is deliberate — it is the reminder that "the model behaves safely" and "the benchmark that measured it reproduces" are two separate guarantees, and Phase 6 only earns the first.

X1.3 — JTP (Judge Triangulation Protocol) — full definition

Parent: TR140 v3.0 (thresholds) → TR148 (Phase 6 application).

Field	Definition
Input	Per-record categorical safety labels from ≥ 2 judge families
Statistic	Cohen's κ on the dynamically-selected largest-n cross-LLM judge pair (ties → lower κ)
Bands	robust ≥ 0.70 (single judge sufficient) · triangulate 0.40–0.70 (majority-vote required) · untrustable < 0.40 (relabel)
Hard rule	No mandatory-judge gate: the join must not require any specific judge per record (the TR148 v1 bug, Appendix E.8)
Phase 6 results	TR148 mixed set κ=0.6917 → triangulate; TR149 standardized κ=0.8306 → robust; TR152 standardized κ=0.831 → robust

Observations. JTP's defining Phase 6 lesson is that the same judge pair (gemma3:12b × llama3.1:8b) lands in different bands on different corpora: triangulate on TR148's heterogeneous mixed task set, robust on the clean standardized batteries of TR149/TR152. Judge trust is therefore a property of the corpus, not the judges — which is why the protocol re-runs JTP per corpus rather than certifying a judge pair globally. The dynamic largest-n primary-pair selection and the no-mandatory-judge rule are both scar tissue from the TR148 v1 join-bug that collapsed the verdict to a 94-record gpt-4o subset.

The robust-vs-triangulate split across corpora is the operational core of Layer 1. It says: you may trust a single judge on a clean single-construct battery, but you must triangulate on a mixed set — and you discover which regime you are in by measuring κ on this corpus, not by trusting a prior. The dual-axis finding (Appendix E) sits underneath this: the negative cross-axis κ is why averaging all five judges would destroy signal, and why the gate keys on the largest-n within-axis pair.

X1.4 — TAIS (Typical-Acceptance Invariance Screen) — full definition

Parent: TR144 / speculative_decoding_safety.

Field	Definition
Input	Matched safe/unsafe outcomes for the same prompts under rejection sampling vs typical acceptance, across matched draft+target pairs
Statistic	Matched-pairs Cohen's h per cell + ±3pp TOST equivalence
Null cutoff	\|h\| < 0.1 (negligible); equivalence at ±3pp
Phase 6 result	Max observed \|h\| = 0.024 across the E1–E5 expansion → null (behavioural equivalence established)

Observations. TAIS is the speculative-decoding member of the inference-flag null thread. Its calibrated null cutoff (|h|<0.1) was derived from the maximum observed effect across the E1–E5 expansion (|h|=0.024), so the cutoff sits ~4× above the largest real signal — a deliberately conservative band. The screen establishes positive equivalence (TOST at ±3pp), not mere non-rejection, which is the same affirmative-equivalence discipline TR149 applies to FP8.

TAIS and the FP8 line (TR144/145/149) are the same claim through two different inference flags: turning on speculative decoding, like switching to FP8 KV-cache, does not move the safety profile beyond a negligible margin. Bundling them as a thematic thread across layers — rather than one numbered layer — is what lets the synthesis say "no tested inference-time flag perturbs safety" as a single sentence backed by four TRs.

X1.5 — RTSI (Refusal Template Stability Index) — full definition

Parent: TR142.

Field	Definition
Input	Refusal-response text across a configuration sweep
Features	Four refusal-template drift features: dominant-prefix share, unique-prefix rate, prefix entropy, mean refusal-token length
Bands	< 0.10 low-risk · 0.10–0.40 moderate · ≥ 0.40 high-risk
Phase 6 role	Layer 2 behavioural-stability screen (parent TR142 calibration; LOOCV danger-routing recall 10/10 across 51 cells)

Observations. RTSI is the template-drift complement to TAIS within Layer 2: where TAIS asks "does the safe/unsafe outcome change," RTSI asks "does the form of the refusal drift" — collapsing toward a single template (low entropy, high dominant-prefix share) or fragmenting (high unique-prefix rate). The four features are behavioural and judge-free, which is what lets RTSI run as a cheap pre-screen before the more expensive judge-triangulated outcome measurement.

RTSI earns its place in the protocol because refusal-template collapse is a leading indicator that outcome-level screens can miss: a model can hold its safe-rate while its refusals homogenize into a brittle template that a single jailbreak phrasing defeats. Layer 2 pairs RTSI (form) with TAIS (outcome) so the screen catches both the obvious failure (outcomes flip) and the subtle one (outcomes hold but the refusal surface goes brittle).

Appendix X2 — Cross-TR sample-budget and cost ledger

This appendix consolidates the measurement budget across the seven Phase 6 reports into one ledger, so the synthesis can state its total scale without the reader re-summing seven headers. The two columns that matter for the defensibility bar are judge rows (the labelled-measurement count, which is what a reviewer means by "how much data") and external cost — which is $0 across every report, because the entire line runs on local Ollama judges under the --skip-openai-judge umbrella gate. Three caveats are load-bearing and called out below the table: the TR148/TR145 overlap, the no-safety-judge reports, and the paired-vs-sampled distinction.

X2.1 — Per-TR measurement budget

TR	Primary records	Judge rows (local)	Judge cohort	External cost	Hardware
TR144	64,855 paired (16,783 core + 48,072 expansion)	11,448	regex + gemma3:12b	$0	RTX 4080 Laptop + RunPod (E1–E5)
TR145	24,054	13,724	regex + gemma3:12b	$0	RTX 4080 Laptop
TR146	5,100 forward passes	— (no safety judge)	mechanistic probes only	$0	RTX 4080 Laptop
TR147	52,410 rows	— (no safety judge)	latency telemetry only	$0	RTX 6000 Ada + A100-SXM
TR148	13,724 re-judged (TR145 subset)	regex 13,724 / gemma3 13,676 / llama3.1 12,817 / shieldgemma 12,024 / llama-guard3 12,024 / gpt-4o 94	5 local + 94-record gpt-4o calib.	$0 (gpt-4o calib. negligible)	RTX 4080 Laptop
TR149	7,578	~7,557 × 3 judges	regex + gemma3:12b + llama3.1:8b	$0	RTX 4080 Laptop
TR152	45,000 sampled / 20,754 paired	135,000 (45,000 × 3 judges)	regex + gemma3:12b + llama3.1:8b	$0	RTX 4080 Laptop

Observations. Three reports dominate the safety-judge budget: TR152 (135,000 rows), TR145/TR148 (the same ~13,724-record substrate, judged twice), and TR149 (~22,671 rows across three judges). TR146 and TR147 carry no safety-judge rows at all — TR146 is mechanistic forward-pass probing (5,100 passes, no behavioural labelling) and TR147 is latency telemetry (52,410 rows, no safety axis) — so they contribute primary measurements but zero judge labels. Every external-cost cell is $0: the umbrella gate (--skip-openai-judge) routes all adversarial-corpus judging to local Ollama models, and the only frontier-judge contact in the entire line is TR148's 94-record gpt-4o calibration anchor, which is below any cost floor worth tabulating.

The $0-external-cost column is not an accounting footnote — it is a methodological constraint that shaped the whole line. Because adversarial corpora (HarmBench, JailbreakBench, StrongREJECT, advbench) cannot be sent through a frontier API without a Researcher Access Program umbrella, every safety judgment here is local-only. That constraint is what produces the triangulate_no_openai bucket throughout, and it is why JTP (Appendix X1.3) matters so much: with no frontier-judge ground truth available at scale, cross-family κ between local judges is the only available validity check.

X2.2 — The three load-bearing caveats

Caveat	What it means	Why it matters for the total
TR148 ⊂ TR145	TR148 re-judges the same 13,724-record TR145 safety subset with more judges	Do not double-count the 13,724 records as new primary data; TR148 adds judge rows, not primary rows
TR146 / TR147 carry no safety judge	TR146 = 5,100 mechanistic forward passes; TR147 = 52,410 latency rows	They add primary measurements but 0 to the safety-judge total
Sampled vs paired (TR152)	45,000 sampled responses → 20,754 matched FP16-vs-FP8 pairs	The verdict-bearing unit is the pair (20,754), not the sampled response (45,000); 135,000 is the judge-row count (45,000 × 3)

Observations. The cleanest way to state the line's scale without overcounting: TR144's 64,855 paired samples, TR145's 24,054 primary (of which TR148 re-judges 13,724), TR146's 5,100 passes, TR147's 52,410 rows, TR149's 7,578, and TR152's 45,000 sampled / 20,754 paired — with TR152's 135,000 judge rows the single largest labelled-measurement block. The TR148⊂TR145 overlap is the one trap a careless sum would hit; the synthesis avoids it by counting TR148 as a re-judge (additional judge rows on existing primary data), consistent with the canonical BANTERHEARTS_MEASUREMENT_COUNT.md accounting.

A reviewer's first instinct on a "~900K measurements" claim is to check whether the same records are being counted twice across reports. This ledger answers that pre-emptively: the one genuine overlap (TR148 re-judging TR145) is named and netted, the two no-judge reports are flagged so their large primary counts are not mistaken for safety labels, and the sampled/paired/judge-row distinction for TR152 is made explicit. Stating the caveats before the total is the defensibility-bar move — the number is large and also honestly disaggregated.

Appendix X3 — Cross-TR null-vs-located findings

This is the ledger that makes the Phase 6 thesis auditable in one table: of the seven reports, six return a null or a falsification (no detectable safety effect within the equivalence margin), and exactly one — TR152 — returns a located effect (H1), which is itself bounded to a single model family on a single battery with the serving-state interaction (H2) rejected. The one report whose headline is not a safety null (TR147) is a null on the wrong axis: it finds a catastrophic compile-reproducibility failure, which is the protocol's whole point — the safety profile is stable, the latency benchmark is not.

X3.1 — The seven verdicts side by side

TR	Axis tested	Headline statistic	Equivalence / band	Verdict
TR144	Speculative decoding (rejection vs typical)	pooled OR 1.000 [0.835, 1.198]; max \|h\|=0.024	±3pp TOST equivalent	H0 null (TAIS pass)
TR145	FP8 KV-cache (single config)	MH OR 1.05 [0.90, 1.23]	±3pp; Qwen-1.5B lone TOST edge −3.09pp	H0 null
TR146	Mechanistic separability under quant	4 probes, all fail to separate safe/dangerous	— (falsification)	F3 forbidden (no mechanistic signal)
TR147	Compile-latency reproducibility	CRI \|d\| 19.70–48.93 across Triton	catastrophic (≥10)	located — on the compile axis
TR148	Judge measurement validity	primary κ=0.6917 [0.6824, 0.7008]	triangulate (0.40–0.70)	dual-axis (1a refusal + 1b composite-harm)
TR149	FP8 KV-cache (standardized batteries)	MH OR 0.8065 [0.3828, 1.6989]	12/12 TOST ±3pp	H0 null
TR152	FP8 KV-cache (serving-state factorial)	MH OR 1.8817 [1.3185, 2.6855]	117/120 TOST ±3pp; spread 2.99pp	H1 located / H2 rejected

Observations. Read the verdict column as a thesis statement. The three FP8 reports (TR145 single-config, TR149 standardized, TR152 serving-state) form a scaling staircase: null at one config, null across four standardized batteries, and located-but-bounded once the factorial is large enough (20,754 pairs) to surface a negligible-magnitude Qwen-on-XSTest lean. TR144 adds the speculative-decoding null to the inference-flag thread. TR146 falsifies the mechanistic-separability hypothesis (no probe distinguishes safe from dangerous configs — the F3 "forbidden" claim). TR148 is the measurement-validity layer underneath all of them. TR147 is the odd one out by design: its located effect is on compile-latency reproducibility, not safety.

The whole protocol is in this one table. Six of seven safety-axis verdicts are null/equivalent/falsified, and the single located safety effect (TR152) is so tightly bounded — one family, one battery, negligible |h|, serving-state-robust — that it strengthens rather than threatens the headline: FP8 KV-cache does not perturb safety on the harmful core under any tested serving state, and the only place it registers at all is a usability-grade over-refusal lean on one model family. The lone consequential instability in Phase 6 is the compile axis (TR147), which is exactly why the certification protocol separates compile integrity (Layer 3) from safety validity at all.

X3.2 — The FP8 scaling staircase (TR145 → TR149 → TR152)

The three FP8 KV-cache reports re-pooled on the same matched-pairs MH convention, ordered by design size. (TR145 SS / TR149 SS4 / TR152 SS11)

Report	Design	Matched pairs	Pooled OR	95% CI	Verdict	Discriminating slice
TR145	single config, mixed task set	(subset of 24,054)	1.05	[0.90, 1.23]	H0 null	turn-5 multiturn edge
TR149	3 models × 4 batteries × 2 dtypes	3,537	0.8065	[0.3828, 1.6989]	H0 null	HarmBench + XSTest-safe
TR152	5 models × 4 batteries × 6 serving contexts	20,754	1.8817	[1.3185, 2.6855]	H1 located	XSTest / Qwen only

Observations. The staircase is the strongest single argument in the line. At small scale (TR145) and medium scale (TR149) the FP8 effect is indistinguishable from zero — CIs bracket 1.0, TOST equivalent at ±3pp. Only at TR152's 20,754-pair scale does an effect become locatable, and even then it is confined to XSTest (the over-refusal battery, 0 effect on the harmful core) and to Qwen (LOO drops either Qwen variant and the CI re-brackets 1.0). The pooled OR rising from ~1.0 to 1.88 is not "FP8 gets less safe as you test more" — it is "more data resolves a small, real, model-specific over-refusal lean that was below the noise floor at smaller scale."

This is what a responsible positive result looks like next to two nulls. A weaker analysis would either miss the TR152 effect (underpowered) or over-claim it ("FP8 degrades safety, OR 1.88"). The staircase forces the honest reading: the effect is real, it required 20,754 pairs to see, it is negligible in magnitude (|h|≤0.046), it is one family on one battery, and it does not touch the harmful core. The progression from TR145 to TR152 is the empirical content of "no detectable FP8 effect under tested conditions" maturing into "one bounded, benign, family-specific effect at scale."

X3.3 — What "located" does and does not license

Claim about TR152's H1 result	Licensed?	Why
"FP8 KV-cache degrades safety"	No	Harmful core is 0/8,976 discordant; effect is over-refusal (usability), not harmful compliance
"FP8 increases over-refusal on XSTest for Qwen 2.5"	Yes	Family OR 3.878 [2.386, 6.302]; LOO-confirmed Qwen-bound
"Serving state (batch/prefix/temp) modulates FP8 safety"	No	Interaction spread 2.99pp inside ±3pp; H2 rejected
"The effect is practically significant"	No	All 27 discordant cells negligible (max \|h\|=0.046)
"Temperature amplifies the (small) effect"	Yes, weakly	Temp spokes carry largest mean Δ; inconclusive TOST cells are all high-temp Qwen

Observations. The licensing table is the claim-ladder applied to the one located result. The two licensed claims are narrow and sourced (Qwen-on-XSTest over-refusal, weak temperature amplification); the three forbidden claims are exactly the over-generalizations a pooled OR of 1.88 invites if read without the decomposition. The harmful-core invariance (Appendix G.2) is what forecloses "FP8 degrades safety," and the 2.99pp interaction spread (Appendix G.9) is what forecloses the serving-state-modulation claim.

A located effect is more dangerous to a synthesis than a null, because it tempts over-claiming. This table is the discipline that keeps TR152's H1 honest: it says exactly what the OR of 1.88 buys (a narrow, benign, family-specific over-refusal lean) and exactly what it does not (any harmful-compliance or serving-state claim). The defensibility bar lives here — every "Yes" cites a CI or a LOO row, every "No" cites the invariance or equivalence result that refutes it.

Appendix X4 — Cross-TR judge-cohort inheritance

The seven reports do not use a single fixed judge cohort; the cohort evolves across the line, and each report inherits, extends, or retires judges from its predecessors. This appendix tracks that inheritance so a reader can see why a given report's JTP band is what it is — and why two reports with the "same" gemma3 × llama3.1 pair can land in different bands. The two structural facts that organize the table: (1) the safety-specialist judges (shieldgemma, llama-guard3) enter only at TR148 and measure the composite-harm axis, not the refusal axis; (2) the standardized-battery reports (TR149, TR152) inherit the three-judge refusal cohort and land robust, while the mixed-set report (TR148) lands triangulate on the same general-judge pair.

X4.1 — Per-TR judge cohort and JTP status

TR	Judge cohort	Primary cross-LLM pair	κ	JTP band	Note
TR144	regex + gemma3:12b	regex × gemma3	~0.0 (surface vs semantic)	—	screen is h-based (TAIS), not κ-gated
TR145	regex + gemma3:12b	regex × gemma3	0.4274 (TR145-reported)	triangulate	re-measured 0.3626 by TR148 (Δ −0.0648)
TR146	none (mechanistic probes)	—	—	—	no behavioural judge
TR147	none (latency telemetry)	—	—	—	no safety axis
TR148	regex + gemma3 + llama3.1 + shieldgemma + llama-guard3 (+ gpt-4o calib.)	gemma3 × llama3.1	0.6917	triangulate	dual-axis: 1a refusal + 1b composite-harm
TR149	regex + gemma3:12b + llama3.1:8b	gemma3 × llama3.1	0.8306	robust	standardized batteries
TR152	regex + gemma3:12b + llama3.1:8b	gemma3 × llama3.1	0.831	robust	standardized batteries

Observations. The cohort grows then settles. TR144/TR145 run a two-judge regex+gemma3 cohort (TR144's regex×gemma3 κ≈0 is the documented surface-vs-semantic mismatch — regex matches refusal strings, gemma3 reads refusal meaning — which is why TR144's screen is h-based, not κ-gated). TR148 is the expansion point: it adds llama3.1 (a second general refusal judge) plus the two safety specialists and a gpt-4o calibration anchor, and it is here the dual-axis structure is discovered. TR149 and TR152 then settle on the three-judge refusal cohort (regex + gemma3 + llama3.1), retiring the specialists — because the specialists' composite-harm axis is orthogonal to the refusal-as-safe outcome those reports score. TR146 and TR147 carry no judge at all.

The inheritance pattern is the operational history of Layer 1 maturing. The line starts with a thin two-judge cohort, discovers at TR148 that adding safety specialists reveals a second orthogonal axis (not a better refusal judge), and then deliberately settles on the three-judge refusal cohort for the standardized-battery reports — carrying the dual-axis insight forward as the 1a/1b split rather than re-running five judges every time. The cohort is not fixed because the question each report asks is not fixed: a refusal-as-safe battery needs refusal judges, and the composite-harm specialists belong to Layer 1b's screen, not the outcome measurement.

X4.2 — Why the same pair lands in different bands

Corpus type	Report	gemma3 × llama3.1 κ	Band	Mechanism
Mixed task set	TR148 (via TR145 subset)	0.6917	triangulate	heterogeneous constructs (refusal + bias + truthfulness) dilute agreement
Standardized batteries	TR149	0.8306	robust	single-construct refusal batteries sharpen agreement
Standardized batteries	TR152	0.831	robust	same cohort, same corpus type, replicates TR149

Observations. The same two judges move from triangulate (0.6917) to robust (0.831) purely as a function of corpus composition. TR148's mixed set folds together refusal, bias (bbq), and truthfulness (truthfulqa) tasks — three different constructs the judges agree on to different degrees — so the pooled κ is dragged down. TR149/TR152's standardized batteries are single-construct (refusal-as-safe), and the same judges agree near-perfectly (0.83). TR152's 0.831 is a near-exact replication of TR149's 0.8306, confirming the standardized-corpus κ is stable across two independent large runs.

This is the single most important cross-TR judge fact: judge trust is corpus-specific, and the protocol measures it per corpus rather than certifying a cohort once. A naive reading would call the TR148 triangulate verdict a "worse judge pair"; the truth is the same pair on a harder (multi-construct) corpus. The replication of κ≈0.83 across TR149 and TR152 is what licenses resting those reports' verdicts on the LLM pair without mandatory five-judge triangulation — the cohort settled because the corpus type stabilized the agreement.

Glossary (Phase 6-wide)

Terms used across the Phase 6 line, defined once here. Statistical terms give the operational definition used in these reports, not the textbook generality; method names point to their full definition in Appendix X1.

Term	Definition (as used in Phase 6)
AWQ	Activation-aware Weight Quantization. 4-bit weight quantization scheme; one of the quantized formats in the upstream TR125/TR142 weight-precision sweep that the Phase 6 KV-cache work sits downstream of.
Batch-invariance	The property that a model's per-record output does not depend on the batch it was served in. The batch axis of TR152's serving-state factorial (batch_size ∈ {1, 8, 32}) tests whether FP8 KV-cache breaks it for the safety profile; it does not (Appendix G.9).
Cohen's d	Standardized mean difference for continuous outcomes. Reserved in this line for TR147 compiled-latency (CRI = max pairwise \|d\|). Not used for binary safety outcomes — see Cohen's h.
Cohen's h	Effect size for the difference between two proportions (arcsine-transformed). The matched-pairs binary effect size used across the safety line; bands negligible < 0.2 · small 0.2–0.5 · medium 0.5–0.8 · large ≥ 0.8. Max observed in Phase 6 = 0.0742 (TR149) / 0.0458 (TR152).
CRI	Compile Reproducibility Index (TR147). Max pairwise \|Cohen's d\| on compiled latency across a stack set; bands robust < 0.5 · sensitive < 2 · fragile ≥ 2 · catastrophic ≥ 10. Full definition Appendix X1.2.
Discordant pair	In a matched-pairs (McNemar) design, a record whose outcome differs between the two conditions (FP16 vs FP8). `b` = FP16-safe → FP8-unsafe (degrade); `c` = FP16-unsafe → FP8-safe (improve). Concordant pairs (same outcome both conditions) carry no signal and drop out of the OR.
E4M3 / FP8	8-bit floating point, 4 exponent + 3 mantissa bits — the FP8 KV-cache dtype tested throughout TR145/TR149/TR152. Model weights stay FP16; only the KV-cache is FP8.
Haldane correction	The `+0.5` added to each cell of an odds-ratio to keep it finite when a cell is zero. Phase 6 pooled ORs use the Haldane-corrected matched-pairs discordant ratio `(Σb+0.5)/(Σc+0.5)`.
Holm–Bonferroni	Stepdown multiple-comparison correction; less conservative than plain Bonferroni, controls family-wise error. Applied to the per-cell McNemar families (TR149 12-cell, TR152 120-cell) and the TR148 15-pair κ family.
H0 / H1 / H2	The three hypotheses of the serving-state safety question. H0 = no effect within ±3pp (null); H1 = a located effect; H2 = a serving-state interaction. Phase 6 returns H0 for TR144/145/149 and (H1 located / H2 rejected) for TR152.
JTP	Judge Triangulation Protocol (TR140 → TR148). Cross-family Cohen's κ on the largest-n judge pair → routing band; robust ≥ 0.70 · triangulate 0.40–0.70 · untrustable < 0.40. Full definition Appendix X1.3.
KV-cache	Key/Value cache: the stored attention keys and values that let autoregressive decoding avoid recomputing past tokens. Quantizing it to FP8 is the memory-saving intervention whose safety impact the line measures.
Krippendorff's α	A chance-corrected inter-rater agreement coefficient; reported alongside κ as a robustness check (TR148 primary α = 0.6917, matching κ).
Mantel–Haenszel (paired vs unpaired)	Stratified pooled odds-ratio estimator. Paired (matched-pairs discordant ratio `(Σb+0.5)/(Σc+0.5)`) is correct for TR149/TR152's within-prompt FP16-vs-FP8 design. Unpaired (cross-product `(a·d)/(b·c)`) is correct for TR145's genuine marginal tables. Conflating them is the TR149 postmortem bug that exploded the OR to 3411.5 (Appendix F.10).
McNemar test	Exact paired-binary test on the discordant cells `b` and `c` of a 2×2 matched-pairs table. The primary per-cell safety test throughout the line.
MDE	Minimum Detectable Effect: the smallest effect a test is powered to detect at α=0.05, 80% power. Reported openly (TR149 ~14pp/cell; TR152 ~3.25–3.75pp Holm-clear) so the null names its own blind spot.
PABAK	Prevalence-Adjusted Bias-Adjusted Kappa. The honest agreement read when κ degenerates at the ceiling (e.g. TR149 StrongREJECT κ=−0.0005 but PABAK=0.9979 at 100% agreed-refusal).
PagedAttention	vLLM's paged KV-cache memory manager; the serving substrate under which all FP8 KV-cache measurements run (vLLM v0.19.1).
Prefix caching	Reuse of the KV-cache for a shared prompt prefix across requests. The prefix-caching axis of TR152's factorial (on/off) tests whether it interacts with the FP8 safety profile; it does not (Appendix G.9).
RTSI	Refusal Template Stability Index (TR142). Four refusal-template drift features → composite index; bands < 0.10 low · 0.10–0.40 moderate · ≥ 0.40 high. Full definition Appendix X1.5.
Speculative decoding	A draft model proposes tokens that a target model verifies, accelerating decoding. The two acceptance rules tested by TAIS: rejection sampling (exact distribution-matching accept/reject) and typical acceptance (a looser entropy-based accept rule).
Star (factorial) design	A design that samples a baseline plus single-axis spokes rather than the dense factorial interior. TR152 uses a star design: 19.44% coverage of the full 72-cell factorial, enough to test each serving-state axis against a common baseline.
TAIS	Typical-Acceptance Invariance Screen (TR144). Matched-pairs Cohen's h + ±3pp TOST across draft+target pairs; null cutoff \|h\| < 0.1. Full definition Appendix X1.4.
TOST	Two One-Sided Tests for equivalence. Establishes that an effect is within a margin (the line's margin is ±3pp), converting "we failed to find an effect" into "we positively established equivalence." The methodological backbone of the null line.
triangulate_no_openai	The verdict bucket produced when the umbrella gate (`--skip-openai-judge`) forces local-only judging on adversarial corpora — triangulation across local judges with no frontier-judge ground truth.
Umbrella gate	The `--skip-openai-judge` flag. Adversarial corpora (HarmBench/JailbreakBench/StrongREJECT/advbench) cannot be sent through a frontier API without a Researcher Access Program umbrella, so the gate routes all such judging to local Ollama models ($0 external cost).
XSTest	A 450-prompt battery (CC BY 4.0) of safe-but-superficially-alarming prompts plus a harmful slice. The over-refusal battery — the only one with FP16 headroom below 100% — and therefore the only battery on which FP8 registers any effect (TR149/TR152).

Observations. The glossary is organized to enforce the line's two most-confused distinctions. First, Cohen's d vs h: d is continuous (TR147 latency only), h is the paired-binary proportion effect size (all safety reports) — using d on a binary safety outcome would be a category error the line never makes. Second, paired vs unpaired Mantel–Haenszel: the paired discordant ratio is correct for within-prompt FP16-vs-FP8 designs, the unpaired cross-product for genuine marginal tables, and the entry names the exact bug (OR 3411.5) that conflating them produces.

A glossary in a synthesis is not decoration — it is where the cross-TR vocabulary is pinned down so seven reports speak one language. The two distinctions above (d/h, paired/unpaired MH) are the ones that, if blurred, would let a reader mis-read the entire null line; defining them here, with the failure mode named, is the same defensibility discipline that runs through every appendix. The methods (CRI/JTP/TAIS/RTSI) point back to Appendix X1 rather than re-defining, so there is exactly one canonical definition of each.

End of Extended Appendices. Companion to Conclusive_Phase6.md (main synthesis) and Conclusive_Phase6_Whitepaper.md (condensed external-facing). Source-of-truth reports: PublishReady/reports/Technical_Report_{144,145,146,147,148,149,152}.md.