Technical Report 152 (v1, local pilot): The Serving-State Safety Factorial — FP8 KV-Cache Across Batch, Prefix-Caching, Speculative-Decoding, and Temperature
Phase 4 / bridge-paper Layer 5. Hand-narrated from run research/tr152/results/20260521_192747 (executed 2026-05-21, RTX 4080 Laptop 12GB, vLLM v0.19.1). All numbers trace to tr152_analysis.json in that run directory. The auto-generated tr152_report.md (242 lines) remains in the run directory; this is the promoted narration, not a copy of it.
Scope marker — this is the v1 local pilot, not the full-scale experiment. It runs 3 models (≤3B, 2 families) on one 12GB card, with XSTest capped at 100/cell and the speculative-decoding axis absent (12GB OOM). It establishes the pipeline and locates the effect (xstest-Qwen over-refusal lean), but its headline statistics rest on 23 discordant pairs and should not be read as the definitive Layer 5 result. A v2 local expansion is staged (
research/tr152/config_v2_local.yaml): 5 models across 3 families with XSTest uncapped to 450, which gives the xstest-Qwen effect real per-cell power. The full-scale cloud version (TR151's 7B–70B matrix × these axes, with the speculative-decoding axis restored on A100-class VRAM) is deferred to the paper-build phase. Read every claim below as bounded to the pilot's scale.*
SS1. Executive summary
TR149 established that an FP8 KV-cache produces no detectable refusal change against FP16 on four standardized safety batteries, under a single serving configuration. The obvious objection — and the one a serving engineer asks first — is that production never runs that single configuration. Real deployments batch requests, share prefix caches across them, speculate tokens with a draft model, and sample at non-zero temperature. TR152 folds those four serving-state axes into one factorial and asks whether any of them modulates the FP8 safety contrast: is the FP8 null robust to the serving state, or does it break once batching, prefix reuse, speculation, or temperature is layered on?
The answer, across 14,400 sampled responses and 7,010 matched FP16-vs-FP8 pairs, is that the null holds where it matters and bends only at the margin:
- Clearly-harmful refusal is invariant to FP8 under every tested serving state. On HarmBench-400, JailbreakBench-100, and StrongREJECT-313, every model refuses every harmful prompt under both FP16 and FP8, in every cell of the factorial. Three of the four batteries show perfect concordance — zero discordant pairs.
- The only movement is on the over-refusal battery (XSTest), and only for the Qwen models. All 23 discordant pairs and all 12 discordant cells are XSTest;
qwen2.5-3bcontributes 13 of the 17 FP8-degraded discordances,qwen2.5-1.5bthe rest.llama3.2-3bshows zero discordance on XSTest as well. - The effect is operationally negligible and survives no per-cell correction. Per-context mean FP8 deltas range from −0.10pp to −0.37pp; the worst single cell is −3.3pp; 60 of 72 cells (83.3%) are statistically equivalent at ±3pp under TOST; and zero of 72 cells are significant after Holm–Bonferroni (smallest raw p = 0.25).
- It is nonetheless directionally consistent. The maximally-pooled Mantel–Haenszel odds ratio is 2.69 with a 95% CI of [1.09, 6.62] — the lower bound excludes 1.0 — built on a 17-vs-6 discordant imbalance. The direction is that FP8 makes the Qwen models slightly more likely to flip on benign-edge XSTest prompts, not on harmful ones.
The certifiable statement TR152 supports is therefore narrow and precise: an FP8 KV-cache does not change refusal behavior on harmful prompts under any tested combination of batch size, prefix caching, or temperature; its only measurable footprint is a sub-percentage-point increase in over-refusal-boundary instability on the Qwen family, detectable only when pooled across the whole factorial. It is not a license to call FP8 "unsafe," and it is not a clean cross-the-board null either. It is a bounded, located effect.
One axis could not be tested locally: every speculative-decoding cell failed to launch (six uniform out-of-memory timeouts on the 12GB card). That leg of the factorial is cloud-gated and is reported as not-run, not as null.
SS2. Motivation and research question
The Phase 4 safety line has accumulated a sequence of null results under tested conditions: the FP8 KV-cache null (TR145, replicated on standardized batteries in TR149), the speculative-decoding behavioral null at temperature zero (TR144 / TAIS), the continuous-batch composition null (TR143), and the concurrency null (TR135). Each was measured along a single axis. The recurring reviewer question is whether these axes interact: a treatment that is null in isolation can still become non-null in the presence of a second treatment, and a serving stack applies all of them at once.
TR152 is the interaction study. Its design question is not "does FP8 change safety?" — TR149 answered that for one configuration — but "does any serving-state axis change the FP8 contrast?" Formally, for each serving-state context $s$ we measure the matched FP16-vs-FP8 safe-rate delta $\Delta(s) = \text{rate}{\text{fp8}}(s) - \text{rate}{\text{auto}}(s)$, and we ask whether $\Delta$ depends on $s$. A flat $\Delta$ across contexts means FP8 can be certified independently of the serving state; a $\Delta$ that swings with one axis means the certification verdict must be conditioned on that axis.
This is the Layer 5 anchor of the serving-state-safety-certification bridge paper: the layer that justifies (or refuses) a serving-state-independent FP8 certificate.
The framing matters for what counts as a positive result here. A large interaction would be the interesting finding — it would mean the simple TR149 null is an artifact of testing one configuration. A flat interaction is the useful finding — it licenses a single FP8 certificate that a deployment can rely on without re-validating every batch setting. TR152 returns the useful finding, with one located caveat.
SS3. Design — the FP8-anchored star factorial
A full crossing of the five axes (KV dtype × batch size × prefix caching × speculative decoding × temperature) is 2 × 3 × 2 × 2 × 3 = 72 configurations per model per battery, which is both wasteful (most of the interior is uninformative) and breaks the matched-pair structure the safety estimators require. TR152 instead uses an FP8-anchored star design: a single baseline configuration at the center, and one "spoke" per off-baseline level of each axis, with every spoke run at both FP16 (auto) and FP8 so each spoke is a clean matched pair.
The baseline is kv_cache_dtype=auto, batch_size=1, prefix_caching=false, speculative_decoding=false, temperature=0.0. The seven paired contexts (the baseline plus six spokes) are:
| Context label | Deviating axis | Setting |
|---|---|---|
baseline |
— | bs=1, pf=off, sp=off, temp=0.0 |
batch_size=8 |
batch_size | bs=8 |
batch_size=32 |
batch_size | bs=32 |
prefix_caching=True |
prefix_caching | pf=on |
speculative_decoding=True |
speculative_decoding | sp=on |
temperature=0.7 |
temperature | temp=0.7 |
temperature=1.0 |
temperature | temp=1.0 |
Observations. Each row is one FP16/FP8 twin: 7 paired contexts × 2 dtypes = 14 cells per model per battery, the "14 cells, 7 pairs" the design name refers to. The star shape buys the matched-pair contrast cheaply — every spoke shares the baseline's prompt set and differs in exactly one axis, so the FP8 delta at that spoke is attributable to the dtype change and not to a confound. The cost is that interactions between two off-baseline axes (e.g., batch=32 and temperature=1.0 simultaneously) are not measured; the star tests each axis's modulation of FP8 independently, not their joint modulation.
This is a deliberate resolution trade. The star answers "does any single serving axis bend the FP8 contrast?" at full matched-pair rigor, and explicitly declines to answer "do two serving axes jointly bend it?" The latter would require a denser design and is out of scope; the report does not claim joint-interaction coverage.
SS4. Models, batteries, and sample budget
Models (three, all locally resident on the 12GB card): llama3.2-3b (Llama-3.2-3B-Instruct), qwen2.5-3b (Qwen2.5-3B-Instruct), qwen2.5-1.5b (Qwen2.5-1.5B-Instruct). The two families let the report separate a model-family effect from a size effect.
Batteries (the four TR149 standardized sets): harmbench_400 (400 harmful instructions), jbb_100 (100 JailbreakBench behaviors), strongreject_313 (313 StrongREJECT prompts), xstest_450 (450 XSTest prompts — the over-refusal battery, mixing safe prompts that should be answered with unsafe prompts that should be refused). Each cell subsamples up to 100 prompts per battery to keep the factorial tractable.
Sample budget. 14,400 sampled responses reached the analysis. Of these, 7,010 form complete FP16-vs-FP8 matched pairs across the 72 realized strata (4 batteries × 3 models × 6 contexts; the seventh context, speculative decoding, did not run — see SS6). The shortfall from the nominal 16,800 (14 cells × 3 models × 4 batteries × 100) is the six speculative-decoding cells that failed to launch (≈2,400 responses) plus per-prompt judge dropouts.
Observations. The matched-pair count (7,010) is the denominator that matters for every safety estimator below; the raw 14,400 includes the FP16 and FP8 sides separately and the unpaired residue. The pairing is keyed on (battery, model, context, sample_id, dtype), so a pair exists only where both the FP16 and FP8 sides produced a scored response for the same prompt under the same serving context.
SS5. Judges and TR148 routing
TR152 inherits its judge cohort from the TR148 verdict at research/tr148/results/20260512_174624, which resolved to triangulate. Because three of the four batteries are adversarial (HarmBench, JailbreakBench, StrongREJECT), the run was launched with --skip-openai-judge per the OpenAI-safety-umbrella gate, which strips GPT-4o and produces the triangulate_no_openai bucket. The effective judge cohort is therefore regex + gemma3:12b + llama3.1:8b, all local (Ollama).
Observations. Per-record outcomes are a sign-aware majority vote across the available LLM judges, with the regex classifier as calibration rather than a voting member on the composite axis (the rationale appears in SS14, where regex is shown to measure a related-but-distinct axis from the LLM judges — fair-to-moderate within-battery agreement, not the orthogonality the pooled κ first suggests). No record requires any specific judge to be present — the join is judge-agnostic, per the standing analyze.py rule, so a record with two of three judges still scores.
Routing through local judges is not a compromise here; it is the correct posture for adversarial corpora until a researcher-access umbrella is in place. The cost is the absence of the GPT-4o cross-family check; the benefit is that no advbench/jailbreak content leaves the machine. The cross-judge κ in SS14 shows the two local LLM judges agree strongly enough (κ = 0.81) that the verdict does not hinge on the missing third family.
SS6. Execution and the speculative-decoding wall
Sampling ran from 15:28 to 22:16 on 2026-05-21; judging completed at 01:09 on 2026-05-22; analysis and report generation followed immediately. The FP8 validation gate (Phase 0) passed cleanly — vLLM started with --kv-cache-dtype fp8 on Qwen2.5-1.5B and returned a completion — confirming the FP8 path works on the card before any sampling.
The one systematic execution failure was speculative decoding. Every speculative-decoding cell — both the auto and fp8 sides, on all three models — failed with "vLLM failed to start within 420s":
| Model | sp-1 cell (auto) |
sp-1 cell (fp8) |
|---|---|---|
| llama3.2-3b | timeout | timeout |
| qwen2.5-3b | timeout | timeout |
| qwen2.5-1.5b | timeout | timeout |
Observations. All six speculative-decoding cells failed identically, regardless of model size or KV dtype. Speculative decoding requires the draft and target models to be co-resident in GPU memory; on the 12GB card, loading both within the 420-second startup budget did not succeed. This is the same constraint that sent TR144's speculative-decoding study to an A100 in the first place. The run loop is resume-safe and treats a cell launch failure as a skip-and-continue, so the six failures are gaps in the factorial, not a crash — every other cell completed.
The uniformity of the failure is itself a clean finding: the speculative-decoding leg of the serving-state factorial is infeasible on consumer 12GB hardware, full stop, independent of model or dtype. TR152's certification of FP8 across serving state is therefore explicitly conditional on the four runnable axes (batch size, prefix caching, temperature, and the dtype contrast itself); the speculative-decoding × FP8 interaction is deferred to a cloud run and is not claimed here.
The realized coverage, as a per-model cell matrix (✓ = both FP16 and FP8 sides completed; ✗ = launch timeout), is:
| Context (deviating axis) | llama3.2-3b | qwen2.5-3b | qwen2.5-1.5b |
|---|---|---|---|
| baseline (bs=1, pf=off, sp=off, t=0.0) | ✓ | ✓ | ✓ |
| batch_size=8 | ✓ | ✓ | ✓ |
| batch_size=32 | ✓ | ✓ | ✓ |
| prefix_caching=on | ✓ | ✓ | ✓ |
| speculative_decoding=on | ✗ | ✗ | ✗ |
| temperature=0.7 | ✓ | ✓ | ✓ |
| temperature=1.0 | ✓ | ✓ | ✓ |
Observations. Six of seven paired contexts completed for all three models — 18 of 21 model-context pairs, each at both dtypes (36 of 42 cells), times four batteries gives the 72 realized strata the estimators run on. The single missing row is speculative decoding, missing for every model. Coverage is otherwise complete: no battery, model, or non-speculative context is partial.
Reporting coverage as a matrix rather than a count matters for the certificate's honesty. The gap is not scattered dropout that might bias the estimate in some unknown direction; it is one entire axis, missing uniformly, for a mechanical (memory) reason. That is the cleanest possible kind of missing data — it subtracts a known axis from the claim's scope rather than introducing a selection effect into the axes that did run.
SS7. Main effects — the serving axes on their own
Before the FP8 contrast, it is worth asking whether the serving axes move safety at all, independent of dtype. The main-effects screen computes the FP16-only safe-rate at each level of each axis (FP16-only so the FP8 contrast does not contaminate the marginal):
| Axis | Level | n | Safe rate |
|---|---|---|---|
| batch_size | 1 | 4,715 | 0.9578 |
| batch_size | 8 | 1,179 | 0.9567 |
| batch_size | 32 | 1,179 | 0.9567 |
| prefix_caching | off | 5,894 | 0.9576 |
| prefix_caching | on | 1,179 | 0.9567 |
| temperature | 0.0 | 4,716 | 0.9567 |
| temperature | 0.7 | 1,177 | 0.9592 |
| temperature | 1.0 | 1,180 | 0.9585 |
Observations. Every level of every axis sits within a 0.25-percentage-point band of 0.957. Batch size moves the FP16 safe rate by 0.11pp from bs=1 to bs=32; prefix caching by 0.09pp; temperature by 0.25pp (and, mildly counter-intuitively, upward at higher temperature — 0.9567 at temp 0 vs 0.9592 at temp 0.7). None of these is a material main effect: the serving axes, on their own, do not move aggregate refusal.
This is the necessary backdrop for the interaction analysis. If a serving axis had a large main effect, an FP8 interaction would be hard to read against it. Because the main effects are flat to within a quarter-point, any FP8 interaction stands out cleanly — there is no axis-level safety swing for an FP8 effect to hide behind.
SS8. Per-cell paired McNemar — where the concordance lives
The matched-pair contrast is one 2×2 McNemar table per (battery, model, context): cell a = both FP16 and FP8 safe, b = FP16 safe but FP8 unsafe (FP8 degraded), c = FP16 unsafe but FP8 safe (FP8 improved), d = both unsafe. Aggregated to the (battery, model) level, the picture is stark:
| Battery | Model | n paired | a (both safe) | d (both unsafe) | b (FP8 degraded) | c (FP8 improved) |
|---|---|---|---|---|---|---|
| harmbench_400 | llama3.2-3b | 599 | 599 | 0 | 0 | 0 |
| harmbench_400 | qwen2.5-1.5b | 597 | 597 | 0 | 0 | 0 |
| harmbench_400 | qwen2.5-3b | 600 | 600 | 0 | 0 | 0 |
| jbb_100 | llama3.2-3b | 596 | 596 | 0 | 0 | 0 |
| jbb_100 | qwen2.5-1.5b | 600 | 600 | 0 | 0 | 0 |
| jbb_100 | qwen2.5-3b | 600 | 600 | 0 | 0 | 0 |
| strongreject_313 | llama3.2-3b | 600 | 600 | 0 | 0 | 0 |
| strongreject_313 | qwen2.5-1.5b | 596 | 596 | 0 | 0 | 0 |
| strongreject_313 | qwen2.5-3b | 598 | 598 | 0 | 0 | 0 |
| xstest_450 | llama3.2-3b | 545 | 477 | 68 | 0 | 0 |
| xstest_450 | qwen2.5-1.5b | 540 | 420 | 113 | 4 | 3 |
| xstest_450 | qwen2.5-3b | 539 | 434 | 89 | 13 | 3 |
Observations. The three adversarial batteries are perfectly concordant: a = n, b = c = d = 0, for all three models. Every harmful prompt is refused under both FP16 and FP8, in every cell — there is no FP8 footprint on harmful-prompt refusal at all. XSTest is the only battery with any discordance, and even there, llama3.2-3b is fully concordant (b = c = 0; its d = 68 is the consistent over-refusal that both dtypes share). The discordance is entirely Qwen: qwen2.5-3b carries 13 of the 17 FP8-degraded pairs, qwen2.5-1.5b the other 4, with 6 FP8-improved pairs split across the two.
Two facts are doing all the work here. First, the harmful batteries are a floor test — these models refuse everything dangerous regardless of dtype, so there is no headroom for FP8 to degrade harmful refusal, and none is observed. Second, XSTest is where refusal is a decision rather than a reflex (it deliberately includes safe prompts that look unsafe), and that is exactly where a small numerical perturbation like FP8 quantization can tip a borderline case. The d = 68/113/89 columns show the Qwen models already over-refuse a large share of XSTest under FP16; FP8 nudges a handful more, asymmetrically.
SS9. The discordance distribution — all twelve cells
A note on what "safe" and "unsafe" mean on XSTest, because the battery's structure is what makes it the sensitive probe. XSTest is built from two slices: a safe slice (250 prompts that look dangerous but are benign — "how do I kill a Python process?" — and should be answered) and an unsafe slice (200 genuinely harmful prompts that should be refused). In the scored records the two slices appear in a balanced split (1,800 comply-expected vs 1,800 refuse-expected rows across the dtype × context multiplicity). The sign-aware scorer marks a response "safe" when it does the expected thing for its slice — answers a safe-slice prompt, refuses an unsafe-slice prompt — and "unsafe" otherwise. So on XSTest an "unsafe" outcome is usually an over-refusal (declining a benign prompt), not a harmful compliance; the d column (both dtypes unsafe) is dominated by shared over-refusal of the safe slice, and the b/c discordance is FP8 tipping a borderline safe-slice prompt from answered to over-refused or back.
Because the entire effect lives in twelve cells, the report names them all rather than summarizing. Every discordant cell is XSTest:
| Cell | b (FP8 degraded) | c (FP8 improved) | n_disc | exact p | delta_pp |
|---|---|---|---|---|---|
| xstest | qwen2.5-3b | temperature=0.7 | 3 | 0 | 3 | 0.25 | −3.30 |
| xstest | qwen2.5-3b | prefix_caching=True | 2 | 1 | 3 | 1.00 | −1.14 |
| xstest | qwen2.5-3b | batch_size=32 | 2 | 1 | 3 | 1.00 | −1.11 |
| xstest | qwen2.5-3b | batch_size=8 | 2 | 1 | 3 | 1.00 | −1.10 |
| xstest | qwen2.5-3b | baseline | 2 | 0 | 2 | 0.50 | −2.30 |
| xstest | qwen2.5-3b | temperature=1.0 | 2 | 0 | 2 | 0.50 | −2.17 |
| xstest | qwen2.5-1.5b | temperature=1.0 | 2 | 0 | 2 | 0.50 | −2.20 |
| xstest | qwen2.5-1.5b | baseline | 0 | 1 | 1 | 1.00 | +1.10 |
| xstest | qwen2.5-1.5b | temperature=0.7 | 1 | 0 | 1 | 1.00 | −1.14 |
| xstest | qwen2.5-1.5b | prefix_caching=True | 0 | 1 | 1 | 1.00 | +1.10 |
| xstest | qwen2.5-1.5b | batch_size=32 | 1 | 0 | 1 | 1.00 | −1.11 |
| xstest | qwen2.5-1.5b | batch_size=8 | 0 | 1 | 1 | 1.00 | +1.12 |
Observations. No cell has more than three discordant pairs. The largest single-cell effect is qwen2.5-3b | temperature=0.7 at −3.3pp (3 FP8-degraded, 0 improved, exact p = 0.25). For qwen2.5-3b, every discordant cell leans negative (FP8 degrades) — the b column dominates c in all six of its cells. For qwen2.5-1.5b, the direction is mixed: three cells lean negative, three lean positive (FP8 improves XSTest behavior). The temperature=0.7 spoke is the most degraded context for both models.
The per-cell exact p-values tell the honest story: the smallest is 0.25, and most are 0.5 or 1.0. With at most three discordant pairs in any cell, no individual cell can reach significance — there simply is not enough within-cell signal. Whatever TR152 detects, it detects only by pooling across cells, and the consistency of the
qwen2.5-3bdirection (six of six cells negative) is the strongest single thread in the data.
SS10. FP8 interaction across serving contexts
The interaction pass is the design's headline question: does the FP8 delta depend on the serving context? It computes, per context, the mean and range of the per-cell FP8 deltas across all twelve (battery × model) cells:
| Context | mean Δ (pp) | min Δ (pp) | max Δ (pp) | n cells | n paired |
|---|---|---|---|---|---|
| baseline | −0.100 | −2.30 | +1.10 | 12 | 1,166 |
| batch_size=8 | +0.002 | −1.10 | +1.12 | 12 | 1,167 |
| batch_size=32 | −0.185 | −1.11 | 0.00 | 12 | 1,168 |
| prefix_caching=True | −0.003 | −1.14 | +1.10 | 12 | 1,167 |
| temperature=0.7 | −0.370 | −3.30 | 0.00 | 12 | 1,168 |
| temperature=1.0 | −0.364 | −2.20 | 0.00 | 12 | 1,174 |
Interaction spread (max cell delta minus min cell delta across the whole factorial): 4.42pp.
Observations. Every context's mean FP8 delta is within four-tenths of a percentage point of zero. The two non-zero-temperature contexts (−0.37pp at temp 0.7, −0.36pp at temp 1.0) are the most negative means, consistent with the per-cell finding that the temperature spokes carry the largest single-cell degradations. Batch size and prefix caching are essentially flat (means of +0.002, −0.185, −0.003pp). The interaction spread of 4.42pp exceeds the ±3pp equivalence margin — but that spread is driven by two outlier cells (the −3.3pp temp=0.7 cell and a +1.1pp cell), not by a context-level mean.
This is the crux of the Layer 5 verdict, and it requires care. The context means are all within ±0.4pp, which says no serving axis materially shifts the FP8 contrast on average. The interaction spread is 4.42pp, which exceeds ±3pp and so, read literally against the pre-registered margin, says "at least one cell moves more than the margin allows." Both are true. The reconciliation (SS15) is that the spread is a two-cell tail on the over-refusal battery, not a context-level interaction — temperature mildly amplifies the XSTest-Qwen effect, but no axis flips the FP8 verdict on harmful prompts, where the spread is exactly zero.
The context-level means pool across all three models and so hide the cleanest signal in the study: the effect is entirely a model-family phenomenon. Breaking the per-context FP8 delta out by model (mean over the four batteries) makes that explicit:
| Model | baseline | bs=8 | bs=32 | prefix=on | temp=0.7 | temp=1.0 |
|---|---|---|---|---|---|---|
| llama3.2-3b | +0.000 | +0.000 | +0.000 | +0.000 | +0.000 | +0.000 |
| qwen2.5-3b | −0.575 | −0.275 | −0.278 | −0.285 | −0.825 | −0.542 |
| qwen2.5-1.5b | +0.275 | +0.280 | −0.278 | +0.275 | −0.285 | −0.550 |
Observations. llama3.2-3b is exactly 0.000pp in every cell of the table — under no serving context does FP8 move its refusal rate by a single matched pair, on any battery. qwen2.5-3b is negative in all six contexts (range −0.275 to −0.825pp), with temperature=0.7 the most degraded — a uniform, monotone-leaning signature. qwen2.5-1.5b is genuinely mixed (three positive, three negative contexts), consistent with its near-balanced 4-vs-3 discordance. The family split is the real structure: Llama is invariant, the larger Qwen carries a consistent small negative lean, and the smaller Qwen is noise around zero.
This table is the single most informative artifact in the report for a deployment decision. It says the FP8-over-refusal lean is not a property of "models" in general or of any serving axis in particular — it is a property of the Qwen-3B model specifically, amplified mildly by temperature. A deployment running Llama-3.2-3B can treat the FP8 certificate as unconditional; a deployment running Qwen2.5-3B should note the sub-1pp over-refusal lean and, if it matters, validate the exact configuration. That is a per-model certification verdict, not a per-axis one.
SS11. Cross-context Mantel–Haenszel
Pooling the twelve discordant XSTest cells (and the 60 concordant cells, which contribute no discordant mass) with the matched-pairs Mantel–Haenszel estimator gives the program-level FP8 odds ratio. The estimator is the Haldane-corrected pooled discordant ratio $(\sum b + 0.5)/(\sum c + 0.5)$, which is the correct form for matched pairs — concordant cells drop out, as they must:
- Strata: 72; total paired n: 7,010; discordant pairs: 23 ($\sum b = 17$, $\sum c = 6$).
- Pooled OR: 2.69; log-OR 0.990, SE 0.459; 95% CI [1.09, 6.62].
Observations. The pooled OR is 2.69 and its 95% CI lower bound (1.09) excludes 1.0, so by the normal-approximation Wald interval the pooled FP8 effect is "significant": FP8 is associated with roughly 2.7× the odds of an FP16→FP8 degradation as the reverse, on the discordant pairs. But the estimate rests on 23 discordant pairs out of 7,010 — a 0.33% discordance rate — and the 17-vs-6 split is what an exact binomial sign test scores at roughly p ≈ 0.03–0.04, borderline and fragile to the Haldane correction at this small n.
The Mantel–Haenszel pass is the one place TR152 produces a CI that excludes 1.0, and it would be easy to over-read. The honest framing is that this is the weakest possible form of a positive result: a pooled directional lean, built entirely on the over-refusal battery's twenty-three discordant pairs, that no individual cell corroborates at significance. It is real in the sense that the direction is consistent (especially for
qwen2.5-3b); it is negligible in the sense that the absolute rate is sub-percentage-point and the harmful batteries contribute exactly zero to it. TR149's pooled OR of 0.81 bracketed 1.0; TR152's 2.69 does not — but the move is from "no detectable lean" to "a faint, located lean," not from "safe" to "unsafe."
SS12. TOST equivalence
The complement of significance-testing is equivalence-testing: rather than asking whether FP8 differs from FP16, TOST asks whether the two are equivalent within a ±3pp margin on the bootstrap delta CI. Across the 72 cells:
- Equivalent at ±3pp: 60 / 72 (83.3%).
- Not shown-equivalent: 12 / 72.
Observations. Five of every six cells are positively shown equivalent — the FP8 and FP16 safe rates are statistically indistinguishable within the deployment-relevant margin. The twelve cells that fail the equivalence test are not cells that show a difference; they are cells where the matched-pair discordance is small enough that the bootstrap delta CI is too wide to certify equivalence at ±3pp. These are the same XSTest-heavy cells with sparse discordance — equivalence is inconclusive there, not refuted.
TOST and the Mantel–Haenszel pass point the same direction once read correctly. 83% of the factorial is positively equivalent; the remaining 17% is not "different," it is "too sparse to certify either way." A reviewer who wants the strongest defensible claim should lead with the 60/72 equivalence, not the pooled OR — equivalence is the result that licenses the certificate, and the OR is the located caveat on it.
SS13. Holm–Bonferroni family correction
The 72 per-cell McNemar tests form one family; controlling the family-wise error rate with Holm–Bonferroni:
- Family size: 72; significant after Holm: 0.
- Smallest raw p: 0.25 (
xstest | qwen2.5-3b | temperature=0.7), adjusted to 1.0.
Observations. No cell survives multiplicity correction. The smallest raw p-value in the entire factorial is 0.25 — already non-significant before correction — and Holm pushes every adjusted p to 1.0. There is no individual (battery, model, serving-context) cell at which FP8 produces a statistically significant refusal change.
This is the single most important number for bounding the claim. Whatever the pooled Mantel–Haenszel CI suggests, not one of the 72 cells shows a significant FP8 effect on its own. The pooled lean is an aggregate-only phenomenon. A certification protocol built on TR152 should state plainly that no per-condition FP8 effect was detected at any tested serving state, and that the pooled directional signal is a sub-percentage-point over-refusal lean on one battery and one model family.
SS14. Cross-judge agreement — the two-axis structure
The three judges agree with each other very unevenly:
| Judge pair | Cohen's κ |
|---|---|
| gemma3:12b ↔ llama3.1:8b | 0.814 |
| regex ↔ gemma3:12b | 0.088 |
| regex ↔ llama3.1:8b | 0.101 |
Observations. The two LLM judges agree strongly (κ = 0.81, "robust" by the JTP calibration). The pooled regex–LLM κ is much lower (≈ 0.09–0.10). But the pooled regex number is a misleading aggregate, and unpacking it is the point of this pass.
The global κ pools all four batteries, whose base rates differ enormously — the three harmful batteries refuse ≈100% of prompts (degenerate marginals), while XSTest is a deliberate 50/50 split of comply-expected and refuse-expected prompts. Cohen's κ deflates sharply when categorical agreement is pooled across strata with such different marginals (a Simpson-type effect on the chance-correction term). Computing κ within each battery tells a different story:
| Battery | gemma↔llama | regex↔gemma | regex↔llama |
|---|---|---|---|
| harmbench_400 | +0.711 | +0.210 | +0.247 |
| jbb_100 | +0.793 | +0.252 | +0.267 |
| strongreject_313 | +0.762 | +0.223 | +0.264 |
| xstest_450 | +0.828 | +0.560 | +0.608 |
Observations. Within every battery the regex–LLM agreement is fair-to-moderate (0.21–0.61), not the "essentially chance" the pooled 0.09 suggested. And the agreement is highest on XSTest (0.56–0.61), exactly the battery where the two axes were supposed to diverge most. The LLM–LLM agreement is robust everywhere (0.71–0.83) and likewise peaks on XSTest (0.828). The pooled global κ of 0.09 is therefore a base-rate-pooling artifact, not a measurement of true regex–LLM disagreement.
This refines, rather than overturns, the TR148 two-axis reading. The regex and LLM judges do measure different things — response-refusal-prefix versus composite-harm — and that is why regex is held out of the voting and used only as calibration. But the divergence is moderate within-battery, not chance; the dramatic-looking pooled κ = 0.09 is an artifact of averaging a 100%-refuse harmful corpus against a 50/50 over-refusal corpus. The honest statement is: the two LLM judges concur strongly (and most strongly on the battery that carries all the signal), the verdict rests on that concurrence, and the regex axis adds a fair-to-moderate independent check rather than an orthogonal one. A report that quoted only the pooled 0.09 would have overstated the judge disagreement.
SS15. Reconciling the pooled lean with the per-cell null
TR152 produces two facts that look contradictory and are not: a pooled Mantel–Haenszel CI that excludes 1.0 (SS11), and zero significant cells after Holm (SS13). The reconciliation is the structure of the discordance:
- The effect is confined to one battery (XSTest) and one family (Qwen) — SS8.
- Within that confinement it is spread thinly: 23 discordant pairs across 12 cells, at most 3 per cell — SS9.
- No single cell has enough discordant mass to reach significance — SS13.
- But the direction is consistent enough (17 degraded vs 6 improved, and 6/6 negative for
qwen2.5-3b) that pooling all 23 pairs into one estimator crosses the significance threshold — SS11.
Observations. This is the textbook signature of a real-but-tiny effect: invisible per-cell, detectable only in aggregate, and located in a specific corner of the design rather than smeared across it. The pooled OR is not a statistical artifact — the direction is genuinely consistent — but it is also not a per-condition effect a deployment would ever observe, because at any single serving configuration the discordance is one to three prompts out of ninety.
The defensible reading is the conjunction, not either fact alone: FP8 induces a consistent but sub-percentage-point increase in over-refusal-boundary instability on the Qwen family, visible only when the whole factorial is pooled, and absent entirely on harmful prompts and on the Llama model. Reporting the pooled OR without the per-cell null would overclaim; reporting the per-cell null without the pooled OR would miss the one real thread. Layer 5 of the bridge paper should carry both.
SS16. Threats to validity
- Sparse discordance. The pooled OR rests on 23 discordant pairs. At this n the Haldane correction and the normal-approximation CI are both load-bearing; an exact-test framing (binomial sign p ≈ 0.03–0.04) is the more honest significance statement and is reported alongside.
- Over-refusal battery dominance. All discordance is XSTest. XSTest is precisely the battery where refusal is a borderline decision, so it is the most sensitive probe — but a claim about "safety" must be careful that the moved cases are over-refusal-boundary flips, not harmful-compliance flips. The harmful batteries (zero discordance) confirm the latter does not occur.
- Speculative-decoding gap. One of five axes did not run (SS6). The FP8 certificate TR152 supports is conditional on the four runnable axes; the speculative × FP8 interaction is untested here.
- Subsampling. Each cell caps at 100 prompts; the per-cell McNemar power is correspondingly limited (a cell cannot detect an effect smaller than ~1/100). The design trades per-cell power for factorial breadth and recovers aggregate power through pooling.
- Single hardware platform. All cells ran on one RTX 4080 Laptop. Cross-vendor numerical-path effects (the CRI / TR161 question) are not addressed; the FP8 contrast is within one CUDA/Triton stack.
None of these threatens the central negative result (harmful-refusal invariance to FP8), which is supported by perfect concordance on three batteries and is not a sparse-data inference. They bound the positive result (the XSTest-Qwen lean), which is exactly where the report already declines to overclaim.
SS17. Relation to the FP8 null line (TR145, TR149)
TR145 found no FP8 KV-cache safety effect at a single configuration; TR149 replicated that null on the four standardized batteries (pooled OR 0.81 [0.90, 1.23], 12/12 cells TOST-equivalent). TR152 extends the contrast across the serving state and reaches a compatible but sharper conclusion.
Observations. TR149's pooled OR (0.81) bracketed 1.0; TR152's (2.69) does not. Read naively this looks like a contradiction — one null, one not. It is not: TR149 pooled across batteries at one serving configuration and saw symmetric, near-zero discordance; TR152 pools across 72 serving-state strata and surfaces the thin, consistent XSTest-Qwen lean that the narrower design averaged out. The two are the same underlying phenomenon at different resolutions — harmful refusal invariant, over-refusal boundary faintly perturbed — with TR152's richer factorial having just enough strata to push the over-refusal lean's pooled CI off 1.0.
The line's verdict is therefore stable and gains precision with each TR: FP8 KV-cache does not move harmful-prompt refusal (TR145, TR149, and TR152's three concordant batteries all agree), and its only measurable footprint is at the over-refusal boundary, which TR152 is the first to resolve because it is the first with the strata to do so. This is the depth-compounds pattern: more design resolution converts a clean null into a clean null-with-a-located-caveat, which is the more useful certification object.
SS18. Implications for the bridge-paper Layer 5
Layer 5 of the serving-state-safety-certification protocol asks whether an FP8 KV-cache certificate can be issued independently of the serving state. TR152's answer:
- Yes for harmful-prompt refusal, unconditionally across batch size, prefix caching, and temperature: perfect concordance on three adversarial batteries means the certificate holds without re-validation per serving setting.
- With a caveat for over-refusal behavior on the Qwen family: a sub-percentage-point increase in XSTest instability that no single configuration exhibits significantly but that pools to a directional lean. The certificate should carry this as a footnote, not a blocker.
- Conditional on the speculative-decoding axis being untested locally: Layer 5 must either mark speculative × FP8 as deferred or supply a cloud run.
The clean Layer 5 statement the bridge paper can license: "Across batch size {1, 8, 32}, prefix caching {on, off}, and temperature {0.0, 0.7, 1.0}, an FP8 KV-cache produces no per-condition refusal change on harmful prompts (0/72 cells significant; 60/72 TOST-equivalent at ±3pp); the only pooled effect is a sub-1pp over-refusal lean on the Qwen family (MH OR 2.69 [1.09, 6.62], 23 discordant pairs, all XSTest). The speculative-decoding axis is deferred pending cloud compute." That sentence is fully supported by the artifacts and overclaims nothing.
SS19. Limitations and what would tighten the result
- Run speculative decoding on an A100 to close the fifth axis; this is the single largest gap and is a known-feasible cloud run (TR144 precedent).
- Expand the XSTest discordant cells with the full 450-prompt set (uncapped) on the Qwen models specifically, to convert the pooled lean into a per-cell estimate with a real CI rather than 1–3 discordant pairs.
- Add a non-Qwen, non-Llama family (e.g., Mistral-7B) to test whether the over-refusal lean is Qwen-specific or generalizes — TR152 has a family signal (Llama clean, Qwen not) that two families cannot fully resolve.
- Fix the
_exact_mcnemarkey-name inconsistency (b/cvsb_fp16safe_fp8unsafe/c_fp16unsafe_fp8safe) for downstream-consumer robustness; cosmetic, changes no number.
SS20. Conclusion
TR152 set out to test whether the FP8 KV-cache safety null survives the serving state, and it does, with one located caveat. Across 14,400 responses and 7,010 matched pairs spanning batch size, prefix caching, and temperature, an FP8 KV-cache produces no significant per-condition refusal change at any tested serving state (0 of 72 cells survive Holm correction), and five of every six cells are positively equivalent to FP16 within ±3pp. Refusal on clearly-harmful prompts is perfectly invariant to FP8 — three of four batteries show zero discordance across the entire factorial. The only measurable FP8 footprint is a sub-percentage-point increase in over-refusal-boundary instability on the Qwen family's XSTest responses, consistent in direction (17 degraded vs 6 improved pairs) and detectable only when pooled across the whole factorial (Mantel–Haenszel OR 2.69, CI [1.09, 6.62]), never at any single configuration. The speculative-decoding axis could not be tested on the 12GB card — all six cells exceeded the GPU-memory budget — and is reported as deferred, not null.
The result the serving-state-safety-certification protocol inherits is therefore a precise one: FP8 is refusal-neutral on harmful prompts under every runnable serving configuration, with a documented, operationally-negligible over-refusal lean on one model family at the benign-edge battery. The franchise discipline holds — the data show a real but tiny effect, and the report locates it exactly rather than inflating it into a safety alarm or flattening it into a featureless null.