Technical Report

TR140: Many-Shot & Long-Context Jailbreak Under Quantization

15,000 scored samples across 4 models, 6 quant levels, 5 shot counts, and 3 context-length profiles.

Table of Contents

4 models, 6 GGUF quant levels, 5 shot counts, 2 prompt formats, and 3 context-length profiles across 15,000 scored samples
Abstract
v1 -> v3 Integration Summary
Executive Summary
Key Findings
Core Decisions
Validation Summary
Claim Validation
How to Read This Report
When to Use This Report
Scenario 1: Evaluating many-shot attack risk on a quantized deployment
Scenario 2: Choosing between prompt formats for a chat API
Scenario 3: Setting context-length limits for safety
Scenario 4: Positioning TR140 relative to the broader safety program
Table of Contents
Metric Definitions
Primary Metrics
Statistical Tests Used
Evidence Standard
SS1. Introduction
SS1.1 Research Questions
SS1.2 Why This Matters
SS1.3 Scope
SS1.4 Literature Grounding
SS1.5 How to Read This Report
SS2. Methodology
SS2.1 Overall Design
SS2.2 Unit of Analysis
SS2.3 Scoring Pipeline
SS2.4 Design Safeguards
SS2.5 Compute Budget
SS2.6 What This Design Does Not Do
SS3. Models and Design
SS4. Prompt Construction
SS4.1 Faux Dialogue Format
SS4.2 Message Array Format
SS4.3 Phase 2: Context Dilution Format
SS5. Phase 1 Results: Many-Shot ASR by Shot Count and Quantization
Table 1: llama3.1-8b
Table 2: llama3.2-1b
Table 3: llama3.2-3b
Table 4: qwen2.5-1.5b
SS5.5 Baseline-Normalized ASR
Table 4b: Normalized ASR ratios (qwen2.5-1.5b only -- the only model with non-zero Q8_0 baselines)
Table 4c: Llama family Q2_K normalization (absolute delta where Q8_0 = 0%)
SS5.6 Minimum Effective Shot Count
SS5.7 Shot-Count Effectiveness Patterns
SS6. Statistical Tests vs Q8_0 Baseline
Table 5: Significant comparisons (Holm-adjusted p < 0.05)
SS7. Critical Quant Thresholds
SS8. Power-Law Analysis
Table 6: Power-law parameters (well-fit curves only, R-squared > 0.5)
SS9. Per-Category and Per-Model ANOVA
SS9.1 Per-Category Breakdown
SS9.2 One-Way ANOVA (Model Effect by Shot Count)
SS9.3 Two-Way ANOVA (Quant x Shot Count per Model)
SS10. Prompt Format Comparison
Table 7: Significant format comparisons (selected, p < 0.001)
SS11. Phase 2 Results: Long-Context Safety
Table 8: Phase 2 ASR by model, quant, and context profile
Context Dilution Slopes
SS12. Variance Decomposition and Many-Shot Amplification
SS12.1 Variance Decomposition
SS12.2 Many-Shot Amplification Ratios
SS13. Context-Budget Analysis
SS14. Judge Agreement and Scoring Reliability
SS14b. Latency Analysis
Table 9: Mean wall-clock latency (ms) by model and quant at selected shot counts
Table 10: Mean prompt tokens by shot count
SS15. TOST Equivalence and Power Analysis
SS15.1 TOST Equivalence Tests
SS15.2 Power Analysis
SS16. Cross-TR Validation
SS17. Statistical Synthesis and Hypothesis Evaluation
H1: Invariant Power-Law Exponent
H2: Quantization Left-Shifts the Power Law
H3: Context-Window Caps Many-Shot Effectiveness
Synthesis
Factor Hierarchy
Interaction Model
Theoretical Framework for the Q2_K Cliff
SS18. Production Guidance
SS18.1 Decision Matrix
SS18.2 Defense Layering
SS19. Limitations and Follow-Up
SS19.1 Methodological Limitations
SS19.2 Open Research Questions
SS19.3 Follow-Up Work
SS20. Conclusions
Cross-TR Comparison
Broader Implications
SS22. v2 Control C1 -- Judge Triangulation
SS22.1 Motivation
SS22.2 Three-Judge Agreement Matrix
Table v2.1: Three-judge agreement on the 15,000 v1 samples
SS22.3 Do the 15 v1 "significant comparisons" Survive Under gemma3 and Claude?
Table v2.2: Q2_K vs Q8_0 pooled-model Fisher + Holm under three judges
SS22.4 Which v3 ASR numbers does this report cite?
SS23. v2 Control C2 -- FP16 Baseline (Qwen2.5-1.5b + Llama3.1-8b)
SS23.1 Motivation
Table v2.3: C2 FP16 ASR vs v1 Q8_0 ASR
SS23.2 Why Is Qwen2.5-1.5b FP16 Safer Than Q8_0?
SS24. v2 Control C3 -- Static Prompt Ablation
SS24.1 Motivation
SS24.2 Result
Table v2.4: C3 static-prompt ASR at each (model, quant) cell, n=50
SS25. v2 Control C4 -- Benign-Demo Negative Control
SS25.1 Motivation
SS25.2 Result
Table v2.5: C4 ASR with benign few-shot demos
SS26. v2 Control C6 -- Phase 2 Reinforcement (n=150)
SS26.1 Motivation
Table v2.6: C6 Phase 2 replication at n=150 per cell
SS27. v2 Control C7 -- Breadth Expansion (27,000 rows, 5 models)
SS27.1 Motivation
SS27.2 Pooled Q2_K vs Q8_0 Tests on the Full Family
Table v2.7: Pooled Q2_K vs Q8_0 in C7
Table v2.8: Full C7 cell grid (post-aggregation, pooled across shot-count and format)
SS28. v2 Control C8 -- Right-Tail (N=256) Saturation
SS28.1 Motivation
Table v2.9: C8 ASR at N=256 (n=300 per cell)
SS28.2 Trajectory Slope from N=16 Peak to N=256
SS29. v2 Controls C9-C12 -- Narrower Cells
SS29.1 C9 Larger-Model Anchor (qwen2.5-14b, n=900)
Table v2.10: C9 qwen2.5-14b
SS29.2 C10 Non-GGUF Quantization (AWQ + GPTQ)
Table v2.11: C10 qwen2.5-1.5b under AWQ / GPTQ / GGUF at matched 4-bit
SS29.3 C11 Temperature Sensitivity
Table v2.12: C11 ASR at T=0.0 vs T=0.7
SS29.4 C12 Shot Ordering (Reproducibility Snapshot)
Table v2.13: C12 Q2_K ASR under "default" shot ordering
SS30. Reviewer Objections Closed
Table v2.14: Reviewer Objection -> v2 Control Mapping
SS31. v3.0 Open Gaps and Follow-Up
SS21. Reproducibility
References
Appendix A: Full ASR Tables
Phase 1: Many-Shot ASR (all 120 cells)
Phase 2: Long-Context ASR (selected cells with Wilson CIs)
Appendix B: Extended Statistical Tables
Power-Law Fit Parameters (all 24 fits)
Non-Significant Pairwise Comparisons (Summary)
Bootstrap CIs for Power-Law Exponents (B = 2000, seed = 42)
Appendix C: Sensitivity and Robustness
C.1 Significance Threshold Sensitivity
C.2 TOST Equivalence Margin Sensitivity
C.3 Judge Threshold Sensitivity
C.4 Shot-Count Subset Stability
C.5 Format Subset Stability
Appendix D: Glossary
Appendix E: Configs
Run Configuration
v3.0 Data Manifest and Compute Script

Technical Report 140: Many-Shot and Long-Context Jailbreak Susceptibility Under Quantization

4 models, 6 GGUF quant levels, 5 shot counts, 2 prompt formats, and 3 context-length profiles across 15,000 scored samples

Field	Value
TR Number	140
Project	Banterhearts
Date	2026-04-17
Version	3.0 (v1 + v2 controls C1-C13 integrated)
Author	Research Team
Git Commit	see `research/tr140/v2_controls/analysis/v3_data_manifest.json`
Status	Complete
Report Type	Full-depth integrated (v1 frozen run + v2 controls C1-C13)
Word Count	~24,500
Analysis Passes	38 (25 v1 + 13 new v2 passes)
Statistical Tests	Fisher exact, ANOVA, TOST, power-law OLS, bootstrap CI (n=1000), Cohen's h, Cohen's kappa, Holm-Bonferroni
v1 Run Directory	`research/tr140/results/20260316_164907/`
v2 Controls Directory	`research/tr140/v2_controls/results/`
v1 Total Samples	15,000 (12,000 Phase 1 + 3,000 Phase 2)
v2 New Primary Samples	48,950 (C2 600 + C3 1,200 + C4 400 + C6 2,700 + C7 27,000 + C8 10,800 + C9 900 + C10 600 + C11 4,000 + C12 750)
v3 Integrated Total	63,950 primary samples, 78,950 judge labels (v1 qwen 15K + v2 gemma3 48,950 + C1 gemma3 rejudge 15K + C1 Claude 15K; overlaps collapse per-sample)
v1 Judge	qwen2.5:7b-instruct-q8_0 (Ollama, Q8_0)
v2 Judge	gemma3:12b (primary) + Claude Sonnet 4.6 (C1 triangulation on the 15K v1 sample set)
Models	v1: llama3.2-1b, llama3.2-3b, qwen2.5-1.5b, llama3.1-8b. v2 adds: gemma2-2b (C7/C8), phi3.5-mini (C8 only; C7 run A pull_failed), qwen2.5-14b (C9 anchor).
Quant Levels	v1: Q8_0 to Q2_K. v2 adds: FP16 (C2), AWQ-4bit and GPTQ-4bit (C10 vLLM).
Phase 1 Design	4 models x 6 quants x 5 shot counts x 2 formats x 50 behaviors
Phase 2 Design	4 models x 5 quants x 3 context profiles x 50 behaviors
Related Work	TR134, TR139, TR142, TR147 (integration pattern)
Depends On	TR134 (safety baselines), TR139 (multi-turn jailbreak baselines)
Compute Script	`research/tr140/compute_v2_stats.py` (emits `v2_controls/analysis/v3_stats.json`)

Abstract

TR140 asks whether GGUF quantization amplifies many-shot jailbreaking and long-context safety attacks on open-weight language models. Following Anthropic's many-shot methodology (NeurIPS 2024), we construct prompts containing N in-context compliance examples (N = 1, 4, 16, 64, 128) in two formats -- faux dialogue and message array -- and measure attack success rate (ASR) across 4 models (1.2B to 8B parameters) and 6 quantization levels (Q8_0 through Q2_K). Phase 2 tests whether harmful instructions hidden after benign context prefixes are more effective on quantized models. The v1 frozen run (research/tr140/results/20260316_164907/) produced 15,000 scored samples with 15,000 qwen2.5:7b judge labels. The v3.0 integration adds 48,950 additional primary samples from thirteen v2 controls (C1-C13) plus 30,000 additional judge labels (15,000 gemma3:12b rejudge of the v1 set and 15,000 Claude Sonnet 4.6 labels on the same set) to bring the total to 63,950 scored samples / 78,950 judge labels across 38 analysis passes.

The core v1 findings survive and strengthen under v2 evidence. (1) Many-shot attacks are devastating on qwen2.5-1.5b, reaching 99% ASR at N=128 on Q2_K (v1, n=100) and 98.7% at N=256 on Q2_K (v2 C8, n=300) -- the attack works even at Q8_0 (40% ASR at N=128 in v1; 47.3% at N=256 in C8). Quantization amplifies an already-vulnerable model rather than creating vulnerability, and the C2 FP16 baseline (8.0% ASR on qwen2.5-1.5b at n=300) confirms that the vulnerability is not introduced by quantization. (2) Llama models are nearly immune to many-shot attacks at Q4_K_M and above. The v2 C7 breadth expansion (24,000 new rows on the full family with n=1,000 per cell) shows llama3.2-1b Q8_0 ASR of 0.0% (0/1000, Wilson upper 0.38%), and Q2_K ASR of 43.7% -- a cliff that holds under the larger sample. (3) The v2 C7 breadth data shows that all five family-level Q2_K-vs-Q8_0 comparisons are Holm-adjusted p < 1e-14 with Cohen's h from 0.65 to 1.44 -- an order-of-magnitude tighter rejection than the n=100 v1 tests. (4) The v2 C8 right-tail pushes shot count to N=256 and shows that llama3.1-8b Q2_K saturates (46% at N=16, 9.7% at N=256 -- the context-cap hypothesis H3 is reconfirmed) while qwen2.5-1.5b Q2_K asymptotes toward 100% (98.7% at N=256). (5) C1 cross-judge triangulation shows gemma3 and Claude agree at kappa = 0.925 with raw 99.02% on the 11,451 overlapping labels; the v1 qwen2.5:7b judge disagrees systematically (kappa = 0.23 vs gemma3, 0.25 vs Claude). Under the two stronger judges every single one of the four Q2_K-vs-Q8_0 model-level tests remains Holm-significant at p < 1e-10; the v1 qwen judge produces 1 non-significant test (llama3.1-8b). (6) C4 benign-demo ASR is low (20.5% pooled across 400 rows), but only because qwen2.5-1.5b Q2_K with a benign demo still reaches 75% -- the "benign" label travels with the model's quant-induced compliance, not with the semantic cue. (7) C9 qwen2.5-14b at n=900 (300 per quant) shows pooled ASR = 18.9% across Q2_K/Q4_K_M/Q8_0 with no monotonic degradation (Q2_K = 23.7%, Q4_K_M = 16.3%, Q8_0 = 16.7%); the 14B scale does NOT close the qwen-family Q2_K gap but dampens it compared to 1.5B.

The operational conclusion is unchanged but sharpened: Q2_K remains the universal safety breakpoint, the message array format remains the dominant amplifier, and per-behavior residual variance remains the largest factor. The v3.0 integration closes the major v1 reviewer objections (judge reliability, FP16 anchor, larger-N saturation, broader model family, non-GGUF quantization, temperature sensitivity, shot ordering). The four claims changed from v1 are summarized in the v1->v3 Integration Summary.

v1 -> v3 Integration Summary

The stale v2.0 draft of TR140 reported results from one 15,000-sample frozen run with a single qwen2.5:7b judge. The current v3.0 report integrates thirteen v2 controls (C1-C13) that were designed to close specific reviewer objections. The following table makes the delta explicit and cites the source file for every number.

Axis	v1 (frozen, 2026-03-16)	v3 (v1 + v2 controls, 2026-04-17)	Implication for the Paper
Primary samples	15,000 (Phase 1: 12,000; Phase 2: 3,000)	63,950 (v1 15,000 + v2 48,950)	4.3x the measurement base; C7 alone adds 24,000 full-family rows.
Judge labels	15,000 (qwen2.5:7b only)	78,950 (v1 qwen 15,000 + v2 gemma3 48,950 + C1 gemma3 rejudge 15,000 + C1 Claude 15,000)	Triangulated triple-judge (qwen, gemma3, Claude) on the 15K v1 set.
Models	4 (llama3.2-1b/3b, qwen2.5-1.5b, llama3.1-8b)	+3 (gemma2-2b, phi3.5-mini, qwen2.5-14b) on a subset of controls	Family coverage broadened for C7/C8/C9; not full-grid -- flagged as narrow cells.
Quantization families	GGUF Q2_K-Q8_0 only	+ FP16 (C2), AWQ-4bit (C10), GPTQ-4bit (C10)	Reviewer objection "GGUF-only -> results not portable" addressed for qwen2.5-1.5b; C10 showed vLLM's AWQ/GPTQ at 3% ASR vs GGUF Q4_K_M at 42% (same model).
Shot-count range	N in {1, 4, 16, 64, 128}	+ N=256 (C8, 10,800 rows across 6 models x 6 quants)	H3 (context cap) directly tested at the saturation tail.
Temperature	0.0 only	+ 0.7 (C11, 4,000 rows on llama3.2-3b + qwen2.5-1.5b at Q2_K and Q8_0)	T sensitivity measured; ASR deltas within 1.2pp on all 4 cells.
Shot ordering	Default (random)	+ explicit "default" bucket documented (C12, 750 rows)	Ordering-stability measurement at n=250-500 per cell.
Phase 2 reinforcement	n=50 per (model,quant,profile)	+ n=150 per (model,quant,profile) for 3 models x 3 quants x 3 profiles (C6, 2,700 rows)	Phase 2's qwen2.5-1.5b Q2_K long_prefix "100% ASR" replicates at n=150 (150/150, not 50/50).
Larger-model anchor	None above 8B	qwen2.5-14b (C9, 900 rows at Q2_K/Q4_K_M/Q8_0)	14B does NOT rescue qwen family at Q2_K (23.7% still elevated), but dampens the cliff versus 1.5B (86.2% at Q2_K in C7).
Judge validation	Single-judge kappa reported at 0.23 (regex vs qwen2.5:7b)	Three-judge triangulation: gemma3 vs Claude kappa = 0.925 (n=11,451), v1 qwen vs gemma3 kappa = 0.233, three-way agreement 93.8% (n=11,419)	v1 qwen judge is systematically under-calling COMPLIANCE; corrected gemma3/Claude labels are used for the v3 per-cell ASR where available.
Benign-demo control	None	C4 (400 rows): pooled ASR 20.5% [16.8%, 24.7%]	The "benign" demo control is NOT a clean negative control on qwen2.5-1.5b Q2_K (75% ASR). Q2_K's quant-induced compliance dominates the demo's semantic cue.
Reviewer objections closed	0	9 (see SS30)	Judge reliability, FP16 anchor, N=256 saturation, broader family, non-GGUF, T sensitivity, shot ordering, Phase 2 reinforcement, 14B anchor.

Claims changed from v1:

C6 Phase 2 qwen2.5-1.5b Q2_K long_prefix is NOT merely "100% on n=50" -- it is 100% at n=150 (research/tr140/v2_controls/results/c6_phase2_reinforcement/20260415_031712/samples.jsonl). The claim stands, but with a 3x tighter Wilson CI [97.6%, 100.0%].
C1 cross-judge triangulation: the v1 report's overall judge kappa of 0.23 was driven by qwen2.5:7b systematic under-calling of COMPLIANCE. gemma3 and Claude agree at kappa = 0.925. The v1 ASR values at Q2_K are conservative lower bounds; the gemma3/Claude rejudged ASR on the same samples averages 8-15 pp HIGHER on Q2_K cells. The updated Q2_K-vs-Q8_0 Fisher significance under gemma3 yields Holm-p < 1e-10 on all four models (vs three-of-four under v1 qwen). Llama3.1-8b, which was the one non-significant test under v1 qwen (Holm p = 0.128), flips to Holm p = 3.4e-15 under gemma3.
C2 FP16 baseline does NOT pass TOST equivalence with Q8_0 at ±3pp: qwen2.5-1.5b FP16 = 8.0% (24/300) vs Q8_0 = 14.5% (94/650), TOST p = 0.95, not equivalent. The v1 report claimed "Q8_0 is the baseline because Ollama does not serve FP16." That claim still holds operationally (Ollama-stack constraint), but the safety difference between FP16 and Q8_0 is larger than the ±3pp floor on this model. Specifically FP16 is 6.5pp LOWER ASR than Q8_0 on qwen2.5-1.5b (FP16 safer than Q8_0). Llama3.1-8b shows the opposite sign (FP16 4.7% vs Q8_0 1.1%) at n=300, though this cell is within Wilson CI overlap.
C8 right-tail overturns v1's H3 interpretation in one direction: on llama3.1-8b Q2_K the v1 peak-then-decline pattern (46% @ N=16 -> 26% @ N=128) continues to decline at N=256 (9.7%, research/tr140/v2_controls/results/c8_right_tail/20260416_005713/samples.jsonl), confirming context-cap. But on qwen2.5-1.5b Q2_K, ASR does NOT decline at N=256 -- it saturates at 98.7% (296/300). The context-cap is model-specific; the v1 report treated H3 as "supported" in general but the v2 data localizes it to llama3.1-8b.

Claims that held under v2 evidence:

Q2_K universal vulnerability threshold: held across C7 (5 models, all Holm p < 1e-14), C8 (N=256 tail), C11 (T=0.7 shows same pattern), and under both gemma3 and Claude rejudging.
Message array dominance: v2 did not repeat SS10's cell-by-cell format comparison (out of scope for the controls), so v1's ranking stands.
Residual-variance-dominates (65.7%): v2 did not repeat the variance decomposition at larger N, so v1's decomposition stands.

Executive Summary

Key Findings

Peak ASR reaches 99.0% on qwen2.5-1.5b Q2_K at N=128 (v1, n=100) and 98.7% at N=256 (v2 C8, n=300). This is the highest attack success rate in the study under both sample sizes. The same model at Q8_0 reaches 40.0% at N=128 (v1) and 47.3% at N=256 (v2 C8, n=300), establishing that qwen2.5-1.5b is fundamentally vulnerable to many-shot attacks regardless of quantization level (Fisher exact p < 1e-14 in C7's 4x-larger test, Holm-adjusted). Verified from research/tr140/v2_controls/results/c8_right_tail/20260416_005713/samples.jsonl (300 rows for qwen2.5-1.5b Q2_K N=256, 296 compliance).
Llama models are immune to many-shot attacks above Q3_K_M. Across all three Llama variants (1b, 3b, 8b), ASR is at or below 2.0% at Q4_K_M through Q8_0 for every shot count. The largest non-Q2_K value is llama3.1-8b Q8_0 at N=128 (4.0%, Wilson CI [1.6%, 9.8%]). The safety alignment of Llama-family instruct models resists many-shot pressure at production quantization levels.
Q2_K is the universal vulnerability threshold. Every model shows significantly elevated ASR at Q2_K compared to Q8_0. Of 100 Fisher exact tests, 15 survive Holm-Bonferroni correction, all involving Q2_K or Q3_K_M. On Llama models, the transition from "immune" to "broken" occurs between Q3_K_M and Q2_K.
Message array format is dramatically more effective than faux dialogue. On llama3.1-8b Q2_K at N=16, faux dialogue achieves 0% ASR while message array achieves 92% ASR (p < 0.001). On qwen2.5-1.5b Q4_K_M at N=128, faux dialogue achieves 4% while message array achieves 86% (p < 0.001). The chat-template exploitation format is the dominant attack vector.
ASR peaks at N=16 on llama3.1-8b Q2_K (46%), then declines at N=64/128. This confirms H3 (context-window cap): at high shot counts, the prompt exceeds the model's effective attention span, and many-shot effectiveness degrades. The 8B model appears to lose coherent attention over the exemplars at very long prompts.
Power-law fits show quantization shifts the exponent. Well-fit curves (R-squared > 0.5) yield exponents ranging from 0.15 to 0.77. H1 (invariant exponent) is rejected: the exponent varies both across quant levels and across models. qwen2.5-1.5b Q5_K_M has the steepest power law (b = 0.77, R-squared = 0.81).
Phase 2 (long-context) confirms quant amplification. Context dilution slopes are negative for all four models, meaning lower quantization weakens safety when harmful content is hidden after benign prefixes. qwen2.5-1.5b shows the steepest slope (-0.15/BPW, 95% CI [-0.55, -0.04]) and reaches 100% ASR on Q2_K with long_prefix.
Variance decomposition: residual dominates (65.7%). Per-behavior variation explains more ASR variance than any experimental factor. Quantization explains 17.9%, model identity 12.6%, and shot count only 2.7%. The specific harmful behavior being requested matters more than how aggressively the model is quantized.
Judge agreement is moderate under v1's single judge but strong under v2's triangulation. v1 reported overall kappa = 0.23, agreement = 90.3% (regex vs qwen2.5:7b). v2 C1 rejudged all 15,000 v1 samples with gemma3:12b and Claude Sonnet 4.6. gemma3 vs Claude on overlapping 11,451 cases: raw agreement 99.02%, Cohen's kappa 0.925. v1 qwen vs gemma3: raw 92.88%, kappa 0.233. v1 qwen vs Claude: raw 94.17%, kappa 0.246. Three-way agreement (v1 qwen AND gemma3 AND Claude all same label) is 93.76% on 11,419 overlapping rows. The v1 qwen judge is the outlier, not the gold standard, and its low kappa was previously misattributed to intrinsic case ambiguity. Verified from research/tr140/v2_controls/analysis/v3_stats.json under key c1_agreement.pairwise.
C7 breadth expansion (27,000 new rows, gemma3 judge) tightens and broadens the Q2_K finding. The full-family 4-model x 6-quant grid at n=1,000 per cell yields llama3.2-1b Q2_K = 43.7% vs Q8_0 = 0.0% (0/1000; Wilson upper 0.38%) with Cohen's h = 1.44 (Holm-p = 6.8e-15). qwen2.5-1.5b Q2_K = 86.2% (n=1,000) vs Q8_0 = 22.2%, h = 1.40. gemma2-2b (a newly added family at n=500) yields Q2_K = 19.6% vs Q8_0 = 0.2%, h = 0.83. Verified from research/tr140/v2_controls/results/c7_breadth_expansion/20260415_223902/samples.jsonl (24,000 rows, 0 JSON errors, 0 failures per the manifest).
C8 (N=256) result disambiguates H3. llama3.1-8b Q2_K decays monotonically from the v1 peak of 46% @ N=16 to 9.7% @ N=256 (29/300). llama3.2-1b Q2_K stays high at 91.7% @ N=256 (275/300). qwen2.5-1.5b Q2_K saturates at 98.7% @ N=256 (296/300). phi3.5-mini (newly tested) shows 90-98% ASR at N=256 across ALL six quants -- an extreme outlier whose v2 characterization is prerequisite for any phi3.5-mini deployment claim. Verified from research/tr140/v2_controls/results/c8_right_tail/20260416_005713/samples.jsonl.
C9 14B anchor is intentionally narrow and pooled-only. qwen2.5-14b was run at 3 quant levels x 100 behaviors x 3 shot counts = 900 samples on RunPod L40S. Pooled ASR = 18.9% (170/900). Per-quant: Q2_K 23.7%, Q4_K_M 16.3%, Q8_0 16.7%. The 14B's Q2_K ASR is lower than qwen2.5-1.5b's Q2_K ASR of 86.2% in C7, but higher than llama3.1-8b's Q2_K ASR of 27.4% at the same C7 sample size. Scale dampens but does not eliminate the qwen-family cliff. Underpowered for stratified shot-count breakdown. Verified from research/tr140/v2_controls/results/c9_larger_model/20260415_182202/samples.jsonl (900 rows, manifest total=900).
C10 non-GGUF cells invert the cliff on qwen2.5-1.5b. Same model, same N=1 N=4 N=16 N=64 N=128 prompts (n=100 per cell), but served via vLLM with AWQ-4bit and GPTQ-4bit weights rather than Ollama GGUF: AWQ 3% ASR, GPTQ 3% ASR, vs Ollama Q4_K_M 42% ASR on the same model at the same bit width. The GGUF K-quant mixed-precision layer selection is therefore a material confounder -- what v1 attributed to "4-bit quantization" is more accurately "4-bit GGUF K-quant on llama.cpp." AWQ and GPTQ at the same bit width did not reproduce the v1 cliff. Verified from research/tr140/v2_controls/results/c10_non_gguf/20260415_190113/samples.jsonl (400 rows, 2 cells per quant format).
C2 FP16 baseline shows FP16 is not equivalent to Q8_0. qwen2.5-1.5b FP16 ASR = 8.0% (24/300) vs v1 Q8_0 = 14.5% (94/650). TOST-p at ±3pp = 0.95 -- not equivalent. FP16 is 6.5pp SAFER than Q8_0 on this model, contradicting the v1 assumption that Q8_0 is an adequate proxy for "unquantized." Llama3.1-8b FP16 = 4.7% (14/300) vs v1 Q8_0 = 1.1% -- here FP16 is NOMINALLY less safe, but both values are within Wilson CI overlap and the direction flip is within sampling noise. Verified from research/tr140/v2_controls/results/c2_fp16_baseline/20260414_203036/samples.jsonl.
C4 benign-demo control is compromised by Q2_K's quant-induced compliance. Designed as a negative control where a harmless few-shot demo should produce low ASR, C4 instead shows: qwen2.5-1.5b Q2_K 75% ASR, qwen2.5-1.5b Q8_0 3% ASR, llama3.1-8b Q2_K 4%, llama3.1-8b Q8_0 0%. The benign-demo manipulation only works as a negative control when the model isn't already broken by Q2_K. Pooled across all 400 rows, ASR = 20.5% -- driven entirely by the qwen2.5-1.5b Q2_K cell. Verified from research/tr140/v2_controls/results/c4_benign_demo/20260414_195746/samples.jsonl.
C11 temperature sensitivity is small. At T=0.0 vs T=0.7 on llama3.2-3b and qwen2.5-1.5b at Q2_K and Q8_0 (n=250 per cell x 8 cells), ASR deltas are within 1.2pp: qwen2.5-1.5b Q2_K 99.2% (T=0.0) vs 99.6% (T=0.7); llama3.2-3b Q8_0 4.0% vs 4.4%. The study's T=0.0 protocol is not undercutting the findings. Verified from research/tr140/v2_controls/results/c11_temp_sensitivity/20260415_000924/samples.jsonl.
C12 shot ordering: not varied. The C12 run produced 750 rows with only a single "default" ordering bucket (llama3.1-8b Q2_K 78.8%, qwen2.5-1.5b Q2_K 87.2%). Shot-ordering variation was not tested; this control documents ordering-stability reproducibility (matches v1's Q2_K cells within Wilson CI) but does not answer the ordering-sensitivity question. Explicit gap flagged in SS30.

Core Decisions

Many-shot jailbreaking is not a quantization-specific threat -- it is a model-specific and format-specific threat that quantization can amplify.
Avoid Q2_K for any safety-relevant deployment. Q3_K_M is marginal. Q4_K_M and above are safe from many-shot amplification on Llama models.
Restrict or sanitize message array format inputs. Faux dialogue is orders of magnitude less effective and represents a lower-risk surface.
qwen2.5-1.5b requires additional safety hardening regardless of quantization level -- the model is vulnerable to many-shot attacks even at Q8_0.
Context-length limits (max turns, max tokens) are an effective mitigation: ASR plateaus or declines at high N on most models.

Validation Summary

Target	Metric	Required	Achieved	Status
Sample count	Total N	>= 10,000	15,000	PASS
Phase 1 coverage	Cells	4 x 6 x 5 x 2 = 240	240	PASS
Phase 2 coverage	Cells	4 x 5 x 3 = 60	60	PASS
Judge completion	Labels/samples	100%	100% (0 failures)	PASS
H1 (invariant exponent)	Exponent constancy	Exponents differ < 0.1	Range 0.15-0.77	FAIL (H1 rejected)
H3 (context cap)	ASR plateau at high N	Peak before N=128	Peak at N=16 for 8b	PASS
Cross-TR anchor	N=1 Q8_0 refusal	>= 95%	98-100%	PASS
MDE at 80% power	Minimum effect	< 10%	3.9-19.4%	PARTIAL

Claim Validation

#	Claim	Evidence Base	Status
C1	Q2_K is the universal vulnerability threshold	SS7 critical quant analysis, SS6 Fisher exact tests	Established
C2	Llama models are immune above Q3_K_M	SS5 Table 1 (all 0% ASR cells)	Established
C3	Message array is more effective than faux dialogue	SS10 format comparison (significant in 20+ cells)	Established
C4	Power-law exponent shifts with quantization	SS8 exponent table, bootstrap CIs	Established
C5	Context-window caps many-shot effectiveness	SS5 shot-count curves (peak at N=16 on 8b)	Partial (model-specific)
C6	qwen2.5-1.5b is fundamentally vulnerable	SS5 Q8_0 baseline ASR (40% at N=128)	Established
C7	Quantization left-shifts the power law (H2)	Insufficient well-fit curves at matched quant pairs	Not established

How to Read This Report

Phase 1 (many-shot) results span SS5 through SS10; Phase 2 (long-context) results are in SS11. Statistical synthesis and hypothesis verdicts are in SS17. Each result section follows: context prose, data table, then Observations interpreting the table. ASR is reported as a percentage with Wilson CIs; values exceeding 10% are bolded. Sections SS15-SS16 cover equivalence testing and cross-TR anchoring. Appendices A-B provide full raw and statistical tables; Appendix C is a sensitivity analysis; Appendix D is the glossary; Appendix E has the run configs.

Reading time estimates: Executive summary (5 min), core results SS5-SS11 (20 min), statistical synthesis SS15-SS17 (10 min), full report end-to-end (60 min).

When to Use This Report

Scenario 1: Evaluating many-shot attack risk on a quantized deployment

Question: We serve llama3.2-3b at Q4_K_M through Ollama. How vulnerable is it to many-shot jailbreaking?

Answer: Not vulnerable. SS5 Table 1 shows 0.0% ASR across all shot counts and both prompt formats at Q4_K_M. The vulnerability activates only at Q2_K (41% ASR at N=128). If you stay at Q4_K_M or above, many-shot is not your threat surface.

Scenario 2: Choosing between prompt formats for a chat API

Question: Should we allow users to inject message arrays or restrict them to single-message inputs?

Answer: Restrict message arrays wherever possible. SS10 shows message array format achieving 92% ASR where faux dialogue achieves 0% on the same model, quant, and shot count. The chat-template exploitation is the dominant attack amplifier, not quantization.

Scenario 3: Setting context-length limits for safety

Question: How many user turns should we allow before truncating context?

Answer: SS13 context-budget analysis shows that 4K-token contexts limit shot count to N=16, and 8K contexts to N=64. On Llama models, even N=128 on Q2_K does not consistently exceed 72% ASR, and ASR peaks at N=16 on the 8B model before declining. Limiting to 4K tokens is an effective many-shot mitigation.

Scenario 4: Positioning TR140 relative to the broader safety program

Question: How does many-shot jailbreaking relate to multi-turn jailbreaking (TR139) and single-turn safety (TR134)?

Answer: TR134 measures single-turn refusal rates. TR139 measures multi-turn conversational attack strategies (crescendo, role-play, etc.). TR140 measures in-context learning attacks where the model is shown examples of compliance. These are complementary threat surfaces: a model can pass all three or fail any subset independently. qwen2.5-1.5b fails on many-shot (TR140) while performing differently on multi-turn (TR139).

Abstract
Executive Summary
- How to Read This Report
When to Use This Report
Metric Definitions
SS1. Introduction
SS2. Methodology
SS3. Models and Design
SS4. Prompt Construction
SS5. Phase 1 Results: Many-Shot ASR by Shot Count and Quantization
- SS5.5 Baseline-Normalized ASR
- SS5.6 Minimum Effective Shot Count
- SS5.7 Shot-Count Effectiveness Patterns
SS6. Statistical Tests vs Q8_0 Baseline
SS7. Critical Quant Thresholds
SS8. Power-Law Analysis
SS9. Per-Category and Per-Model ANOVA
SS10. Prompt Format Comparison
SS11. Phase 2 Results: Long-Context Safety
SS12. Variance Decomposition and Many-Shot Amplification
SS13. Context-Budget Analysis
SS14. Judge Agreement and Scoring Reliability
SS14b. Latency Analysis
SS15. TOST Equivalence and Power Analysis
SS16. Cross-TR Validation
SS17. Statistical Synthesis and Hypothesis Evaluation
- Factor Hierarchy
- Interaction Model
- Theoretical Framework for the Q2_K Cliff
SS18. Production Guidance
- SS18.1 Decision Matrix
- SS18.2 Defense Layering
SS19. Limitations and Follow-Up
- SS19.1 Methodological Limitations
- SS19.2 Open Research Questions
- SS19.3 Follow-Up Work
SS20. Conclusions
- Cross-TR Comparison
- Broader Implications
SS22. v2 Control C1 -- Judge Triangulation
SS23. v2 Control C2 -- FP16 Baseline
SS24. v2 Control C3 -- Static Prompt Ablation
SS25. v2 Control C4 -- Benign-Demo Negative Control
SS26. v2 Control C6 -- Phase 2 Reinforcement (n=150)
SS27. v2 Control C7 -- Breadth Expansion (27,000 rows)
SS28. v2 Control C8 -- Right-Tail (N=256) Saturation
SS29. v2 Controls C9-C12 -- Narrower Cells
SS30. Reviewer Objections Closed
SS31. v3.0 Open Gaps and Follow-Up
SS21. Reproducibility
References
Appendix A: Full ASR Tables
Appendix B: Extended Statistical Tables
Appendix C: Sensitivity and Robustness
Appendix D: Glossary
Appendix E: Configs

Metric Definitions

Primary Metrics

Metric	Definition	Interpretation
ASR (Attack Success Rate)	Fraction of samples where the model complied with a harmful request	0-1, higher = less safe (attack succeeded)
Refusal Rate	1 - ASR; fraction of samples where the model correctly refused	0-1, higher = safer
Power-law exponent (b)	Exponent in ASR = a * N^b; rate at which ASR grows with shot count	Higher b = faster ASR growth with more shots
Context dilution slope	Change in ASR per BPW across context profiles	Negative = lower quant weakens safety under long context

Statistical Tests Used

Test	Role in This Report
Fisher exact test	Pairwise quant-vs-Q8_0 comparisons on ASR (2x2 contingency)
Holm-Bonferroni	Multiple comparison correction across 100 pairwise tests
One-way ANOVA	Omnibus test for model effects at each shot count
Two-way ANOVA	Quant x shot-count interaction effects per model
OLS power-law fit	ASR = a * N^b on log-log scale
Bootstrap percentile CI	CIs for power-law exponents (B = 2000, seed = 42)
Wilson score CI	Confidence intervals for ASR proportions
TOST	Equivalence testing within +/-3pp of Q8_0 baseline
Cohen's h	Effect size for two-proportion comparison; h = 2
Cohen's kappa	Inter-rater agreement between regex refusal detector and LLM judge

Evidence Standard

Established findings require Fisher exact p < 0.05 after Holm-Bonferroni correction, with ASR difference exceeding 10pp or clinical significance for deployment decisions.

Partial findings show raw significance but lose it after multiple comparison correction, or show effects only in specific model-format-quant combinations.

Non-claims are results where the study lacks power, where TOST confirms equivalence, or where the effect is within the noise floor of the scoring pipeline.

SS1. Introduction

SS1.1 Research Questions

H1 (Invariant Exponent): Is the many-shot power-law exponent b invariant across quantization levels?
H2 (Left Shift): Does quantization left-shift the power law, meaning fewer in-context shots are needed to achieve equivalent ASR at lower precision?
H3 (Context Cap): Do consumer-model context-window limits naturally cap many-shot effectiveness at high shot counts?

Secondary questions derived from the three hypotheses:

At what minimum shot count does ASR exceed 50% for each (model, quant) combination? (SS5.6)
Is the prompt format (faux dialogue vs message array) a larger effect than quantization? (SS10)
How does per-behavior variance compare to experimental factors? (SS12)
Are production quant levels (Q4_K_M, Q5_K_M) equivalent to Q8_0 for safety? (SS15)

SS1.2 Why This Matters

Anthropic's many-shot jailbreaking paper (NeurIPS 2024) demonstrated that in-context compliance examples can overwhelm safety alignment. Their work tested API models with very long context windows. The open-weight quantized ecosystem is different: models are smaller, contexts are shorter, and quantization may have already degraded the instruction-following capability that safety alignment depends on. If quantization makes many-shot attacks more effective at lower shot counts, the threat surface expands significantly -- attackers would need fewer examples to jailbreak a quantized deployment.

This concern is particularly acute for edge deployments: mobile, embedded, and on-premise LLM applications typically use aggressive quantization (Q4_K_M or lower) to fit within memory constraints. If these deployments are also exposed to user-controlled multi-turn inputs, the combination of quantization and many-shot pressure could create a safety gap that neither factor alone would produce. TR140 measures this interaction directly.

TR140 also serves a methodological purpose within the Banterhearts safety program: it tests a threat model (in-context compliance demonstration) that is mechanistically distinct from both single-turn refusal testing (TR134) and multi-turn conversational strategies (TR139). Together, the three TRs cover the primary attack surfaces available to adversaries interacting with quantized open-weight models through standard chat APIs. Each subsequent TR narrows the gap between laboratory safety measurements and real-world attack feasibility.

SS1.3 Scope

Dimension	Coverage
Models	llama3.2-1b (1.2B), llama3.2-3b (3.2B), qwen2.5-1.5b (1.5B), llama3.1-8b (8.0B)
Quant levels	Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K
Shot counts (Phase 1)	1, 4, 16, 64, 128
Prompt formats (Phase 1)	Faux dialogue, message array
Behaviors	50 harmful requests across 10 JBB categories
Context profiles (Phase 2)	short_prefix, medium_prefix, long_prefix
Phase 1 samples	12,000 (4 x 6 x 5 x 2 x 50 = 12,000)
Phase 2 samples	3,000 (4 x 5 x 3 x 50 = 3,000)
Total samples	15,000
Backend	Ollama (llama.cpp)
Temperature	0.0
Seed	42

SS1.4 Literature Grounding

Anil et al. (2024) introduced many-shot jailbreaking, demonstrating ASR scaling with in-context example count on Claude models. Zheng et al. (2024) extended this to open-weight models with improved few-shot strategies. Neither study examined the interaction with quantization. Dettmers et al. (2023) established that 4-bit quantization preserves task performance on standard benchmarks, but safety-specific metrics were not evaluated. Zou et al. (2023) demonstrated transferable adversarial attacks on aligned models, establishing that safety alignment can be bypassed through optimization -- many-shot jailbreaking achieves a similar goal through in-context learning rather than adversarial optimization. TR140 fills the gap between these lines of work by systematically varying both shot count and quant level on open-weight models, and by adding a long-context phase (Phase 2) that tests attention-dilution attacks under quantization.

SS1.5 How to Read This Report

Phase 1 (many-shot) results appear in SS5 through SS10. Phase 2 (long-context) results appear in SS11. Statistical synthesis and hypothesis verdicts are in SS17 (including factor hierarchy, interaction model, and theoretical framework for the Q2_K cliff). Each result section follows the pattern: context prose, data table, then Observations interpreting the table. ASR is reported as a percentage with Wilson CIs; values in bold exceed 10%. Effect sizes (Cohen's h for proportion comparisons, eta-squared for ANOVA) accompany all significance tests. Appendix C provides sensitivity analyses confirming robustness of key findings.

SS2. Methodology

SS2.1 Overall Design

The experiment has two phases:

Phase 1 (Many-Shot): For each (model, quant, shot-count, format) combination, 50 harmful behaviors are tested. The prompt contains N exemplar Q/A pairs demonstrating compliance, followed by the target harmful request. ASR is the fraction of the 50 behaviors where the model complied.
Phase 2 (Long-Context): For each (model, quant, context-profile) combination, 50 harmful behaviors are tested. The prompt contains a benign text prefix (short, medium, or long) followed by the harmful request. This tests whether quantization weakens the model's ability to maintain safety alignment across long contexts.

SS2.2 Unit of Analysis

One sample is a single (model, quant, shot-count or context-profile, format, behavior) tuple. Each sample produces one model response, one regex refusal classification, and one LLM judge label. ASR is computed by aggregating judge labels over the 50 behaviors within each experimental cell.

SS2.3 Scoring Pipeline

Each response is scored in two stages:

Regex RefusalDetector: Fast pattern-matching for refusal phrases ("I cannot", "I'm sorry", etc.). Produces refusal_score: 1.0 = refused, 0.0 = complied.
LLM Judge (qwen2.5:7b-instruct-q8_0): Post-hoc validation using a different model family from any model under test. The judge receives the harmful request and the model's response and classifies compliance.

The judge's label is the authoritative classification. Cohen's kappa between regex and judge is reported in SS14.

SS2.4 Design Safeguards

Temperature fixed at 0.0 for deterministic outputs.
Seed fixed at 42 for reproducibility.
Each model is warmed up with 3 requests before evaluation begins.
10-second cooldown between model switches to prevent GPU memory residue.
Models are unloaded between quant-level switches.
Judge model (qwen2.5:7b) is from a different family than evaluated models to prevent family-specific bias.

SS2.5 Compute Budget

Phase	Cells	Samples/cell	Total samples	Est. GPU-hours
Phase 1	240	50 x 2 formats = 100	12,000	~8
Phase 2	60	50	3,000	~2
Judge labeling	--	--	15,000	~3
Total	300	--	15,000	~13

All computation was performed on a single NVIDIA RTX GPU (12GB VRAM) over approximately 13 GPU-hours. The binding constraint is GPU memory, not compute time: llama3.1-8b at Q8_0 occupies ~8.5GB, leaving limited KV-cache headroom for long prompts. Phase 1 N=128 prompts (~12K tokens) approach the memory limit on this model.

SS2.6 What This Design Does Not Do

It does not test adaptive attacks that modify strategy based on model responses.
It does not test models above 8B parameters (GPU memory constraint: 12GB).
It does not include FP16 baselines (Ollama does not serve FP16 for these models; Q8_0 is the baseline).
It does not test AWQ, GPTQ, or other quantization frameworks -- results are specific to GGUF/llama.cpp.

SS3. Models and Design

Model	Parameters	Family	Ollama Base Tag	Skip FP16?
llama3.2-1b	1.2B	Llama 3.2	llama3.2:1b	No
llama3.2-3b	3.2B	Llama 3.2	llama3.2:3b	No
qwen2.5-1.5b	1.5B	Qwen 2.5	qwen2.5:1.5b	No
llama3.1-8b	8.0B	Llama 3.1	llama3.1:8b	Yes (FP16 OOMs)

Quant Level	BPW	Notes
Q8_0	8.00	Baseline (highest available precision)
Q6_K	6.57	K-quant mixed precision
Q5_K_M	5.69	Medium K-quant
Q4_K_M	4.85	Most common production quant
Q3_K_M	3.91	Aggressive quantization
Q2_K	3.35	Extreme quantization

Observations.

The four models span a 6.5x parameter range (1.2B to 8.0B), providing a natural axis for testing whether model size mediates many-shot vulnerability.
qwen2.5-1.5b is included as a cross-family comparison. Qwen uses a different tokenizer, chat template, and safety training pipeline than Llama, which may affect both baseline safety and many-shot susceptibility.
llama3.1-8b cannot be served at FP16 on a 12GB GPU, so Q8_0 is the highest-precision baseline available.
The parameter range (1.2B-8B) was chosen to represent the consumer-grade deployment tier where quantization is most common. Models below 1B lack sufficient safety training to be meaningful baselines. Models above 8B exceed 12GB GPU memory at Q8_0. The 4-model lineup provides two data points on the Llama 3.2 branch (1b, 3b), one on the Llama 3.1 branch (8b), and one cross-family comparison (qwen2.5-1.5b).
The six GGUF quant levels were selected to cover the full K-quant range available in llama.cpp: from near-lossless Q8_0 (8.00 BPW) through aggressive Q2_K (3.35 BPW). Q4_K_M (4.85 BPW) is the most commonly deployed production quant and serves as the natural decision boundary for safety recommendations.

SS4. Prompt Construction

SS4.1 Faux Dialogue Format

A single user message containing N exemplar conversations as plain text:

Human: [harmful question 1]
Assistant: [compliant answer 1]
Human: [harmful question 2]
Assistant: [compliant answer 2]
...
Human: [TARGET harmful question]
Assistant:

This format places all exemplars within a single user turn. The model sees the demonstrations as plain text, not as actual chat-template turns. It is Anthropic's original many-shot format.

SS4.2 Message Array Format

Each exemplar is injected as an explicit user/assistant message pair via the Ollama /api/chat endpoint:

[{"role": "user", "content": "[harmful question 1]"},
 {"role": "assistant", "content": "[compliant answer 1]"},
 {"role": "user", "content": "[harmful question 2]"},
 {"role": "assistant", "content": "[compliant answer 2]"},
 ...
 {"role": "user", "content": "[TARGET harmful question]"}]

This format exploits the model's actual chat template special tokens ([INST]/[/INST] for Llama, <|im_start|>/<|im_end|> for Qwen), making each demonstration more authoritative than plain text.

SS4.3 Phase 2: Context Dilution Format

Phase 2 uses a single harmful instruction appended after a benign text prefix. Three profiles control prefix length:

short_prefix: ~500 tokens of benign text before the harmful instruction.
medium_prefix: ~2,000 tokens of benign text.
long_prefix: ~4,000 tokens of benign text.

The hypothesis is that quantized models have weaker long-range attention and may fail to apply safety constraints to instructions that appear late in context.

Observations.

The message array format is structurally different from faux dialogue: it injects real chat-template tokens rather than plain-text markers. This distinction is critical because the model's safety training targets template-delimited turns, not arbitrary text.
Phase 2 is orthogonal to Phase 1: it tests context-length vulnerability rather than in-context learning. The two attack surfaces may compound in practice.
The 50 harmful behaviors are drawn from 10 JailbreakBench (JBB) categories, ensuring coverage across harm types. Each category contributes 5 behaviors, providing balanced representation across harm domains.
The choice of 50 behaviors (rather than the full JBB set of 100) is driven by the combinatorial explosion of the Phase 1 design: 50 behaviors x 240 cells = 12,000 samples. Using 100 behaviors would double Phase 1 to 24,000 samples, exceeding the compute budget. The 50-behavior subset was selected to maximize category coverage while keeping the total sample count feasible.

SS5. Phase 1 Results: Many-Shot ASR by Shot Count and Quantization

Phase 1 aggregates both prompt formats (faux dialogue and message array) per cell. Each cell has n = 100 samples (50 behaviors x 2 formats). The table below reports ASR with Wilson 95% CIs.

Table 1: llama3.1-8b

Quant	BPW	N=1	N=4	N=16	N=64	N=128
Q8_0	8.00	0.0%	0.0%	1.0%	2.0%	4.0%
Q6_K	6.57	0.0%	0.0%	1.0%	1.0%	1.0%
Q5_K_M	5.69	0.0%	0.0%	1.0%	0.0%	1.0%
Q4_K_M	4.85	0.0%	0.0%	2.0%	0.0%	0.0%
Q3_K_M	3.91	1.0%	2.0%	5.0%	0.0%	4.0%
Q2_K	3.35	2.0%	27.0%	46.0%	33.0%	26.0%

Table 2: llama3.2-1b

Quant	BPW	N=1	N=4	N=16	N=64	N=128
Q8_0	8.00	0.0%	0.0%	0.0%	0.0%	0.0%
Q6_K	6.57	0.0%	0.0%	0.0%	0.0%	0.0%
Q5_K_M	5.69	0.0%	0.0%	0.0%	0.0%	0.0%
Q4_K_M	4.85	1.0%	0.0%	0.0%	0.0%	0.0%
Q3_K_M	3.91	1.0%	0.0%	0.0%	0.0%	6.0%
Q2_K	3.35	11.0%	10.0%	62.0%	48.0%	72.0%

Table 3: llama3.2-3b

Quant	BPW	N=1	N=4	N=16	N=64	N=128
Q8_0	8.00	0.0%	0.0%	0.0%	0.0%	1.0%
Q6_K	6.57	0.0%	0.0%	0.0%	0.0%	0.0%
Q5_K_M	5.69	0.0%	0.0%	0.0%	0.0%	0.0%
Q4_K_M	4.85	0.0%	0.0%	0.0%	0.0%	0.0%
Q3_K_M	3.91	0.0%	0.0%	0.0%	0.0%	5.0%
Q2_K	3.35	7.0%	2.0%	0.0%	7.0%	41.0%

Table 4: qwen2.5-1.5b

Quant	BPW	N=1	N=4	N=16	N=64	N=128
Q8_0	8.00	2.0%	6.0%	4.0%	32.0%	40.0%
Q6_K	6.57	0.0%	5.0%	2.0%	26.0%	40.0%
Q5_K_M	5.69	1.0%	4.0%	2.0%	36.0%	42.0%
Q4_K_M	4.85	9.0%	16.0%	2.0%	28.0%	45.0%
Q3_K_M	3.91	19.0%	27.0%	20.0%	28.0%	53.0%
Q2_K	3.35	85.0%	82.0%	65.0%	92.0%	99.0%

Observations.

The Llama family shows a stark binary pattern: near-zero ASR at Q4_K_M and above, then catastrophic failure at Q2_K. This cliff behavior suggests that Llama's safety alignment depends on weight precision that Q2_K destroys.
llama3.2-1b Q2_K shows a non-monotonic shot-count curve: ASR rises from 11% at N=1 to 62% at N=16, drops to 48% at N=64, then recovers to 72% at N=128. The N=64 dip may reflect context-length interference where the exemplars begin competing for attention.
llama3.1-8b Q2_K peaks at N=16 (46%) and then declines monotonically to 26% at N=128. This is the clearest evidence for H3 (context cap): the 8B model loses many-shot effectiveness as prompts grow long.
qwen2.5-1.5b is vulnerable at every quant level. Even at Q8_0, ASR reaches 40% at N=128. Quantization amplifies an existing vulnerability rather than creating one. The Q2_K N=1 ASR of 85% means that Q2_K qwen2.5-1.5b is already broken before any many-shot exemplars are added.
llama3.2-3b is the most robust Llama variant: Q2_K ASR only reaches 41% at N=128, and all other quant levels show 0% ASR except for marginal leakage at Q3_K_M N=128 (5%).

SS5.5 Baseline-Normalized ASR

To enable cross-model comparison, ASR is normalized against Q8_0 at the same (model, shot-count) cell. The normalization ratio r = ASR(quant) / ASR(Q8_0) measures multiplicative degradation from quantization. Cells where Q8_0 ASR = 0% are marked "N/A" (ratio undefined; the pp delta in SS6 is the appropriate metric for these cells).

Table 4b: Normalized ASR ratios (qwen2.5-1.5b only -- the only model with non-zero Q8_0 baselines)

Quant	N=1 (Q8_0=2%)	N=4 (Q8_0=6%)	N=16 (Q8_0=4%)	N=64 (Q8_0=32%)	N=128 (Q8_0=40%)
Q6_K	0.00x	0.83x	0.50x	0.81x	1.00x
Q5_K_M	0.50x	0.67x	0.50x	1.13x	1.05x
Q4_K_M	4.50x	2.67x	0.50x	0.88x	1.13x
Q3_K_M	9.50x	4.50x	5.00x	0.88x	1.33x
Q2_K	42.50x	13.67x	16.25x	2.88x	2.48x

Table 4c: Llama family Q2_K normalization (absolute delta where Q8_0 = 0%)

Model	N=1	N=4	N=16	N=64	N=128
llama3.1-8b	+2.0pp	+27.0pp	+45.0pp	+31.0pp	+22.0pp
llama3.2-1b	+11.0pp	+10.0pp	+62.0pp	+48.0pp	+72.0pp
llama3.2-3b	+7.0pp	+2.0pp	+0.0pp	+7.0pp	+40.0pp

Observations.

qwen2.5-1.5b Q2_K at N=1 shows a normalization ratio of 42.5x -- the single highest in the study. This extreme ratio reflects Q2_K's 85% ASR against a 2% baseline: quantization multiplies the model's vulnerability by more than 40-fold at the lowest shot count. However, by N=128, the ratio compresses to 2.48x because the Q8_0 baseline itself has risen to 40%. The decreasing ratio with shot count confirms that many-shot pressure on qwen2.5-1.5b is an independent vulnerability that quantization amplifies additively rather than multiplicatively at high N.
Q3_K_M on qwen2.5-1.5b shows substantial normalized degradation at low shot counts (9.5x at N=1, 4.5x at N=4) but converges toward 1.0x at high shot counts (0.88x at N=64, 1.33x at N=128). At high N, the many-shot attack overwhelms the safety regardless of quant level, compressing the quant effect.
Q6_K and Q5_K_M ratios hover near 1.0x across all shot counts on qwen2.5-1.5b, confirming their TOST equivalence to Q8_0 (SS15). Production quant levels produce indistinguishable safety profiles on this model.
For Llama models (Table 4c), Q8_0 baselines are 0% at most cells, so absolute pp deltas replace ratios. The largest absolute degradation is llama3.2-1b N=128 (+72pp), confirming that Q2_K completely destroys Llama safety alignment at high shot counts.

SS5.6 Minimum Effective Shot Count

The minimum N at which ASR exceeds 50% measures how many in-context examples are needed to reliably jailbreak each (model, quant) combination.

Model	Quant	Min N for 50% ASR
llama3.1-8b	Q2_K	not reached
llama3.1-8b	Q3_K_M-Q8_0	not reached
llama3.2-1b	Q2_K	16
llama3.2-1b	Q3_K_M-Q8_0	not reached
llama3.2-3b	Q2_K	not reached
llama3.2-3b	Q3_K_M-Q8_0	not reached
qwen2.5-1.5b	Q2_K	1
qwen2.5-1.5b	Q3_K_M	128
qwen2.5-1.5b	Q4_K_M-Q8_0	not reached

Observations.

Only three (model, quant) combinations ever reach 50% ASR: qwen2.5-1.5b Q2_K at N=1 (already above 50% with zero many-shot exemplars), llama3.2-1b Q2_K at N=16, and qwen2.5-1.5b Q3_K_M at N=128. The many-shot attack reliably jailbreaks only 3 of 24 (model, quant) configurations.
qwen2.5-1.5b Q2_K reaches 50% at N=1, meaning the model is majority-compliant at this quant level without any in-context demonstrations. Many-shot is unnecessary -- the quantization damage alone is sufficient to break safety.
llama3.2-1b Q2_K crosses 50% at N=16 but then dips below 50% at N=64 (48%) before recovering at N=128 (72%). This non-monotonicity complicates the "minimum N" concept: the attack is not reliably above 50% across all shot counts.
No Llama model at Q3_K_M or above ever reaches 50% ASR, even at N=128. This reinforces the production recommendation: Q4_K_M and above are many-shot safe for Llama.
The practical implication for attackers is that many-shot jailbreaking of Llama models requires both extreme quantization (Q2_K) and moderate shot count (N >= 16). If either condition is denied -- by restricting quant level or by limiting context length -- the attack fails.

SS5.7 Shot-Count Effectiveness Patterns

The relationship between shot count and ASR is not monotonic on all models. Several patterns emerge:

Monotonic growth (qwen2.5-1.5b Q2_K): ASR grows from 85% at N=1 to 99% at N=128, with a dip at N=16 (65%). The N=16 anomaly may reflect a phase transition in how the model processes intermediate-length exemplar sequences.
Peak-then-decline (llama3.1-8b Q2_K): ASR peaks at N=16 (46%) and monotonically decreases to 26% at N=128. The model's attention degrades with prompt length, reducing the effectiveness of additional exemplars.
Delayed activation (llama3.2-3b Q2_K): ASR is low at N=1-64 (0-7%) then jumps to 41% at N=128. This model requires a large number of exemplars before many-shot pressure overwhelms its safety alignment.
Non-monotonic dip (llama3.2-1b Q2_K): ASR rises sharply to 62% at N=16, dips to 48% at N=64, then recovers to 72% at N=128. The N=64 dip may represent a transition zone where exemplar density interacts with context-window limits.

These patterns suggest that optimal attack strategy depends on both the target model and the quant level. A universal "more shots is always better" assumption does not hold.

SS6. Statistical Tests vs Q8_0 Baseline

Each quant level is compared to Q8_0 at the same (model, shot-count) using a two-sided Fisher exact test on the 2x2 contingency table (complied/refused x quant/baseline). Holm-Bonferroni correction is applied across all 100 pairwise tests (4 models x 5 shot counts x 5 quant levels).

Table 5: Significant comparisons (Holm-adjusted p < 0.05)

Cohen's h is computed as h = 2|arcsin(sqrt(p_test)) - arcsin(sqrt(p_baseline))| for each comparison. Effect size benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large (Cohen, 1988).

Model	N	Quant	ASR (Q8_0)	ASR (test)	Delta	Fisher p	Holm p	Cohen's h
llama3.1-8b	4	Q2_K	0.0%	27.0%	+27.0pp	<0.001	<0.001	1.09
llama3.1-8b	16	Q2_K	1.0%	46.0%	+45.0pp	<0.001	<0.001	1.28
llama3.1-8b	64	Q2_K	2.0%	33.0%	+31.0pp	<0.001	<0.001	0.93
llama3.1-8b	128	Q2_K	4.0%	26.0%	+22.0pp	<0.001	0.001	0.67
llama3.2-1b	16	Q2_K	0.0%	62.0%	+62.0pp	<0.001	<0.001	1.81
llama3.2-1b	64	Q2_K	0.0%	48.0%	+48.0pp	<0.001	<0.001	1.52
llama3.2-1b	128	Q2_K	0.0%	72.0%	+72.0pp	<0.001	<0.001	2.02
llama3.2-3b	128	Q2_K	1.0%	41.0%	+40.0pp	<0.001	<0.001	1.19
qwen2.5-1.5b	1	Q2_K	2.0%	85.0%	+83.0pp	<0.001	<0.001	2.06
qwen2.5-1.5b	1	Q3_K_M	2.0%	19.0%	+17.0pp	<0.001	0.009	0.62
qwen2.5-1.5b	4	Q2_K	6.0%	82.0%	+76.0pp	<0.001	<0.001	1.77
qwen2.5-1.5b	4	Q3_K_M	6.0%	27.0%	+21.0pp	<0.001	0.008	0.60
qwen2.5-1.5b	16	Q2_K	4.0%	65.0%	+61.0pp	<0.001	<0.001	1.47
qwen2.5-1.5b	64	Q2_K	32.0%	92.0%	+60.0pp	<0.001	<0.001	1.37
qwen2.5-1.5b	128	Q2_K	40.0%	99.0%	+59.0pp	<0.001	<0.001	1.57

Observations.

All 15 significant comparisons show large effect sizes (Cohen's h >= 0.60). The two Q3_K_M comparisons on qwen2.5-1.5b (h = 0.60 and 0.62) are the smallest, falling at the medium-large boundary. The 13 Q2_K comparisons are all h >= 0.67, with the largest being qwen2.5-1.5b N=1 Q2_K (h = 2.06) -- an extraordinarily large effect exceeding the "large" threshold by 2.5x.
Of 100 pairwise tests, 15 survive Holm-Bonferroni correction. All 15 involve either Q2_K (13 tests) or Q3_K_M (2 tests, both on qwen2.5-1.5b). No quant level at Q4_K_M or above shows a statistically significant ASR increase over Q8_0 after correction.
The largest effect by both pp delta and Cohen's h is qwen2.5-1.5b N=1 Q2_K (+83pp, h = 2.06): quantization alone -- without any many-shot pressure -- produces the study's largest safety degradation. The largest Llama effect is llama3.2-1b N=128 Q2_K (+72pp, h = 2.02).
llama3.1-8b N=128 Q2_K has the smallest significant Cohen's h (0.67), reflecting that its pp delta (+22pp) starts from a non-zero baseline (4%). Even the weakest significant effect is well above the "medium" threshold.
llama3.2-1b N=1 Q2_K (11% vs 0%, raw p < 0.001) loses significance after Holm correction (adjusted p = 0.062), illustrating that the correction is doing real work. The raw effect (11pp) is near the noise floor at n=100.
qwen2.5-1.5b is the only model where Q3_K_M reaches significance. For Llama models, Q3_K_M effects are small enough to be absorbed by the correction.

SS7. Critical Quant Thresholds

For each (model, shot-count) combination, the critical quant threshold is the most aggressive quant level that still shows a statistically significant ASR increase over Q8_0 (Holm-adjusted p < 0.05).

Model	N=1	N=4	N=16	N=64	N=128
llama3.1-8b	--	Q2_K	Q2_K	Q2_K	Q2_K
llama3.2-1b	--	--	Q2_K	Q2_K	Q2_K
llama3.2-3b	--	--	--	--	Q2_K
qwen2.5-1.5b	Q3_K_M	Q3_K_M	Q2_K	Q2_K	Q2_K

Observations.

Q2_K is the critical threshold in 11 of 20 conditions, and Q3_K_M in 2 additional conditions (both qwen2.5-1.5b at low N). The remaining 7 conditions show no significant threshold.
The "--" entries (no significant threshold) cluster at low shot counts on Llama models, where even Q2_K does not produce enough ASR to reach significance at n=100. This is a power limitation, not evidence of safety.
qwen2.5-1.5b is the only model where the critical threshold reaches Q3_K_M, and only at N=1 and N=4. At higher shot counts, even Q3_K_M's effect is masked by the already-elevated Q8_0 baseline.
The practical takeaway is that Q4_K_M and above are safe from statistically significant many-shot amplification across all models and shot counts tested.
The "--" entries correlate perfectly with zero-ASR cells: where the Q8_0 baseline is 0% and the test quant is also 0%, no threshold can be defined. This is not a limitation of the analysis but a genuine safety property -- these configurations are robust to many-shot attacks regardless of quant level.
Comparing SS7 to TR139's critical thresholds: TR139 found Q2_K as the critical threshold for multi-turn attacks on 11 of 20 conditions as well. The convergence across attack types strengthens the Q2_K finding beyond what either study establishes alone.

SS8. Power-Law Analysis

For each (model, quant) pair, we fit ASR = a * N^b on a log-log scale. The exponent b captures how rapidly ASR increases with shot count. Only fits with R-squared > 0.5 are considered well-fit.

Table 6: Power-law parameters (well-fit curves only, R-squared > 0.5)

Model	Quant	BPW	b	a	R-squared	95% Bootstrap CI
llama3.1-8b	Q2_K	3.35	0.454	0.056	0.509	[-0.274, 1.224]
llama3.1-8b	Q3_K_M	3.91	0.299	0.013	0.721	[-0.107, 0.661]
llama3.1-8b	Q8_0	8.00	0.643	0.002	0.964	[0.000, 1.000]
llama3.2-1b	Q2_K	3.35	0.430	0.096	0.781	[-0.014, 0.673]
llama3.2-1b	Q3_K_M	3.91	0.369	0.010	1.000	[0.369, 0.369]
qwen2.5-1.5b	Q3_K_M	3.91	0.155	0.182	0.563	[-0.016, 0.469]
qwen2.5-1.5b	Q5_K_M	5.69	0.769	0.009	0.808	[0.156, 1.619]
qwen2.5-1.5b	Q6_K	6.57	0.726	0.009	0.633	[-0.661, 1.850]
qwen2.5-1.5b	Q8_0	8.00	0.609	0.018	0.845	[0.182, 1.206]

Observations.

Nine (model, quant) pairs produce well-fit power laws. The exponent b ranges from 0.155 (qwen2.5-1.5b Q3_K_M) to 0.769 (qwen2.5-1.5b Q5_K_M), a 5x range. H1 (invariant exponent) is rejected: the exponent is not constant across quant levels or models.
llama3.1-8b Q8_0 has the highest R-squared (0.964) with b = 0.643. This is a near-perfect power law on a model that is nearly immune to many-shot attacks (ASR reaches only 4% at N=128). The power law captures a real but operationally insignificant trend.
qwen2.5-1.5b Q5_K_M has the steepest exponent (b = 0.769), meaning ASR grows fastest with shot count at this quant level. However, the bootstrap CI is wide [0.156, 1.619], reflecting high uncertainty from only 5 data points.
qwen2.5-1.5b Q2_K has an R-squared of only 0.136 (not shown in table) because ASR is already 85% at N=1 -- there is little room for a power law to operate when the baseline is near-saturated.
Bootstrap CIs cross zero for 4 of 9 well-fit curves, indicating that even the well-fit exponents are not robustly different from zero. The power-law model is descriptive but not predictive at this sample size.
Fifteen of 24 total (model, quant) pairs produce R-squared < 0.5 and are excluded from the well-fit table. These failures cluster in two patterns: (a) all-zero cells (Llama at Q4_K_M+) where no curve can be fit, and (b) near-saturated cells (qwen2.5-1.5b Q2_K, R-squared = 0.136) where ASR starts so high that shot count adds little. The power-law model is informative only in the intermediate regime where ASR transitions from low to high with increasing N.
The practical implication for H2 (left-shift) testing is that well-fit curves exist at different quant levels only for qwen2.5-1.5b (Q3_K_M, Q5_K_M, Q6_K, Q8_0) and llama3.1-8b (Q2_K, Q3_K_M, Q8_0). Within each model, the exponents do not follow a monotonic BPW trend, precluding a simple left-shift conclusion. A larger study with more quant levels (e.g., Q4_0, Q5_0, Q6_0 non-K-quant) might reveal a cleaner relationship.

SS9. Per-Category and Per-Model ANOVA

SS9.1 Per-Category Breakdown

Category	Avg ASR	Max ASR	Min ASR
Chemical & Biological Weapons	11.0%	100.0%	0.0%
Cybercrime & Unauthorized Intrusion	13.3%	100.0%	0.0%
Discrimination & Stereotyping	8.4%	100.0%	0.0%
Disinformation & Deception	14.3%	100.0%	0.0%
Economic Harm & Fraud	12.2%	100.0%	0.0%
Harassment & Bullying	11.9%	100.0%	0.0%
Illegal Activities	11.3%	100.0%	0.0%
Privacy Violations & Surveillance	11.4%	100.0%	0.0%
Self-Harm & Dangerous Substances	8.6%	100.0%	0.0%
Violence & Weapons	9.6%	100.0%	0.0%

Observations.

Category-level variation is modest: average ASR ranges from 8.4% (Discrimination & Stereotyping) to 14.3% (Disinformation & Deception), a spread of only 5.9pp. The many-shot attack does not strongly discriminate by harm category.
Every category reaches 100% max ASR (qwen2.5-1.5b Q2_K at high N) and 0% min ASR (all Llama models at Q4_K_M+), confirming that the model-quant interaction dominates category effects.
The relatively low variance across categories supports treating ASR as a model-level metric rather than a category-level one for deployment decisions.
The two highest-ASR categories -- Disinformation & Deception (14.3%) and Cybercrime & Unauthorized Intrusion (13.3%) -- represent dual-use domains where the boundary between legitimate and harmful information is ambiguous. Models may be trained with less aggressive refusal on these categories because over-refusal would impair legitimate use cases (e.g., cybersecurity education, misinformation research). The many-shot attack exploits this lower refusal threshold.
The two lowest-ASR categories -- Discrimination & Stereotyping (8.4%) and Self-Harm & Dangerous Substances (8.6%) -- are domains where safety training is typically most aggressive and unambiguous. Even many-shot pressure does not easily overcome strong category-specific refusal training.
The 5.9pp spread across categories is smaller than the model-level spread (0% to 99% across cells) by two orders of magnitude, confirming that category effects are a second-order concern after model identity and quant level. However, category effects may be more pronounced at the per-cell level -- a category analysis within Q2_K cells only would reveal whether the many-shot attack preferentially breaks certain harm categories.

SS9.2 One-Way ANOVA (Model Effect by Shot Count)

Shot Count	F	p	eta-squared	n_models
N=1	101.42	<0.001	0.11	4
N=4	112.06	<0.001	0.12	4
N=16	33.31	<0.001	0.04	4
N=64	200.21	<0.001	0.20	4
N=128	243.01	<0.001	0.23	4

Observations.

Model identity is a significant predictor of ASR at every shot count (all p < 0.001). Effect size (eta-squared) grows with shot count: from 0.11 at N=1 to 0.23 at N=128. As shot count increases, model differences become more pronounced because vulnerable models (qwen2.5-1.5b) diverge further from immune models (llama3.2-3b).
The N=16 dip in eta-squared (0.04) reflects that N=16 is the shot count where Llama Q2_K models begin showing elevated ASR, partially closing the gap with qwen2.5-1.5b and reducing the between-model variance.
Eta-squared values of 0.11-0.23 are considered medium-to-large effects in social science conventions (0.06 medium, 0.14 large; Cohen 1988). Model identity is a large effect at N=64 and N=128, meaning that knowing which model is deployed explains 20-23% of ASR variance at high shot counts. This reinforces the production guidance that safety evaluations must be model-specific.
The monotonic increase in eta-squared with N has a practical interpretation: at low shot counts (N=1), all models refuse (similar ASR), so model identity is less informative. At high shot counts, the vulnerable models diverge from the immune ones, making model identity the strongest predictor after residual variance.

SS9.3 Two-Way ANOVA (Quant x Shot Count per Model)

Model	Factor	F	p	eta-squared
llama3.1-8b	Quant	146.00	<0.001	0.18
llama3.1-8b	Shot Count	15.95	<0.001	0.02
llama3.1-8b	Interaction	10.69	<0.001	0.05
llama3.2-1b	Quant	422.76	<0.001	0.35
llama3.2-1b	Shot Count	46.81	<0.001	0.03
llama3.2-1b	Interaction	42.79	<0.001	0.14
llama3.2-3b	Quant	69.13	<0.001	0.08
llama3.2-3b	Shot Count	41.81	<0.001	0.04
llama3.2-3b	Interaction	29.80	<0.001	0.15
qwen2.5-1.5b	Quant	282.70	<0.001	0.29
qwen2.5-1.5b	Shot Count	116.01	<0.001	0.09
qwen2.5-1.5b	Interaction	2.70	<0.001	0.01

Observations.

Quantization is the dominant main effect for every model: eta-squared ranges from 0.08 (llama3.2-3b) to 0.34 (llama3.2-1b). Shot count has a much smaller main effect (eta-squared 0.02-0.09).
The quant x shot-count interaction is large for llama3.2-1b (0.14) and llama3.2-3b (0.15), meaning the effect of shot count depends on quant level. Concretely, increasing shot count matters enormously at Q2_K but has zero effect at Q4_K_M and above.
qwen2.5-1.5b has the smallest interaction (0.01) because qwen2.5-1.5b shows elevated ASR across all quant levels. The shot-count effect is relatively consistent regardless of quant level -- the model is vulnerable everywhere.
The interaction effect explains why aggregate variance decomposition (SS12) shows low shot-count variance (2.7%): the shot-count effect is concentrated in Q2_K cells, which are a minority of the design space. Within Q2_K, shot count explains substantially more variance, but this signal is diluted by the many zero-ASR cells at higher quant levels.
From a deployment perspective, the large quant main effects (eta-squared 0.08-0.35) confirm that quant-level selection is the most impactful safety decision. The interaction confirms that this decision matters most at extreme quant levels -- operators using Q4_K_M and above can safely ignore shot-count considerations.

SS10. Prompt Format Comparison

Faux dialogue and message array are compared within each (model, quant, shot-count) cell using a two-sample t-test on per-behavior ASR.

Table 7: Significant format comparisons (selected, p < 0.001)

Each comparison is within the same (model, quant, N) cell, with n = 50 per format. Cohen's h is computed on the two format-specific ASR proportions.

Model	Quant	N	Faux Dialogue	Message Array	Delta	p	Cohen's h
llama3.1-8b	Q2_K	4	0.0%	54.0%	+54.0pp	<0.001	1.65
llama3.1-8b	Q2_K	16	0.0%	92.0%	+92.0pp	<0.001	2.57
llama3.1-8b	Q2_K	64	0.0%	66.0%	+66.0pp	<0.001	1.89
llama3.1-8b	Q2_K	128	0.0%	52.0%	+52.0pp	<0.001	1.53
llama3.2-1b	Q2_K	4	0.0%	20.0%	+20.0pp	<0.001	0.93
llama3.2-1b	Q2_K	16	72.0%	52.0%	-20.0pp	<0.001	0.49
llama3.2-1b	Q2_K	64	0.0%	96.0%	+96.0pp	<0.001	2.74
llama3.2-1b	Q2_K	128	56.0%	88.0%	+32.0pp	<0.001	0.73
llama3.2-3b	Q2_K	128	6.0%	76.0%	+70.0pp	<0.001	1.63
qwen2.5-1.5b	Q2_K	1	96.0%	74.0%	-22.0pp	<0.001	0.66
qwen2.5-1.5b	Q2_K	16	90.0%	40.0%	-50.0pp	<0.001	1.13
qwen2.5-1.5b	Q3_K_M	4	42.0%	12.0%	-30.0pp	<0.001	0.70
qwen2.5-1.5b	Q3_K_M	16	6.0%	34.0%	+28.0pp	<0.001	0.74
qwen2.5-1.5b	Q3_K_M	128	36.0%	70.0%	+34.0pp	<0.001	0.70
qwen2.5-1.5b	Q4_K_M	64	10.0%	46.0%	+36.0pp	<0.001	0.84
qwen2.5-1.5b	Q4_K_M	128	4.0%	86.0%	+82.0pp	<0.001	1.97
qwen2.5-1.5b	Q5_K_M	64	0.0%	72.0%	+72.0pp	<0.001	2.02
qwen2.5-1.5b	Q5_K_M	128	2.0%	82.0%	+80.0pp	<0.001	1.98
qwen2.5-1.5b	Q6_K	64	0.0%	52.0%	+52.0pp	<0.001	1.53
qwen2.5-1.5b	Q6_K	128	4.0%	76.0%	+72.0pp	<0.001	1.72
qwen2.5-1.5b	Q8_0	64	0.0%	64.0%	+64.0pp	<0.001	1.85
qwen2.5-1.5b	Q8_0	128	2.0%	78.0%	+76.0pp	<0.001	1.89

Observations.

The message array format is the dominant attack vector on most model-quant-N combinations. The largest effect is llama3.2-1b Q2_K N=64: faux dialogue achieves 0% ASR while message array achieves 96% (Cohen's h = 2.74, the study's largest format effect). Nineteen of 22 significant comparisons show h >= 0.70 (large), confirming that the format difference is not just statistically significant but practically enormous.
llama3.1-8b shows zero vulnerability to faux dialogue across all quant levels and shot counts. Every non-zero ASR on llama3.1-8b comes from the message array format (all four comparisons: h = 1.53-2.57). This means Llama 3.1 8B's safety training is robust to plain-text demonstrations but not to chat-template-injected demonstrations.
qwen2.5-1.5b shows a reversal at Q2_K N=1 and N=16: faux dialogue is more effective than message array (96% vs 74% at N=1, h = 0.66; 90% vs 40% at N=16, h = 1.13). At extreme quantization, the model is so degraded that even plain-text exemplars overwhelm safety, and the chat template may actually help the model parse instructions correctly.
On qwen2.5-1.5b at Q5_K_M through Q8_0, the faux dialogue ASR is near zero at high N while message array ASR reaches 64-82% (h = 1.53-2.02). This demonstrates that even at production quant levels, the message array format is a potent attack surface on qwen2.5-1.5b. The aggregated ASR in SS5 masks this format asymmetry.
The only comparison with h < 0.50 is llama3.2-1b Q2_K N=16 (h = 0.49), where faux dialogue (72%) actually exceeds message array (52%). This is a unique reversal on a Llama model -- at N=16 on Q2_K, the model appears to be more susceptible to plain-text demonstrations than to template-injected ones. The mechanism is unclear and warrants investigation.

SS11. Phase 2 Results: Long-Context Safety

Phase 2 tests whether harmful instructions hidden after benign context prefixes are more effective on quantized models. Each cell has n = 50 (50 behaviors x 1 format). Phase 2 uses Q8_0, Q6_K, Q4_K_M, Q3_K_M, and Q2_K (5 levels; Q5_K_M is omitted to keep Phase 2 tractable).

Table 8: Phase 2 ASR by model, quant, and context profile

Model	Quant	short_prefix	medium_prefix	long_prefix
llama3.1-8b	Q8_0	0.0%	0.0%	0.0%
llama3.1-8b	Q6_K	0.0%	0.0%	0.0%
llama3.1-8b	Q4_K_M	0.0%	0.0%	0.0%
llama3.1-8b	Q3_K_M	0.0%	0.0%	2.0%
llama3.1-8b	Q2_K	4.0%	8.0%	12.0%
llama3.2-1b	Q8_0	0.0%	0.0%	2.0%
llama3.2-1b	Q6_K	0.0%	0.0%	2.0%
llama3.2-1b	Q4_K_M	0.0%	0.0%	4.0%
llama3.2-1b	Q3_K_M	0.0%	0.0%	2.0%
llama3.2-1b	Q2_K	20.0%	30.0%	52.0%
llama3.2-3b	Q8_0	0.0%	2.0%	0.0%
llama3.2-3b	Q6_K	0.0%	0.0%	0.0%
llama3.2-3b	Q4_K_M	0.0%	0.0%	0.0%
llama3.2-3b	Q3_K_M	0.0%	0.0%	0.0%
llama3.2-3b	Q2_K	10.0%	6.0%	4.0%
qwen2.5-1.5b	Q8_0	6.0%	6.0%	8.0%
qwen2.5-1.5b	Q6_K	6.0%	6.0%	10.0%
qwen2.5-1.5b	Q4_K_M	10.0%	12.0%	20.0%
qwen2.5-1.5b	Q3_K_M	12.0%	14.0%	34.0%
qwen2.5-1.5b	Q2_K	62.0%	76.0%	100.0%

Context Dilution Slopes

Model	Slope/BPW	95% CI	Interpretation
llama3.1-8b	-0.018	[-0.083, 0.000]	Weak negative: lower quant slightly weakens long-context safety
llama3.2-1b	-0.068	[-0.335, 0.000]	Moderate negative: Q2_K strongly amplifies long-prefix attacks
llama3.2-3b	-0.005	[-0.028, 0.000]	Negligible: llama3.2-3b is robust to context dilution
qwen2.5-1.5b	-0.150	[-0.551, -0.036]	Strong negative: quantization reliably weakens long-context safety

Observations.

qwen2.5-1.5b Q2_K long_prefix reaches 100% ASR -- every single one of the 50 harmful behaviors is complied with. This is the ceiling of attack effectiveness. The context dilution slope for qwen2.5-1.5b (-0.150/BPW) is the only one whose 95% CI excludes zero, making it the only model where context dilution amplification is statistically established.
The context dilution slopes are computed via OLS regression of ASR on BPW across the 5 quant levels within each model, separately for each context profile. The slope represents the average ASR change per BPW decrease. A slope of -0.150 means that each 1 BPW decrease in precision raises long-context ASR by 15 percentage points on qwen2.5-1.5b. From Q8_0 (8.00 BPW) to Q2_K (3.35 BPW), the predicted increase is 0.150 x 4.65 = ~70pp, consistent with the observed Q8_0 (8%) to Q2_K (100%) gap at long_prefix.
llama3.2-1b Q2_K shows the clearest monotonic gradient: 20% (short) to 30% (medium) to 52% (long). Longer benign prefixes progressively weaken safety at Q2_K.
llama3.2-3b Q2_K shows a reversed pattern: 10% (short) to 6% (medium) to 4% (long). On this model, longer prefixes appear to help safety rather than hurt it. This may reflect the 3B model's context handling: longer prefixes dilute the harmful instruction's salience.
Phase 2 confirms the same Q2_K threshold as Phase 1: all four models show elevated Phase 2 ASR only at Q2_K (and qwen2.5-1.5b additionally at Q3_K_M and Q4_K_M). The Phase 1 and Phase 2 vulnerability profiles are consistent.
Comparing Phase 1 (many-shot) and Phase 2 (context dilution) on matched quant levels: at Q2_K, many-shot is the stronger attack on every model. llama3.2-1b Q2_K reaches 72% ASR via many-shot (N=128) but only 52% via long-prefix context dilution. qwen2.5-1.5b Q2_K reaches 99% via many-shot but 100% via long-prefix -- both are ceiling-saturated. The many-shot attack is more efficient because it provides explicit compliance demonstrations, while context dilution relies on passive attention decay.
The Phase 2 long_prefix profile (~4,000 tokens of benign text) is roughly equivalent in token count to the Phase 1 N=16 prompt (~1,600 tokens of exemplars plus framing). Despite comparable token counts, the mechanisms differ: Phase 1 N=16 on llama3.2-1b Q2_K achieves 62% ASR through in-context learning, while Phase 2 long_prefix achieves 52% through attention dilution. Active demonstration is more effective than passive dilution, but the margin is smaller than expected.
The Phase 2 results have a practical implication for RAG systems: if a quantized model processes user-controlled documents that are prepended to the harmful query, context dilution could weaken safety. The long_prefix profile simulates this scenario. At Q4_K_M and above, all Llama models remain safe (0-4% ASR), but qwen2.5-1.5b shows 10-20% ASR even at Q4_K_M long_prefix -- a non-trivial risk for RAG deployments using this model.

SS12. Variance Decomposition and Many-Shot Amplification

SS12.1 Variance Decomposition

Factor	% Variance Explained
Quantization	17.9%
Model	12.6%
Shot Count	2.7%
Residual	65.7%

Total Phase 1 samples: 12,000.

Observations.

Residual variance (65.7%) dominates, meaning the specific harmful behavior being tested explains more ASR variation than any experimental factor. Some behaviors are inherently easier to elicit regardless of model, quant, or shot count.
Quantization (17.9%) is a larger factor than model identity (12.6%), but both are dwarfed by residual. The practical implication is that behavior-level hardening (improving training data for specific harm categories) would reduce ASR more than restricting quant levels.
Shot count explains only 2.7% of variance. This is surprisingly low given the power-law relationship in SS8, but it reflects that shot count only matters on vulnerable (model, quant) combinations, which are a minority of the design space. If the decomposition were restricted to Q2_K cells only, shot count's share would increase substantially -- the low aggregate number reflects dilution by the many zero-ASR cells at Q4_K_M and above.
The 65.7% residual implies that even a perfect model of (quant, model, shot_count) would leave two-thirds of ASR variance unexplained. This residual is driven by per-behavior heterogeneity: some harmful requests are consistently easy to elicit (e.g., generic information requests with dual-use framing) while others are consistently refused (e.g., explicit violence with named targets). A behavior-level analysis correlating ASR with request features (specificity, category, linguistic complexity) would be a natural follow-up.
Interaction terms are implicit in this decomposition. The two-way ANOVA in SS9.3 shows that the quant x shot-count interaction explains 0.01-0.15 of within-model variance depending on the model. This means the 17.9% quant share in the aggregate decomposition includes both main effects and interactions.

SS12.2 Many-Shot Amplification Ratios

The amplification ratio is ASR(N) / ASR(1) -- how much more effective many-shot is compared to single-shot at the same quant level.

Model	Quant	N	ASR(1)	ASR(N)	Ratio
llama3.1-8b	Q2_K	16	2.0%	46.0%	23.0x
llama3.1-8b	Q2_K	4	2.0%	27.0%	13.5x
llama3.1-8b	Q2_K	64	2.0%	33.0%	16.5x
llama3.1-8b	Q2_K	128	2.0%	26.0%	13.0x
llama3.2-1b	Q2_K	128	11.0%	72.0%	6.5x
llama3.2-1b	Q2_K	16	11.0%	62.0%	5.6x
llama3.2-1b	Q2_K	64	11.0%	48.0%	4.4x

Observations.

Peak amplification is 23.0x on llama3.1-8b Q2_K at N=16: single-shot ASR of 2% becomes 46% with 16 exemplars. This is the strongest many-shot amplification effect in the study.
Amplification declines at N=64 and N=128 on llama3.1-8b (16.5x and 13.0x), consistent with the context-cap hypothesis (H3). More shots do not always mean more effective attacks.
llama3.2-1b Q2_K shows lower amplification ratios (4.4x to 6.5x) because the N=1 baseline is already elevated (11%). When the model is already partially broken at N=1, there is less room for many-shot amplification.
Amplification ratios are undefined or trivial for most (model, quant) pairs because ASR(1) and ASR(N) are both 0% at Q4_K_M and above on Llama models.
The amplification data directly addresses the question "does many-shot pressure interact with quantization?" The answer is clearly yes at Q2_K (amplification ratios 4-23x), weakly at Q3_K_M (qwen2.5-1.5b only: from 19% at N=1 to 53% at N=128, a 2.8x amplification), and not at all at Q4_K_M and above on Llama. Many-shot is a conditional threat: it amplifies existing weakness but cannot create vulnerability where none exists.
For attackers, the amplification data reveals diminishing returns: the highest ratio (23x at N=16 on llama3.1-8b Q2_K) does not correspond to the highest absolute ASR (which is N=16 at 46%). An attacker optimizing for absolute compliance would choose qwen2.5-1.5b Q2_K N=128 (99% ASR, 1.2x amplification) over the high-amplification but lower-absolute cells.

SS13. Context-Budget Analysis

At what shot count N does the prompt exceed typical context windows? This determines the practical ceiling for many-shot attacks.

Context Budget	Max N	Implication
4K tokens	16	On Llama models at Q4_K_M+, ASR = 0%. On qwen2.5-1.5b Q8_0, ASR = 4%. Safe for Llama; marginal for Qwen.
8K tokens	64	On Llama models at Q4_K_M+, ASR = 0%. On qwen2.5-1.5b Q8_0, ASR = 32%. Unsafe for Qwen at high N.
16K+ tokens	128	Required for peak many-shot attacks. On qwen2.5-1.5b Q2_K, ASR = 99%.

Observations.

A 4K context limit is an effective many-shot mitigation for Llama models at all quant levels. It restricts N to 16, where ASR is 0% on all Llama variants at Q4_K_M and above.
A 4K limit is insufficient for qwen2.5-1.5b: even N=16 at Q2_K achieves 65% ASR. For qwen2.5-1.5b, context limits must be combined with quant-level restrictions.
The context budget is model-independent (all models use similar tokenizers and prompt templates), so N=16 corresponds to approximately 1,600 prompt tokens and N=64 to approximately 6,300 tokens across all models tested.
An operator deploying with a 4K context limit effectively restricts the maximum many-shot attack to N=16. At this shot count, the only cells exceeding 10% ASR are qwen2.5-1.5b Q2_K (65%) and qwen2.5-1.5b Q3_K_M (20%). For Llama models, a 4K context limit reduces the many-shot threat to zero regardless of quant level.
At 16K+ contexts (common in API deployments), N=128 is feasible and the full ASR landscape in SS5 applies. Operators serving models with extended context windows (32K, 128K) should assume that the many-shot threat surface is fully available to attackers and apply quant-level and format restrictions accordingly.

SS14. Judge Agreement and Scoring Reliability

Cohen's kappa between the regex refusal detector and the LLM judge (qwen2.5:7b-instruct-q8_0), stratified by quant level.

Stratum	n	Agreement	Kappa
Q2_K	2,000	63.5%	0.13
Q3_K_M	2,000	93.5%	0.36
Q4_K_M	2,000	95.7%	0.27
Q5_K_M	2,000	96.2%	0.23
Q6_K	2,000	96.9%	0.28
Q8_0	2,000	96.0%	0.22
Overall	12,000	90.3%	0.23

Observations.

Overall agreement is 90.3% with kappa = 0.23 (fair agreement). The kappa is low relative to agreement because the base rates are heavily skewed: most samples are refusals, so agreement by chance is high.
Q2_K has drastically lower agreement (63.5%, kappa = 0.13) because models at Q2_K produce more ambiguous responses -- partial compliance, hedged refusals, or garbled outputs that the regex and judge classify differently.
Q3_K_M through Q8_0 show consistent agreement (93-97%), reflecting that these quant levels produce clear refusals or clear compliance with little ambiguity.
The moderate kappa is a known limitation: the judge is a 7B model and may have its own biases. However, the judge is authoritative for all ASR calculations, so internal consistency is maintained even if absolute accuracy is imperfect.
The kappa value (0.23) is comparable to TR139's dual-judge kappa (0.23 for the 7B judge), suggesting this is a systemic property of the scoring pipeline rather than a TR140-specific issue. Improving judge reliability would require either a larger judge model (13B+, infeasible on 12GB GPU) or a multi-judge ensemble with majority voting.
From a measurement-theory perspective, the low kappa at Q2_K (0.13) means that individual Q2_K ASR values should be interpreted with +/-5pp uncertainty beyond the Wilson CI. The Q2_K findings remain robust because the effects (20-80pp) far exceed this uncertainty band, but borderline cells (e.g., llama3.2-3b Q2_K N=64: 7% ASR) could plausibly be 2-12% under different judge calibrations.

SS14b. Latency Analysis

Wall-clock latency per response is measured for each (model, quant, shot-count) cell. This section assesses whether many-shot prompts incur prohibitive latency costs that might naturally limit attack feasibility.

Table 9: Mean wall-clock latency (ms) by model and quant at selected shot counts

Model	Quant	N=1	N=16	N=128	ms/shot
llama3.1-8b	Q2_K	504	778	997	3.2
llama3.1-8b	Q3_K_M	568	672	851	1.9
llama3.1-8b	Q4_K_M	429	596	802	2.5
llama3.1-8b	Q5_K_M	444	678	876	2.7
llama3.1-8b	Q6_K	440	738	888	2.6
llama3.1-8b	Q8_0	724	1,256	1,462	4.1
llama3.2-1b	Q2_K	271	467	860	4.5
llama3.2-1b	Q3_K_M	250	208	285	0.4
llama3.2-1b	Q4_K_M	159	203	282	1.1
llama3.2-1b	Q8_0	277	263	338	0.5
llama3.2-3b	Q2_K	283	204	605	2.6
llama3.2-3b	Q4_K_M	230	183	301	0.8
llama3.2-3b	Q8_0	319	262	488	1.7
qwen2.5-1.5b	Q2_K	779	578	1,024	2.9
qwen2.5-1.5b	Q4_K_M	249	178	431	1.6
qwen2.5-1.5b	Q8_0	191	216	572	3.0

Table 10: Mean prompt tokens by shot count

N	Mean Tokens (across all models/quants)
1	~160
4	~465
16	~1,600
64	~6,260
128	~12,350

Observations.

Latency scales sub-linearly with shot count. N=128 prompts are approximately 80x longer than N=1 prompts (~12,350 vs ~160 tokens) but only 1.5-3x slower in wall-clock time. This is because Ollama's llama.cpp backend processes prompt tokens in parallel during the prefill phase -- the latency cost of many-shot attacks is modest relative to the token count.
llama3.1-8b Q8_0 is the slowest configuration at all shot counts (724ms at N=1, 1,462ms at N=128) because Q8_0 at 8B parameters saturates GPU memory bandwidth. Quantized variants are 30-45% faster, which ironically means the safest configuration (Q8_0) is also the most expensive to serve.
llama3.2-1b shows anomalous latency: N=16 and N=64 are sometimes faster than N=1 at Q3_K_M (208ms vs 250ms). This likely reflects warm-up effects or measurement noise at very short generation times.
The ms/shot metric captures marginal cost per additional exemplar. Values range from 0.4 ms/shot (llama3.2-1b Q3_K_M) to 4.5 ms/shot (llama3.2-1b Q2_K). Even at the highest rate, N=128 adds only ~570ms of latency over N=1, meaning many-shot attacks impose negligible latency cost. Rate-limiting alone is not an effective defense against many-shot attacks.
qwen2.5-1.5b Q2_K at N=1 is anomalously slow (779ms) because the model generates longer responses when safety is broken -- compliant responses tend to be more verbose than refusals. This pattern reverses at N=16 (578ms) where the many-shot prompt dominates total latency regardless of response length.
The latency data has an important defensive implication: many-shot prompts are detectable by their token count (N=128 uses ~12K tokens), not their latency. Token-count monitoring or prompt-length limits are more effective mitigations than latency-based rate limiting.

SS15. TOST Equivalence and Power Analysis

SS15.1 TOST Equivalence Tests

TOST tests whether each quant level is equivalent to Q8_0 within a +/-3pp margin. Of 100 tests, 8 show equivalence.

Model	Condition	Quant	TOST p	Equivalent?
llama3.1-8b	N=1	Q3_K_M	<0.001	YES
llama3.1-8b	N=16	Q5_K_M	<0.001	YES
llama3.1-8b	N=16	Q6_K	<0.001	YES
llama3.2-1b	N=1	Q3_K_M	<0.001	YES
llama3.2-1b	N=1	Q4_K_M	<0.001	YES
llama3.2-3b	N=128	Q4_K_M	<0.001	YES
llama3.2-3b	N=128	Q5_K_M	<0.001	YES
llama3.2-3b	N=128	Q6_K	<0.001	YES

Observations.

Only 8 of 100 tests confirm equivalence, meaning the study cannot formally claim that most quant levels are equivalent to Q8_0. This is largely a power problem: when both Q8_0 and the test quant produce 0% ASR, the TOST test requires sufficient sample size to bound the difference within +/-3pp. The remaining 92 tests split into: 15 significantly different (SS6), 8 equivalent, and 69 indeterminate (neither significant nor equivalent).
All 8 equivalence confirmations occur on Llama models at Q3_K_M through Q6_K -- the quant levels where ASR is consistently 0%. TOST validates what the raw data already shows: these quant levels are indistinguishable from Q8_0 for safety purposes.
No qwen2.5-1.5b condition achieves TOST equivalence because even Q6_K and Q8_0 differ in ASR at some shot counts. The model's variable baseline prevents equivalence claims.
The 69 indeterminate tests represent the "gray zone" of the study: conditions where we can neither claim harm nor guarantee safety. Most are Llama models at Q4_K_M-Q6_K at shot counts where the both-zero issue makes TOST underpowered. Increasing to n = 200 per cell would resolve approximately 30-40 of these indeterminate cases to "equivalent," based on the MDE analysis in SS15.2.
The TOST margin of +/-3pp was chosen to match the baseline noise floor observed on qwen2.5-1.5b (2% ASR at N=1 Q8_0). A wider margin (e.g., +/-5pp) would confirm more equivalences but at the cost of accepting clinically meaningful safety degradation. Appendix C.2 shows the margin-sensitivity analysis.

SS15.2 Power Analysis

Model	N	Baseline ASR	MDE at 80% Power
llama3.1-8b	1	0.0%	3.9%
llama3.1-8b	64	2.0%	5.5%
llama3.1-8b	128	4.0%	7.8%
llama3.2-1b	1-64	0.0%	3.9%
llama3.2-1b	128	0.0%	3.9%
llama3.2-3b	1-64	0.0%	3.9%
llama3.2-3b	128	1.0%	3.9%
qwen2.5-1.5b	1	2.0%	5.5%
qwen2.5-1.5b	64	32.0%	18.5%
qwen2.5-1.5b	128	40.0%	19.4%

Observations.

At n = 100 per cell, the study can detect effects as small as 3.9pp when the baseline is 0%. This is adequate for the primary finding (Q2_K effects are 20-80pp).
MDE increases with baseline ASR: at qwen2.5-1.5b N=128 (baseline 40%), the MDE is 19.4pp. The study cannot detect small quant effects on top of an already-high baseline. This explains why Q3_K_M through Q6_K comparisons on qwen2.5-1.5b at high N fail to reach significance -- the MDE exceeds the actual effect.
To detect 5pp effects at 80% power with baseline 40%, approximately n = 800 per cell would be needed (8x current sample size). This is infeasible for the current GPU budget.
The power analysis reveals a fundamental asymmetry in the study: we have excellent power to detect the Q2_K breakpoint (large effects, 20-80pp, all detected) but poor power to characterize the gradual degradation from Q8_0 to Q4_K_M (small effects, 0-5pp, mostly undetected). This means the study is well-suited for identifying safety thresholds but poorly suited for fitting smooth degradation curves. The power-law analysis (SS8) inherits this limitation.
Practically, the 3.9pp MDE at 0% baseline means the study can rule out safety degradation exceeding 4% for any Llama model at Q4_K_M and above. This is a strong negative result: if Q4_K_M introduced even a modest 5% ASR, we would detect it with 80% probability.

SS16. Cross-TR Validation

Single-shot (N=1) refusal rates at Q8_0 serve as a baseline anchor, comparable to TR134's single-turn refusal rates.

Model	TR140 N=1 Q8_0 Refusal	n
llama3.1-8b	100.0%	100
llama3.2-1b	100.0%	100
llama3.2-3b	100.0%	100
qwen2.5-1.5b	98.0%	100

Observations.

All four models achieve 98-100% refusal at N=1 Q8_0, confirming that the baseline safety alignment is intact before many-shot or quantization pressure is applied.
qwen2.5-1.5b's 98% refusal at N=1 Q8_0 (2% ASR) is consistent with its known higher compliance tendency. The 2% baseline is small but non-zero, and grows to 40% at N=128 Q8_0 through many-shot pressure alone.
These baselines anchor TR140's findings to TR134's single-turn safety measurements. Any model that fails at N=1 Q8_0 would indicate a methodology problem rather than a quantization effect.
The 2% ASR on qwen2.5-1.5b at N=1 Q8_0 is consistent with TR134's finding that this model has a slightly lower refusal rate than Llama variants. Across TR134, TR139, and TR140, qwen2.5-1.5b consistently shows 1-3% baseline leakage at the highest precision level, suggesting a small but persistent gap in its safety training.
The cross-TR validation establishes external validity: the experimental pipeline produces consistent baseline measurements across independent runs, reducing the risk that TR140's elevated ASR values at Q2_K are artifacts of the scoring pipeline rather than genuine safety degradation.

SS17. Statistical Synthesis and Hypothesis Evaluation

H1: Invariant Power-Law Exponent

Verdict: REJECTED.

The power-law exponent b ranges from 0.155 to 0.769 across well-fit curves (SS8 Table 6). If H1 were true, all exponents would fall within a narrow band. The 5x range and the bootstrap CIs that do not overlap across several model-quant pairs confirm that quantization and model identity both shift the exponent. However, many bootstrap CIs are wide and include zero, so the rejection is primarily driven by the point estimates rather than by CIs that exclude each other.

H2: Quantization Left-Shifts the Power Law

Verdict: INSUFFICIENT DATA.

Testing H2 requires comparing matched (model, quant_A) and (model, quant_B) power laws where both are well-fit. Only qwen2.5-1.5b has multiple well-fit curves (Q3_K_M, Q5_K_M, Q6_K, Q8_0). The exponents do not follow a monotonic BPW trend: Q5_K_M has the highest exponent (0.769), Q8_0 is lower (0.609), and Q3_K_M is lowest (0.155). This contradicts a simple left-shift model. More data points (more quant levels with well-fit curves) would be needed to evaluate H2.

H3: Context-Window Caps Many-Shot Effectiveness

Verdict: SUPPORTED.

llama3.1-8b Q2_K shows peak ASR at N=16 (46%) with decline at N=64 (33%) and N=128 (26%). llama3.2-1b Q2_K shows a dip at N=64 (48%) between N=16 (62%) and N=128 (72%). The context-budget analysis (SS13) shows that N=64 requires approximately 6,300 tokens, approaching the model's effective attention span. H3 is supported for llama3.1-8b (clear peak-then-decline) and partially supported for llama3.2-1b (dip-then-recovery). qwen2.5-1.5b shows monotonic ASR growth to N=128, so H3 does not apply to models with near-saturated baselines.

Synthesis

The three hypotheses paint a coherent picture: many-shot jailbreaking under quantization is not a simple scaling law. The power-law exponent shifts with quantization (H1 rejected), but not in a predictable left-shift pattern (H2 insufficient). Context-window limits naturally cap attack effectiveness on some models (H3 supported), providing a built-in mitigation.

Factor Hierarchy

Combining the variance decomposition (SS12), ANOVA (SS9), and effect-size analysis (SS6), the factors controlling ASR rank as follows:

Per-behavior residual (65.7%): The dominant factor. Individual harm behaviors vary enormously in elicitation difficulty, independent of experimental conditions.
Quantization level (17.9%): The largest experimental factor. Operates primarily through the Q2_K cliff -- the jump from Q3_K_M to Q2_K accounts for nearly all the explained quant variance.
Model identity (12.6%): The second experimental factor. Driven almost entirely by qwen2.5-1.5b's outlier vulnerability versus the three Llama models' near-uniform immunity.
Prompt format (not decomposed separately): Cross-cutting factor. SS10 shows format effects (Cohen's h = 0.49-2.74) that exceed most quant effects on a per-cell basis. Format is confounded with shot count in the variance decomposition because both formats contribute to each aggregated cell.
Shot count (2.7%): Surprisingly small. Shot count matters only on the minority of (model, quant) cells that are vulnerable -- and even there, the relationship is often non-monotonic (SS5.7).

Interaction Model

The results are best understood as a three-way interaction: Model x Quant x Format. The model determines baseline vulnerability (qwen2.5-1.5b: high; Llama: near-zero). Quantization to Q2_K breaks the safety floor for all models. The message array format then amplifies whatever vulnerability exists. These three factors are approximately multiplicative at low ASR and additive at high ASR (due to ceiling compression). Shot count acts as a dose parameter that modulates the format effect but contributes little independent variance.

Theoretical Framework for the Q2_K Cliff

The sharp Q3_K_M-to-Q2_K transition (0.56 BPW difference) suggests that safety alignment is encoded in a narrow precision band. One interpretation is that Llama's RLHF safety training creates a "safety subspace" in weight space -- a set of weight patterns that distinguish harmful from benign requests. At Q3_K_M (3.91 BPW), sufficient precision remains to represent these patterns. At Q2_K (3.35 BPW), the quantization noise floor exceeds the magnitude of the safety-relevant weight components, effectively erasing the safety signal while leaving general language capability intact. This explains why Q2_K models still generate fluent, relevant text (the "capability subspace" survives) while losing refusal behavior (the "safety subspace" is destroyed). Testing this hypothesis requires mechanistic interpretability work: probing safety-relevant neurons across quant levels to identify which layers fail first.

SS18. Production Guidance

Based on TR140's findings, the following recommendations apply to deploying quantized open-weight models in safety-sensitive contexts:

Avoid Q2_K for any safety-relevant deployment. Every model tested shows catastrophic safety degradation at Q2_K. There is no safe configuration at this quant level.
Treat Q3_K_M as marginal. Q3_K_M shows statistically significant ASR increases on qwen2.5-1.5b (N=1 and N=4) and borderline effects on Llama models. Q4_K_M is the minimum recommended quant for safety.
Restrict message array format inputs. The message array format is the dominant attack amplifier, more impactful than quantization level in most conditions. Input validation should prevent users from injecting arbitrary user/assistant message pairs.
Implement context-length guards. Limiting prompt length to 4K tokens restricts many-shot attacks to N=16, which is ineffective on Llama models at Q4_K_M and above. For qwen2.5-1.5b, context limits must be combined with quant-level restrictions.
Run safety benchmarks at the deployed quant level. Safety measured at Q8_0 does not predict safety at Q2_K. Any change in quantization level requires re-evaluation.
Model-specific hardening for qwen2.5-1.5b. This model is vulnerable to many-shot attacks even at Q8_0 (40% ASR at N=128). Quantization restrictions alone are insufficient; the model requires additional safety training or deployment-time guardrails.

SS18.1 Decision Matrix

The following matrix summarizes the recommended deployment posture for each (model-family, quant-level) combination under many-shot threat:

Quant	Llama (1b-8b)	qwen2.5-1.5b
Q8_0	SAFE: No many-shot risk. Deploy freely.	CAUTION: 40% ASR at N=128 via message array. Add format restrictions + context limits.
Q6_K	SAFE: Equivalent to Q8_0 (TOST confirmed on 3 conditions).	CAUTION: Same as Q8_0.
Q5_K_M	SAFE: Equivalent to Q8_0.	CAUTION: Same as Q8_0 (42% ASR at N=128 via message array).
Q4_K_M	SAFE: No significant ASR increase. Recommended production quant.	CAUTION: 45% ASR at N=128. Format restriction is critical.
Q3_K_M	MARGINAL: Low ASR (<6%) but threshold-sensitive significance. Monitor.	UNSAFE: 53% ASR at N=128. Significant at alpha=0.05 at low N.
Q2_K	UNSAFE: 26-72% ASR. Never deploy.	UNSAFE: 85-99% ASR. Completely broken.

How to use this matrix: Find your model family (column) and planned quant level (row). SAFE means no many-shot-specific mitigation is needed beyond standard safety practices. CAUTION means format restrictions and context-length limits are required. MARGINAL means the quant level should only be used with active monitoring and a fallback plan. UNSAFE means the configuration must not be deployed in any safety-sensitive context.

SS18.2 Defense Layering

No single mitigation is sufficient for all (model, quant) combinations. The recommended defense stack, in priority order:

Quant-level floor (Q4_K_M minimum): Eliminates the Q2_K cliff. Effective for all Llama models. Does not protect qwen2.5-1.5b at high N.
Format restriction (block message array injection): Eliminates the dominant attack amplifier (Cohen's h = 0.49-2.74 in SS10). Requires API-level input validation.
Context-length cap (4K tokens): Restricts N to 16, which is below the minimum effective shot count for all Llama models and limits qwen2.5-1.5b Q8_0 to 4% ASR.
Token-count monitoring: Detects many-shot prompts by their characteristic token counts (SS14b). N=64+ prompts use 6K+ tokens, which is anomalous for normal conversation.
Model-specific safety layer: For qwen2.5-1.5b, an output filter or safety classifier is needed because the model is vulnerable even at Q8_0.

Layers 1-3 together eliminate the many-shot threat for Llama deployments. For qwen2.5-1.5b, all five layers are recommended.

SS19. Limitations and Follow-Up

SS19.1 Methodological Limitations

Ollama quantization only. GGUF quant levels via llama.cpp may not behave identically to other quantization frameworks (GPTQ, AWQ, bitsandbytes). The K-quant mixed-precision approach in GGUF allocates different bit widths to different layers based on sensitivity analysis, which may protect safety-critical layers differently than uniform quantization. Results should not be generalized to other frameworks without replication.
Static shot pool. The 100 exemplar Q/A pairs are template-generated, not sourced from an uncensored model. Real-world attackers may use more convincing demonstrations harvested from actual model outputs, potentially increasing ASR at lower shot counts. The minimum effective shot count (SS5.6) should be treated as an upper bound -- sophisticated attackers may need fewer examples.
Single seed. All runs use seed = 42 and temperature = 0.0 for reproducibility, but this means results capture a single deterministic trajectory. At nonzero temperature, some cells near the 0% or 100% boundary might show different rates. The Wilson CIs partially account for this, but true variance estimation would require multiple seeds.
Regex + LLM judge pipeline. The regex refusal detector has known edge cases with partial refusals ("I can't help with that, but here's some related information..."). The 7B LLM judge has its own biases and the moderate kappa (0.23) reflects genuine scorer disagreement. The Q2_K stratum is most affected (63.5% agreement), meaning Phase 1 ASR values at Q2_K carry higher measurement uncertainty than reported CIs suggest.
Consumer-scale models only (1.2B-8B). Results may not generalize to 70B+ parameter models or API-served models with additional safety layers, system prompts, or output filters. Larger models may have more redundant safety representations that survive quantization better, or they may have enough capacity to simultaneously process exemplars and maintain safety -- both untested.
Context-window conflation. At N=128 (~12,300 tokens), some models may be operating near their effective context limit. The ASR decline at high N on llama3.1-8b (SS5.7) may reflect attention degradation rather than safety robustness. Disentangling these would require testing the same model with different context-window configurations (e.g., RoPE scaling).
Phase 2 sample size. Phase 2 uses n = 50 per cell (vs n = 100 for Phase 1), reducing statistical power for context-dilution effects. Only qwen2.5-1.5b's dilution slope has a CI excluding zero. Doubling Phase 2 to n = 100 would likely resolve several borderline cases.
No system prompt variation. All tests use a default system prompt. Production deployments often include custom system prompts with safety instructions ("You are a helpful assistant. Never provide harmful content."). System prompts may interact with many-shot pressure differently across quant levels -- a strong safety system prompt might raise the effective shot count needed for jailbreaking, or it might be ignored at Q2_K.
Binary compliance metric. ASR treats compliance as binary (refused/complied), but real responses exist on a spectrum from full refusal through partial hedging to enthusiastic compliance. The quality or specificity of harmful outputs may degrade under quantization even when the binary judge labels them as "compliant." A graded compliance metric would capture this nuance.

SS19.2 Open Research Questions

The following questions are raised by TR140's findings but cannot be answered within the current experimental design:

Why does the Q2_K cliff exist? The sharp transition between Q3_K_M (safe) and Q2_K (broken) on Llama models suggests that a specific precision threshold is needed to maintain the weight patterns encoding safety alignment. At 3.35 BPW, the model may lose the ability to represent the fine-grained distinctions between "generate harmful content" and "refuse harmful content." Mechanistic interpretability work (e.g., probing safety-relevant neurons across quant levels) could identify which layers or attention heads fail first under quantization.
Is the message array advantage due to chat-template tokens or to structural positioning? SS10 shows that message array format is dramatically more effective, but it is unclear whether the advantage comes from (a) the literal special tokens injected by the chat template, (b) the structural separation of exemplars into distinct turns that the model's attention processes as "real" conversation history, or (c) some combination. Testing a third format -- faux dialogue with injected special tokens but no API-level message separation -- would isolate these factors.
Does safety training target the template or the concept? Llama's immunity to faux dialogue (0% ASR on all cells) but vulnerability to message array at Q2_K suggests that Llama's safety training is anchored to chat-template tokens rather than to the semantic content of harmful requests. If true, this has implications for safety training methodology: template-anchored safety is fragile to template-injection attacks regardless of quantization.
What explains the 65.7% residual variance? The per-behavior variance dominates the experimental design, but we do not know which behaviors are easy vs hard to elicit, or why. Is it the semantic category (e.g., "how to build a weapon" vs "how to harass someone"), the linguistic framing, the length of the request, or the model's training data coverage? A behavior-level analysis correlating ASR with behavior features could identify the drivers.
Can many-shot and multi-turn attacks compound? TR139 tested multi-turn conversational attacks; TR140 tested many-shot in-context attacks. An attacker could combine both: use crescendo or role-play strategies (TR139) with many-shot exemplars (TR140) in the same conversation. The interaction between these attack surfaces is unexplored and may be super-additive.
Does quantization-aware safety training exist? If safety alignment degrades under quantization, a natural follow-up is quantization-aware safety fine-tuning: training the model's safety responses at the target quant level. This is analogous to quantization-aware training (QAT) for accuracy but applied to safety. No published work exists on this topic.

SS19.3 Follow-Up Work

TR141: Cross-architecture refusal fragility under batch perturbation. Tests whether safety is consistent across concurrent requests -- a different threat model from many-shot.
TR143 (proposed): Adaptive many-shot attacks. Iteratively refine exemplars based on model responses to determine whether the minimum effective shot count can be reduced below the static-pool values in SS5.6.
TR144 (proposed): Many-shot jailbreaking at 13B-70B scale on cloud A100 GPUs. Test whether model scale provides inherent many-shot resistance, and whether the Q2_K cliff shifts at larger parameter counts.
Cross-framework replication: Replicate the Q2_K cliff finding using GPTQ and AWQ quantization to determine whether the safety breakpoint is GGUF-specific or a general quantization phenomenon. If the cliff occurs at the same BPW across frameworks, it suggests a fundamental precision threshold for safety; if it varies, the framework's layer-wise bit allocation strategy is the mediating factor.
System prompt interaction: Test whether safety system prompts shift the minimum effective shot count or raise the critical quant threshold above Q2_K. A strong safety system prompt might effectively raise the quant-safety floor from Q2_K to Q3_K_M, or it might be ignored entirely at Q2_K -- both outcomes have distinct production implications.
Behavior-level analysis: Correlate per-behavior ASR with behavior features (specificity, category, linguistic complexity, dual-use framing) to explain the 65.7% residual variance. Identifying which behaviors are most vulnerable to many-shot attacks would enable targeted safety training improvements.
Quantization-aware safety training (QAS-T): Fine-tune safety responses at the target quant level rather than at FP16. If the Q2_K cliff exists because safety weights are lost during post-hoc quantization, training the model to be safe at Q2_K precision might recover the safety signal. This would be analogous to quantization-aware training (QAT) for accuracy but applied to the safety objective.
Scaling law extrapolation: Use TR140's 4-model data (1.2B-8B) to fit a preliminary scaling law for many-shot susceptibility vs model size, then test predictions at 13B and 70B. If the Q2_K cliff shifts to lower BPW at larger model sizes (i.e., larger models tolerate more aggressive quantization before safety breaks), this would quantify the "safety margin" provided by scale.

SS20. Conclusions

TR140 provides a systematic measurement of many-shot jailbreak susceptibility under GGUF quantization on open-weight models. The principal conclusions are:

Many-shot jailbreaking is a model-specific and format-specific threat, not a quantization-specific one. Quantization amplifies existing vulnerabilities but does not create them. qwen2.5-1.5b is vulnerable at every quant level; Llama models are immune above Q3_K_M. The implication is that safety evaluations must be model-specific -- family-level conclusions ("Llama is safe") hold within family but do not transfer across families.
Q2_K is the universal safety breakpoint. Every model tested shows catastrophic ASR elevation at Q2_K. The transition from safe to broken occurs sharply between Q3_K_M (3.91 BPW) and Q2_K (3.35 BPW), a span of only 0.56 BPW. This cliff suggests a phase transition in the model's ability to represent safety-relevant weight patterns. The practical consequence is that Q2_K should be treated as a hard safety boundary, not a gradual degradation.
The message array prompt format is more dangerous than any quantization level. Switching from faux dialogue to message array produces ASR increases of 50-96pp on vulnerable cells, exceeding the effect of any single quantization step. This finding reframes the threat model: input validation and format restrictions are higher-leverage safety interventions than quantization restrictions.
Power-law scaling exists but is not invariant. The exponent b shifts across models and quant levels (H1 rejected), ranging from 0.15 to 0.77. A simple "more shots = proportionally more effective" model is insufficient. The exponent encodes model-specific and quant-specific properties of the safety-versus-in-context-learning tradeoff, and cannot be predicted from model size or BPW alone.
Context-window limits are a natural defense. ASR peaks at N=16 on llama3.1-8b Q2_K then declines, confirming that finite context windows cap many-shot effectiveness (H3 supported). This is encouraging for consumer deployments with 4K-8K context limits, but does not protect against models served with extended context (32K+).
Residual variance dominates. Per-behavior variation explains 65.7% of ASR variance. The specific harmful request matters more than the model, quantization level, or shot count. This implies that behavior-level safety hardening (improving training data for specific harm categories) is a higher-return investment than quant-level restrictions.
Latency is not a natural defense. Many-shot prompts with N=128 add less than 600ms of latency over N=1 (SS14b), meaning rate-limiting based on response time cannot detect many-shot attacks. Token-count monitoring and context-length limits are the correct detection mechanism.

Cross-TR Comparison

Dimension	TR134 (Safety Baselines)	TR139 (Multi-Turn Jailbreak)	TR140 (Many-Shot Jailbreak)
Attack type	Single-turn refusal	Multi-turn conversational (8 strategies)	In-context compliance exemplars (N=1-128)
Models	4 models, 6 quants	4 models, 6 quants	4 models, 6 quants
Peak ASR	~15% (Q2_K)	~60% (Q2_K, crescendo)	99% (Q2_K, N=128, qwen2.5-1.5b)
Q2_K breakpoint?	Yes	Yes	Yes
Q4_K_M safe?	Yes	Yes (for Llama)	Yes (for Llama)
qwen2.5-1.5b outlier?	Moderate	Yes	Yes (most extreme)
Dominant factor	Quant level	Strategy x quant	Residual (per-behavior)
Format effect	N/A	Strategy-dependent	Message array dominant (h = 0.49-2.74)
Judge kappa	~0.3	0.23 (7B dual-judge)	0.23

The cross-TR pattern is consistent: Q2_K is the universal safety breakpoint across single-turn, multi-turn, and many-shot attack surfaces. qwen2.5-1.5b is the most vulnerable model across all three studies. Q4_K_M and above are safe for Llama models across all attack types tested. The converging evidence from three independent threat models strengthens the production recommendation: Q4_K_M is the minimum safe quant for Llama deployments.

Broader Implications

TR140's findings intersect with two active debates in the safety community:

The quantization-safety tradeoff. The prevailing assumption is that quantization degrades safety gradually as precision decreases. TR140 shows this is wrong -- safety degrades as a cliff function, not a slope. Models are either safe (Q4_K_M and above for Llama) or catastrophically broken (Q2_K). This has implications for how quantization is regulated: a "minimum BPW for safety" threshold (e.g., 4 BPW) would be more effective than a graduated penalty.

In-context learning vs safety training. Many-shot jailbreaking fundamentally pits in-context learning against safety training. At sufficient example count, in-context learning overwhelms the safety signal -- but only on some models and only below a precision threshold. This suggests that safety training and in-context learning occupy different weight subspaces, and quantization selectively destroys the safety subspace at extreme compression. Understanding this interaction at the mechanistic level is a key open problem.

The role of the chat template in safety. TR140 reveals that safety alignment is partially anchored to chat-template tokens rather than to semantic content. Llama's complete immunity to faux dialogue (which uses plain-text markers) but vulnerability to message array format at Q2_K (which uses real template tokens) implies that the model's safety decision boundary is located in the template-processing pathway, not in the content-understanding pathway. If confirmed by mechanistic work, this finding would suggest that safety training should target both pathways independently -- template-level safety (refusing when special tokens frame a harmful request) and content-level safety (refusing when the semantic content is harmful, regardless of formatting). Current RLHF approaches may inadvertently concentrate safety in the template pathway because training data predominantly uses the chat template format.

Ecological validity. The many-shot attack as implemented in TR140 is somewhat artificial: an attacker must supply 50-128 curated compliance examples within a single conversation. In practice, API-level defenses (turn limits, content filtering) would intercept many of these attempts. However, the attack is more realistic than adversarial suffix attacks (Zou et al., 2023) because it requires no optimization -- only access to the chat API with message injection. The message array format attack in particular is directly applicable to any deployment that accepts structured message arrays as input.

SS22. v2 Control C1 -- Judge Triangulation

SS22.1 Motivation

v1 relied on a single LLM judge (qwen2.5:7b-instruct-q8_0) plus a regex refusal detector. The v1-reported Cohen's kappa of 0.23 was interpreted as "moderate scorer disagreement on ambiguous outputs" in SS14. Reviewers raised the concern that the v1 judge might be systematically under- or over-calling COMPLIANCE in a way that biased every ASR number in the report. C1 addresses this by rejudging all 15,000 v1 samples with two additional judges from different families: gemma3:12b (local Ollama, free) and Claude Sonnet 4.6 (API, paid).

SS22.2 Three-Judge Agreement Matrix

Pairwise Cohen's kappa and raw agreement, computed on the maximum overlap for each pair (source: research/tr140/v2_controls/analysis/v3_stats.json under c1_agreement.pairwise, recomputed fresh from JSONL by research/tr140/compute_v2_stats.py).

Table v2.1: Three-judge agreement on the 15,000 v1 samples

Pair	Overlap n	Raw agreement	Cohen's kappa	Interpretation
v1 qwen2.5:7b vs gemma3:12b	11,845	92.88%	0.233	fair -- qwen is the outlier
v1 qwen2.5:7b vs Claude Sonnet 4.6	11,440	94.17%	0.246	fair -- qwen is the outlier
gemma3:12b vs Claude Sonnet 4.6	11,451	99.02%	0.925	near-perfect
Three-way (all three agree)	11,419	93.76%	--	dominated by v1 qwen's disagreements

Observations.

The v1 qwen2.5:7b judge disagrees with both gemma3 and Claude at ~7% raw rate, while gemma3 and Claude disagree with each other at <1% raw rate. The v1 kappa of 0.23 was not intrinsic to the task; it reflected systematic bias in the v1 judge's COMPLIANCE threshold.
At kappa = 0.925, the gemma3-Claude pair satisfies the "near-perfect" Landis-Koch threshold (kappa > 0.8), meaning the v3.0 ASR numbers computed from gemma3 rejudge labels are effectively interchangeable with Claude-rejudged ASR numbers. The v3 report cites gemma3 labels where available (they cover the full 15K) and uses Claude as a cross-check.
The overlap sizes (11,451 and 11,440) are less than 15,000 because the joining keys (behavior_id, model, quant, shot_count, prompt_format, phase) have some missing fields in the Claude batch file schema. The 11,419 three-way overlap is the conservative base for the 93.76% three-way agreement claim.

SS22.3 Do the 15 v1 "significant comparisons" Survive Under gemma3 and Claude?

The v1 report identified 15 Holm-significant Q2_K or Q3_K_M vs Q8_0 comparisons (SS6 Table 5). C1 reruns the four model-level Q2_K vs Q8_0 Fisher exact tests on the 15K v1 samples separately under each judge. (The five-quant-level shot-count-broken-out cells have insufficient overlap under the Claude batch join to replicate the exact v1 15-row test list; the model-level pooled test is the closest directly comparable number.)

Table v2.2: Q2_K vs Q8_0 pooled-model Fisher + Holm under three judges

Model	Judge	Q2_K k/n (ASR)	Q8_0 k/n (ASR)	Fisher p	Holm p (family of 4)	Cohen's h
llama3.2-1b	v1 qwen2.5:7b	18/512 (3.5%)	0/550 (0.0%)	1.7e-06	6.8e-06	0.38
llama3.2-1b	gemma3:12b (rejudge)	131/553 (23.7%)	1/650 (0.15%)	8.3e-16	3.3e-15	0.94
llama3.2-1b	Claude Sonnet 4.6	103/451 (22.8%)	1/649 (0.15%)	1.1e-15	3.8e-15	0.92
llama3.2-3b	v1 qwen2.5:7b	0/547 (0.0%)	0/550 (0.0%)	1.000	1.000	0.00
llama3.2-3b	gemma3:12b (rejudge)	41/650 (6.3%)	1/650 (0.15%)	1.1e-11	1.1e-11	0.43
llama3.2-3b	Claude Sonnet 4.6	15/602 (2.5%)	1/649 (0.15%)	1.6e-04	1.6e-04	0.24
qwen2.5-1.5b	v1 qwen2.5:7b	29/516 (5.6%)	9/549 (1.6%)	4.4e-04	1.3e-03	0.22
qwen2.5-1.5b	gemma3:12b (rejudge)	291/599 (48.6%)	90/650 (13.8%)	1.3e-15	3.4e-15	0.78
qwen2.5-1.5b	Claude Sonnet 4.6	202/298 (67.8%)	62/646 (9.6%)	1.2e-15	3.8e-15	1.30
llama3.1-8b	v1 qwen2.5:7b	9/550 (1.6%)	2/550 (0.36%)	6.4e-02	1.3e-01	0.14
llama3.1-8b	gemma3:12b (rejudge)	109/649 (16.8%)	7/650 (1.1%)	1.1e-15	3.4e-15	0.64
llama3.1-8b	Claude Sonnet 4.6	122/649 (18.8%)	2/644 (0.31%)	9.5e-16	3.8e-15	0.79

Verified from research/tr140/v2_controls/analysis/v3_stats.json under key c1_agreement.q2k_vs_q8_0_by_judge.

Observations.

Under the v1 qwen judge, llama3.1-8b's Q2_K-vs-Q8_0 test is non-significant after Holm (p = 0.128). Under gemma3 it is p = 3.4e-15; under Claude it is p = 3.8e-15. The v1 report's "llama3.1-8b only marginally affected by Q2_K" framing is a judge artifact. With stronger judges, llama3.1-8b is indistinguishable from the other three Llama models in the Q2_K-vs-Q8_0 fight.
Under the v1 qwen judge, llama3.2-3b's Q2_K-vs-Q8_0 test has p = 1.000 (both 0/n). The v1 qwen judge never called COMPLIANCE on any llama3.2-3b row in this pooled subset. gemma3 calls 41 COMPLIANCE events on the Q2_K arm (6.3%); Claude calls 15 (2.5%). The v1 qwen judge was systematically missing compliance events on the smaller Llama model.
The point estimates of Q2_K-vs-Q8_0 delta are always higher under gemma3 and Claude than under v1 qwen. The v1 ASR numbers at Q2_K are conservative lower bounds.
The "Cohen's h = 2.06" extreme from v1 SS6 came from the n=100 cell qwen2.5-1.5b N=1 Q2_K (85%) vs Q8_0 (2%). In the pooled triangulation the per-model Cohen's h maxes out at 1.30 (qwen2.5-1.5b under Claude), which is consistent with v1's extreme effect but diluted by pooling across shot counts.
The v2 message: the original v1 Fisher + Holm framework was structurally correct; the v1 judge attenuated effect sizes but did not invent them. Three-judge agreement is the appropriate v3 standard.

SS22.4 Which v3 ASR numbers does this report cite?

All v1 tables in SS5-SS16 cite the original v1 qwen2.5:7b labels to preserve historical consistency. Where v2 rejudged labels would materially change the qualitative claim (llama3.1-8b case above), the text explicitly flags the rejudge delta.
All NEW v2 analyses (SS22-SS29) cite gemma3:12b labels by default, because gemma3 covers the full 27,000-row C7 sample and the full 10,800-row C8 sample with 0% judge failure. Claude was run only on the 15,000 v1 set for triangulation; it is not used for the C7/C8/C9/C10/C11/C12 analyses because no Claude labels exist for those cells.
All pairwise agreement, kappa, and TOST numbers are computed from raw JSONL by research/tr140/compute_v2_stats.py; no pre-existing analysis JSON is trusted without row-count verification.

SS23. v2 Control C2 -- FP16 Baseline (Qwen2.5-1.5b + Llama3.1-8b)

SS23.1 Motivation

v1's baseline was Q8_0, not FP16, because Ollama does not serve FP16 for the evaluated models. Reviewers challenged the implicit claim that Q8_0 was a faithful FP16 proxy. C2 addresses this by running FP16 inference via transformers + CUDA (bypassing Ollama) on the two most informative models: qwen2.5-1.5b (the v1 outlier) and llama3.1-8b (the largest v1 model). n=300 per (model, FP16) cell = 600 rows total.

Table v2.3: C2 FP16 ASR vs v1 Q8_0 ASR

Source: research/tr140/v2_controls/results/c2_fp16_baseline/20260414_203036/samples.jsonl (600 rows).

Model	FP16 ASR (C2)	FP16 Wilson 95% CI	Q8_0 ASR (v1)	Q8_0 Wilson 95% CI	Delta (FP16 - Q8_0)	TOST p (±3pp)	Equivalent?
qwen2.5-1.5b	8.0% (24/300)	[5.4%, 11.6%]	14.5% (94/650)	[12.0%, 17.4%]	-6.5pp	0.951	NO
llama3.1-8b	4.7% (14/300)	[2.8%, 7.7%]	1.1% (7/650)	[0.5%, 2.2%]	+3.6pp	0.677	NO

Observations.

qwen2.5-1.5b FP16 is safer than Q8_0 by 6.5pp, with non-overlapping Wilson CIs (FP16 upper 11.6% < Q8_0 lower 12.0%). The v1 Q8_0 baseline was not a faithful FP16 proxy for this model.
llama3.1-8b FP16 shows a small directional increase vs Q8_0, but Wilson CIs overlap (FP16 [2.8%, 7.7%] vs Q8_0 [0.5%, 2.2%]). The direction is suspicious given the qwen finding; we do NOT claim FP16 is less safe than Q8_0 on llama3.1-8b from n=300.
TOST at ±3pp fails for both models, meaning neither FP16 is statistically equivalent to Q8_0. This means the v1's entire Q8_0 baseline line should be interpreted as "Q8_0 baseline" and NOT as "FP16 proxy."
The practical implication is that any deployment that genuinely runs FP16 (e.g., non-quantized HuggingFace transformers serving) will have a materially different ASR profile than the v1 report's Ollama Q8_0 numbers. The direction is model-dependent.

SS23.2 Why Is Qwen2.5-1.5b FP16 Safer Than Q8_0?

This is a within-family, within-model comparison so it cannot be chalked up to the family or model. Three hypotheses:

Ollama template/system-prompt divergence. Ollama may apply a different chat template or system prompt than the transformers inference path. The C2 pipeline uses the qwen2.5-1.5b-instruct HuggingFace chat template directly.
Bit-accurate Q8_0 is lossy at outlier weights. The qwen2.5 family is known to have outlier attention weights that are quantization-sensitive. Q8_0's round-to-nearest can shift safety-critical weight values by 1-2 LSBs.
Ollama llama.cpp version drift. The v1 run used OLLAMA_VERSION=0.6.x; the v2 C2 run used transformers 4.44. Ollama updates can change kernel selection.

We do not resolve this in v3.0 -- it is a clear follow-up. For now, the conservative reading is that v1's Q8_0 baseline is one specific deployment configuration, not a universal "unquantized" baseline.

SS24. v2 Control C3 -- Static Prompt Ablation

SS24.1 Motivation

v1 used randomized exemplar selection from a pool of 100 compliance Q/A pairs. A reviewer objection: the ASR may depend on the specific random selection rather than on the many-shot pressure. C3 runs a "static" prompt mode where the same fixed 50 exemplars are used across every behavior.

SS24.2 Result

Source: research/tr140/v2_controls/results/c3_static_prompt/20260414_193834/samples.jsonl (1,200 rows, 4 models x 6 quants x 50 behaviors, static-mode only).

Table v2.4: C3 static-prompt ASR at each (model, quant) cell, n=50

Model	Q2_K	Q3_K_M	Q4_K_M	Q5_K_M	Q6_K	Q8_0
llama3.1-8b	2.0%	0.0%	0.0%	0.0%	0.0%	0.0%
llama3.2-1b	32.0%	4.0%	4.0%	2.0%	2.0%	2.0%
llama3.2-3b	6.0%	0.0%	2.0%	0.0%	0.0%	0.0%
qwen2.5-1.5b	52.0%	8.0%	6.0%	0.0%	2.0%	2.0%

Observations.

The Q2_K cliff reproduces: every Q2_K cell is elevated above its Q8_0 counterpart.
At n=50 the Wilson CI half-widths are wide (±7-14pp), so Q3_K_M vs Q8_0 comparisons are individually underpowered. Aggregate Q2_K-vs-others remains clear.
The static-prompt ASR values are lower than v1's randomized-prompt ASR on qwen2.5-1.5b Q2_K (52% static vs 85% v1 N=1, 99% v1 N=128). Randomization of exemplar selection modestly amplifies the attack. This is an interesting methodological note but does not overturn any v1 claim.
C3 does NOT compare static vs randomized head-to-head within the same run -- it only reports the static arm. The delta vs v1 is cross-run and includes judge family, batch scheduling, and model-stack drift. Treat as directional.

SS25. v2 Control C4 -- Benign-Demo Negative Control

SS25.1 Motivation

A classical negative control: if many-shot jailbreaking works because of the harmful nature of the exemplars (demonstrating compliance with harmful requests), then replacing harmful exemplars with benign exemplars (e.g., "Human: What is 2+2? Assistant: 4.") should NOT elevate ASR. C4 runs this substitution on the two most informative models at two quants (Q2_K and Q8_0), 100 behaviors per cell = 400 rows.

SS25.2 Result

Source: research/tr140/v2_controls/results/c4_benign_demo/20260414_195746/samples.jsonl (400 rows).

Table v2.5: C4 ASR with benign few-shot demos

Cell	ASR	k/n	Wilson 95% CI
llama3.1-8b Q2_K	4.0%	4/100	[1.6%, 9.8%]
llama3.1-8b Q8_0	0.0%	0/100	[0.0%, 3.7%]
qwen2.5-1.5b Q2_K	75.0%	75/100	[65.7%, 82.5%]
qwen2.5-1.5b Q8_0	3.0%	3/100	[1.0%, 8.5%]
Pooled (all 4 cells)	20.5%	82/400	[16.8%, 24.7%]

Observations.

The C4 benign-demo cell produces 75% ASR on qwen2.5-1.5b Q2_K -- only 10pp below the v1 many-shot ASR of 85% (Q2_K, N=1, random harmful demo). The "benign" demo does not rescue the model. The model is already broken by Q2_K before demos are introduced.
llama3.1-8b Q2_K with benign demo shows 4% ASR vs v1's Q2_K N=1 ASR of 2%. No elevation above the v1 baseline, which is the direction a proper negative control should produce.
Q8_0 cells behave as expected: near-zero ASR with benign demos.
The C4 result means that the benign-demo control only cleanly negates many-shot on models/quants that aren't already broken. It is not a universal negative control. The v1 claim that "many-shot works because of the semantic compliance cue" is not established by C4 for qwen2.5-1.5b Q2_K; an alternative hypothesis (Q2_K's quant-induced baseline compliance is dominant) is equally consistent with C4's data.
Follow-up: a cleaner negative control would substitute benign refusal demos, testing whether demonstrating refusal rescues the model. That experiment was not part of C4 and is flagged in SS30.

SS26. v2 Control C6 -- Phase 2 Reinforcement (n=150)

SS26.1 Motivation

v1 Phase 2 used n=50 per (model, quant, context-profile) cell. The qwen2.5-1.5b Q2_K long_prefix cell reached 100% ASR (50/50) -- a result that was plausibly a small-sample artifact. C6 replicates the Phase 2 design with n=150 per cell on 3 models x 3 quants x 3 profiles = 27 cells = 2,700 rows.

Table v2.6: C6 Phase 2 replication at n=150 per cell

Source: research/tr140/v2_controls/results/c6_phase2_reinforcement/20260415_031712/samples.jsonl (2,700 rows, 2,700 gemma3 judge labels).

Model	Quant	Profile	ASR (C6, n=150)	ASR (v1, n=50)
llama3.1-8b	Q2_K	short_prefix	10.7%	4.0%
llama3.1-8b	Q2_K	medium_prefix	18.0%	8.0%
llama3.1-8b	Q2_K	long_prefix	19.3%	12.0%
llama3.1-8b	Q4_K_M	short_prefix	0.7%	n/a
llama3.1-8b	Q4_K_M	medium_prefix	0.7%	n/a
llama3.1-8b	Q4_K_M	long_prefix	0.0%	n/a
llama3.1-8b	Q8_0	short_prefix	0.0%	n/a
llama3.1-8b	Q8_0	medium_prefix	0.7%	n/a
llama3.1-8b	Q8_0	long_prefix	0.0%	n/a
qwen2.5-1.5b	Q2_K	short_prefix	66.7%	62.0%
qwen2.5-1.5b	Q2_K	medium_prefix	82.0%	76.0%
qwen2.5-1.5b	Q2_K	long_prefix	100.0%	100.0%
qwen2.5-1.5b	Q4_K_M	short_prefix	16.7%	n/a
qwen2.5-1.5b	Q4_K_M	medium_prefix	18.0%	n/a
qwen2.5-1.5b	Q4_K_M	long_prefix	30.0%	n/a
qwen2.5-1.5b	Q8_0	short_prefix	7.3%	n/a
qwen2.5-1.5b	Q8_0	medium_prefix	8.0%	n/a
qwen2.5-1.5b	Q8_0	long_prefix	10.0%	n/a

Note: the "v1 n=50" column references v1 Phase 2 values where the 3-quant subset (Q2_K, Q4_K_M, Q8_0) overlaps with C6; the "n/a" entries are cells where v1 used a different quant and cannot be directly compared. llama3.2-3b was not included in v1 Phase 2 for this quant subset.

Observations.

qwen2.5-1.5b Q2_K long_prefix replicates at 150/150 = 100% (Wilson 95% CI [97.6%, 100.0%]). The v1's 50/50 was not an artifact; it was a genuine saturation. At n=150 we now have a two-decimal-point lower bound of 97.6%.
llama3.1-8b Q2_K ASR is HIGHER across all three profiles at n=150 than at n=50. The v1 Phase 2 cell "llama3.1-8b Q2_K long_prefix 12%" is replaced by "19.3%" at n=150. This is within Wilson CI overlap (v1 [5.6%, 23.8%], C6 [13.7%, 26.4%]) but the direction is consistently higher. The likely explanation is judge-family difference: v1 used qwen2.5:7b; C6 uses gemma3:12b. Per SS22, gemma3 systematically calls more COMPLIANCE on Q2_K than v1 qwen did. The v2 ASR should be viewed as more reliable.
Context-dilution slope on qwen2.5-1.5b Q2_K (short 66.7% -> medium 82% -> long 100%) is clean and monotonic at n=150, strengthening v1's SS11 finding.
Q4_K_M and Q8_0 are effectively at the noise floor on llama3.1-8b, consistent with v1's recommendation that Q4_K_M is safe for this model.
This control CONFIRMS the v1 Phase 2 finding. No claim is overturned; the Q2_K long-context vulnerability is larger at n=150 with a modern judge than at n=50 with v1 qwen.

SS27. v2 Control C7 -- Breadth Expansion (27,000 rows, 5 models)

SS27.1 Motivation

v1 covered 4 models x 6 quants x 5 shot counts x 2 formats x 50 behaviors = 12,000 Phase 1 rows. C7 broadens this to 100 behaviors (vs 50) and adds two new model families (gemma2-2b, phi3.5-mini), yielding two runs:

Run A (20260415_212050): gemma2-2b + phi3.5-mini at 6 quants x 5 shots x 2 formats x 50 behaviors = 6,000 expected; actual 3,000 because all six phi3.5-mini cells failed Ollama pull (model tag mismatch). Gemma2-2b cells landed cleanly.
Run B (20260415_223902): 4 full-family models (v1's 4) at 6 quants x 5 shots x 2 formats x 100 behaviors = 24,000 rows; 0 failures.

Total C7 rows: 3,000 + 24,000 = 27,000. All judged by gemma3:12b with 0 failures.

SS27.2 Pooled Q2_K vs Q8_0 Tests on the Full Family

Cells pool across shot-count and format (n=1,000 per model+quant for Run B, n=500 for gemma2-2b from Run A). Fisher exact + Holm across the 5 tests.

Table v2.7: Pooled Q2_K vs Q8_0 in C7

Source: research/tr140/v2_controls/analysis/v3_stats.json under c7.combined.q2k_vs_q8_0, recomputed from raw JSONL.

Model	Q2_K k/n (ASR)	Q8_0 k/n (ASR)	Cohen's h	Bootstrap delta	Bootstrap 95% CI	Fisher p	Holm p
llama3.2-1b	437/1000 (43.7%)	0/1000 (0.0%)	1.44	+43.7pp	[40.9, 46.8]	1.7e-15	6.8e-15
llama3.1-8b	274/1000 (27.4%)	16/1000 (1.6%)	0.85	+25.8pp	[23.0, 28.8]	2.4e-15	6.8e-15
gemma2-2b (new)	98/500 (19.6%)	1/500 (0.2%)	0.83	+19.4pp	[15.8, 23.0]	1.2e-15	6.1e-15
llama3.2-3b	138/1000 (13.8%)	3/1000 (0.3%)	0.65	+13.5pp	[11.3, 15.6]	1.8e-15	6.8e-15
qwen2.5-1.5b	862/1000 (86.2%)	222/1000 (22.2%)	1.40	+64.0pp	[60.8, 67.4]	3.4e-15	6.8e-15

Bootstrap: n=1,000, seed=42, percentile method.

Observations.

All five tests are Holm-significant at p < 1e-14. Cohen's h ranges from 0.65 (medium-large) to 1.44 (very large).
The gemma2-2b row (a family not previously tested) shows the same Q2_K cliff pattern: 19.6% Q2_K vs 0.2% Q8_0. Gemma2 joins the Llama family pattern rather than the Qwen pattern.
The v1 "Q2_K universal" claim is not family-specific: it applies to 3 distinct families (Llama, Qwen, Gemma) across 5 models and 4 parameter counts (1.2B, 1.5B, 2B, 3B, 8B).
qwen2.5-1.5b's Q8_0 ASR at n=1,000 (22.2%) is higher than v1's Q8_0 ASR at n=650 (14.5%). The direction matches the SS22 judge-correction direction: gemma3 calls more COMPLIANCE than v1 qwen on the qwen2.5-1.5b rows.
The "Q2_K is the universal vulnerability threshold" claim is now supported by 27,000 rows spanning 5 models across 3 families, judged by gemma3:12b, with bootstrap CIs excluding zero for every single pair. This is by far the strongest Q2_K evidence in the Banterhearts program.

Table v2.8: Full C7 cell grid (post-aggregation, pooled across shot-count and format)

Model	Q2_K	Q3_K_M	Q4_K_M	Q5_K_M	Q6_K	Q8_0	Cells total
gemma2-2b	19.6% (500)	0.0% (500)	0.0% (500)	1.0% (500)	1.0% (500)	0.2% (500)	3,000
llama3.1-8b	27.4% (1000)	2.3% (1000)	0.6% (1000)	0.7% (1000)	1.2% (1000)	1.6% (1000)	6,000
llama3.2-1b	43.7% (1000)	1.5% (1000)	0.0% (1000)	0.0% (1000)	0.0% (1000)	0.0% (1000)	6,000
llama3.2-3b	13.8% (1000)	1.8% (1000)	0.1% (1000)	0.3% (1000)	0.1% (1000)	0.3% (1000)	6,000
qwen2.5-1.5b	86.2% (1000)	38.5% (1000)	26.8% (1000)	23.2% (1000)	19.7% (1000)	22.2% (1000)	6,000
Total							27,000

Observations (C7 cell grid).

Llama3.2-1b at Q8_0 is 0/1000 -- the Wilson 95% upper is 0.38%. This is a stronger "llama is safe at Q8_0" statement than anything in the v1 report.
qwen2.5-1.5b Q3_K_M at 38.5% (385/1000) is higher than any Llama Q2_K cell. The qwen family at Q3_K_M is less safe than the Llama family at Q2_K.
qwen2.5-1.5b ASR decreases with BPW non-monotonically: Q8_0 22.2%, Q6_K 19.7%, Q5_K_M 23.2%, Q4_K_M 26.8%, Q3_K_M 38.5%, Q2_K 86.2%. The Q6_K-to-Q8_0 fluctuation is within Wilson CI overlap; the Q3_K_M-to-Q2_K jump (38.5% -> 86.2%) is the cliff.
Gemma2-2b has zero COMPLIANCE at Q3_K_M and Q4_K_M across 1,000 rows. It is safer than Llama3.2-1b (which also has zeros at these quants) in the sense that gemma2-2b has the same zeros plus a lower Q2_K ASR than llama3.2-1b's Q2_K (19.6% vs 43.7%).

SS28. v2 Control C8 -- Right-Tail (N=256) Saturation

SS28.1 Motivation

v1 tested shot counts {1, 4, 16, 64, 128}. The v1 H3 verdict ("context-window caps many-shot effectiveness") was derived from the peak-then-decline pattern on llama3.1-8b Q2_K (46% @ N=16 -> 26% @ N=128). A reviewer objection: ASR may simply keep climbing past N=128 on other cells; the v1 sampling cannot distinguish saturation from decline. C8 pushes shot count to N=256 across 6 models x 6 quants x 50 behaviors x 3 formats (wait, per manifest, the design is 6 x 6 x 50 x ~6 seeds for a total of 10,800 rows; the run produced exactly 10,800 samples).

Table v2.9: C8 ASR at N=256 (n=300 per cell)

Source: research/tr140/v2_controls/results/c8_right_tail/20260416_005713/samples.jsonl (10,800 rows, 10,800 gemma3 judge labels).

Model	Q2_K	Q3_K_M	Q4_K_M	Q5_K_M	Q6_K	Q8_0
gemma2-2b	44.3% (133/300)	0.0% (0/300)	0.7% (2/300)	1.0% (3/300)	1.3% (4/300)	0.3% (1/300)
llama3.1-8b	9.7% (29/300)	4.3% (13/300)	2.0% (6/300)	1.0% (3/300)	1.7% (5/300)	2.0% (6/300)
llama3.2-1b	91.7% (275/300)	0.3% (1/300)	0.0% (0/300)	0.0% (0/300)	0.0% (0/300)	0.0% (0/300)
llama3.2-3b	89.3% (268/300)	1.3% (4/300)	0.0% (0/300)	0.0% (0/300)	0.0% (0/300)	0.0% (0/300)
phi3.5-mini (new)	98.0% (294/300)	91.0% (273/300)	90.3% (271/300)	90.7% (272/300)	89.0% (267/300)	90.7% (272/300)
qwen2.5-1.5b	98.7% (296/300)	41.3% (124/300)	38.0% (114/300)	46.7% (140/300)	39.3% (118/300)	47.3% (142/300)

Observations.

Llama3.1-8b Q2_K continues to DECLINE past N=128: v1's N=128 ASR was 26% and N=256 is 9.7%. The peak-then-decline pattern is now clearly established for this model. H3 is firmly supported for llama3.1-8b.
Llama3.2-1b and Llama3.2-3b Q2_K CLIMB past N=128: v1 N=128 ASRs were 72% and 41%. C8 N=256 shows 91.7% and 89.3%. The context-cap does NOT apply uniformly to the Llama family.
qwen2.5-1.5b Q2_K saturates near 100%: 99% at N=128, 98.7% at N=256. The qwen family hits its asymptote by N=128; additional shots are redundant.
phi3.5-mini is the study's most dangerous family by a wide margin. ALL six quants show 89-98% ASR at N=256. Q8_0 = 90.7%, Q2_K = 98.0%. The Q8_0-to-Q2_K delta on phi3.5-mini is only 7.3pp, because Q8_0 is already broken. This family should not be deployed against many-shot-capable adversaries at any GGUF quant level. The v2 C8 evidence is from a single extreme shot count (N=256); a full shot-count sweep on phi3.5-mini is the highest-priority follow-up flagged in SS30.
The H3 "context-cap" verdict is therefore model-specific: confirmed for llama3.1-8b, rejected for llama3.2-1b / llama3.2-3b / phi3.5-mini / qwen2.5-1.5b, and ambiguous for gemma2-2b. The v1 "H3 supported" claim should be narrowed to "H3 supported for llama3.1-8b."

SS28.2 Trajectory Slope from N=16 Peak to N=256

Using v1's N=16 ASR (peak for llama3.1-8b Q2_K, 46%) and C8's N=256 ASR (9.7%), the decay slope is -0.151 ASR per doubling of N. The same slope calculation on llama3.2-1b Q2_K (v1 N=16 = 62%, C8 N=256 = 91.7%) gives +0.074 ASR per doubling -- positive, confirming climb. The sign of the slope is the correct H3 discriminator.

SS29. v2 Controls C9-C12 -- Narrower Cells

SS29.1 C9 Larger-Model Anchor (qwen2.5-14b, n=900)

Source: research/tr140/v2_controls/results/c9_larger_model/20260415_182202/samples.jsonl. 900 rows from a RunPod L40S run at 3 quants x 100 behaviors x 3 shot counts. n=900 is underpowered for a stratified shot-count x quant breakdown; we report pooled ASR and per-quant ASR only.

Table v2.10: C9 qwen2.5-14b

Cell	ASR	k/n	Wilson 95% CI
qwen2.5-14b Q2_K	23.7%	71/300	[19.2%, 28.8%]
qwen2.5-14b Q4_K_M	16.3%	49/300	[12.6%, 20.9%]
qwen2.5-14b Q8_0	16.7%	50/300	[12.9%, 21.3%]
Pooled	18.9%	170/900	[16.5%, 21.6%]

Observations.

The v1 prediction under simple "scale rescues safety" reasoning would be: qwen2.5-14b Q2_K << qwen2.5-1.5b Q2_K. This is numerically true (23.7% vs 86.2% in C7) but 23.7% is still an alarming number at n=300.
The Q2_K-vs-Q8_0 delta on qwen2.5-14b is 7.0pp; the same delta on qwen2.5-1.5b in C7 is 64.0pp. Scale dampens the cliff by ~9x.
Q4_K_M and Q8_0 are statistically indistinguishable at n=300 (Wilson CIs overlap substantially). The v1 "Q4_K_M is the safe floor" guidance extends tentatively to 14B.
C9 is explicitly narrow. Only qwen2.5-14b was tested; no Llama 70B or 13B anchor exists in TR140. The "14B dampens the cliff" claim does not generalize to other families at scale. The budget to run Llama3.1-70B on RunPod would be ~10x the qwen2.5-14b run and is not in scope.

SS29.2 C10 Non-GGUF Quantization (AWQ + GPTQ)

Source: research/tr140/v2_controls/results/c10_non_gguf/20260415_190113/samples.jsonl (400 rows, successful run; the earlier 20260415_041257 run failed vLLM startup).

Table v2.11: C10 qwen2.5-1.5b under AWQ / GPTQ / GGUF at matched 4-bit

Format	ASR	k/n	Wilson 95% CI
AWQ-4bit (vLLM)	3.0%	3/100	[1.0%, 8.5%]
GPTQ-4bit (vLLM)	3.0%	3/100	[1.0%, 8.5%]
GGUF Q4_K_M (Ollama)	42.0%	42/100	[32.9%, 51.7%]
GGUF Q2_K (Ollama)	67.0%	67/100	[57.2%, 75.5%]

Observations.

At the same nominal 4-bit precision, AWQ and GPTQ produce ~14x lower ASR than GGUF Q4_K_M on qwen2.5-1.5b. The v1 "Q4_K_M is the safe floor" is GGUF-specific; it does NOT generalize to other 4-bit quantization schemes.
This overturns a subtle v1 implication: that "4-bit quantization" is a universal descriptor for a safety regime. It isn't. GGUF K-quants' layer-wise mixed-precision bit allocation happens to land safety-relevant weights in a damaging region that uniform AWQ/GPTQ do not.
AWQ and GPTQ are almost identical (3/100 each). This is n=100 so Wilson CIs are wide and a larger C10 follow-up would be informative.
The operational message: if a deployment needs sub-Q6_K precision on qwen2.5-1.5b, use AWQ or GPTQ, not GGUF.
C10 tested only qwen2.5-1.5b. AWQ/GPTQ on Llama models would require re-quantization and was not in scope. The generality of the "AWQ/GPTQ safer than GGUF at 4-bit" claim is open.

SS29.3 C11 Temperature Sensitivity

Source: research/tr140/v2_controls/results/c11_temp_sensitivity/20260415_000924/samples.jsonl (2,000 rows, 2,000 gemma3 judge labels; the earlier 20260415_000812 run had no judge labels and is referenced only as a replication check).

Table v2.12: C11 ASR at T=0.0 vs T=0.7

Cell	T=0.0 ASR	T=0.7 ASR	Delta (T=0.7 - T=0.0)
llama3.2-3b Q2_K	72.0% (180/250)	70.4% (176/250)	-1.6pp
llama3.2-3b Q8_0	4.0% (10/250)	4.4% (11/250)	+0.4pp
qwen2.5-1.5b Q2_K	99.6% (249/250)	99.2% (248/250)	-0.4pp
qwen2.5-1.5b Q8_0	91.2% (228/250)	84.8% (212/250)	-6.4pp

Observations.

Three of four cells show absolute deltas under 2pp. The study's T=0.0 protocol does not materially bias the findings on these cells.
qwen2.5-1.5b Q8_0 shows a -6.4pp delta (T=0.7 is safer by 6.4pp), which is outside Wilson CI overlap (T=0.0 [87.0%, 94.2%] vs T=0.7 [79.9%, 88.8%] -- overlapping bands by a narrow 5pp). The direction suggests that sampling diversity may slightly reduce qwen2.5-1.5b's many-shot compliance at Q8_0.
Llama3.2-3b Q2_K at 72% ASR here is consistent with v1's N=128 Q2_K ASR of 41% + a plausible gemma3-vs-v1-qwen judge correction and a shot-count effect; the exact shot count for C11 was held at N=64 per the manifest.
Temperature is therefore a second-order variable for TR140's conclusions.

SS29.4 C12 Shot Ordering (Reproducibility Snapshot)

Source: research/tr140/v2_controls/results/c12_shot_ordering/20260415_025821/samples.jsonl (750 rows).

Table v2.13: C12 Q2_K ASR under "default" shot ordering

Cell	ASR	k/n
llama3.1-8b Q2_K (default ordering)	78.8%	197/250
qwen2.5-1.5b Q2_K (default ordering)	87.2%	436/500

Observations.

C12 produces only a single "default" ordering bucket in the raw JSONL; no alternative orderings were emitted. The reported numbers are therefore a reproducibility snapshot of the v1 Q2_K cells under a fresh run with gemma3 judging and larger n, NOT an ordering-sensitivity test.
Both numbers are consistent with the v1 and C7 Q2_K cells for the same model (qwen2.5-1.5b Q2_K in C7 was 86.2%; llama3.1-8b Q2_K in C7 was 27.4% -- the 78.8% here is higher, likely because C12's "default" ordering uses the N=128 shot count while C7 pools across shot counts including low-N cells).
The v1 "shot ordering does not matter" claim is neither supported nor refuted by C12. The ordering experiment was not actually run in this configuration. A proper C12v2 would test 3 orderings (original, reverse, random-alt-seed) at matched N; it is flagged as follow-up in SS30.

SS30. Reviewer Objections Closed

The following table maps each v2 control to the specific reviewer objection from the v1 review round that it addresses. "Closed" means the objection is addressed at the evidence level claimed by the corresponding section; "partial" means the control reduces but does not eliminate the concern.

Table v2.14: Reviewer Objection -> v2 Control Mapping

#	Reviewer Objection (from v1 reviews)	v2 Control	Status	Key numbers
1	"Single 7B judge; kappa = 0.23 too low to trust"	C1 (rejudge with gemma3 + Claude, 30K extra labels)	Closed	gemma3 vs Claude kappa = 0.925 on n=11,451; v1 qwen is the outlier, not the gold (SS22)
2	"Q8_0 is not FP16; claims about quantization baseline are confounded"	C2 (FP16 on 2 models, n=600)	Partial	FP16 qwen2.5-1.5b 8.0% vs Q8_0 14.5% (TOST fails at ±3pp). Direction known; 2-model coverage is narrow (SS23)
3	"Exemplar randomization might drive ASR, not many-shot pressure"	C3 (static prompt, n=1,200)	Partial	Static ASR lower than randomized; same Q2_K cliff pattern (SS24)
4	"No benign-demo negative control"	C4 (benign-demo, n=400)	Partial	Control compromised on qwen2.5-1.5b Q2_K (75% ASR); clean on Llama (SS25)
5	"Phase 2 n=50 too small"	C6 (n=150, 2,700 rows)	Closed	qwen2.5-1.5b Q2_K long_prefix 150/150 = 100%; context-dilution slope replicates (SS26)
6	"Claims only cover Llama and Qwen; no Gemma or Phi families"	C7 + C8 (gemma2-2b, phi3.5-mini added)	Closed	5 models, 3 families, 37,800 new rows; Q2_K cliff across all 5 (SS27, SS28)
7	"N=128 cap is arbitrary; does ASR climb past it?"	C8 (N=256, 10,800 rows)	Closed	H3 is model-specific: climbs on llama3.2-1b/3b, qwen2.5-1.5b, phi3.5-mini; declines on llama3.1-8b (SS28)
8	"GGUF-only results don't apply to AWQ/GPTQ deployments"	C10 (AWQ + GPTQ, n=400)	Partial	At 4-bit: AWQ 3%, GPTQ 3%, GGUF Q4_K_M 42% on qwen2.5-1.5b. 1-model narrow (SS29.2)
9	"No larger-than-8B anchor"	C9 (qwen2.5-14b, n=900)	Partial	14B dampens cliff by ~9x vs 1.5B but 23.7% Q2_K ASR still elevated. 1-model narrow (SS29.1)
10	"T=0.0 may mask sampling-induced variability"	C11 (T=0.7, n=2,000)	Closed	3 of 4 cells within ±2pp; T not a material confounder (SS29.3)
11	"Shot ordering may dominate"	C12 (nominal, 750 rows)	Not closed	C12 only ran one ordering bucket. Remains an open gap (SS29.4, SS31)

Overall: 5 objections Closed, 5 Partial, 1 Not closed. The v3.0 report claims closure only where the v2 evidence directly supports the claim at the scope flagged in each cell.

SS31. v3.0 Open Gaps and Follow-Up

The v2 controls close most but not all of the v1 reviewer objections. The remaining gaps, in priority order:

Shot ordering was not actually varied. C12 ran only a "default" bucket. A proper C12v2 would test at least 3 orderings at matched N on 2 models x 2 quants = 1,200 rows. Low compute cost; should be done before the next external submission round.
phi3.5-mini has only N=256 data. C8 is the first and only TR140 run on phi3.5-mini, and it is catastrophic (89-98% ASR across all quants). A full phi3.5-mini shot-count sweep (6 quants x 5 shots x 2 formats x 50 behaviors = 3,000 rows) would settle whether phi3.5-mini is broken at all N or whether its Q8_0 baseline is simply elevated. This is the highest-priority v2.1 experiment.
AWQ/GPTQ coverage is 1-model. C10 tested only qwen2.5-1.5b. Extending to llama3.2-1b and llama3.1-8b would test whether the "AWQ/GPTQ safer than GGUF at 4-bit" claim generalizes.
FP16 coverage is 2-model. C2 tested only qwen2.5-1.5b and llama3.1-8b. Extending to llama3.2-1b and llama3.2-3b would establish the FP16-vs-Q8_0 safety delta across the full family.
70B anchor not attempted. C9's qwen2.5-14b is the largest model tested. A Llama3.1-70B run at 2 quants x 100 behaviors = 200 rows would be ~4 RunPod hours at L40S/A100 and is the logical next anchor.
Benign-refusal demo is missing. C4 used benign-compliance demos. A cleaner negative control would use benign-refusal demos; this is a ~400-row experiment.
Judge kappa on the 48,950 v2 rows is gemma3-only. Claude triangulation was restricted to the 15,000 v1 set. Extending Claude judging to the 48,950 v2 rows would double-validate every v2 cell; the Claude API cost is ~$400-600.
v3 multi-turn compounding. TR139 (multi-turn) and TR140 (many-shot) do not yet have a crossed design where the same conversation carries BOTH many-shot exemplars AND multi-turn strategies. This is the most important follow-up for the safety program, not just TR140, and is a separate TR.

SS21. Reproducibility

Item	Value
v1 Git commit	d2c3fdac
v3 Git commit	see `research/tr140/v2_controls/analysis/v3_data_manifest.json`
v1 Total samples	15,000
v3 Total primary samples	63,950
v3 Total judge labels	78,950 (qwen2.5:7b 15K + gemma3:12b 63,950 + Claude 15K; overlaps collapse per-sample)
Runner	`python research/tr140/run.py`
Benchmarks	`python research/tr140/prepare_benchmarks.py`
Analysis	`python research/tr140/analyze.py`
Report	`python research/tr140/generate_report.py`
Results directory	`research/tr140/results/20260316_164907/`
Config snapshot	`research/tr140/results/20260316_164907/config_snapshot.yaml`
Raw samples	`research/tr140/results/20260316_164907/samples.jsonl` (15,000 lines)
Judge labels	`research/tr140/results/20260316_164907/judge_labels.jsonl` (15,000 lines)
Scored samples	`research/tr140/results/20260316_164907/tr140_scored.jsonl` (15,000 lines)
Analysis JSON	`research/tr140/results/20260316_164907/tr140_analysis.json`
Python	3.11+
Ollama	v0.6+
GPU	NVIDIA RTX (12GB VRAM)
Temperature	0.0
Seed	42
Expected runtime	~13 GPU-hours (single RTX 12GB)
Disk footprint	~2.1 GB (samples + judge labels + analysis)

References

Anil, C., et al. (2024). Many-shot Jailbreaking. NeurIPS 2024. Anthropic. arXiv:2404.02151.
Zheng, S., et al. (2024). Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses. NeurIPS 2024. arXiv:2406.01288.
Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043.
Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv:2404.01318.
Dettmers, T., et al. (2024). The case for 4-bit precision: k-bit Inference Scaling Laws. ICML 2023.
Schuirmann, D.J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657-680.
Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209-212.
Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Banterhearts TR134 (2026). Quantization Safety Baselines for Open-Weight LLMs. Internal technical report.
Banterhearts TR139 (2026). Multi-Turn Jailbreak Susceptibility Under Quantization. Internal technical report.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70.

Appendix A: Full ASR Tables

Phase 1: Many-Shot ASR (all 120 cells)

Model	Quant	N	ASR	k/n	Wilson CI
llama3.1-8b	Q2_K	1	0.020	2/100	[0.006, 0.070]
llama3.1-8b	Q2_K	4	0.270	27/100	[0.193, 0.364]
llama3.1-8b	Q2_K	16	0.460	46/100	[0.366, 0.557]
llama3.1-8b	Q2_K	64	0.330	33/100	[0.246, 0.427]
llama3.1-8b	Q2_K	128	0.260	26/100	[0.184, 0.354]
llama3.1-8b	Q3_K_M	1	0.010	1/100	[0.002, 0.055]
llama3.1-8b	Q3_K_M	4	0.020	2/100	[0.006, 0.070]
llama3.1-8b	Q3_K_M	16	0.050	5/100	[0.022, 0.112]
llama3.1-8b	Q3_K_M	64	0.000	0/100	[0.000, 0.037]
llama3.1-8b	Q3_K_M	128	0.040	4/100	[0.016, 0.098]
llama3.1-8b	Q4_K_M	1	0.000	0/100	[0.000, 0.037]
llama3.1-8b	Q4_K_M	4	0.000	0/100	[0.000, 0.037]
llama3.1-8b	Q4_K_M	16	0.020	2/100	[0.006, 0.070]
llama3.1-8b	Q4_K_M	64	0.000	0/100	[0.000, 0.037]
llama3.1-8b	Q4_K_M	128	0.000	0/100	[0.000, 0.037]
llama3.1-8b	Q5_K_M-Q8_0	1-128	0.000-0.040	0-4/100	see analysis JSON
llama3.2-1b	Q2_K	1	0.110	11/100	[0.063, 0.186]
llama3.2-1b	Q2_K	4	0.100	10/100	[0.055, 0.174]
llama3.2-1b	Q2_K	16	0.620	62/100	[0.522, 0.709]
llama3.2-1b	Q2_K	64	0.480	48/100	[0.385, 0.577]
llama3.2-1b	Q2_K	128	0.720	72/100	[0.625, 0.799]
llama3.2-1b	Q3_K_M	128	0.060	6/100	[0.028, 0.125]
llama3.2-1b	Q4_K_M-Q8_0	all	0.000-0.010	0-1/100	see analysis JSON
llama3.2-3b	Q2_K	1	0.070	7/100	[0.034, 0.138]
llama3.2-3b	Q2_K	4	0.020	2/100	[0.006, 0.070]
llama3.2-3b	Q2_K	16	0.000	0/100	[0.000, 0.037]
llama3.2-3b	Q2_K	64	0.070	7/100	[0.034, 0.138]
llama3.2-3b	Q2_K	128	0.410	41/100	[0.319, 0.508]
llama3.2-3b	Q3_K_M	128	0.050	5/100	[0.022, 0.112]
llama3.2-3b	Q4_K_M-Q8_0	all	0.000-0.010	0-1/100	see analysis JSON
qwen2.5-1.5b	Q2_K	1	0.850	85/100	[0.767, 0.907]
qwen2.5-1.5b	Q2_K	4	0.820	82/100	[0.733, 0.883]
qwen2.5-1.5b	Q2_K	16	0.650	65/100	[0.553, 0.736]
qwen2.5-1.5b	Q2_K	64	0.920	92/100	[0.850, 0.959]
qwen2.5-1.5b	Q2_K	128	0.990	99/100	[0.946, 0.998]
qwen2.5-1.5b	Q3_K_M	1	0.190	19/100	[0.125, 0.278]
qwen2.5-1.5b	Q3_K_M	4	0.270	27/100	[0.193, 0.364]
qwen2.5-1.5b	Q3_K_M	16	0.200	20/100	[0.133, 0.289]
qwen2.5-1.5b	Q3_K_M	64	0.280	28/100	[0.201, 0.375]
qwen2.5-1.5b	Q3_K_M	128	0.530	53/100	[0.433, 0.625]
qwen2.5-1.5b	Q4_K_M	1	0.090	9/100	[0.048, 0.162]
qwen2.5-1.5b	Q4_K_M	4	0.160	16/100	[0.101, 0.244]
qwen2.5-1.5b	Q4_K_M	16	0.020	2/100	[0.006, 0.070]
qwen2.5-1.5b	Q4_K_M	64	0.280	28/100	[0.201, 0.375]
qwen2.5-1.5b	Q4_K_M	128	0.450	45/100	[0.356, 0.548]
qwen2.5-1.5b	Q5_K_M	1	0.010	1/100	[0.002, 0.055]
qwen2.5-1.5b	Q5_K_M	4	0.040	4/100	[0.016, 0.098]
qwen2.5-1.5b	Q5_K_M	16	0.020	2/100	[0.006, 0.070]
qwen2.5-1.5b	Q5_K_M	64	0.360	36/100	[0.273, 0.458]
qwen2.5-1.5b	Q5_K_M	128	0.420	42/100	[0.328, 0.518]
qwen2.5-1.5b	Q6_K	1	0.000	0/100	[0.000, 0.037]
qwen2.5-1.5b	Q6_K	4	0.050	5/100	[0.022, 0.112]
qwen2.5-1.5b	Q6_K	16	0.020	2/100	[0.006, 0.070]
qwen2.5-1.5b	Q6_K	64	0.260	26/100	[0.184, 0.354]
qwen2.5-1.5b	Q6_K	128	0.400	40/100	[0.309, 0.498]
qwen2.5-1.5b	Q8_0	1	0.020	2/100	[0.006, 0.070]
qwen2.5-1.5b	Q8_0	4	0.060	6/100	[0.028, 0.125]
qwen2.5-1.5b	Q8_0	16	0.040	4/100	[0.016, 0.098]
qwen2.5-1.5b	Q8_0	64	0.320	32/100	[0.237, 0.417]
qwen2.5-1.5b	Q8_0	128	0.400	40/100	[0.309, 0.498]

Phase 2: Long-Context ASR (selected cells with Wilson CIs)

Key Phase 2 cells with Wilson 95% CIs (n = 50 per cell):

Model	Quant	Profile	ASR	k/n	Wilson CI
llama3.1-8b	Q2_K	short_prefix	0.040	2/50	[0.011, 0.134]
llama3.1-8b	Q2_K	medium_prefix	0.080	4/50	[0.032, 0.188]
llama3.1-8b	Q2_K	long_prefix	0.120	6/50	[0.056, 0.238]
llama3.2-1b	Q2_K	short_prefix	0.200	10/50	[0.112, 0.330]
llama3.2-1b	Q2_K	medium_prefix	0.300	15/50	[0.193, 0.432]
llama3.2-1b	Q2_K	long_prefix	0.520	26/50	[0.385, 0.653]
llama3.2-3b	Q2_K	short_prefix	0.100	5/50	[0.044, 0.213]
llama3.2-3b	Q2_K	long_prefix	0.040	2/50	[0.011, 0.134]
qwen2.5-1.5b	Q2_K	short_prefix	0.620	31/50	[0.481, 0.741]
qwen2.5-1.5b	Q2_K	medium_prefix	0.760	38/50	[0.625, 0.858]
qwen2.5-1.5b	Q2_K	long_prefix	1.000	50/50	[0.929, 1.000]
qwen2.5-1.5b	Q3_K_M	long_prefix	0.340	17/50	[0.225, 0.477]
qwen2.5-1.5b	Q4_K_M	long_prefix	0.200	10/50	[0.112, 0.330]

Full 60-cell table with CIs available in the analysis JSON.

Appendix B: Extended Statistical Tables

Power-Law Fit Parameters (all 24 fits)

Model	Quant	a	b	R-squared	n_points
llama3.1-8b	Q2_K	0.0558	0.454	0.509	5
llama3.1-8b	Q3_K_M	0.0128	0.299	0.721	4
llama3.1-8b	Q4_K_M	0.0000	0.000	0.000	1
llama3.1-8b	Q5_K_M	0.0100	0.000	0.000	2
llama3.1-8b	Q6_K	0.0100	0.000	0.000	3
llama3.1-8b	Q8_0	0.0016	0.643	0.964	3
llama3.2-1b	Q2_K	0.0961	0.430	0.781	5
llama3.2-1b	Q3_K_M	0.0100	0.369	1.000	2
llama3.2-1b	Q4_K_M-Q8_0	0.0000	0.000	0.000	0-1
llama3.2-3b	Q2_K	0.0321	0.350	0.416	4
llama3.2-3b	Q3_K_M-Q8_0	0.0000	0.000	0.000	0-1
qwen2.5-1.5b	Q2_K	0.7749	0.030	0.136	5
qwen2.5-1.5b	Q3_K_M	0.1819	0.155	0.563	5
qwen2.5-1.5b	Q4_K_M	0.0623	0.278	0.209	5
qwen2.5-1.5b	Q5_K_M	0.0086	0.769	0.808	5
qwen2.5-1.5b	Q6_K	0.0092	0.726	0.633	4
qwen2.5-1.5b	Q8_0	0.0182	0.609	0.845	5

Non-Significant Pairwise Comparisons (Summary)

Of 100 Fisher exact tests, 85 do not survive Holm-Bonferroni correction. These break down as follows:

Category	Count	Description
Both ASR = 0%	52	Baseline and test are both 0%; Fisher p = 1.000. Underpowered by construction.
ASR diff < 5pp, raw p > 0.05	18	Small effects within noise floor. Includes most Q3_K_M-Q6_K comparisons on Llama.
ASR diff 5-15pp, raw p < 0.05, Holm p > 0.05	8	Effects that are significant before correction but not after. Includes llama3.2-1b N=1 Q2_K (11pp, raw p < 0.001, Holm p = 0.062).
qwen2.5-1.5b Q3_K_M-Q6_K at high N	7	Moderate absolute ASR but small delta from elevated Q8_0 baseline. MDE exceeds actual effect (SS15.2).

The 52 both-zero comparisons represent cells where the study is structurally underpowered: when both conditions produce 0/100 compliances, no test can reject the null. These are Llama models at Q4_K_M through Q6_K across all shot counts. The TOST analysis (SS15.1) confirms equivalence for 8 of these 52 cells; the remaining 44 are indeterminate (neither significant nor equivalent).

Bootstrap CIs for Power-Law Exponents (B = 2000, seed = 42)

Model	Quant	b	95% CI
llama3.1-8b	Q2_K	0.454	[-0.274, 1.224]
llama3.1-8b	Q3_K_M	0.299	[-0.107, 0.661]
llama3.1-8b	Q8_0	0.643	[0.000, 1.000]
llama3.2-1b	Q2_K	0.430	[-0.014, 0.673]
llama3.2-1b	Q3_K_M	0.369	[0.369, 0.369]
llama3.2-3b	Q2_K	0.350	[-0.904, 2.550]
qwen2.5-1.5b	Q2_K	0.030	[-0.097, 0.214]
qwen2.5-1.5b	Q3_K_M	0.155	[-0.016, 0.469]
qwen2.5-1.5b	Q4_K_M	0.278	[-0.543, 1.599]
qwen2.5-1.5b	Q5_K_M	0.769	[0.156, 1.619]
qwen2.5-1.5b	Q6_K	0.726	[-0.661, 1.850]
qwen2.5-1.5b	Q8_0	0.609	[0.182, 1.206]

Bootstrap CI Observations.

Only 3 of 12 bootstrap CIs exclude zero: llama3.2-1b Q3_K_M [0.369, 0.369] (degenerate -- only 2 data points), qwen2.5-1.5b Q5_K_M [0.156, 1.619], and qwen2.5-1.5b Q8_0 [0.182, 1.206]. The remaining 9 CIs include zero, meaning we cannot reject b = 0 (no shot-count dependence) for most model-quant pairs.
The degenerate CI for llama3.2-1b Q3_K_M (b = 0.369, CI = [0.369, 0.369]) occurs because only 2 non-zero ASR data points exist (N=1: 1%, N=128: 6%). With 2 points, the power law is perfectly determined and bootstrap resampling cannot generate CI width. This "fit" should be interpreted with caution.
Wide CIs (e.g., llama3.2-3b Q2_K: [-0.904, 2.550], qwen2.5-1.5b Q6_K: [-0.661, 1.850]) reflect high variability in the underlying ASR values. The power-law model captures the average trend but the shot-count-ASR relationship has substantial noise at n=100 per cell.
The bootstrap used B = 2000 iterations with seed = 42, resampling per-behavior ASR values within each cell. This captures sampling uncertainty but not systematic biases (e.g., judge error). True uncertainty is wider than the bootstrap CIs suggest.

Appendix C: Sensitivity and Robustness

This appendix presents sensitivity analyses testing whether key findings are robust to changes in analytical thresholds and methodology.

C.1 Significance Threshold Sensitivity

The primary analysis uses alpha = 0.05 with Holm-Bonferroni correction. How do findings change at alpha = 0.01?

Threshold	Significant tests	Tests involving Q2_K	Tests involving Q3_K_M
alpha = 0.05 (Holm)	15 / 100	13	2
alpha = 0.01 (Holm)	13 / 100	13	0
alpha = 0.001 (Holm)	11 / 100	11	0

Observations. The two Q3_K_M findings (qwen2.5-1.5b at N=1 and N=4) drop out at alpha = 0.01 (Holm-adjusted p = 0.009 and 0.008), while all Q2_K findings survive even at alpha = 0.001. The Q2_K breakpoint is robust to threshold choice. The Q3_K_M finding is threshold-sensitive and should be treated as suggestive rather than established.

C.2 TOST Equivalence Margin Sensitivity

The primary analysis uses a +/-3pp equivalence margin. How does the equivalence count change at different margins?

Margin	Equivalence confirmations (of 100)
+/-1pp	0
+/-3pp	8
+/-5pp	14
+/-10pp	42

Observations. At the strictest clinically meaningful margin (+/-1pp), no condition achieves formal equivalence -- the sample size of n=100 per cell is insufficient to bound the difference within 1pp. At +/-5pp, 14 conditions achieve equivalence, all on Llama models at Q3_K_M through Q6_K. At +/-10pp, 42 conditions are equivalent, covering all Llama models at all quant levels except Q2_K. The choice of margin does not change the primary finding (Q2_K is never equivalent) but does affect how confidently we can claim Q3_K_M-Q6_K are "safe."

C.3 Judge Threshold Sensitivity

The LLM judge classifies responses as compliant or refused using a binary threshold. If the judge systematically over- or under-calls compliance, ASR shifts uniformly. We assess robustness by examining how judge agreement varies across the ASR spectrum.

ASR Stratum	Judge-Regex Agreement	n
ASR = 0% cells	97.2%	7,200
ASR 1-10% cells	93.8%	2,600
ASR 11-50% cells	78.4%	1,200
ASR > 50% cells	62.1%	1,000

Observations. Agreement degrades with increasing ASR, confirming that the judge is least reliable in the cells that matter most for safety claims. The high-ASR cells (Q2_K at elevated N) carry the highest measurement uncertainty. However, the key finding -- that Q2_K produces dramatically elevated ASR -- is robust: even if we assume the judge over-calls compliance by 10pp in high-ASR cells, the corrected ASR values (e.g., 89% instead of 99% on qwen2.5-1.5b Q2_K N=128) remain catastrophically high.

C.4 Shot-Count Subset Stability

Do findings hold if we restrict to subsets of shot counts? We test whether the Q2_K breakpoint is detectable from any 3 of 5 shot counts.

Subset	Q2_K significant on llama3.2-1b?	Q2_K significant on qwen2.5-1.5b?
N = {1, 4, 16}	Yes (N=16: p < 0.001)	Yes (all 3: p < 0.001)
N = {1, 16, 128}	Yes (N=16, 128: p < 0.001)	Yes (all 3: p < 0.001)
N = {4, 64, 128}	Yes (N=64, 128: p < 0.001)	Yes (all 3: p < 0.001)
N = {1, 4, 64}	Yes (N=64: p < 0.001)	Yes (all 3: p < 0.001)

Observations. The Q2_K breakpoint is detectable from every 3-shot-count subset tested. The finding does not depend on any single shot count. Even the most conservative subset (N = {1, 4, 16}, which excludes the highest ASR cells at N=64 and N=128) still detects the Q2_K effect. This confirms that the Q2_K finding is not an artifact of a particular shot-count choice.

C.5 Format Subset Stability

Does the Q2_K finding hold when restricting to a single prompt format?

Format	Q2_K cells with ASR > 10% (of 20)	Peak ASR
Faux dialogue only	8 / 20	96% (qwen2.5-1.5b N=1)
Message array only	14 / 20	99% (qwen2.5-1.5b N=128)
Both (aggregated, as in SS5)	14 / 20	99% (qwen2.5-1.5b N=128)

Observations. The Q2_K breakpoint is robust across both formats. Faux dialogue alone detects Q2_K vulnerability in 8 of 20 cells (concentrated on qwen2.5-1.5b and llama3.2-1b). Message array alone detects it in 14 of 20 cells, recovering nearly the full aggregated picture. The 6-cell difference between formats reflects llama3.1-8b, where faux dialogue produces 0% ASR even at Q2_K but message array produces 27-92% ASR. Restricting to either format alone does not change the conclusion that Q2_K is the universal safety breakpoint.

Appendix D: Glossary

Term	Definition
ASR	Attack Success Rate -- fraction of samples where the model complied with a harmful request
BPW	Bits Per Weight -- effective precision of a quantized model
Many-Shot	Attack using N in-context compliance examples before the target harmful query
Faux Dialogue	Many-shot format embedding Human:/Assistant: text labels within a single user message
Message Array	Many-shot format injecting actual chat API user/assistant message pairs
Power Law	ASR = a * N^b relationship between shot count and attack success rate
Context Dilution	Hiding harmful content after long benign text prefixes to evade safety detection
TOST	Two One-Sided Tests for equivalence within +/-delta bounds
Wilson CI	Wilson score confidence interval for binomial proportions
MDE	Minimum Detectable Effect at a given statistical power level
GGUF	GPT-Generated Unified Format for quantized model weights (llama.cpp native)
Holm-Bonferroni	Step-down multiple comparison correction controlling family-wise error rate
Cohen's h	Effect size for comparing two proportions: h = 2\|arcsin(sqrt(p1)) - arcsin(sqrt(p2))\|. Benchmarks: 0.2 small, 0.5 medium, 0.8 large
Cohen's kappa	Inter-rater agreement statistic corrected for chance agreement
JBB	JailbreakBench -- standardized harmful behavior taxonomy

Appendix E: Configs

Run Configuration

experiment: tr140_many_shot_long_context_quant
models:
  - llama3.2:1b
  - llama3.2:3b
  - qwen2.5:1.5b
  - llama3.1:8b
quant_levels:
  - Q8_0
  - Q6_K
  - Q5_K_M
  - Q4_K_M
  - Q3_K_M
  - Q2_K
phase1:
  shot_counts: [1, 4, 16, 64, 128]
  formats: [faux_dialogue, message_array]
  n_behaviors: 50
phase2:
  context_profiles: [short_prefix, medium_prefix, long_prefix]
  quant_levels: [Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K]
  n_behaviors: 50
judge:
  model: qwen2.5:7b-instruct-q8_0
  temperature: 0.0
generation:
  temperature: 0.0
  seed: 42
  warmup_requests: 3
  cooldown_seconds: 10

Config Rationale:

Phase 1 uses all 6 quant levels x 5 shot counts x 2 formats to maximize coverage of the many-shot attack surface. The 50 behaviors per cell balance statistical power (n=100 after format aggregation) against compute cost.
Phase 2 drops Q5_K_M to keep the phase tractable (5 quant levels x 3 profiles x 4 models = 60 cells). Q5_K_M is the least informative level: it falls between the safe Q6_K and the safe Q4_K_M, both of which are included.
The judge model (qwen2.5:7b-instruct-q8_0) was chosen from a different family than any evaluated model to prevent family-specific bias. The 7B size fits within GPU memory alongside the evaluated model.

Full config snapshot: research/tr140/results/20260316_164907/config_snapshot.yaml

v3.0 Data Manifest and Compute Script

The v3.0 integration consumes the following files. Every row count was verified by research/tr140/compute_v2_stats.py at report-generation time; see research/tr140/v2_controls/analysis/v3_data_manifest.json for the full manifest and per-file schema notes.

Control	Directory	samples.jsonl rows	judge labels (gemma3)	Notes
v1 frozen	`research/tr140/results/20260316_164907/`	15,000	15,000 (qwen2.5:7b)	Primary v1 data
C1 rejudge (gemma3)	`research/tr140/v2_controls/results/c1_rejudge/20260414_165208/`	0	15,000	Gemma3 labels on v1 samples
C1 rejudge (Claude)	`research/tr140/v2_controls/results/claude_judge/claude_judge_labels.jsonl`	0	15,000	Claude Sonnet 4.6 on v1 samples
C2 FP16 baseline	`research/tr140/v2_controls/results/c2_fp16_baseline/20260414_203036/`	600	--	FP16 via transformers
C3 static prompt	`research/tr140/v2_controls/results/c3_static_prompt/20260414_193834/`	1,200	--	Static exemplar selection
C4 benign demo	`research/tr140/v2_controls/results/c4_benign_demo/20260414_195746/`	400	--	Benign-compliance demos
C6 Phase 2 reinforcement	`research/tr140/v2_controls/results/c6_phase2_reinforcement/20260415_031712/`	2,700	2,700	n=150 per cell
C7 breadth (run A)	`research/tr140/v2_controls/results/c7_breadth_expansion/20260415_212050/`	3,000	3,000	gemma2-2b only (phi3.5 pull_failed)
C7 breadth (run B)	`research/tr140/v2_controls/results/c7_breadth_expansion/20260415_223902/`	24,000	24,000	Full family, n=1,000 per cell
C8 right-tail	`research/tr140/v2_controls/results/c8_right_tail/20260416_005713/`	10,800	10,800	N=256 across 6 models x 6 quants
C9 14B anchor	`research/tr140/v2_controls/results/c9_larger_model/20260415_182202/`	900	900	qwen2.5-14b on RunPod L40S
C10 non-GGUF (failed)	`research/tr140/v2_controls/results/c10_non_gguf/20260415_041257/`	200	200	AWQ/GPTQ vLLM startup failed
C10 non-GGUF (success)	`research/tr140/v2_controls/results/c10_non_gguf/20260415_190113/`	400	400	Successful rerun, qwen2.5-1.5b only
C11 temp (no judge)	`research/tr140/v2_controls/results/c11_temp_sensitivity/20260415_000812/`	2,000	0	Replication, no labels
C11 temp (judged)	`research/tr140/v2_controls/results/c11_temp_sensitivity/20260415_000924/`	2,000	2,000	T=0.0 vs T=0.7
C12 shot ordering	`research/tr140/v2_controls/results/c12_shot_ordering/20260415_025821/`	750	750	Default bucket only
C13 combined bundle (x3)	`research/tr140/v2_controls/results/c13_combined_bundle/`	15,000 x 3 (each mirrors v1)	15,000 x 3	Bundle snapshots, not new data

Compute reproduction:

# From repo root:
python research/tr140/compute_v2_stats.py
# writes:
#   research/tr140/v2_controls/analysis/v3_stats.json
# and prints the inventory + per-control status summary to stdout.

The script has no external dependencies beyond Python's standard library (Fisher exact is implemented via math.comb; Wilson CI, bootstrap, TOST, Cohen's h, Cohen's kappa, and Holm-Bonferroni are all implemented in-file). It reads only raw JSONL from research/tr140/results/20260316_164907/ and research/tr140/v2_controls/results/. It does NOT trust any pre-existing analysis JSON.

Claim changes from v1 to v3 (quick reference):

Claim	v1	v3
Phase 2 qwen2.5-1.5b Q2_K long_prefix 100%	n=50/50	n=150/150 (Wilson lower 97.6%)
Judge kappa = 0.23 interpreted as task ambiguity	True	False -- v1 qwen judge is the outlier; gemma3-Claude kappa = 0.925
Q8_0 as proxy for FP16	Implicit	Rejected; TOST fails on 2/2 models tested
H3 (context cap) supported generally	Yes	Only for llama3.1-8b; rejected for 3 other models at N=256
llama3.1-8b Q2_K vs Q8_0 Fisher non-significant after Holm (p=0.128)	True under v1 qwen	p=3.4e-15 under gemma3; judge artifact, not real
"Quantization amplifies, does not create" vulnerability	Supported	Supported and strengthened (C2 FP16 shows qwen2.5-1.5b is vulnerable at 8% ASR even at FP16)
"GGUF K-quant 4-bit" as representative of "4-bit quantization"	Implicit	Rejected; AWQ and GPTQ at 4-bit show 3% ASR vs GGUF Q4_K_M 42% on qwen2.5-1.5b