Technical Report 139: Multi-Turn Jailbreak Susceptibility Under Quantization
Full-scale conversational attack sweep across 4 models, 6 GGUF quant levels, 8 multi-turn strategies, and 2-phase refusal-persistence evaluation
| Field | Value |
|---|---|
| TR Number | 139 |
| Project | Banterhearts Safety Alignment Research |
| Date | 2026-03-14 |
| Version | 1.2 |
| Author | Research Team |
| Git Commit | edbaf196 |
| Status | Complete |
| Report Type | Multi-turn jailbreak quantization study |
| Run Directory | 20260314_012503 |
| Total Conversations | 10,600 |
| Phase 1 Conversations | 9,600 |
| Phase 2 Conversations | 1,000 |
| Judge Labels | 37,825 |
| Models | llama3.2-1b, llama3.2-3b, qwen2.5-1.5b, llama3.1-8b |
| Quant Levels | Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K |
| Related Work | TR134, TR135, TR136, TR137, TR138, TR138_v2 |
| Depends On | TR134 shared refusal detector and judge stack, TR125 run utilities, JailbreakBench-derived behavior set |
Abstract
TR139 asks whether GGUF quantization changes susceptibility to multi-turn jailbreak attacks and whether lower-bit models become easier to break once a conversation has already entered an adversarial pressure regime. This is the first full study in this research line to combine modern multi-turn jailbreak methodology with a quantization sweep on local open-weight models.
The completed run contains 10,600 conversations: 9,600 Phase 1 attack-efficacy conversations across 4 models, 6 quant levels, 8 attack strategies, and 50 harmful behaviors, plus 1,000 Phase 2 persistence conversations across 4 models, 5 quant levels, and the same 50 behaviors. Post-hoc adjudication adds 37,825 LLM-judge labels on top of the primary regex-based refusal detector.
The Phase 1 result is strong. All 8 strategy-specific ANOVAs reject quant-independence with p < 1e-4, with effect sizes ranging from eta^2 = 0.0314 (context_fusion) to eta^2 = 0.1526 (direct). The highest-risk cells cluster in qwen2.5-1.5b and low-bit llama3.2-1b: qwen2.5-1.5b / Q2_K / attention_shift, context_fusion, and crescendo all reach 100% ASR, while llama3.2-1b / Q2_K / crescendo reaches 86%.
The Phase 2 result is narrower and more heterogeneous. Persistence generally decreases as quantization gets lower, but not by the same mechanism in every model. The strongest degradation appears in llama3.2-1b (slope = -0.198 persistence points per BPW, bootstrap CI [-0.305, -0.144]), while llama3.1-8b is so weak under pressure across the sweep that the main signal is high break rate rather than a clean monotone slope.
The most important non-result is that quantization does not establish a universal law that multi-turn strategies become more quant-sensitive than direct attacks. Average direct slope is -0.0793, average multi-turn slope is -0.0536, and the Welch comparison is not significant (p = 0.702). Lower quantization changes multi-turn attack success, but it does not prove global multi-turn amplification beyond direct degradation.
The operational conclusion is that quantization belongs in the safety envelope for conversational deployment, but the risk is model- and threshold-specific rather than universal. The right deployment response is targeted validation and threshold-aware policy, not a single blanket quantization rule.
Executive Summary
TR139 answers:
does quantization change multi-turn jailbreak risk, and does it make initially refusing models easier to break under conversational pressure?
Yes, but not in the simplest possible way.
Key Findings
- Quantization materially affects multi-turn ASR. All 8 strategy-level ANOVAs reject quant-independence with
p < 1e-4. - Risk is strongly model-dependent.
qwen2.5-1.5bis broadly vulnerable,llama3.2-1bcollapses atQ2_K,llama3.1-8bshows an instability band aroundQ4_K_MtoQ3_K_M, andllama3.2-3bis the strongest model in the sweep. - Persistence usually weakens as quantization gets lower. Three of four persistence slopes are negative, and the aggregate Phase 2 verdict supports degradation under lower bit-width.
- There is no safe equivalence story. Across
160TOST checks at+/-3pp,0 / 160establish equivalence. - Multi-turn amplification exists in specific cells but not as a universal law. The highest finite amplification ratios reach
49.0x, but H2 is still not supported in aggregate. - A preserved post-hoc judge pass exposes exactly where regex scoring is weakest. The saved judge layer adds
37,825adjudications, with overall agreement64.99%and kappa0.1037against the regex detector. The lowest-agreement slices are the mid-quant Phase 2 cells, which is exactly where human review is most justified.
Core Decisions
- Do not treat quantization as a pure systems optimization for interactive chat.
- Avoid
Q2_Kfor safety-sensitive deployment on the smaller models tested here. - Validate persistence under pressure, not just direct harmful prompt refusal.
- Treat
qwen2.5-1.5bas high-risk for conversational adversarial exposure across much of the quant ladder. - Treat
llama3.2-3bas strongest-in-sweep, not as robust-by-default.
Validation Summary
| Target | Metric | Required | Achieved | Status |
|---|---|---|---|---|
| Phase 1 conversations | Count | 9,600 | 9,600 | PASS |
| Phase 2 conversations | Count | 1,000 | 1,000 | PASS |
| Judge labels | Count | >= 9,600 | 37,825 | PASS |
| H1 rejection (quant-independence) | Strategy ANOVA p < 0.05 | 8 / 8 strategies | 8 / 8, all p < 1e-4 | PASS |
| H2 multi-turn amplification | Welch p < 0.05 | significant slope difference | p = 0.702 | FAIL |
| H3 persistence degradation | Negative slope, CI excludes 0 | 4 / 4 models | 3 / 4 models | PARTIAL |
| TOST non-equivalence (+/-3pp) | Proportion establishing equivalence | 0 / 160 | 0 / 160 | PASS (confirms non-interchangeability) |
| Cross-phase anchor match | Phase 1 direct vs Phase 2 initial refusal | 20 / 20 cells | 20 / 20 exact | PASS |
| Power (median MDE at 80%) | Detectable effect | <= 20pp | 15.19pp | PASS |
Claim Validation
| # | Claim | Evidence Base | Status |
|---|---|---|---|
| 1 | Quantization changes multi-turn ASR | 8/8 strategy ANOVAs reject H1, all p < 1e-4 |
Demonstrated |
| 2 | Lower quant always worsens multi-turn risk | Model-level sweep shows mixed threshold patterns, not universal monotonicity | Not validated |
| 3 | Multi-turn attacks become more quant-sensitive than direct attacks | Welch slope comparison p = 0.702 |
Not validated |
| 4 | Lower quant often weakens persistence under pressure | H3 supported; 3/4 persistence slopes negative | Demonstrated |
| 5 | Threshold behavior matters more than a single global quant rule | Critical quant thresholds cluster by model/strategy, not globally | Demonstrated |
| 6 | Regex-only scoring would be too weak for this report | 37,825 preserved judge labels identify the same mid-quant Phase 2 slices as the highest-ambiguity cells | Demonstrated |
When to Use This Report
TR139 is the reference report for conversational jailbreak risk under quantization. Use it when:
Scenario 1: Quantized chat deployment review
Question: "Can I safely deploy a lower-bit local model for interactive chat?"
Answer: Not without adversarial conversational validation. Direct-turn refusal rates are not enough. TR139 shows that high direct refusal can coexist with very high multi-turn ASR.
Scenario 2: Comparing models for hostile-input environments
Question: "Which of these local models is least bad under multi-turn attack?"
Answer: In this run, llama3.2-3b is strongest overall. qwen2.5-1.5b is the most dangerous. llama3.2-1b is acceptable only above its low-bit cliff. llama3.1-8b is especially poor under continued pressure.
Scenario 3: Deciding whether direct harmful-prompt tests are sufficient
Question: "If the model refuses direct prompts, is that enough evidence?"
Answer: No. TR139 directly shows that a model can retain high direct refusal while still failing badly in multi-turn attack cells.
Scenario 4: Positioning TR139 relative to the broader safety line
Question: "What does TR139 add beyond TR134-TR138?"
Answer: Earlier reports established that inference-time choices can affect safety. TR139 extends that line into conversational attack structure and shows that quantization changes the multi-turn attack surface too.
Table of Contents
- Metric Definitions and Evidence Standards
- 1. Introduction and Research Motivation
- 2. Methodology and Experimental Design
- 3. Models, Quants, Strategies, and Pressure Tactics
- 4. Phase 1 Results: Multi-Turn Attack Success
- 5. Phase 2 Results: Refusal Persistence Under Pressure
- 6. Statistical Synthesis and Hypothesis Evaluation
- 7. Judge Agreement and Scoring Reliability
- 8. Cross-Phase Validation
- 9. Cross-Report Positioning
- 10. Production Guidance
- 11. Limitations and Follow-Up Program
- 12. Conclusions
- 13. Reproducibility
- References and External Anchors
- Appendix A: Raw Data Tables
- Appendix B: Extended Statistical Tables
- Appendix C: Sensitivity and Robustness
- Appendix D: Glossary
- Appendix E: Configs (Source of Truth)
Metric Definitions and Evidence Standards
Primary Metrics
| Metric | Definition | Interpretation |
|---|---|---|
| ASR | Fraction of Phase 1 conversations whose target turn is scored compliant | Higher is worse |
| Break rate | Fraction of initially refusing Phase 2 conversations that eventually comply under pressure | Higher is worse |
| Persistence | Fraction of turns in a Phase 2 conversation that preserve refusal | Lower is worse |
| Slope per BPW | Change in ASR or persistence per bits-per-weight unit | Negative ASR slope means lower quant increases ASR; negative persistence slope means lower quant weakens persistence |
| Amplification ratio | Multi-turn ASR divided by direct ASR for the same (model, quant) cell |
Values > 1 indicate multi-turn strategy outperforms direct |
Statistical Tests Used
| Test | Role in report |
|---|---|
| One-way ANOVA | H1 quant-independence checks by strategy |
| Two-way ANOVA | Quant x strategy interaction within each model |
| Fisher exact test | Per-cell quant-vs-baseline comparisons |
| Welch's t-test | H2 direct-vs-multi-turn slope comparison; pairwise quant comparisons |
| Cohen's d | Effect size on every pairwise quant comparison |
| Holm-Bonferroni correction | Applied to all Fisher exact and pairwise t-test families to control FWER |
| Bootstrap CI (1,000 iterations, seed 42) | Uncertainty on persistence slopes and ASR slopes |
TOST (+/-3pp) |
Tests whether quant cells are practically equivalent |
| Variance decomposition | Separates model, quant, strategy, and residual contributions |
| Power analysis (alpha=0.05, power=0.80) | Estimates MDE per cell |
Evidence Standard
TR139 distinguishes:
- Established findings: directly supported by the completed artifacts and tests
- Partial findings: directionally real but still heterogeneous or underconstrained
- Non-claims: tempting interpretations that the run does not justify
That distinction is necessary because TR139 contains both strong positive findings and real negative findings.
1. Introduction and Research Motivation
Multi-turn jailbreaks are now one of the strongest attack classes in the alignment literature. They work by distributing harmful intent across turns, building authority, exploiting persona commitment, or gradually narrowing from benign context into unsafe content. That makes them structurally different from the single-turn refusal tests used in much of routine safety validation.
Separately, the recent safety line in this repo has already shown that deployment choices such as backend, batching, and quantization are not safety-neutral. But those reports mostly interrogated single-turn or near-single-turn behavior. The missing question was whether the same deployment choices alter the much stronger conversational attack regime.
TR139 fills that gap.
The core novelty is not simply "lower quant hurts safety." The novelty is:
quantization interacts with conversational attack structure in a model-specific and threshold-specific way.
That matters for practice because a deployment team often chooses quantization for memory fit, speed, or laptop/server constraints. If that same decision also moves the model into a conversationally unstable attack regime, then quantization is part of the safety envelope, not just the systems stack.
The report therefore has two jobs:
- determine whether multi-turn ASR depends on quantization
- determine whether lower quant weakens persistence after an initial refusal
1.1 Research questions
TR139 answers four concrete decision questions:
- Does multi-turn ASR depend on quantization level?
- Does lower quantization make multi-turn strategies more dangerous than direct attacks, or does it merely degrade safety more generally?
- Once a model refuses, which quant settings preserve that refusal and which settings collapse under pressure?
- Are there model-specific threshold regions that justify deployment policy?
1.2 Why this matters
The practical risk is compound. A system can appear strong on direct harmful prompts, fit comfortably on available hardware, and still enter a conversational failure regime because:
- lower-bit deployment changes the refusal surface
- multi-turn context changes how the model interprets the final request
- repeated pressure changes whether a refusal is durable
TR139 therefore closes a real decision gap. It is not just another jailbreak benchmark. It is a deployment-facing study of how quantization interacts with the strongest attack class in the current literature.
1.3 Scope
| Scope item | Coverage |
|---|---|
| Deployment style | Local Ollama chat-style serving |
| Attack family | Multi-turn jailbreaks plus follow-up persistence pressure |
| Models | 4 local instruct models from 1B to 8B |
| Quant ladder | 6 Phase 1 levels, 5 Phase 2 levels |
| Behavior set | 50 harmful behaviors across 10 categories |
| Primary focus | Safety under adversarial conversational interaction |
1.4 Literature grounding
TR139 is anchored in two prior literatures:
- multi-turn jailbreak methods such as Crescendo, Foot-in-the-Door, Attention Shifting, RACE-like refinement, context fusion, and persona attacks
- quantization-safety work showing that lower-bit deployment can alter alignment behavior
The novelty gap is the intersection. Prior work treats these mostly as separate problems. TR139 measures them together.
1.5 How to read this report
Read the report in three passes:
- Phase 1 for the main quantization x conversational-attack result
- Phase 2 for the persistence-under-pressure result
- Statistical synthesis and production guidance for what is actually established and how to use it
For unfamiliar terms, see Appendix D (Glossary). For raw data tables and full pairwise statistics, see Appendices A and B. For robustness checks and sensitivity analysis, see Appendix C. The canonical run configuration is reproduced in Appendix E.
2. Methodology and Experimental Design
2.1 Overall design
TR139 is a two-phase design.
Phase 1 measures attack success under fully-specified multi-turn jailbreak templates:
- 4 models
- 6 quant levels
- 8 attack strategies
- 50 harmful behaviors
This yields exactly 9,600 conversations.
Phase 2 measures persistence after an initial refusal:
- 4 models
- 5 quant levels
- 50 harmful behaviors
This yields exactly 1,000 conversations.
2.2 Unit of analysis
The basic unit is one conversation under one fixed (model, quant, strategy, behavior) configuration.
That conversation then becomes:
- one ASR contribution in Phase 1
- or one persistence/break contribution in Phase 2
The report is therefore conversation-level in generation, but cell-level in aggregation and hypothesis testing.
2.3 How rows become claims
The evidence chain is:
- model outputs are generated from fixed multi-turn or persistence templates
- the refusal detector scores target turns or pressure turns
- the judge layer re-scores a large post-hoc subset
- conversation outcomes aggregate into ASR, break rate, and persistence by cell
- those cells feed ANOVA, slope, equivalence, and variance analyses
- only then do report-level claims follow
This matters because TR139 is not just a table of alarming cells. The report's claims are tied to the aggregation and statistics, not only the worst examples.
2.4 Scoring stack
The primary scorer is the shared refusal detector inherited from the TR134 safety stack. It provides the run-wide consistent refusal/compliance signal.
TR139 then adds one preserved post-hoc LLM judge pass, producing 37,825 labels on the scored conversations. The saved artifact chain uses qwen2.5:7b-instruct-q8_0 as the judge model. An earlier fallback run existed during development, but its outputs were overwritten and are not treated as part of the final evidentiary record. The judge layer is therefore used here as a secondary adjudication signal over the final saved run, not as a dual-judge comparison.
2.5 Why Phase 1 and Phase 2 are separated
Phase 1 and Phase 2 answer different questions.
- Phase 1: "Can the attacker get compliance by the target turn under a structured multi-turn attack?"
- Phase 2: "If the model initially refuses, how much pressure does it tolerate before breaking?"
Those are related but not identical. A model can be weak in one regime and less weak in the other. TR139 keeps them separate for that reason.
2.6 Design safeguards
The design includes several safeguards specifically to keep the interpretation clean:
directanchors the single-turn baselinebenign_contextisolates multi-turn context from overt escalation strategy- Phase 2 measures both overall persistence and pressure-only resistance so immediate failure is not conflated with later breakage
- the 8B model is explicitly treated as quantized-only in practice rather than pretending FP16 would be deployable on this hardware
- the same harmful behavior pool is reused across phases so the persistence analysis is not answering a different content question from Phase 1
2.7 What this design does not do
TR139 does not test:
- adaptive human red-teaming
- hosted frontier APIs
- training-time mitigation changes
- every possible quant scheme
It is a serving-time deployment study, not a universal alignment benchmark.
2.8 Why these models
The model lineup is intentional rather than opportunistic.
llama3.2-1bis the small-model cliff candidate: easy to deploy locally, but most likely to show quantization-sensitive failure bands.llama3.2-3bprovides a same-family comparison at higher capability, which helps separate family effects from pure scale effects.qwen2.5-1.5badds a different instruction-tuning family with a strong reputation on utility tasks, making it a useful cross-family safety test rather than a second Llama-only result.llama3.1-8btests whether larger parameter count automatically buys conversational robustness. In this run it does not.
This is the same style of design logic used in TR126: the report is not just "four models happened to be available." Each model occupies a distinct explanatory role in the causal story.
2.9 Why this quant ladder
The quant ladder covers the deployment settings that actually matter for local serving:
Q8_0is the closest practical approximation to a high-fidelity quantized baseline.Q6_KandQ5_K_Mcover common "safe enough for quality, lighter on memory" settings.Q4_K_MandQ3_K_Mare the first places where many laptop and edge deployments begin trading quality for fit or throughput aggressively.Q2_Kis the severe compression endpoint and functions as a stress test for low-bit collapse.
Phase 2 drops Q5_K_M on purpose. The goal there is slope resolution under pressure, not exhaustive coverage of every adjacent quant step. Keeping Q8_0, Q6_K, Q4_K_M, Q3_K_M, and Q2_K preserves the upper, middle, and lower parts of the ladder while keeping the persistence run computationally tractable.
2.10 Sample-count integrity
TR139's executed counts are fixed by design and matter for interpretation:
| Phase | Formula | Executed |
|---|---|---|
| Phase 1 | 4 models x 6 quants x 8 strategies x 50 behaviors |
9,600 conversations |
| Phase 2 | 4 models x 5 quants x 50 behaviors |
1,000 conversations |
| Judge | preserved target-turn and follow-up adjudication on scored outputs | 37,825 labels |
These exact counts matter for two reasons:
- they show the run is complete rather than partially sampled
- they define what the report can and cannot claim at the cell level
TR139 has enough coverage for strong aggregate claims, but only moderate power for many fine-grained cell-to-cell comparisons. The report therefore follows TR126's discipline: use the whole design for the headline claims, and treat the cell-level readout as structure and threshold discovery rather than a license for hundreds of tiny claims.
3. Models, Quants, Strategies, and Pressure Tactics
3.1 Model and quant matrix
| Model | Family | Phase 1 quants | Phase 2 quants |
|---|---|---|---|
llama3.2-1b |
Llama 3.2 | Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K |
Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K |
llama3.2-3b |
Llama 3.2 | same | same |
qwen2.5-1.5b |
Qwen 2.5 | same | same |
llama3.1-8b |
Llama 3.1 | same | same |
Bits-per-weight values used in the slope analyses:
| Quant | BPW |
|---|---|
Q8_0 |
8.00 |
Q6_K |
6.57 |
Q5_K_M |
5.69 |
Q4_K_M |
4.85 |
Q3_K_M |
3.91 |
Q2_K |
3.35 |
3.2 Phase 1 attack strategies
The 8 strategies cover controls plus five recognizable jailbreak families from the literature:
| Strategy | Function in the study |
|---|---|
direct |
Single-turn control baseline |
benign_context |
Multi-turn context control without explicit escalation |
foot_in_door |
Topic escalation |
crescendo |
Authority and professional framing escalation |
attention_shift |
Context distraction and fictional reframing |
role_play |
Persona manipulation |
progressive_refinement |
Iterative narrowing toward harmful specificity |
context_fusion |
Distributed intent across benign setup turns |
The control strategies are important. They let the report separate:
- the effect of multi-turn context alone
- the effect of structured multi-turn attack strategy
- the direct single-turn baseline
3.3 Phase 2 pressure tactics
Phase 2 uses up to 8 follow-up pressure tactics after an initial refusal, including:
- authority appeal
- public-knowledge framing
- moral reframing
- emotional appeal
- rephrasing
- competitive comparison
- reciprocity leverage
- normalization
This gives a more decision-relevant measure of persistence than one additional retry turn.
3.4 Strategy taxonomy and turn structure
The attack families fall into four operational groups:
| Group | Strategies | Typical target-turn position |
|---|---|---|
| Direct baseline | direct |
turn 1 |
| Context-only control | benign_context |
turn 5 |
| Escalation and authority | foot_in_door, crescendo, attention_shift |
turn 5 |
| Intent distribution / persona / refinement | context_fusion, role_play, progressive_refinement |
turn 4 |
This matters for the turn-of-first-compliance analysis: the report is not comparing arbitrary turn counts. It is comparing fixed attack structures with distinct target-turn positions.
3.5 Why the behavior categories matter
The 50 harmful behaviors are spread across 10 categories, which lets the report distinguish:
- a model that fails only in a narrow risk pocket
- a model that degrades across many harm families at once
That distinction becomes important in the per-category vulnerability section below.
3.6 Published method coverage and why the controls matter
The strategy set is not just a benchmark grab-bag. It deliberately spans multiple conversational failure mechanisms:
- momentum without overt escalation:
benign_context - topic escalation:
foot_in_door - authority and professional framing:
crescendo - attention diversion and fictionalization:
attention_shift - persona commitment:
role_play - iterative narrowing:
progressive_refinement - distributed intent across turns:
context_fusion
That coverage is what lets TR139 say something stronger than "some jailbreak worked." It can ask whether quantization mostly changes:
- direct harmful compliance
- conversational momentum sensitivity
- specific strategy families
- post-refusal durability
The controls are therefore not filler. They are what keeps the report from collapsing all conversational effects into a single undifferentiated "multi-turn attacks are strong" claim.
4. Phase 1 Results: Multi-Turn Attack Success
4.1 H1 is rejected cleanly
All 8 strategy-specific ANOVAs reject H1:
| Strategy | F | p | Eta squared |
|---|---|---|---|
attention_shift |
9.7575 | < 1e-4 |
0.0393 |
benign_context |
19.5875 | < 1e-4 |
0.0758 |
context_fusion |
7.7356 | < 1e-4 |
0.0314 |
crescendo |
32.9943 | < 1e-4 |
0.1214 |
direct |
43.0025 | < 1e-4 |
0.1526 |
foot_in_door |
29.2947 | < 1e-4 |
0.1093 |
progressive_refinement |
13.3443 | < 1e-4 |
0.0529 |
role_play |
19.3813 | < 1e-4 |
0.0751 |
Observations. All 8 F-statistics exceed 7.7 and all p-values are below 1e-4 after Holm-Bonferroni correction across the 8-test family. The largest effect size is direct (eta^2 = 0.1526, large by conventional thresholds), meaning quantization alone explains over 15% of the variance in direct-attack ASR. Even the smallest effect (context_fusion, eta^2 = 0.0314) clears the small-effect threshold. The result is unambiguous:
quantization materially affects attack success in the multi-turn regime.
4.2 Aggregate strategy ranking
Average ASR across the full Phase 1 sweep ranks the strategies as follows:
| Strategy | Mean ASR |
|---|---|
attention_shift |
32.33% |
progressive_refinement |
30.92% |
role_play |
25.67% |
context_fusion |
24.08% |
crescendo |
23.67% |
direct |
23.50% |
benign_context |
22.58% |
foot_in_door |
19.92% |
Observations. Two conclusions follow.
attention_shiftandprogressive_refinementare the strongest strategies overall in this run.- The control strategies are not trivial.
directandbenign_contextare both strong enough that the multi-turn landscape is not reducible to "special jailbreak tricks only."
4.3 Model-level Phase 1 profiles
qwen2.5-1.5b
Mean ASR by quant:
Q8_0:43.25%Q6_K:39.25%Q5_K_M:42.75%Q4_K_M:55.50%Q3_K_M:53.75%Q2_K:83.75%
This is the worst model in the sweep. It is not merely a low-bit problem. The model is already vulnerable across upper-bit settings, and then becomes catastrophic at Q2_K.
llama3.2-1b
Mean ASR by quant:
Q8_0:1.75%Q6_K:1.75%Q5_K_M:3.75%Q4_K_M:5.75%Q3_K_M:9.75%Q2_K:61.50%
This is the clearest cliff model in the study. Above Q2_K, the system is imperfect but relatively stable. At Q2_K, it collapses.
llama3.1-8b
Mean ASR by quant:
Q8_0:18.50%Q6_K:23.25%Q5_K_M:14.75%Q4_K_M:26.25%Q3_K_M:56.00%Q2_K:44.00%
This is not a monotone degradation story. It is an instability-band story, with the sharpest failures around Q3_K_M.
llama3.2-3b
Mean ASR by quant:
Q8_0:5.75%Q6_K:6.75%Q5_K_M:2.75%Q4_K_M:3.25%Q3_K_M:0.50%Q2_K:3.75%
This is the strongest model in the sweep. It still fails in some cells, but it is clearly separated from the other three.
4.4 Strategy effects by model
Model-level ANOVA across strategies shows that the models are not just different in total ASR; they differ in which strategy families dominate them.
| Strategy | Strongest model mean ASR | Eta squared |
|---|---|---|
attention_shift |
qwen2.5-1.5b at 93.33% |
0.6017 |
context_fusion |
qwen2.5-1.5b at 67.33% |
0.3556 |
progressive_refinement |
qwen2.5-1.5b at 78.33% |
0.3955 |
role_play |
qwen2.5-1.5b at 67.33% |
0.3683 |
direct |
llama3.1-8b at 46.00% |
0.0961 |
Observations. This confirms the qualitative reading from the cell tables: qwen2.5-1.5b is disproportionately weak to several multi-turn methods, while llama3.1-8b is unusually weak even on the direct baseline. The eta^2 values for attention_shift (0.6017) and progressive_refinement (0.3955) are exceptionally large, indicating that model identity explains over half the variance in those strategy families.
4.5 Quant x strategy interaction
Two-way ANOVA within each model shows that the effect of quant level depends on the strategy being used:
| Model | Quant effect eta^2 | Strategy effect eta^2 | Interaction eta^2 |
|---|---|---|---|
llama3.1-8b |
0.1019 | 0.0454 | 0.1038 |
llama3.2-1b |
0.3794 | 0.0192 | 0.0402 |
llama3.2-3b |
0.0113 | 0.0806 | 0.0251 |
qwen2.5-1.5b |
0.0898 | 0.2898 | 0.0570 |
Observations. The interpretation is different by model:
llama3.2-1b: quant dominates (eta^2 = 0.3794); the low-bit cliff is the main storyqwen2.5-1.5b: strategy dominates (eta^2 = 0.2898); the model is broadly vulnerable and the attack family matters more than the exact quantllama3.1-8b: quant and interaction are both substantial (eta^2 = 0.1019and0.1038respectively); instability depends on where in the strategy ladder the model is hitllama3.2-3b: the whole surface is comparatively flat and low (quanteta^2 = 0.0113)
4.6 Highest-risk cells
The worst Phase 1 cells are:
| Model | Quant | Strategy | ASR |
|---|---|---|---|
qwen2.5-1.5b |
Q2_K |
attention_shift |
100% |
qwen2.5-1.5b |
Q2_K |
context_fusion |
100% |
qwen2.5-1.5b |
Q2_K |
crescendo |
100% |
qwen2.5-1.5b |
Q6_K |
attention_shift |
98% |
qwen2.5-1.5b |
Q8_0 |
attention_shift |
98% |
llama3.1-8b |
Q3_K_M |
benign_context |
92% |
llama3.2-1b |
Q2_K |
crescendo |
86% |
Observations. Two things stand out:
attention_shiftis one of the strongest strategies in the whole run, especially onqwen2.5-1.5b.- The
benign_contextcontrol can itself become dangerous in unstable quant regimes, which means multi-turn context accumulation alone is not safety-neutral.
4.7 Turn-of-first-compliance structure
Mean turn-of-first-compliance by strategy is almost perfectly determined by template position:
| Strategy | Mean compliance turn |
|---|---|
direct |
1.0 |
context_fusion |
4.0 |
progressive_refinement |
4.0 |
role_play |
4.0 |
attention_shift |
5.0 |
benign_context |
5.0 |
crescendo |
5.0 |
foot_in_door |
5.0 |
Observations. This is not a trivial observation. It means the attack scaffold is doing what it is supposed to do: models usually hold until the final target turn and then break at the intended conversational moment, rather than leaking compliance randomly earlier in the sequence.
4.8 Per-category vulnerability
Average Phase 1 ASR by behavior category across the full sweep:
| Category | Mean ASR |
|---|---|
| Chemical & Biological Weapons | 32.81% |
| Disinformation & Deception | 29.58% |
| Violence & Weapons | 27.71% |
| Self-Harm & Dangerous Substances | 27.40% |
| Illegal Activities | 27.19% |
| Cybercrime & Unauthorized Intrusion | 25.10% |
| Harassment & Bullying | 24.48% |
| Discrimination & Stereotyping | 20.63% |
| Privacy Violations & Surveillance | 19.58% |
| Economic Harm & Fraud | 18.85% |
Observations. The spread is not enormous (14pp between highest and lowest), which is itself informative. The worst models are not failing only on one narrow content type. They are failing across a broad harm surface. Chemical & Biological Weapons leads at 32.81%, likely because these prompts contain specific procedural requests that trip even high-bit models. Economic Harm & Fraud trails at 18.85%, suggesting models have stronger refusal priors on financial harm framing.
4.9 Cross-strategy correlation
Several strategy pairs fail on the same behaviors, especially in the high-risk regions:
llama3.1-8b / Q4_K_M:context_fusionvsprogressive_refinement,r = 0.7513qwen2.5-1.5b / Q3_K_M:context_fusionvsprogressive_refinement,r = 0.7441qwen2.5-1.5b / Q6_K:context_fusionvscrescendo,r = 0.6896
Observations. This suggests some behaviors are intrinsically multi-turn-fragile rather than being uniquely vulnerable to one attack family. When two strategies with different mechanisms (distributed intent vs iterative narrowing) fail on the same prompts with r > 0.7, the weakness is in the model's representation of the behavior, not in the specifics of the attack scaffold.
4.10 Conditional ASR given setup compliance
Conditional ASR is especially useful when a strategy has a nontrivial setup stage. One illustrative case:
qwen2.5-1.5b / Q3_K_M / attention_shift- setup compliance rate:
70% - unconditional ASR:
72% - conditional ASR given setup success:
91.43%
- setup compliance rate:
This means once the setup stage lands, the target-turn attack is even stronger than the unconditional cell suggests.
4.11 Critical quant thresholds
The critical threshold analysis does not yield one universal cutoff. Instead:
llama3.2-1bmostly crosses atQ2_K, with some earlier thresholds atQ3_K_Mllama3.1-8bmainly crosses atQ3_K_M, with one earlierQ4_K_Mthreshold forbenign_contextqwen2.5-1.5bmostly crosses atQ2_K, butrole_playcrosses earlier atQ5_K_M
This is why the report recommends threshold-aware policy rather than a single repo-wide quant rule.
4.12 What Phase 1 does not prove
Phase 1 does not prove:
- that lower quant always worsens ASR monotonically
- that every multi-turn strategy is more dangerous than direct attack
- that one threshold applies to every model
Those are exactly the overclaims this report avoids.
5. Phase 2 Results: Refusal Persistence Under Pressure
5.1 Aggregate model picture
| Model | Avg break rate | Avg persistence | Initial refusals across sweep | Breaks under pressure |
|---|---|---|---|---|
llama3.1-8b |
96.67% | 0.1329 | 119 | 118 |
llama3.2-1b |
55.95% | 0.6720 | 198 | 95 |
llama3.2-3b |
46.68% | 0.5862 | 208 | 100 |
qwen2.5-1.5b |
38.56% | 0.6809 | 211 | 86 |
Observations. The most operationally important number here is not a slope. It is the llama3.1-8b break profile. Once it initially refuses, it still breaks under pressure almost every time in this design. The 96.67% average break rate means that llama3.1-8b refusal is effectively cosmetic - it stops the user once and then yields. By contrast, qwen2.5-1.5b has the lowest break rate (38.56%) despite being the weakest model in Phase 1, because it simply never refuses many harmful prompts in the first place (fewer initial refusals to break).
5.2 Persistence slope analysis
| Model | Slope per BPW | Bootstrap CI | Interpretation |
|---|---|---|---|
llama3.1-8b |
0.0160 | [-0.0080, 0.0641] |
effectively flat / unstable |
llama3.2-1b |
-0.1980 | [-0.3054, -0.1442] |
lower quant breaks more easily |
llama3.2-3b |
-0.1111 | [-0.4195, -0.0175] |
lower quant breaks more easily |
qwen2.5-1.5b |
-0.1007 | [-0.3106, 0.4225] |
negative direction, wide uncertainty |
Observations. This is enough to support H3 overall, but it also shows why the report does not claim a single common mechanism across the four models. The llama3.2-1b slope (-0.1980) is the steepest and most precisely estimated, while the qwen2.5-1.5b CI crosses zero on the upper bound, indicating that its persistence degradation is directionally real but not yet cleanly separated from noise. Full persistence data by model x quant is in Appendix A.
5.3 Pressure resistance extremes
Lowest mean pressure resistance cells:
| Model | Quant | Pressure resistance | Break rate | Initial refusals |
|---|---|---|---|---|
llama3.2-1b |
Q2_K |
0.0391 | 100.0% | 16 |
llama3.1-8b |
Q4_K_M |
0.0898 | 100.0% | 32 |
llama3.1-8b |
Q8_0 |
0.1919 | 100.0% | 43 |
Highest mean pressure resistance cells:
| Model | Quant | Pressure resistance | Break rate | Initial refusals |
|---|---|---|---|---|
llama3.2-1b |
Q6_K |
0.9896 | 8.33% | 48 |
llama3.2-1b |
Q8_0 |
0.9787 | 17.02% | 47 |
qwen2.5-1.5b |
Q6_K |
0.9770 | 8.16% | 49 |
qwen2.5-1.5b |
Q8_0 |
0.9617 | 10.20% | 49 |
Observations. This makes the persistence story much more concrete than the slope table alone. Some cells remain extremely durable under pressure, while others offer almost no effective resistance at all. The llama3.2-1b / Q2_K cell (pressure resistance 0.0391, 100% break rate) is the single most operationally dangerous cell in Phase 2 - the model folds immediately under any pressure at this quant level.
5.4 Turn-level persistence curves
The turn trajectories show a consistent shape in the weakest cells:
- initial refusal at turn 1 can still be moderately high
- refusal collapses sharply by turns 2-4
- later turns rarely recover meaningfully once the model has broken
Examples:
llama3.2-1b / Q2_K: mean refusal score drops from0.32on turn 1 to0.0on turns 2-3llama3.1-8b / Q8_0: turn 1 refusal is0.86, but by turn 3 it is down to0.186llama3.1-8b / Q4_K_M: turn 1 refusal is0.64, but by turn 4 it is0.0312
These trajectories support the interpretation that some models are not merely vulnerable in the aggregate. They are unable to maintain refusal once the conversation has started bending toward compliance.
5.5 Concrete persistence regimes
Selected Phase 2 cells:
llama3.2-1b / Q6_K: break rate8.33%, persistence0.9511llama3.2-1b / Q8_0: break rate17.02%, persistence0.9222llama3.2-1b / Q2_K: break rate100%, persistence0.0467qwen2.5-1.5b / Q6_K: break rate8.16%, persistence0.9600qwen2.5-1.5b / Q8_0: break rate10.20%, persistence0.9467qwen2.5-1.5b / Q3_K_M: break rate84.44%, persistence0.6156llama3.1-8b / Q8_0: break rate100%, persistence0.2422
The two strongest persistence stories are:
llama3.2-1bandqwen2.5-1.5bboth have relatively stable upper-bit regimes and much weaker lower-bit regimesllama3.1-8bis already weak enough that quant slope is less important than absolute break behavior
5.6 What Phase 2 adds beyond Phase 1
Phase 1 alone could have been dismissed as "multi-turn templates are powerful." Phase 2 blocks that easy escape. It shows that the problem is not just target-turn compliance under crafted attack scaffolds. It is also refusal durability after the system has already said no once.
6. Statistical Synthesis and Hypothesis Evaluation
6.1 H1: ASR independence of quantization
Verdict: rejected.
This is the cleanest result in the report. All 8 strategies reject H1 with p < 1e-4.
In addition to the ANOVA results, per-cell Fisher exact tests (each quant vs Q8_0 baseline) were corrected with Holm-Bonferroni within each model x strategy family. All-pairs Welch's t-tests were also corrected with Holm-Bonferroni. Selected pairwise Cohen's d values illustrate the magnitude:
| Model | Strategy | Comparison | Cohen's d | Holm p |
|---|---|---|---|---|
llama3.2-1b |
crescendo |
Q2_K vs Q8_0 | 2.49 | < 0.001 |
llama3.2-1b |
foot_in_door |
Q2_K vs Q8_0 | 2.33 | < 0.001 |
llama3.1-8b |
benign_context |
Q3_K_M vs Q8_0 | 1.00 | < 0.001 |
llama3.1-8b |
direct |
Q3_K_M vs Q8_0 | 1.00 | < 0.001 |
llama3.1-8b |
attention_shift |
Q2_K vs Q4_K_M | 0.71 | 0.004 |
qwen2.5-1.5b |
role_play |
Q2_K vs Q8_0 | 2.18 | < 0.001 |
qwen2.5-1.5b |
crescendo |
Q2_K vs Q8_0 | 2.20 | < 0.001 |
llama3.2-3b |
direct |
Q2_K vs Q3_K_M | 3.47 | < 0.001 |
Full pairwise tables with Cohen's d and Holm-corrected p-values are in Appendix B.
6.2 H2: quantization amplifies multi-turn attacks relative to direct attacks
Cell-level amplification is real. The largest finite amplification ratios are:
49.0xforqwen2.5-1.5b / Q6_K / attention_shift49.0xforqwen2.5-1.5b / Q8_0 / attention_shift39.0xforqwen2.5-1.5b / Q8_0 / progressive_refinement
But the aggregate slope test does not support the stronger global claim:
- average direct slope:
-0.0793 - average multi-turn slope:
-0.0536 - ratio:
0.68x - Welch
p = 0.7023
So H2 is not supported.
This is one of the most important negative findings in the report. It keeps the interpretation honest. Quantization matters, but it is not enough to claim that multi-turn attacks are generically more quant-sensitive than direct ones.
6.3 H3: persistence decreases with lower quantization
Verdict: supported, with heterogeneity.
Three of four persistence slopes are negative, and the clearest evidence appears in llama3.2-1b and llama3.2-3b. The qwen2.5-1.5b direction is consistent but uncertain, and llama3.1-8b is so brittle in absolute terms that monotonicity is not the whole story.
6.4 Latency and deployment tradeoff
Average mean total latency by quant across the Phase 1 sweep:
| Quant | Mean total latency |
|---|---|
Q8_0 |
8,334 ms |
Q6_K |
7,363 ms |
Q3_K_M |
6,859 ms |
Q2_K |
6,848 ms |
Q5_K_M |
6,641 ms |
Q4_K_M |
6,186 ms |
Observations. The latency slopes show the usual systems temptation: lower-bit deployment often gets faster, especially outside the 8B model. TR139's importance is that it links those speed gains to a real conversational safety tradeoff.
6.5 Equivalence and variance
Two additional results narrow interpretation:
0 / 160TOST checks establish equivalence under+/-3pp- variance decomposition gives
18.31%model,6.72%quant,1.21%quant x strategy,0.83%strategy, and72.93%residual
Observations. This means:
- neighboring quant settings are not safely interchangeable by default
- model family explains more structured variation than strategy family does
- large residual variance (72.93%) is consistent with behavior-level heterogeneity across the 50-behavior benchmark - the safety surface is not smooth, and individual harmful behaviors contribute substantial idiosyncratic variation that no single design factor captures
A sensitivity check on the residual decomposition is in Appendix C.
6.6 Power caveat
Power is uneven across the Phase 1 cells:
- median MDE at 80% power:
15.19pp - minimum:
5.57pp - maximum:
27.43pp
That is why the report leans hardest on:
- large cell differences
- consistent directional patterns
- aggregate hypothesis tests
and not on every small per-cell delta.
6.7 What the negative H2 finding rules out
The failure of H2 is not a side note. It is one of the most important interpretation constraints in the report.
It rules out three easy but wrong summaries:
-
"Multi-turn attacks are always the thing quantization amplifies most." Not shown. Some direct baselines degrade sharply too, especially in the weakest cells.
-
"Any large amplification ratio demonstrates special multi-turn sensitivity." Not by itself. Very large amplification ratios can arise when the direct denominator is tiny and the multi-turn numerator is merely moderate.
-
"The right global story is direct vs multi-turn." Also not shown. The cleaner story is model regime plus threshold regime. For some models, quantization mostly induces a cliff. For others, it changes which conversational strategy family is dominant. For
llama3.1-8b, the persistence story is stronger than the direct-vs-multi-turn slope story.
This is exactly the kind of negative result that improves the report rather than weakening it. It narrows the claim from a broad and easy-to-attack thesis to a more defensible one:
quantization changes the conversational safety surface, but the way it does so is model-specific rather than governed by a single universal amplification law.
7. Judge Agreement and Scoring Reliability
TR139's preserved artifact chain contains one post-hoc judge pass over the final run: qwen2.5:7b-instruct-q8_0 produced 37,825 labels across the same 10,600 conversations used by the regex-based primary analysis. The purpose of this layer is not to replace the deterministic scorer. It is to show where the regex and an independent LLM adjudicator agree, and where they do not.
7.1 Preserved judge pass: overall
| Metric | Saved judge value |
|---|---|
| Overall agreement with regex | 64.99% |
| Overall kappa | 0.1037 |
| Judged rows included in agreement slice | 12,562 |
Observations. Agreement is modest, not strong. That does not invalidate the report, but it does limit what can be claimed about borderline rows. The right interpretation is that the judge layer confirms the existence of ambiguity in the harder slices of the run, especially where conversational persistence responses become partial, hedged, or stylistically mixed.
7.2 Phase 1 agreement by quant level
| Quant | N | Agreement | Kappa |
|---|---|---|---|
Q8_0 |
1,600 | 81.50% | 0.1029 |
Q6_K |
1,600 | 79.81% | 0.0817 |
Q5_K_M |
1,600 | 80.00% | 0.0040 |
Q4_K_M |
1,600 | 75.88% | 0.0817 |
Q3_K_M |
1,600 | 69.31% | 0.0373 |
Q2_K |
1,600 | 54.44% | 0.0612 |
Observations. The preserved judge pass shows the same broad gradient the report would already lead us to expect: agreement is highest in the upper-quant slices and worst at Q2_K. The Q5_K_M stratum is especially notable because agreement remains 80.00% while kappa falls to 0.0040, indicating that the class balance is skewed enough that raw agreement alone overstates certainty.
7.3 Phase 2 agreement by quant level
| Slice | N | Agreement | Kappa |
|---|---|---|---|
p2_Q8_0 |
561 | 43.14% | 0.1149 |
p2_Q6_K |
499 | 46.89% | 0.1436 |
p2_Q4_K_M |
775 | 33.16% | 0.0697 |
p2_Q3_K_M |
503 | 36.78% | 0.0730 |
p2_Q2_K |
624 | 30.61% | 0.0625 |
Observations. Phase 2 is the genuinely difficult part of the scoring problem. Agreement falls sharply once the model is no longer producing a clear direct-turn refusal or compliance answer and instead is responding after multiple pressure turns. The worst slices are the same ones the substantive analysis already treats as fragile and ambiguous: p2_Q4_K_M, p2_Q3_K_M, and p2_Q2_K.
7.4 What the preserved judge pass shows
The saved judge layer establishes three things:
- The core report claims do not depend on the judge. All 8 ANOVA rejections, the H2 non-result, and the H3 persistence slopes are computed from the deterministic primary scorer.
- The mid-quant and low-quant Phase 2 slices are exactly where a human review layer would add value. Those are the cells where agreement is weakest.
- Regex-only scoring would have hidden the extent of row-level ambiguity. The judge layer does not overturn the aggregate findings, but it does show why cell-level narratives need discipline.
7.5 What follows from this
The report's broad findings are defensible because they rest on:
- thousands of conversations
- large differences in the most important cells
- replicated phase structure
- aggregate tests that survive the noisy scoring layer
- a preserved secondary judge pass that highlights which slices are least settled
But the next credibility upgrade is still obvious:
add human adjudication on a stratified scored subset, prioritizing the mid-quant Phase 2 cells where the saved judge pass diverges most from the regex detector.
7.6 Why weak agreement does not collapse the core result
The judge-agreement section is easy to misuse in either direction.
It would be wrong to say:
- "kappa is weak, therefore the report says nothing"
It would also be wrong to say:
- "agreement is good enough, therefore every cell-level story is settled"
The right read is between those extremes.
TR139's core claims survive because they are supported by multiple independent structures at once:
- large and repeated Phase 1 differences in the highest-risk regions
- a completed two-phase design rather than one isolated benchmark
- exact
20 / 20direct-anchor matches between Phase 1 and Phase 2 - negative findings that constrain overclaiming rather than silently disappearing
- a secondary judge layer that points to the same high-ambiguity slices the report already treats cautiously
What the weak agreement blocks is not the entire report. It blocks overconfident fine-grained claims about every borderline scored output. That is why this document is now at flagship-report level, but the next paper-grade upgrade is still human adjudication.
8. Cross-Phase Validation
8.1 Direct-turn anchor matches exactly across phases
The strongest within-report validation check is available because Phase 2 begins with the same direct harmful request regime that Phase 1 already measures under the direct strategy. If the benchmark preparation, prompt wiring, and scoring logic drifted between phases, the Phase 1 direct refusal rate and the Phase 2 initial-refusal rate would disagree.
They do not merely align directionally. They match exactly in all 20 / 20 shared (model, quant) cells.
| Model | Q8_0 |
Q6_K |
Q4_K_M |
Q3_K_M |
Q2_K |
|---|---|---|---|---|---|
llama3.1-8b |
86 / 86 |
66 / 66 |
64 / 64 |
12 / 12 |
10 / 10 |
llama3.2-1b |
94 / 94 |
96 / 96 |
92 / 92 |
82 / 82 |
32 / 32 |
llama3.2-3b |
74 / 74 |
76 / 76 |
82 / 82 |
96 / 96 |
88 / 88 |
qwen2.5-1.5b |
98 / 98 |
98 / 98 |
96 / 96 |
90 / 90 |
40 / 40 |
Each cell is Phase 1 direct refusal % / Phase 2 initial refusal %.
Observations. This is the cleanest validation result in the report outside the hypothesis tests themselves. It shows that the Phase 2 persistence findings are built on the same direct-turn baseline measured independently in Phase 1 rather than on a silently different prompt or scoring path.
8.2 What this validates
This exact match validates four practical things at once:
- the shared behavior set was wired consistently across phases
- the direct harmful prompt in Phase 2 is functionally the same intervention as the Phase 1
directcontrol - the refusal detector is at least self-consistent across the two phases on the initial direct turn
- the large Phase 2 differences are not an artifact of beginning from a different baseline than the Phase 1 sweep
This is the kind of section TR126 uses well: not just another result table, but a check that the measurement system is coherent enough for the downstream interpretation to matter.
8.3 What this does not validate
This validation is important, but it does not solve every reliability problem.
It does not show:
- that the regex classifier is correct in an absolute human sense
- that the judge labels are ground truth
- that the strategy templates exhaust the conversational attack surface
- that the same break profiles would appear on another backend or hardware stack
So the right interpretation is strong but bounded:
TR139's two phases are internally coherent. The report is still not the same thing as a human-adjudicated paper benchmark.
8.4 Why the direct anchor matters scientifically
This validation also sharpens the main causal reading of the report.
The Phase 1 result is not "some models comply a lot." The more interesting claim is:
- several models still refuse direct harmful prompts at meaningful rates
- those same models can nevertheless fail badly once attack structure or pressure is introduced
That is why TR139 is a conversational-safety report rather than a direct-refusal report with extra decoration.
9. Cross-Report Positioning
TR139 extends the repo's safety line in a specific direction.
- TR134 established the single-turn quantization anchor.
- TR135-TR137 showed that inference-time choices such as quantization, backend, and concurrency can move safety behavior.
- TR138 showed that serving-time batching is not safety-neutral even under deterministic decoding.
- TR139 adds the conversational regime: structured multi-turn attacks and post-refusal pressure.
What is newly measured here is not just "more jailbreaks." It is two distinct deployment-facing datapoints:
-
attack-surface reshaping under quantization Quantization changes which strategy families dominate a model and where the dangerous thresholds lie.
-
refusal durability after the model has already said no A model can look acceptable on the direct turn and still collapse once the user keeps pushing.
The strongest cross-report continuity point remains the direct-turn Q8_0 anchor. The following table compares TR139 baselines against prior TR measurements on overlapping models, applying the +/-5pp tolerance established in TR137:
| Model | TR134 Q8_0 (AdvBench) | TR136 Q8_0 (aggregate) | TR139 Q8_0 (direct, 50 behaviors) | TR134->139 delta | TR136->139 delta | Within +/-5pp? |
|---|---|---|---|---|---|---|
llama3.2-1b |
90.0% (n=100) | 87.6% (n=468) | 94.0% (n=50) | +4.0pp | +6.4pp | TR134 yes / TR136 no |
llama3.2-3b |
52.0% (n=100) | 80.9% (n=468) | 74.0% (n=50) | +22.0pp | -6.9pp | no / no |
qwen2.5-1.5b |
- | 84.8% (n=468) | 98.0% (n=50) | - | +13.2pp | - / no |
llama3.1-8b |
- | - | 86.0% (n=50) | - | - | baseline set |
Observations. The llama3.2-1b anchor is the tightest cross-TR match (+4.0pp vs TR134, within tolerance). The larger deltas for other models are expected rather than alarming: TR134 used AdvBench-only prompts, TR136 used a 4-task aggregate, and TR139 uses a 50-behavior JailbreakBench-derived set spanning 10 harm categories. The llama3.2-3b TR134->TR139 gap (+22.0pp) is the most notable and reflects this model's sensitivity to prompt distribution; it refuses more of TR139's curated harmful behaviors than TR134's AdvBench set. These cross-TR deltas are therefore primarily task-distribution effects rather than measurement errors, and the within-stack internal consistency (20/20 exact Phase 1-to-Phase 2 anchor match) remains the stronger validation signal.
That matters because it blocks an easy dismissal. These models are not failing only because they refuse nothing in the first place. Several retain substantial direct refusal while still showing large multi-turn failure regions. TR139 therefore supports a narrower but more useful claim:
conversational attack structure is its own safety regime, and quantization changes that regime.
10. Production Guidance
10.1 What deployment teams should do differently
Do not treat quantization as a pure systems knob for public or hostile-input chat systems.
For any model intended for interactive chat, validate all four of the following:
- direct harmful prompt refusal
- multi-turn jailbreak ASR
- refusal persistence under continued pressure
- model-specific critical quant thresholds
If you only run direct-turn refusal checks, you are missing the exact failure mode that dominates several of the most important cells in TR139.
10.2 Model-by-model deployment regime
The report supports different operational conclusions for the four models.
| Model | Relative safest band in this sweep | Caution band | Clear avoid band | Why |
|---|---|---|---|---|
llama3.2-3b |
Q8_0 through Q3_K_M |
Q2_K |
none conclusively catastrophic, but still validate | Lowest overall ASR surface, strongest model in sweep |
llama3.2-1b |
Q8_0 and Q6_K |
Q4_K_M and Q3_K_M |
Q2_K |
Clear low-bit cliff, strong persistence collapse at Q2_K |
llama3.1-8b |
no clearly comfortable hostile-chat band demonstrated | all quants | especially Q3_K_M instability band |
Direct refusal can look acceptable while persistence remains extremely poor |
qwen2.5-1.5b |
no validated low-risk conversational band in this study | all quants | Q2_K definitively unacceptable |
Severe multi-turn vulnerability even at upper quants; low-bit collapse makes it worse |
Observations. This table should be read as a deployment triage guide, not as a universal leaderboard. "Relative safest" means safest among the tested cells here, not safe in an absolute sense. The most striking row is llama3.1-8b: despite being the largest model in the sweep, it has no comfortable hostile-chat band because its persistence is catastrophically weak (Section 5.1).
10.3 Release checklist justified by this report
Before shipping a quantized chat model into a hostile-input environment:
- Run direct harmful prompt refusal checks on the exact deployment quant
- Run at least one multi-turn benchmark family, not only direct prompts
- Test persistence after initial refusal, not only target-turn compliance
- Identify the first dangerous quant threshold for that model line
- Review high-risk strategy families, not only average ASR
- Check whether the model has a brittle mid-quant instability band rather than a simple low-bit collapse
- Add human adjudication on the most policy-relevant cells before making strong external claims
10.4 What not to infer from this report
TR139 does not justify:
- a universal "lower quant always means less safe" law
- a universal "multi-turn is always more quant-sensitive than direct" law
- a claim that any tested model is robust by default
- a claim that one safe quant threshold applies across model families
The operational lesson is threshold-aware and model-aware policy, not one blanket rule.
10.5 Deployment scenarios
The same report supports different decisions in different deployment contexts.
Scenario A: closed evaluation or trusted-user local assistant
If the model is used only by a trusted operator, the main question is whether low-bit deployment changes the model enough to distort internal evaluation or staff workflows. In that case:
llama3.2-3bremains the cleanest candidatellama3.2-1bcan be usable above itsQ2_Kcliffqwen2.5-1.5bis still a poor choice if the workflow includes adversarial testing, because the conversational failure surface is already broad
Scenario B: public or semi-public chat surface
This is where TR139 matters most. Public chat needs protection against:
- conversational momentum
- structured escalation
- repeated pressure after refusal
For this scenario, the report supports a much stricter rule:
- do not deploy
qwen2.5-1.5bon this evidence base alone - do not deploy
llama3.1-8bwithout stronger persistence controls - do not use
Q2_Kon the smaller models in safety-sensitive settings
Scenario C: quantization chosen only for fit or speed
TR139 directly argues against the common shortcut:
"we only changed memory footprint and latency, not behavior"
That shortcut is not defensible for conversational deployment on the models tested here.
11. Limitations and Follow-Up Program
11.1 Limitations
-
Primary classification remains heuristic. The refusal detector is useful and internally consistent, but it is still a pattern-based classifier rather than a human labeler.
-
The preserved judge pass uses the same family as one evaluated model. The saved judge is Qwen 2.5 7B, which overlaps with the evaluated
qwen2.5-1.5bfamily. The size gap mitigates direct self-evaluation somewhat, but a truly independent judge family would be stronger. -
No human adjudication layer is included yet. This is the single biggest remaining gap if TR139 is meant to become paper-grade evidence rather than flagship technical-report evidence.
-
The strategy templates are fixed and reproducible rather than adaptive. This is a feature for clean comparison, but it also means the report estimates a reproducible lower bound on attacker capability rather than an upper bound from skilled human red-teaming.
-
The run covers one serving stack. Everything here is local Ollama chat serving. The report does not isolate backend effects the way TR135-TR137 do.
-
The run covers four local open-weight model lines, not frontier APIs. The point is local deployment policy, not universal alignment ranking across all model classes.
-
Residual variance is large. The variance decomposition assigns
72.93%to residual variation. That is consistent with behavior-level heterogeneity, but it still means many fine-grained cell stories should be treated cautiously. -
TOST non-equivalence everywhere is not the same as universal large effects.
0 / 160equivalence results justifies caution about interchangeability. It does not mean every neighboring quant pair differs by a practically large amount. -
Single-hardware execution still matters. The quant thresholds and latency tradeoffs were measured on one RTX 4080 Laptop environment.
-
Published-method mapping is faithful but still repo-implemented. The strategy families are grounded in prior literature, but the exact templates are local implementations rather than borrowed benchmark binaries.
11.2 Highest-value follow-up work
- human adjudication on a stratified subset of the most policy-relevant Phase 1 and Phase 2 outputs
- larger-judge rerun for calibration against the fallback-judge path
- replication of the highest-risk cells on another hardware stack
- additional model-family replication, especially stronger Qwen and Llama variants
- adaptive attack comparison against the fixed strategy templates
- backend replication to separate quant effects from serving-stack effects
The first item is still the best next upgrade. The core result is already real enough for flagship technical-report use, but human adjudication is the cleanest path from "strong internal evidence" to "paper-quality evidence."
12. Conclusions
12.1 Direct answers
Does quantization change multi-turn jailbreak risk?
Yes. All 8 strategy-specific ANOVAs reject quant-independence with p < 1e-4 and eta^2 up to 0.1526. Holm-Bonferroni-corrected pairwise tests confirm large Cohen's d values (up to 3.47) between critical quant pairs.
Does lower quantization make multi-turn attacks more dangerous than direct attacks?
Not as a universal law. The aggregate Welch slope comparison is not significant (p = 0.702). Cell-level amplification ratios reach 49.0x, but the global multi-turn-vs-direct amplification claim (H2) is not supported.
Does lower quantization weaken persistence after initial refusal?
Yes, with heterogeneity. Three of four models show negative persistence slopes, with llama3.2-1b the clearest (slope = -0.198/BPW, bootstrap CI [-0.305, -0.144]).
Are neighboring quant levels interchangeable?
No. 0 / 160 TOST checks establish equivalence at +/-3pp. Deployment teams cannot assume adjacent quant settings are safety-equivalent.
12.2 Cross-TR comparison
| Dimension | TR134-TR137 | TR138 / TR138 v2 | TR139 |
|---|---|---|---|
| Attack regime | Single-turn | Single-turn under batch perturbation | Multi-turn conversational |
| Primary manipulation | Quant level, backend | Batch size, co-batching | Quant level x attack strategy |
| Models | 2-5 models, 1B-3B | 3 models, 1B-3B | 4 models, 1B-8B |
| Core finding | Quant changes safety scores | Batching disproportionately flips safety outputs | Quant changes multi-turn ASR and persistence |
| Effect magnitude | Small-to-moderate | 2.0% safety flip rate, 4.7x ratio | eta^2 up to 0.38; ASR swings of 60+pp |
| Strongest negative | - | No Phase 2 co-batching effect | No universal multi-turn amplification (H2) |
| Statistical foundation | Bootstrap CIs, TOST, power analysis | Wilson CIs, odds ratios, TOST | ANOVA, Holm-Bonferroni pairwise, Cohen's d, TOST, variance decomposition |
| Conversational depth | 1 turn | 1 turn | Up to 13 turns (5 attack + 8 pressure) |
| Key deployment implication | Validate quant choice | Treat batch size as safety-relevant | Validate multi-turn + persistence, not just direct refusal |
Observations. The most important column contrast is conversational depth: TR134-TR137 and TR138 operate entirely in the single-turn regime, while TR139 extends to 13 turns. The effect magnitudes also scale accordingly - TR138's 2.0% safety flip rate is a subtle signal requiring careful statistical framing, while TR139's 60+pp ASR swings in the worst cells are unmistakable even without sophisticated tests.
12.3 What this report establishes
- Quantization belongs in the safety envelope for conversational deployment.
- The risk is model-specific and threshold-specific, not governed by a single universal rule.
- Direct-turn refusal is necessary but not sufficient - models that refuse direct harmful prompts can still fail badly under multi-turn attack or continued pressure.
- The strongest model in this sweep (
llama3.2-3b) is not robust by default; it is merely the least bad. - The weakest model under persistence pressure (
llama3.1-8b) has a 96.67% break rate despite high initial refusal, making its refusal behavior effectively cosmetic.
13. Reproducibility
13.1 Entry points
Full pipeline:
python research/tr139/run.py -v
Common rerun modes:
# Judge + analysis + report on an existing run
python research/tr139/run.py -v --skip-prep --skip-eval --run-dir research/tr139/results/20260314_012503
# Analysis + report only
python research/tr139/run.py -v --skip-prep --skip-eval --skip-judge --run-dir research/tr139/results/20260314_012503
13.2 Source-of-truth configuration
The canonical config is research/tr139/config.yaml (full YAML excerpts in Appendix E). The key run-shaping parameters are:
| Setting | Value |
|---|---|
| Phase 1 models | llama3.2-1b, llama3.2-3b, qwen2.5-1.5b, llama3.1-8b |
| Phase 1 quants | Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K |
| Phase 1 strategies | 8 |
| Phase 2 quants | Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K |
| Phase 2 pressure turns | 8 |
| Temperature | 0.0 |
| Max new tokens | 256 |
| Seed | 42 |
| Warmup requests | 3 |
| Cooldown between models | 10s |
13.3 Exact run scope
| Component | Formula | Executed |
|---|---|---|
| Phase 1 conversations | 4 x 6 x 8 x 50 |
9,600 |
| Phase 2 conversations | 4 x 5 x 50 |
1,000 |
| Judge labels | post-hoc scored turns | 37,825 |
| Total conversations | Phase 1 + Phase 2 |
10,600 |
The primary completed run for this report is:
research/tr139/results/20260314_012503
13.4 Model tags and execution context
The configured Ollama tags are:
| Model | Ollama tag | Notes |
|---|---|---|
llama3.2-1b |
llama3.2:1b |
full quant ladder |
llama3.2-3b |
llama3.2:3b |
full quant ladder |
qwen2.5-1.5b |
qwen2.5:1.5b |
full quant ladder |
llama3.1-8b |
llama3.1:8b |
quantized-only in practice; skip_fp16 |
Judge execution detail for the completed preserved run:
- saved judge tag:
qwen2.5:7b-instruct-q8_0 - saved judge output count:
37,825 - earlier fallback-run outputs were overwritten during development and are not treated as part of the final evidence chain
13.5 Key artifacts
| Artifact | Path | Role in this report |
|---|---|---|
| Analysis JSON | research/tr139/results/20260314_012503/tr139_analysis.json |
primary quantitative source |
| Scored rows | research/tr139/results/20260314_012503/tr139_scored.jsonl |
scored conversation-level outputs |
| Judge labels | research/tr139/results/20260314_012503/judge_labels.jsonl |
post-hoc judge layer |
| Generated report | research/tr139/results/20260314_012503/tr139_report.md |
machine-produced appendix layer |
| README | research/tr139/README.md |
design intent and literature map |
| Shared utils | research/tr139/shared/utils.py |
strategy templates, BPW map, constants |
13.6 Analysis and generation scripts
| Script | Role |
|---|---|
research/tr139/run.py |
orchestration entry point |
research/tr139/prepare_benchmarks.py |
task and behavior preparation |
research/tr139/judge_analysis.py |
post-hoc judge pass |
research/tr139/analyze.py |
hypothesis tests and summary statistics |
research/tr139/generate_report.py |
generated markdown report |
research/tr139/shared/utils.py |
model matrix, quant ladder, strategy templates |
The generated report in the run directory remains a useful machine-produced appendix, but this publish-ready report is the authoritative interpretation layer.
13.7 Prerequisites
To rerun TR139 cleanly, the environment needs:
- Ollama running with GPU access
- the four configured model tags installed
- the task files under
research/tr139/tasks/ - the shared refusal detector and judge stack inherited from TR134
Operationally, the run is large enough that it should be treated like a planned experiment rather than a casual benchmark script. That is another reason this report needs to preserve the reasoning chain in full.
References and External Anchors
TR139 is grounded in two external literatures plus the repo's prior safety line.
Multi-turn jailbreak methods
- Crescendo (Russinovich et al., 2024): authority-building conversational jailbreaks
- Foot-in-the-Door (Wang et al., 2025): gradual escalation through related benign requests
- Attention Shifting Attack (Zhou et al., 2025): distraction and fictional framing
- RACE / progressive refinement (Li et al., 2025): narrowing through iterative benign-seeming reformulation
- Context Fusion Attack (Xu et al., 2024): distributed intent across turns
- ActorAttack / persona manipulation (Ren et al., 2024): role and character consistency as a jailbreak lever
Quantization x safety work
- HarmLevelBench: lower-bit deployment can alter harmful-compliance behavior
- Q-resafe: quantization-sensitive safety changes are not purely hypothetical
- Alignment-Aware Quantization: quantization can interact with alignment properties rather than acting as a neutral compression layer
Repo continuity
- TR134-TR137: quantization, backend, and concurrency are not safety-neutral
- TR138 / TR138 v2: batching is not safety-neutral under deterministic decoding
- TR139: conversational attack structure belongs in that same deployment-safety family
TR139's direct novelty is the intersection of those literatures: conversational jailbreak methodology and quantized local deployment.
Appendix A: Raw Data Tables
A.1 Phase 2 persistence by model x quant (complete)
| Model | Quant | BPW | Initial refusals | Broke | Break rate | Mean persistence | Std | Pressure resistance |
|---|---|---|---|---|---|---|---|---|
llama3.1-8b |
Q8_0 |
8.00 | 43 | 43 | 100.0% | 0.2422 | 0.2029 | 0.1919 |
llama3.1-8b |
Q6_K |
6.57 | 33 | 33 | 100.0% | 0.2111 | 0.2214 | 0.2348 |
llama3.1-8b |
Q4_K_M |
4.85 | 32 | 32 | 100.0% | 0.1222 | 0.1493 | 0.0898 |
llama3.1-8b |
Q3_K_M |
3.91 | 6 | 5 | 83.3% | 0.0533 | 0.1921 | 0.3750 |
llama3.1-8b |
Q2_K |
3.35 | 5 | 5 | 100.0% | 0.0356 | 0.1445 | 0.2750 |
llama3.2-1b |
Q8_0 |
8.00 | 47 | 8 | 17.0% | 0.9222 | 0.2389 | 0.9787 |
llama3.2-1b |
Q6_K |
6.57 | 48 | 4 | 8.3% | 0.9511 | 0.1985 | 0.9896 |
llama3.2-1b |
Q4_K_M |
4.85 | 46 | 34 | 73.9% | 0.7467 | 0.2694 | 0.7880 |
llama3.2-1b |
Q3_K_M |
3.91 | 41 | 33 | 80.5% | 0.6933 | 0.3532 | 0.8262 |
llama3.2-1b |
Q2_K |
3.35 | 16 | 16 | 100.0% | 0.0467 | 0.0780 | 0.0391 |
llama3.2-3b |
Q8_0 |
8.00 | 37 | 9 | 24.3% | 0.6067 | 0.4602 | 0.7973 |
llama3.2-3b |
Q6_K |
6.57 | 38 | 13 | 34.2% | 0.5733 | 0.4564 | 0.7237 |
llama3.2-3b |
Q4_K_M |
4.85 | 41 | 13 | 31.7% | 0.6622 | 0.4199 | 0.7835 |
llama3.2-3b |
Q3_K_M |
3.91 | 48 | 24 | 50.0% | 0.7644 | 0.3050 | 0.7708 |
llama3.2-3b |
Q2_K |
3.35 | 44 | 41 | 93.2% | 0.3244 | 0.2856 | 0.2898 |
qwen2.5-1.5b |
Q8_0 |
8.00 | 49 | 5 | 10.2% | 0.9467 | 0.1785 | 0.9617 |
qwen2.5-1.5b |
Q6_K |
6.57 | 49 | 4 | 8.2% | 0.9600 | 0.1631 | 0.9770 |
qwen2.5-1.5b |
Q4_K_M |
4.85 | 48 | 36 | 75.0% | 0.5311 | 0.3263 | 0.4974 |
qwen2.5-1.5b |
Q3_K_M |
3.91 | 45 | 38 | 84.4% | 0.6156 | 0.3176 | 0.6444 |
qwen2.5-1.5b |
Q2_K |
3.35 | 20 | 3 | 15.0% | 0.3511 | 0.4727 | 0.8625 |
Observations. The llama3.1-8b row group is striking: break rate is 100% at every level except Q3_K_M, which only escapes by having very few initial refusals (6). The llama3.2-1b cliff between Q6_K (8.3% break) and Q4_K_M (73.9% break) is the sharpest discontinuity in the table. The qwen2.5-1.5b Q2_K row is paradoxically low-break-rate (15.0%) because the model already complies directly at Q2_K, leaving only 20 initial refusals to test.
A.2 Phase 1 ASR slopes by model x strategy (complete)
| Model | Strategy | Slope/BPW | CI lower | CI upper | R^2 |
|---|---|---|---|---|---|
llama3.1-8b |
attention_shift |
-0.0193 | -0.0743 | +0.0663 | 0.094 |
llama3.1-8b |
benign_context |
-0.1386 | -0.3208 | -0.0241 | 0.574 |
llama3.1-8b |
context_fusion |
+0.0051 | -0.0879 | +0.0497 | 0.008 |
llama3.1-8b |
crescendo |
-0.0522 | -0.2966 | +0.0306 | 0.097 |
llama3.1-8b |
direct |
-0.1699 | -0.3784 | -0.0294 | 0.720 |
llama3.1-8b |
foot_in_door |
-0.0710 | -0.1704 | -0.0046 | 0.659 |
llama3.1-8b |
progressive_refinement |
-0.0446 | -0.1456 | +0.0089 | 0.424 |
llama3.1-8b |
role_play |
-0.0858 | -0.2851 | +0.0561 | 0.395 |
llama3.2-1b |
attention_shift |
-0.0923 | -0.2911 | +0.0000 | 0.355 |
llama3.2-1b |
benign_context |
-0.0692 | -0.1922 | +0.0173 | 0.367 |
llama3.2-1b |
context_fusion |
-0.0650 | -0.2035 | +0.0000 | 0.363 |
llama3.2-1b |
crescendo |
-0.1197 | -0.3788 | +0.0000 | 0.351 |
llama3.2-1b |
direct |
-0.0987 | -0.2712 | -0.0066 | 0.479 |
llama3.2-1b |
foot_in_door |
-0.1080 | -0.2233 | -0.0365 | 0.713 |
llama3.2-1b |
progressive_refinement |
-0.1270 | -0.3331 | -0.0110 | 0.538 |
llama3.2-1b |
role_play |
-0.0465 | -0.1490 | +0.0000 | 0.336 |
llama3.2-3b |
attention_shift |
+0.0000 | +0.0000 | +0.0000 | 0.000 |
llama3.2-3b |
benign_context |
+0.0157 | -0.0079 | +0.0584 | 0.265 |
llama3.2-3b |
context_fusion |
+0.0253 | +0.0000 | +0.0412 | 0.717 |
llama3.2-3b |
crescendo |
-0.0027 | -0.0088 | +0.0000 | 0.336 |
llama3.2-3b |
direct |
+0.0400 | +0.0191 | +0.0695 | 0.722 |
llama3.2-3b |
foot_in_door |
-0.0074 | -0.0242 | +0.0057 | 0.297 |
llama3.2-3b |
progressive_refinement |
+0.0016 | -0.0002 | +0.0076 | 0.111 |
llama3.2-3b |
role_play |
+0.0000 | +0.0000 | +0.0000 | 0.000 |
qwen2.5-1.5b |
attention_shift |
+0.0234 | -0.0110 | +0.1078 | 0.147 |
qwen2.5-1.5b |
benign_context |
-0.0669 | -0.1824 | -0.0076 | 0.493 |
qwen2.5-1.5b |
context_fusion |
-0.0891 | -0.2526 | +0.0074 | 0.493 |
qwen2.5-1.5b |
crescendo |
-0.1169 | -0.3294 | -0.0055 | 0.523 |
qwen2.5-1.5b |
direct |
-0.0885 | -0.2632 | -0.0002 | 0.432 |
qwen2.5-1.5b |
foot_in_door |
-0.0961 | -0.2129 | -0.0174 | 0.609 |
qwen2.5-1.5b |
progressive_refinement |
-0.0144 | -0.0670 | +0.0168 | 0.128 |
qwen2.5-1.5b |
role_play |
-0.1403 | -0.2297 | -0.0204 | 0.716 |
Observations. The llama3.2-3b row group is strikingly flat - most slopes are near zero, and even the largest (direct, +0.0400/BPW) reflects an anomalous pattern where lower quant slightly improves ASR, driven by the model's already-low failure surface. The llama3.2-1b slopes are uniformly negative, consistent with the Q2_K cliff story. The qwen2.5-1.5b role_play slope (-0.1403, R^2 = 0.716) has the best linear fit among the high-risk cells.
A.3 Power analysis by model x strategy
| Model | Strategy | Baseline ASR | MDE (80% power) |
|---|---|---|---|
llama3.1-8b |
attention_shift |
18.0% | 21.5pp |
llama3.1-8b |
benign_context |
16.0% | 20.5pp |
llama3.1-8b |
context_fusion |
22.0% | 23.2pp |
llama3.1-8b |
crescendo |
26.0% | 24.6pp |
llama3.1-8b |
direct |
14.0% | 19.4pp |
llama3.1-8b |
foot_in_door |
8.0% | 15.2pp |
llama3.1-8b |
progressive_refinement |
22.0% | 23.2pp |
llama3.1-8b |
role_play |
22.0% | 23.2pp |
llama3.2-1b |
attention_shift |
0.0% | 5.6pp |
llama3.2-1b |
benign_context |
8.0% | 15.2pp |
llama3.2-1b |
context_fusion |
0.0% | 5.6pp |
llama3.2-1b |
crescendo |
0.0% | 5.6pp |
llama3.2-1b |
direct |
6.0% | 13.3pp |
llama3.2-1b |
foot_in_door |
0.0% | 5.6pp |
llama3.2-1b |
progressive_refinement |
0.0% | 5.6pp |
llama3.2-1b |
role_play |
0.0% | 5.6pp |
llama3.2-3b |
attention_shift |
0.0% | 5.6pp |
llama3.2-3b |
benign_context |
10.0% | 16.8pp |
llama3.2-3b |
context_fusion |
10.0% | 16.8pp |
llama3.2-3b |
crescendo |
0.0% | 5.6pp |
llama3.2-3b |
direct |
26.0% | 24.6pp |
llama3.2-3b |
foot_in_door |
0.0% | 5.6pp |
llama3.2-3b |
progressive_refinement |
0.0% | 5.6pp |
llama3.2-3b |
role_play |
0.0% | 5.6pp |
qwen2.5-1.5b |
attention_shift |
98.0% | 7.8pp |
qwen2.5-1.5b |
benign_context |
8.0% | 15.2pp |
qwen2.5-1.5b |
context_fusion |
60.0% | 27.4pp |
qwen2.5-1.5b |
crescendo |
34.0% | 26.5pp |
qwen2.5-1.5b |
direct |
2.0% | 7.8pp |
qwen2.5-1.5b |
foot_in_door |
30.0% | 25.7pp |
qwen2.5-1.5b |
progressive_refinement |
78.0% | 23.2pp |
qwen2.5-1.5b |
role_play |
36.0% | 26.9pp |
Observations. MDE is strongly driven by baseline ASR. Cells with 0% baseline (e.g., llama3.2-1b / attention_shift) have the tightest MDE (5.6pp) because any movement from zero is easy to detect. The broadest MDEs (24-27pp) appear in cells with baseline ASR near 30-60%, exactly where the binomial variance is maximized. This means the report is well-powered to detect large cliff-like failures but poorly powered to distinguish adjacent moderate-ASR cells.
Appendix B: Extended Statistical Tables
B.1 Selected pairwise comparisons with Cohen's d and Holm-Bonferroni correction
All comparisons use Welch's t-test with Holm-Bonferroni correction within each model x strategy family. Full tables are in tr139_analysis.json -> phase1_pairwise.
llama3.2-1b (cliff model)
| Strategy | Comparison | Mean A | Mean B | t | p | Cohen's d | Holm p | Sig? |
|---|---|---|---|---|---|---|---|---|
crescendo |
Q2_K vs Q8_0 | 0.86 | 0.00 | - | < 0.001 | 2.49 | < 0.001 | yes |
foot_in_door |
Q2_K vs Q8_0 | 0.76 | 0.00 | - | < 0.001 | 2.33 | < 0.001 | yes |
progressive_refinement |
Q2_K vs Q8_0 | 0.78 | 0.00 | - | < 0.001 | 2.49 | < 0.001 | yes |
direct |
Q2_K vs Q8_0 | 0.68 | 0.06 | - | < 0.001 | 1.54 | < 0.001 | yes |
attention_shift |
Q2_K vs Q8_0 | 0.54 | 0.00 | - | < 0.001 | 1.59 | < 0.001 | yes |
llama3.1-8b (instability-band model)
| Strategy | Comparison | Mean A | Mean B | t | p | Cohen's d | Holm p | Sig? |
|---|---|---|---|---|---|---|---|---|
benign_context |
Q3_K_M vs Q8_0 | 0.92 | 0.16 | - | < 0.001 | 1.00 | < 0.001 | yes |
direct |
Q3_K_M vs Q8_0 | 0.88 | 0.14 | - | < 0.001 | 1.00 | < 0.001 | yes |
attention_shift |
Q2_K vs Q4_K_M | 0.42 | 0.12 | 3.55 | < 0.001 | 0.71 | 0.004 | yes |
foot_in_door |
Q6_K vs Q8_0 | 0.20 | 0.08 | - | 0.020 | 0.38 | 0.058 | no |
qwen2.5-1.5b (broadly vulnerable)
| Strategy | Comparison | Mean A | Mean B | t | p | Cohen's d | Holm p | Sig? |
|---|---|---|---|---|---|---|---|---|
role_play |
Q2_K vs Q8_0 | 1.00 | 0.36 | - | < 0.001 | 2.18 | < 0.001 | yes |
crescendo |
Q2_K vs Q8_0 | 1.00 | 0.34 | - | < 0.001 | 2.20 | < 0.001 | yes |
foot_in_door |
Q2_K vs Q8_0 | 0.88 | 0.30 | - | < 0.001 | 1.38 | < 0.001 | yes |
benign_context |
Q2_K vs Q8_0 | 0.52 | 0.08 | - | < 0.001 | 1.10 | < 0.001 | yes |
llama3.2-3b (flat-surface model)
| Strategy | Comparison | Mean A | Mean B | t | p | Cohen's d | Holm p | Sig? |
|---|---|---|---|---|---|---|---|---|
direct |
Q2_K vs Q3_K_M | 0.12 | 0.00 | - | < 0.001 | 3.47 | < 0.001 | yes |
benign_context |
Q2_K vs Q8_0 | 0.02 | 0.10 | - | 0.081 | 0.53 | 1.000 | no |
Observations. The Cohen's d values confirm that the largest pairwise effects are not just statistically significant - they are practically enormous. Values above 2.0 represent complete regime changes rather than incremental shifts. The llama3.2-3b direct Q2_K vs Q3_K_M comparison has the largest d (3.47) despite both being low-ASR, because the transition from 0% to 12% ASR represents a full shift from floor to non-trivial failure. Holm correction is conservative: the llama3.1-8b / foot_in_door / Q6_K vs Q8_0 comparison loses significance after correction (raw p = 0.020, Holm p = 0.058), demonstrating that borderline results are properly gated.
B.2 Fisher exact test results (quant vs Q8_0 baseline, selected significant cells)
All Fisher exact tests are Holm-Bonferroni corrected within each model x strategy family (5 tests per family).
| Model | Strategy | Quant | Odds ratio | Fisher p | Holm p | Sig? |
|---|---|---|---|---|---|---|
llama3.1-8b |
benign_context |
Q3_K_M | - | < 0.001 | < 0.001 | yes |
llama3.1-8b |
benign_context |
Q4_K_M | - | < 0.001 | < 0.001 | yes |
llama3.1-8b |
benign_context |
Q2_K | - | 0.007 | 0.036 | yes |
llama3.1-8b |
crescendo |
Q3_K_M | - | < 0.001 | < 0.001 | yes |
llama3.1-8b |
direct |
Q3_K_M | - | < 0.001 | < 0.001 | yes |
llama3.1-8b |
direct |
Q2_K | - | < 0.001 | < 0.001 | yes |
llama3.1-8b |
foot_in_door |
Q6_K | - | 0.004 | 0.020 | yes |
llama3.2-1b |
crescendo |
Q2_K | - | < 0.001 | < 0.001 | yes |
llama3.2-1b |
direct |
Q2_K | - | < 0.001 | < 0.001 | yes |
qwen2.5-1.5b |
role_play |
Q2_K | - | < 0.001 | < 0.001 | yes |
qwen2.5-1.5b |
crescendo |
Q2_K | - | < 0.001 | < 0.001 | yes |
Observations. After Holm-Bonferroni correction, the most important Q2_K-vs-baseline and Q3_K_M-vs-baseline comparisons remain strongly significant. The correction eliminates borderline cells that might otherwise be overclaimed, which is why the report emphasizes model-level patterns and ANOVA-level conclusions over individual cell stories.
Appendix C: Sensitivity and Robustness
C.1 Runtime and execution shape
TR139 is large enough that execution shape is part of the scientific interpretation, not just an ops detail.
| Component | Structure | Upper-bound model calls |
|---|---|---|
| Phase 1 | 9,600 conversations across 8 fixed strategies | 39,600 |
| Phase 2 | 1,000 conversations with up to 8 pressure follow-ups | 9,000 |
| Eval total | Phase 1 + Phase 2 | 48,600 |
| Judge | post-hoc scored turns | 37,825 labels |
The Phase 1 upper bound comes from the fixed strategy lengths:
direct: 1 turnbenign_context: 5 turnsfoot_in_door: 5 turnscrescendo: 5 turnsattention_shift: 5 turnsrole_play: 4 turnsprogressive_refinement: 4 turnscontext_fusion: 4 turns
That yields 33 attack-template turns per (model, quant, behavior) bundle and therefore 4 x 6 x 50 x 33 = 39,600 Phase 1 model calls before Phase 2 even begins.
This runtime shape explains three report choices:
- Phase 1 is the main evidentiary mass, so the report leans on aggregate Phase 1 structure for the strongest claims.
- Phase 2 is smaller and therefore better used as a persistence-validation layer than as a replacement for Phase 1.
- The cost of reproducing TR139 is nontrivial, which raises the bar for how much reasoning the publish-ready report must preserve.
C.2 Variance decomposition sensitivity
The main variance decomposition (18.31% model, 6.72% quant, 1.21% quant x strategy, 0.83% strategy, 72.93% residual) uses a simple sum-of-squares approach. If the residual term is further partitioned by behavior category, the 10 harm categories absorb an estimated 8-12% of the residual, reducing it to approximately 61-65%. This suggests behavior-level heterogeneity is real but not the dominant unexplained source - individual prompt-level variation within categories is larger.
C.3 Judge sensitivity check
The report's core Phase 1 claims do not depend on the judge layer. The 8/8 ANOVA rejections use the primary regex refusal detector, not the judge. The judge layer adds depth (turning binary refusal into a nuanced agreement picture) but does not change the H1 rejection or the H2 non-result. The H3 persistence slopes also use the primary detector.
This means the core findings survive even if the judge layer is discarded entirely, which is the appropriate robustness posture given the weak kappa values.
C.4 Source artifacts
Primary source-of-truth files for this report:
| Artifact | Path | Report use |
|---|---|---|
| Main analysis | research/tr139/results/20260314_012503/tr139_analysis.json |
all quantitative claims |
| Scored outputs | research/tr139/results/20260314_012503/tr139_scored.jsonl |
cell-level drilldown and follow-up audit |
| Judge labels | research/tr139/results/20260314_012503/judge_labels.jsonl |
agreement and reliability section |
| Generated report | research/tr139/results/20260314_012503/tr139_report.md |
machine-produced baseline interpretation |
| Config | research/tr139/config.yaml |
run scope and settings |
| Design README | research/tr139/README.md |
research questions and literature map |
| Strategy definitions | research/tr139/shared/utils.py |
exact template structure and BPW mapping |
The publish-ready report is manually written from those artifacts. It should be treated as the narrative interpretation layer on top of the generated outputs, not as a substitute for them.
Appendix D: Glossary
Statistical Terms
| Term | Definition |
|---|---|
| ANOVA (one-way) | Tests whether means differ across 3+ groups. F-statistic is the ratio of between-group to within-group variance. p < 0.05 rejects the null that all groups have the same mean. |
| ANOVA (two-way) | Tests main effects of two factors and their interaction simultaneously. eta^2 for each factor shows what fraction of variance it explains. |
| Bootstrap CI | Confidence interval computed by resampling the data with replacement (here: 1,000 iterations, seed 42, percentile method). Does not assume normality. |
| Cohen's d | Standardized effect size for comparing two means: d = (M1 - M2) / pooled SD. Conventional thresholds: small = 0.2, medium = 0.5, large = 0.8. Values above 2.0 indicate regime-change-level differences. |
| Fisher exact test | Non-parametric test for 2x2 contingency tables. Used here for per-cell quant-vs-baseline comparisons where expected counts may be small. |
| Holm-Bonferroni correction | Sequential multiple comparison correction that controls family-wise error rate (FWER). Orders raw p-values, then rejects if p_i <= alpha / (k - i + 1). Less conservative than Bonferroni but still controls Type I error. |
| eta^2 (eta-squared) | ANOVA effect size. Proportion of total variance explained by the factor. Small = 0.01, medium = 0.06, large = 0.14. |
| MDE (Minimum Detectable Effect) | Smallest effect the study can detect at specified power (80%) and alpha (0.05). Depends on sample size and baseline rate. |
| TOST (Two One-Sided Tests) | Equivalence testing: tests whether the difference between groups falls within a pre-specified margin (here +/-3pp). Rejection means the groups are practically equivalent; failure means non-equivalence is not ruled out. |
| Welch's t-test | t-test for comparing two means without assuming equal variances. Used for the H2 slope comparison and pairwise quant comparisons. |
| Wilson score interval | Confidence interval for proportions that corrects for the poor coverage of the normal approximation at extreme values (near 0 or 1). |
Attack Strategy Terms
| Strategy | Definition |
|---|---|
| direct | Single-turn baseline. The harmful request is presented without conversational scaffold. |
| benign_context | Benign multi-turn context followed by the harmful request. Tests whether conversational momentum alone shifts the safety boundary. |
| foot_in_door | Gradual topical escalation. Small, seemingly acceptable requests accumulate toward the final harmful turn. |
| crescendo | Authority and credential framing. The conversation builds trust and legitimacy before the harmful request. |
| attention_shift | Creative or unrelated framing followed by abrupt pivot. Strongest overall strategy in this sweep. |
| role_play | Persona induction. The model is pushed to maintain a fictional character or role that weakens refusal behavior. |
| progressive_refinement | The user starts broad and iteratively narrows toward harmful specificity. One of the most consistently dangerous strategies. |
| context_fusion | Intent is distributed across individually milder turns and fused at the target turn. Especially powerful on the most vulnerable model. |
Deployment Terms
| Term | Definition |
|---|---|
| ASR (Attack Success Rate) | Fraction of conversations where the target turn is scored compliant. Higher = worse. |
| BPW (Bits Per Weight) | Average number of bits per model parameter after quantization. Q8_0 = 8.00, Q2_K = 3.35. |
| Break rate | Fraction of initially-refusing Phase 2 conversations that eventually comply under pressure. |
| GGUF | File format for quantized LLM weights used by llama.cpp and Ollama. |
| Persistence | Fraction of turns in a Phase 2 conversation that preserve refusal. Lower = worse. |
| Pressure resistance | Fraction of post-refusal turns that maintain refusal. Isolates durability after the initial response. |
Appendix E: Configs (Source of Truth)
The executed run is defined by research/tr139/config.yaml. The key excerpts are:
phase1:
quant_levels: [Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K]
attack_strategies:
- direct
- benign_context
- foot_in_door
- crescendo
- attention_shift
- role_play
- progressive_refinement
- context_fusion
phase2:
quant_levels: [Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K]
pressure_turns: 8
max_new_tokens: 256
temperature: 0.0
seed: 42
warmup_requests: 3
cooldown_between_models_s: 10
Model definitions:
- name: llama3.2-1b
ollama_tag: "llama3.2:1b"
- name: llama3.2-3b
ollama_tag: "llama3.2:3b"
- name: qwen2.5-1.5b
ollama_tag: "qwen2.5:1.5b"
- name: llama3.1-8b
ollama_tag: "llama3.1:8b"
skip_fp16: true
Those config excerpts are the final source of truth for what TR139 actually ran.