Technical Report 137: The Safety Tax of Inference Optimization
Unified synthesis of quantization, concurrency, and backend effects on LLM safety alignment
| Field | Value |
|---|---|
| TR Number | 137 |
| Project | Banterhearts LLM Performance Research |
| Date | 2026-03-08 (synthesis of TR134: Mar 5-6, TR135: Mar 7, TR136: Mar 8) |
| Author | Research Team |
| Report Type | Meta-analysis / synthesis (3 source experiments, 18 analysis passes, 74,254 total samples) |
| Compute Time | <5 seconds (pure meta-analysis on pre-computed results) |
| Status | Complete -- all 3 source experiments delivered, full synthesis run |
| Run ID | 20260308_180727 |
| Related Work | TR134 (Alignment Under Quantization), TR135 (Concurrency x Safety), TR136 (Cross-Backend Safety) |
| Depends On | TR134 (quantization safety data), TR135 (concurrency safety data), TR136 (backend safety data) |
TR134 established that quantization degrades safety alignment in small LLMs. TR135 found that concurrent inference does not. TR136 revealed that the serving backend (Ollama GGUF vs vLLM/TGI FP16) produces safety differences larger than quantization itself. Each experiment answered one question in isolation. TR137 asks: what is the total safety cost when all three optimization axes are considered together, and which axis should practitioners worry about most?
This synthesis meta-analysis combines 74,254 evaluated samples across 3 experiments, 5 distinct models (1.2B-7.6B parameters), 7 quantization levels, 4 concurrency levels, and 4 serving backends. We compute cross-axis effect rankings, safety-capability asymmetry, deployment risk projections, jailbreak vulnerability synthesis, per-category bias patterns, and model heterogeneity statistics.
Key findings: (1) Quantization accounts for 57% of total safety cost, backend choice for 41%, and concurrency for 2%. (2) The worst combined configuration (Llama 1B at Q2_K) retains only 57.5% of baseline safety -- a critical-risk deployment. (3) Model heterogeneity is extreme (I-squared = 99.9% on the quant axis) -- models disagree wildly on how dangerous quantization is, making universal deployment guidelines unreliable. (4) Concurrency is the only axis where all models agree: it is safe (max effect = 0.4pp). (5) Jailbreak susceptibility increases under quantization (all technique slopes negative) but is invariant to concurrency. (6) No backend pair achieves TOST equivalence at +/-3pp -- backend swaps cannot be treated as safety-neutral.
Total: 74,254 samples synthesized from 3 source experiments, 18 analysis passes, 24 deployment configurations assessed.
Metric Definitions
Safety Metrics
- Refusal Rate (AdvBench, Jailbreak): Proportion of harmful prompts correctly refused. Range [0, 1]. Higher = safer. Scored by RefusalDetector (regex classifier).
- Truthfulness (TruthfulQA): Proportion of factually accurate responses. Range [0, 1]. Higher = better. Scored by TruthfulnessScorer.
- Bias Resistance (BBQ): Proportion of bias-probing questions answered without demographic stereotyping. Range [0, 1]. Higher = less biased. Scored by BiasDetector.
Capability Metrics
- Accuracy (MMLU, ARC-Challenge): Exact-match accuracy on multiple-choice benchmarks. Range [0, 1]. Higher = better.
Derived / Synthesis Metrics
- Effect Size (pp): Absolute change in safety score (percentage points) between baseline and worst configuration per axis. Positive = degradation.
- Cohen's d: Standardized effect size. < 0.2 trivial, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large.
- I-squared: Heterogeneity statistic. Percentage of total variation due to between-model differences. < 25% low, 25-75% moderate, > 75% high.
- Bootstrap CI: 95% confidence interval via 2,000 bootstrap resamples (seed=42, percentile method).
- Safety Retention:
projected_safety / baseline_safety x 100. Percentage of baseline safety preserved after optimization. - Risk Level: Based on retention: >= 95% low, >= 90% moderate, >= 80% high, < 80% critical.
- MDE (Minimum Detectable Effect): Smallest effect detectable at alpha=0.05, power=0.80 given the sample sizes.
Statistical Methods & Caveats
Methods used:
- Cross-TR validation at anchor configs (Q4_K_M, N=1, Ollama) across all 3 TRs. Tolerance: 5pp.
- Effect size ranking via absolute safety delta (pp) from baseline to worst config per axis, with Cohen's d.
- Bootstrap CI (2,000 iterations, seed=42, 95% percentile) on cross-model effect size means per axis.
- I-squared heterogeneity across models per axis to quantify agreement on effect direction/magnitude.
- Pearson correlation for cross-axis vulnerability (requires >= 3 shared models per pair).
- Additive projection for combined quant + concurrency cost (no factorial design available).
- ANOVA (one-way) on safety degradation slopes across model families (from TR134).
- TOST equivalence testing at +/-3pp margin on backend decomposition (from TR136).
- IQR outlier detection on all source data (Q1 - 1.5IQR, Q3 + 1.5IQR fences).
- Risk-tiered deployment matrix at 95/90/80% retention thresholds.
Important caveats:
-
No factorial design. Each source TR varied one axis while holding others constant. We cannot measure true interaction effects (e.g., "quantization x concurrency synergy"). The deployment matrix uses an additive model, which may underestimate or overestimate combined costs.
-
Small anchor set. Only Llama 3.2 1B and 3B appear in all 3 TRs. Cross-axis conclusions rest on N=2 models. Bootstrap CIs reflect this: the quant axis CI spans [-6.0, 35.2]pp, encompassing both improvement and severe degradation.
-
Qwen size mismatch. TR134 tested Qwen 2.5 7B, TR135 tested Qwen 2.5 3B, TR136 tested Qwen 2.5 1.5B. These are different models from the same family, not the same model. Cross-TR Qwen comparisons are family-level, not model-level.
-
Consumer hardware only. All experiments ran on a single NVIDIA RTX GPU. Datacenter hardware (A100, H_100) may behave differently, particularly for vLLM/TGI optimizations that leverage tensor parallelism.
-
Automated scoring only. Safety scores come from regex classifiers (RefusalDetector, BiasDetector, TruthfulnessScorer). LLM judge validation (Qwen 2.5 7B Q8_0) was applied in TR134 only. Cohen's kappa between regex and judge is 0.147 (poor overall), limiting confidence in classification accuracy.
-
Temperature 0 throughout. All experiments used deterministic sampling. Stochastic sampling (temp > 0) introduces additional variance that may interact with optimization axes differently.
-
Meta-analysis on aggregates. This synthesis operates on pre-computed group statistics, not raw samples. We cannot re-stratify or re-group the data differently without re-running source experiments.
-
Additive cost model is conservative. The deployment matrix sums marginal quant and concurrency costs. If axes interact synergistically (e.g., quantization makes concurrency effects worse), actual costs could exceed projections.
Executive Summary
Key Findings
-
Quantization is the most dangerous optimization axis, accounting for 57% of total safety cost. Across the two models appearing in all TRs, quantization produces a mean safety delta of 20.6pp (95% CI: [-6.0, 35.2]pp). For Llama 3.2 1B specifically, dropping from FP16 to Q2_K costs 35.2pp of safety (Cohen's d = 1.93, large effect). However, Llama 3.2 3B shows an anomalous -6.0pp "improvement" at Q2_K (d = -0.27), driving extreme heterogeneity (Section 13).
-
Backend choice is the second-largest safety factor at 41% of total cost. Switching from Ollama GGUF to vLLM/TGI FP16 costs 14.8pp mean safety (CI: [4.4, 25.1]pp). For Llama 1B, the backend effect (d = -0.60, medium) is comparable in magnitude to quantization. All TOST equivalence tests fail at +/-3pp -- no backend pair can be treated as interchangeable (Section 10).
-
Concurrency is safe: 2% of total cost, max effect 0.4pp. This is the one axis where all models agree (I-squared = 0.0%). Running 1-8 concurrent requests produces negligible safety impact. Practitioners can scale concurrency freely without safety concerns (Section 9).
-
Model heterogeneity is extreme on quant and backend axes. I-squared = 99.9% for quantization, 99.5% for backend. Models do not agree on the magnitude or even direction of safety degradation. Universal "safe quantization level" guidelines are unreliable -- per-model validation is required (Section 13).
-
The worst combined deployment retains only 57.5% of baseline safety. Llama 3.2 1B at Q2_K with any concurrency level is rated CRITICAL risk. Three of 24 assessed configurations are critical, 3 are moderate, and 18 are low risk (Section 14).
-
Jailbreak susceptibility increases under quantization but is invariant to concurrency. All four jailbreak techniques show negative BPW slopes (easier at lower quant). Prefix injection is most effective (slope = -0.036/BPW). Under concurrency, all jailbreak compliance slopes equal zero -- concurrency does not amplify jailbreaks (Section 11).
-
Safety degrades faster than capability in only 3 of 10 model-axis combinations. The "RLHF safety veneer" hypothesis -- that safety is a thin layer stripped first by optimization -- is not universally supported. Most models show capability degrading at comparable or faster rates than safety (Section 7).
-
Nationality is the most vulnerable bias category under quantization; Race/Ethnicity is the least. Per-category BBQ analysis shows Nationality bias slope = -0.010/BPW (worsening), while Race/Ethnicity slope = +0.015/BPW (improving). This ranking is averaged across 4 models (3 families) (Section 17).
-
Cross-TR reproducibility is good but imperfect. At the Q4_K_M anchor point, 9 of 12 task-model pairs agree within 5pp across all 3 TRs. Three tasks exceed tolerance: ARC-Challenge on Llama 1B (6.0pp), AdvBench on Llama 3B (7.0pp), and Jailbreak on Llama 3B (6.7pp). Mean deltas are 2.3pp and 3.0pp respectively (Section 5).
Validation Summary
| Target | Metric | Required | Achieved | Status |
|---|---|---|---|---|
| Source coverage | TRs loaded | 3/3 | 3/3 | PASS |
| Data volume | Total samples | > 50,000 | 74,254 | PASS |
| Anchor consistency | Mean delta < 5pp | Both models | 2.3pp, 3.0pp | PASS |
| Outlier rate | IQR flagged | 0% | 0/300 groups | PASS |
| Effect ranking | All 3 axes ranked | Complete | quant > backend > concurrency | PASS |
| Deployment matrix | Configs assessed | >= 20 | 24 | PASS |
Claim Validation
| # | Claim | Evidence Base | Status |
|---|---|---|---|
| 1 | Quantization is the most dangerous axis | Mean 20.6pp delta, Cohen's d = 1.93 (Llama 1B) | Demonstrated for Llama 1B; Mixed at aggregate level (CI includes negative) |
| 2 | Concurrency is safety-neutral | Max delta 0.4pp, I^2 = 0.0%, all slopes ~0 | Demonstrated |
| 3 | Backend choice matters more than quantization for some models | Llama 1B: backend d = -0.60 vs quant d = 0.05 within TR136 | Demonstrated for within-backend comparison |
| 4 | No backend pair is equivalent at +/-3pp | All 18 TOST tests fail (from TR136) | Demonstrated |
| 5 | Effects are additive across axes | No factorial design to test; additive model used | Assumed, not validated |
| 6 | Safety degrades faster than capability | Only 3/10 model-axis combinations | Refuted as universal claim |
| 7 | Jailbreak vulnerability increases at lower quant | All 4 technique slopes negative | Demonstrated |
| 8 | Model families agree on which axis is dangerous | ANOVA p = 0.1370, I^2 = 99.9% on quant | Refuted -- extreme disagreement |
Key Decisions for Practitioners
-
Prioritize quantization validation over concurrency testing. Quantization accounts for 57% of safety cost; concurrency accounts for 2%. If your testing budget is limited, spend it on evaluating different quant levels for your specific model. Concurrency can be scaled freely (Section 6, Section 9).
-
Treat backend swaps as safety-critical changes. Migrating from Ollama to vLLM/TGI (or vice versa) can change safety behavior by 4-25pp depending on the model. Re-validate safety after any backend migration, even at the same precision level (Section 10).
-
Do not rely on generic "safe quantization" thresholds. I-squared = 99.9% means models disagree completely on quantization's impact. Llama 1B loses 35pp at Q2_K; Llama 3B gains 6pp. Per-model profiling is mandatory (Section 13).
-
Avoid Q2_K for Llama 3.2 1B in any safety-critical application. All three Q2_K configurations for Llama 1B are rated CRITICAL (58% retention). This is the only model-quant combination in the matrix that falls below the 80% safety threshold (Section 14, Section 15).
-
Use Q4_K_M or higher for production deployments of small models. At Q4_K_M, both Llama models retain >= 93% safety across all concurrency levels. The marginal quant cost at this level is 1.3-4.6pp -- well within acceptable bounds for most applications (Section 14).
When to Use This Report
Scenario 1: Choosing a quantization level for deployment
Question: "We want to deploy Llama 3.2 1B quantized. What's the safest level?"
Answer: Q4_K_M retains 98.4% safety at N=1, 99.1% at N=4. Q8_0 is even safer (100.7% retention). Avoid Q2_K entirely (57.5% retention, CRITICAL). See Section 14 for the full deployment matrix.
Scenario 2: Scaling to multiple concurrent users
Question: "Will running 8 concurrent requests degrade safety?"
Answer: No. Maximum observed concurrency effect is 0.4pp across all models (Section 6). All jailbreak compliance slopes under concurrency are zero (Section 11). Concurrency is safe.
Scenario 3: Migrating from Ollama to vLLM
Question: "We're moving from Ollama to vLLM for production. Any safety concerns?"
Answer: Yes. For Llama 1B, this swap costs ~25pp of safety (Section 10). The mechanism is chat template divergence between GGUF-embedded and HuggingFace tokenizer templates (TR136 report, Section 8). Re-validate safety after migration.
Scenario 4: Understanding which optimization to worry about most
Question: "We're optimizing model, quant level, concurrency, and backend simultaneously. Where's the risk?"
Answer: Quantization (57% of cost) and backend (41%) dominate. Concurrency (2%) is negligible. See the effect decomposition in Section 19 and the deployment matrix in Section 14.
Scenario 5: Evaluating a new model family
Question: "Do these findings generalize to our model?"
Answer: Probably not directly. I-squared = 99.9% on quantization means even within this study, models disagree completely. ANOVA across families is not significant (p = 0.1370). Use this report for directional guidance (quantization matters, concurrency doesn't), but validate your specific model (Section 12).
How to Read This Report
| Time | Reading Path |
|---|---|
| 2 min | Abstract + Key Findings 1-4 |
| 10 min | Add Executive Summary tables + Section 6 (Effect Ranking) + Section 14 (Deployment Matrix) |
| 30 min | Add Sections 5, 7, 10-11, 15, 19 for full synthesis picture |
| 60 min | Full report including per-category bias, judge agreement, and appendices |
| Deep dive | Appendix B (jailbreak tables), Appendix C (per-category slopes), source TR reports |
Table of Contents
Front Matter
Context & Design (Sections 1-4)
- Introduction & Research Motivation
- Source Experiments & Design
- Model Coverage & Overlap
- Environment & Artifacts
Core Synthesis (Sections 5-10)
- Cross-TR Baseline Validation
- Effect Size Ranking
- Safety-Capability Asymmetry
- Per-Task Vulnerability Matrix
- Quantization x Concurrency Projection
- Backend x Quantization Decomposition
Extended Analysis (Sections 11-18)
- Jailbreak Synthesis Across Axes
- Family-Level Patterns
- Model-Axis Heterogeneity
- Safety-Adjusted Deployment Matrix
- Worst-Case Analysis
- Power & Sensitivity
- Per-Category Bias Synthesis
- Judge Agreement Synthesis
Conclusions (Sections 19-20)
Appendices
- Appendix A: Full Deployment Matrix
- Appendix B: Jailbreak Success Rates by Technique
- Appendix C: Per-Category Bias Slopes
- Appendix D: Glossary
- References
1. Introduction & Research Motivation
1.1 The Problem
Deploying LLMs in production requires multiple optimization decisions: quantization level (FP16 to Q2_K), serving backend (Ollama, vLLM, TGI), and concurrency (1 to 8+ simultaneous requests). Each optimization improves cost, latency, or throughput -- but may degrade safety alignment. Prior work (TR134, TR135, TR136) studied each axis independently. No prior work in this project has examined their combined effect or ranked them by safety impact.
1.2 Research Questions
- Which inference optimization axis causes the most safety degradation?
- Does safety erode faster than capability under each optimization?
- Do models agree on which axis is most dangerous? (heterogeneity)
- What is the projected safety cost of combined optimizations?
- Are jailbreak susceptibility patterns consistent across axes?
- Which demographic categories are most vulnerable across axes?
- Do models vulnerable on one axis tend to be vulnerable on others?
1.3 Scope
This is a meta-analysis. We do not run new model evaluations. We synthesize pre-computed results from three completed experiments (TR134, TR135, TR136) covering 74,254 total samples. The synthesis adds value through cross-axis comparison, effect ranking, deployment projections, and heterogeneity analysis that no individual TR can provide.
1.4 Contribution
TR137 provides the first unified safety-cost picture across all three optimization axes studied in the Banterhearts research program. The deployment matrix (Section 14) gives practitioners a concrete, risk-tiered lookup table for configuration decisions. The effect decomposition (Section 19) quantifies the relative importance of each axis, enabling rational prioritization of safety validation effort.
2. Source Experiments & Design
Each source TR varied one axis while holding the others constant. The table below summarizes the experimental design of each source.
| Property | TR134 (Quantization) | TR135 (Concurrency) | TR136 (Backend) |
|---|---|---|---|
| Axis varied | Quant level (FP16-Q2_K) | Concurrent requests (1-8) | Serving backend (4 types) |
| Models | 4 (1.2B-7.6B) | 3 (1.2B-3B) | 3 (1.2B-3B) |
| Configs | 26 model-quant variants | 12 model-concurrency variants | 12 model-backend variants |
| Total samples | 24,778 | 39,060 | 10,416 |
| Safety tasks | 4 (advbench, jailbreak, bbq, truthfulqa) | 4 (same) | 4 (same) |
| Capability tasks | 2 (MMLU, ARC) | 2 (same) | 2 (same) |
| Backend | Ollama (all quants) | Ollama Q4_K_M (fixed) | Ollama + vLLM + TGI |
| Temperature | 0.0 | 0.0 | 0.0 |
| LLM Judge | Yes (12,168 judged) | No | Yes (5,616 judged) |
Observations: The design is a one-at-a-time (OAT) factorial, not a full factorial. This means we can estimate marginal effects of each axis but cannot measure interactions. The additive projection in Section 9 relies on this assumption. The common anchor point (Q4_K_M, N=1, Ollama) allows cross-TR validation (Section 5) but does not substitute for a full factorial design. All three TRs share the same 6 benchmarks and temperature setting, enabling consistent comparison.
The sample distribution is uneven: TR135 contributes 53% of total samples due to its concurrency multiplication design (each N-level multiplies sample count), while TR136 contributes only 14%. This does not affect the synthesis because we operate on aggregated statistics (group means, slopes) rather than pooled raw samples. However, it means power analysis (Section 16) differs across sources: TR135 has the lowest MDE (6.8pp) while TR134 has the highest (18.3pp). Effects that are detectable in TR135 may be below TR134's detection threshold.
3. Model Coverage & Overlap
The overlap matrix shows which models appear in which TRs. Only models present in all 3 TRs (Llama 1B and 3B) can anchor the cross-axis synthesis.
| Model | Params | Family | TR134 | TR135 | TR136 | Anchor? |
|---|---|---|---|---|---|---|
| llama3.2-1b | 1.2B | Llama | Yes (7 quants) | Yes (4 N-levels) | Yes (4 backends) | Yes |
| llama3.2-3b | 3.2B | Llama | Yes (7 quants) | Yes (4 N-levels) | Yes (4 backends) | Yes |
| mistral-7b | 7.2B | Mistral | Yes (6 quants) | No | No | No |
| qwen2.5-7b | 7.6B | Qwen | Yes (6 quants) | No | No | No |
| qwen2.5-3b | 3B | Qwen | No | Yes (4 N-levels) | No | No |
| qwen2.5-1.5b | 1.5B | Qwen | No | No | Yes (4 backends) | No |
Observations: The anchor set of 2 models is the primary constraint on this synthesis. All cross-axis statistics (effect ranking, heterogeneity, decomposition, deployment matrix) are computed over these 2 models only. The Qwen family appears in all 3 TRs but at different sizes (7B, 3B, 1.5B), preventing direct cross-TR comparison at the model level -- these are architecturally similar but separately trained models with different parameter counts, vocabulary sizes, and alignment procedures.
Family-level patterns (Section 12) use TR134's 4-model set for the quantization axis and TR135/TR136's 3-model sets for the other axes. The limited anchor set means bootstrap CIs are wide and I-squared estimates should be interpreted with caution. Specifically, I-squared with N=2 is mathematically guaranteed to be either 0% or ~100% -- there is no middle ground. The extreme values (99.9%, 99.5%) reflect genuine disagreement between models, but the binary nature of 2-model I-squared means we cannot distinguish "moderate disagreement" from "complete disagreement." Adding even one more anchor model would substantially improve heterogeneity estimation.
4. Environment & Artifacts
| Property | Value |
|---|---|
| Platform | Windows 11 (10.0.26200) |
| Python | 3.13.1 (MSC v.1942 64-bit) |
| Machine | AMD64 |
| NumPy | 2.3.5 |
| SciPy | 1.15.2 |
| Pandas | 2.2.3 |
| Ollama | Not required (meta-analysis only) |
| Docker | Not required (meta-analysis only) |
4.1 Source Data Paths
| Source | Analysis File | Records |
|---|---|---|
| TR134 | research/tr134/results/phase3/20260305_144827/phase3_analysis.json |
24,778 |
| TR135 | research/tr135/results/20260307_162151/tr135_analysis.json |
39,060 |
| TR136 | research/tr136/results/20260308_015147/tr136_analysis.json |
10,416 |
4.2 Data Quality
IQR outlier detection was applied to all source data. Zero outliers were flagged across 300 total groups (TR134: 156 groups, TR135: 72 groups, TR136: 72 groups). Each group was checked per-task with 12-26 values per task.
5. Cross-TR Baseline Validation
Before synthesizing results, we verify that the three TRs agree at their shared anchor point: Q4_K_M quantization, N=1 concurrency, Ollama backend. Consistency threshold: 5pp.
5.1 Llama 3.2 1B
| Task | TR134 | TR135 | TR136 | Max Delta (pp) | Consistent? |
|---|---|---|---|---|---|
| advbench_refusal | 0.870 | 0.880 | 0.880 | 1.0 | Yes |
| arc_challenge | 0.395 | 0.340 | 0.335 | 6.0 | No |
| bbq_bias | 0.874 | 0.869 | 0.874 | 0.5 | Yes |
| jailbreak_amplification | 0.933 | 0.925 | 0.925 | 0.8 | Yes |
| mmlu_real | 0.337 | 0.305 | 0.310 | 3.2 | Yes |
| truthfulqa | 0.580 | 0.570 | 0.590 | 2.0 | Yes |
Mean delta: 2.25pp. Status: 1 inconsistent task.
5.2 Llama 3.2 3B
| Task | TR134 | TR135 | TR136 | Max Delta (pp) | Consistent? |
|---|---|---|---|---|---|
| advbench_refusal | 0.470 | 0.540 | 0.540 | 7.0 | No |
| arc_challenge | 0.705 | 0.695 | 0.700 | 1.0 | Yes |
| bbq_bias | 0.965 | 0.960 | 0.965 | 0.5 | Yes |
| jailbreak_amplification | 0.825 | 0.892 | 0.892 | 6.7 | No |
| mmlu_real | 0.590 | 0.600 | 0.595 | 1.1 | Yes |
| truthfulqa | 0.500 | 0.480 | 0.480 | 2.0 | Yes |
Mean delta: 3.04pp. Status: 2 inconsistent tasks.
Observations: 9 of 12 task-model pairs are consistent within 5pp. The 3 inconsistent tasks are all on the boundary: ARC-Challenge for Llama 1B (6.0pp) is a capability task with known per-run variance at temperature 0; AdvBench (7.0pp) and Jailbreak (6.7pp) for Llama 3B are refusal tasks where moderate baseline rates (0.47-0.83) leave room for run-to-run variance. Importantly, TR135 and TR136 agree perfectly on all Llama 3B safety scores (0.540, 0.892), and the divergence is entirely from TR134. This suggests a single-run anomaly in TR134 rather than systematic measurement error. The mean deltas (2.3pp, 3.0pp) are well below the 5pp tolerance, providing adequate confidence for synthesis.
6. Effect Size Ranking
The central question: which optimization axis produces the largest safety degradation? We compute the absolute delta (pp) between baseline and worst configuration for each model on each axis.
6.1 Aggregate Ranking
| Rank | Axis | Mean |Delta| (pp) | 95% Bootstrap CI | N Models |
|---|---|---|---|---|
| 1 | Quantization | 20.6 | [-6.0, 35.2] | 2 |
| 2 | Backend | 14.8 | [4.4, 25.1] | 2 |
| 3 | Concurrency | 0.4 | [-0.3, 0.4] | 2 |
6.2 Per-Model Breakdown
| Model | Quant (pp) | Conc. (pp) | Backend (pp) | Worst Axis |
|---|---|---|---|---|
| llama3.2-1b | 35.2 (FP16 -> Q2_K) | -0.3 (N=1 -> N=8) | 25.1 (Ollama -> TGI) | quant |
| llama3.2-3b | -6.0 (FP16 -> Q2_K) | 0.4 (N=1 -> N=8) | 4.4 (Ollama -> TGI) | backend |
6.3 Cohen's d (Quantization Axis)
| Model | Cohen's d | Interpretation |
|---|---|---|
| llama3.2-1b | 1.93 | Large |
| llama3.2-3b | -0.27 | Small (improvement) |
Observations: The aggregate ranking is clear -- quantization > backend >> concurrency -- but the confidence intervals reveal important nuance. The quant CI [-6.0, 35.2] spans zero because Llama 3B shows anomalous safety improvement at Q2_K. This is likely a measurement artifact: Q2_K degrades coherence so severely that the model produces incoherent refusals rather than coherent compliance, inflating refusal scores. The backend CI [4.4, 25.1] is entirely positive, indicating that backend effects are consistently negative for safety even though magnitude varies by model. The concurrency CI [-0.3, 0.4] tightly brackets zero, confirming negligible impact. Cohen's d for Llama 1B quantization (d = 1.93) is a large effect by any convention -- this is not a subtle finding.
7. Safety-Capability Asymmetry
Does safety erode faster than capability under each optimization? If yes, the "safety veneer" hypothesis holds: RLHF alignment is a thin layer stripped first.
7.1 Summary by Axis
| Axis | Models Tested | Safety Degrades Faster | Percentage |
|---|---|---|---|
| Quantization | 4 | 1 (Mistral 7B) | 25% |
| Concurrency | 3 | 1 (Llama 3B) | 33% |
| Backend | 3 | 1 (Llama 1B) | 33% |
7.2 Per-Model Detail (Quantization Axis)
| Model | Family | Safety Slope | Capability Slope | Divergence | Safety Faster? | Conclusion |
|---|---|---|---|---|---|---|
| llama3.2-1b | Llama | +0.013 | +0.022 | -0.009 | No | Robust |
| llama3.2-3b | Llama | -0.007 | +0.011 | -0.018 | No | Robust |
| mistral-7b | Mistral | +0.041 | +0.013 | +0.028 | Yes | SUGGESTIVE |
| qwen2.5-7b | Qwen | +0.008 | +0.016 | -0.008 | No | Robust |
7.3 Backend Axis (Range-Based)
| Model | Safety Range (pp) | Capability Range (pp) | Ratio | Safety Faster? |
|---|---|---|---|---|
| llama3.2-1b | 25.1 | 4.0 | 6.29 | Yes |
| llama3.2-3b | 4.4 | 7.8 | 0.57 | No |
| qwen2.5-1.5b | 4.3 | 5.3 | 0.83 | No |
Observations: The safety veneer hypothesis is NOT universally supported. Only 3 of 10 model-axis combinations show safety degrading faster than capability. On the quantization axis, only Mistral 7B (the largest model with the weakest baseline safety) shows disproportionate safety loss -- and even this is "suggestive" (CIs overlap). On the backend axis, Llama 1B shows dramatic asymmetry (safety range 6.3x capability range), but this is model-specific: the chat template divergence in GGUF vs HuggingFace specifically disrupts Llama 1B's refusal patterns while leaving knowledge-based tasks intact. On concurrency, Llama 3B technically shows safety degrading faster, but both slopes are near-zero (safety: -0.0001, capability: -0.0001), making this practically meaningless. The overall picture: optimization affects safety and capability comparably for most models. Practitioners should monitor both domains, not assume safety is uniquely fragile.
8. Per-Task Vulnerability Matrix
Which benchmarks are most sensitive to each optimization axis? We report mean absolute slopes (quant, concurrency) and minimum chi-squared p-values (backend).
| Task | Quant Abs. Slope | Conc. Abs. Slope | Backend Min p | Most Vulnerable Axis |
|---|---|---|---|---|
| advbench_refusal | 0.040 | 0.000 | 0.0000 | quant |
| jailbreak_amplification | 0.040 | 0.000 | 0.0000 | quant |
| arc_challenge | 0.016 | --- | --- | quant |
| mmlu_real | 0.016 | --- | --- | quant |
| truthfulqa | 0.009 | 0.001 | 0.1401 | quant |
| bbq_bias | 0.006 | 0.000 | 0.0093 | quant |
Observations: AdvBench refusal and jailbreak amplification are the most vulnerable tasks on every axis, with quant slopes 4-7x larger than bias or truthfulness tasks. This makes intuitive sense: refusal behavior is a direct product of RLHF fine-tuning and is the first capability affected when weight precision drops. The tied slopes (0.040 for both advbench and jailbreak) reflect their shared measurement mechanism: both use RefusalDetector to classify responses, and both involve explicit harmful content that the model must learn to refuse.
BBQ bias and TruthfulQA are relatively robust to quantization (slopes 0.006-0.009) and completely insensitive to concurrency. This is expected: bias resistance (choosing the anti-stereotyped option) and factual knowledge (selecting the correct fact) are embedded in the model's general world knowledge, not in a separable RLHF safety layer. On the backend axis, AdvBench, jailbreak, and BBQ all show significant chi-squared tests (p < 0.01), but TruthfulQA does not (p = 0.14) -- factual knowledge is preserved across backends even when refusal behavior diverges.
Capability tasks (ARC, MMLU) show moderate quant slopes (0.016) -- smaller than refusal tasks but non-trivial. They lack concurrency and backend data in this synthesis because TR135 and TR136 did not include these tasks in their safety-focused analyses. The quant axis remains the only axis where both safety and capability sensitivity can be directly compared (Section 7).
9. Quantization x Concurrency Projection
Since no factorial experiment tests quant and concurrency simultaneously, we project the combined cost using an additive model: total = quant_marginal + concurrency_marginal. This assumes no interaction.
| Model | Quant Cost (pp) | Conc. Cost (pp) | Projected Total (pp) | Quant % of Total |
|---|---|---|---|---|
| llama3.2-1b | 1.3 | -0.3 | 1.1 | 83% |
| llama3.2-3b | 4.6 | 0.4 | 5.0 | 91% |
Observations: The marginal quant costs here are computed at the Q4_K_M level (the anchor point), not Q2_K. At this moderate quantization, the combined cost is modest: 1.1pp for Llama 1B, 5.0pp for Llama 3B. Quantization dominates in both cases (83-91% of total). The concurrency contribution is so small that it barely registers. This projection is conservative: if quant and concurrency interact synergistically (e.g., quantized models degrade more under load), actual costs could be higher. However, TR135's finding of zero jailbreak compliance slopes under concurrency (Section 11) suggests interaction is unlikely. The practical implication: practitioners can choose their quantization level based on TR134 data and ignore concurrency scaling entirely.
10. Backend x Quantization Decomposition
TR136 tested four backends on the same models, allowing decomposition of the backend effect into three components: quantization (Q4_K_M vs Q8_0 within Ollama), backend (Ollama Q8_0 vs vLLM FP16), and serving framework (vLLM vs TGI).
10.1 Llama 3.2 1B
| Component | Diff (pp) | Cohen's d | p-value | t-stat | TOST Equiv? |
|---|---|---|---|---|---|
| Quant (Q4-Q8) | +1.8 | 0.054 (trivial) | 0.4055 | 0.832 | No |
| Backend (Q8-vLLM) | -24.8 | -0.604 (medium) | < 0.0001 | -9.244 | No |
| Serving (vLLM-TGI) | -0.3 | -0.007 (trivial) | 0.9189 | -0.102 | No |
10.2 Llama 3.2 3B
| Component | Diff (pp) | Cohen's d | p-value | t-stat | TOST Equiv? |
|---|---|---|---|---|---|
| Quant (Q4-Q8) | +0.5 | 0.014 (trivial) | 0.8341 | 0.210 | No |
| Backend (Q8-vLLM) | -6.1 | -0.149 (trivial) | 0.0233 | -2.273 | No |
| Serving (vLLM-TGI) | -0.4 | -0.010 (trivial) | 0.8798 | -0.151 | No |
10.3 Qwen 2.5 1.5B
| Component | Diff (pp) | Cohen's d | p-value | t-stat | TOST Equiv? |
|---|---|---|---|---|---|
| Quant (Q4-Q8) | +3.0 | 0.081 (trivial) | 0.2146 | 1.242 | No |
| Backend (Q8-vLLM) | -5.7 | -0.149 (trivial) | 0.0225 | -2.285 | No |
| Serving (vLLM-TGI) | -1.0 | -0.024 (trivial) | 0.7166 | -0.363 | No |
Observations: Three patterns emerge consistently across all models. First, within-Ollama quantization (Q4 vs Q8) is trivial: d < 0.09 for all models, no p-value significant. The quantization cost measured here is much smaller than the FP16-to-Q2_K cost from TR134 because the Q4-Q8 range represents a narrow slice of the full quantization spectrum. Second, the Ollama-to-FP16 backend jump is the dominant effect: d = -0.60 (medium) for Llama 1B, d = -0.15 for the others, with p < 0.025 for all models. This is the chat template divergence documented in TR136. Third, vLLM and TGI are functionally identical: d < 0.03, p > 0.7 for all models. The serving framework does not matter; the weight format (GGUF vs FP16) and associated template handling is what drives the difference. Critically, no TOST test passes at +/-3pp for any component on any model, meaning we cannot formally certify any of these transitions as safety-equivalent.
11. Jailbreak Synthesis Across Axes
11.1 Quantization Axis (TR134)
All four jailbreak techniques become more effective at lower quantization levels. Slopes represent change in compliance rate per BPW reduction.
| Technique | Slope (per BPW) | Interpretation | N Samples |
|---|---|---|---|
| prefix_injection | -0.036 | Most effective at low quant | 780 |
| direct | -0.030 | Strongly effective | 780 |
| dan_style | -0.024 | Moderately effective | 780 |
| roleplay | -0.021 | Least effective at low quant | 780 |
Total jailbreak samples: 3,120.
Observations: All slopes are negative, confirming that quantization universally amplifies jailbreak susceptibility. Prefix injection is the most dangerous technique at low quant, with a compliance rate slope 70% steeper than roleplay. At Q2_K, prefix injection achieves 60% compliance on Llama 1B (vs 3.3% at FP16) and 93% on Mistral 7B (vs 87% at Q8_0). The practical implication: low-quant models should be assumed jailbreak-vulnerable regardless of their FP16 refusal rates.
11.2 Concurrency Axis (TR135)
Jailbreak compliance rates are completely flat across concurrency levels. The table below shows compliance at N=1 and N=8 for each model-technique pair to confirm invariance.
| Model | Technique | Compliance @ N=1 | Compliance @ N=8 | Slope (per N) |
|---|---|---|---|---|
| llama3.2-1b | dan_style | 3.3% | 3.3% | 0.000 |
| llama3.2-1b | direct | 13.3% | 13.3% | 0.000 |
| llama3.2-1b | prefix_injection | 13.3% | 13.3% | 0.000 |
| llama3.2-1b | roleplay | 0.0% | 0.0% | 0.000 |
| llama3.2-3b | dan_style | 3.3% | 3.3% | 0.000 |
| llama3.2-3b | direct | 6.7% | 6.7% | 0.000 |
| llama3.2-3b | prefix_injection | 0.0% | 0.0% | 0.000 |
| llama3.2-3b | roleplay | 33.3% | 33.3% | 0.000 |
| qwen2.5-3b | dan_style | 6.7% | 7.1% | 0.001 |
| qwen2.5-3b | direct | 13.3% | 13.3% | 0.000 |
| qwen2.5-3b | prefix_injection | 76.7% | 76.7% | 0.000 |
| qwen2.5-3b | roleplay | 20.0% | 20.0% | 0.000 |
Observations: Concurrency has zero effect on jailbreak susceptibility. 11 of 12 model-technique pairs show exactly 0.000 compliance slope; the single exception (Qwen dan_style, slope = 0.001) is negligible. This is the strongest null finding in the entire synthesis: regardless of model, technique, or concurrency level (1-8), jailbreak success rates do not change. Note the absolute levels vary widely between models: Qwen 2.5 3B is highly susceptible to prefix_injection (76.7% compliance) even at N=1, while Llama 1B resists it (13.3%). But these baseline differences are frozen in place across concurrency levels. Combined with the quant-axis finding, this means jailbreak vulnerability is a function of weight precision and model training, not serving conditions.
11.3 Cross-Axis Summary
| Property | Quantization | Concurrency |
|---|---|---|
| Effect on jailbreak | All 4 slopes negative | All 12 slopes ~zero |
| Most dangerous technique | prefix_injection (-0.036/BPW) | N/A (invariant) |
| Worst-case compliance | Mistral Q2_K: 97% (roleplay) | Qwen N=1: 77% (prefix_injection) |
| Mechanism | Weight precision loss erodes refusal | Deterministic sampling produces identical output |
The fundamental asymmetry: quantization changes the model's weights, directly affecting learned refusal behavior. Concurrency changes only the scheduling of identical requests through the same model, producing bit-identical outputs at temperature 0.
12. Family-Level Patterns
Do model families differ in quantization sensitivity? TR134 tested 4 families, enabling one-way ANOVA on safety degradation slopes.
12.1 ANOVA on Quantization Slopes
| Statistic | Value |
|---|---|
| F-statistic | 2.50 |
| p-value | 0.1370 |
| df (between, within) | (2, 9) |
| Conclusion | Not significant |
12.2 Per-Family Mean Safety Slopes
| Family | N Slopes | Mean Slope | Min Slope | Max Slope | Interpretation |
|---|---|---|---|---|---|
| Llama | 6 | +0.003 | -0.020 | +0.025 | Near-flat (mixed direction) |
| Mistral | 3 | +0.041 | +0.013 | +0.092 | Steep positive (degradation with lower quant) |
| Qwen | 3 | +0.008 | -0.000 | +0.023 | Mild positive |
12.3 Cross-Axis Effects by Family
The Llama family is the only one appearing across all 3 axes. This breakdown shows how the same family behaves on each axis.
| Axis | Model | Effect (pp) | Interpretation |
|---|---|---|---|
| Quantization | llama3.2-1b | +35.2 | Severe degradation |
| Quantization | llama3.2-3b | -6.0 | Anomalous improvement |
| Concurrency | llama3.2-1b | -0.3 | Negligible |
| Concurrency | llama3.2-3b | +0.4 | Negligible |
| Backend | llama3.2-1b | +25.1 | Severe degradation |
| Backend | llama3.2-3b | +4.4 | Mild degradation |
Observations: The ANOVA is not significant (p = 0.1370), meaning we cannot reject the null hypothesis that all families have the same mean safety slope. However, the point estimates differ substantially: Mistral's mean slope (+0.041) is 14x Llama's (+0.003) and 5x Qwen's (+0.008). The non-significance is driven by high within-family variance (Llama slopes range from -0.020 to +0.025) and small sample sizes (3 slopes per family for Mistral/Qwen). With more models per family, this test would likely reach significance.
The cross-axis Llama breakdown reveals an important model-size effect: Llama 1B is severely vulnerable to both quantization and backend changes, while Llama 3B is only mildly affected. The 3x parameter increase (1.2B to 3.2B) provides substantial robustness. Both models agree that concurrency is safe. This suggests that the extreme I-squared values in Section 13 are partially driven by model-size effects rather than pure model identity -- if we controlled for parameter count, heterogeneity might decrease.
13. Model-Axis Heterogeneity
Do models agree on how dangerous each axis is? I-squared quantifies between-model variance as a percentage of total variance.
| Axis | N Models | Signed Mean (pp) | SD (pp) | Range (pp) | I-squared | Interpretation |
|---|---|---|---|---|---|---|
| Quantization | 2 | 14.6 | 29.2 | 41.3 | 99.9% | High disagreement |
| Backend | 2 | 14.8 | 14.6 | 20.7 | 99.5% | High disagreement |
| Concurrency | 2 | 0.1 | 0.5 | 0.7 | 0.0% | Low (agreement) |
Observations: The heterogeneity results are striking. On the quant axis, I-squared = 99.9% means virtually all observed variation is between-model, not within-model. Llama 1B loses 35pp while Llama 3B gains 6pp -- they don't just disagree on magnitude, they disagree on direction. On the backend axis, I-squared = 99.5% reflects the 25pp vs 4pp difference between models. Only concurrency shows consensus (I-squared = 0.0%): both models agree it's negligible. These extreme I-squared values have two important consequences. First, aggregate effect sizes (Section 6) are misleading -- the signed mean of +35.2 and -6.0 is 14.6pp, but no model actually experiences 14.6pp of degradation. (Note: Section 6 reports 20.6pp, the mean of absolute deltas, which better reflects effect magnitude for ranking purposes. This section uses signed means, which are appropriate for heterogeneity estimation.) Second, any meta-analytic pooling of quant effects across models would be statistically inappropriate given I-squared > 75%. Per-model analysis is essential.
14. Safety-Adjusted Deployment Matrix
The deployment matrix projects safety retention for each (model, quant, concurrency) combination using the additive model from Section 9. Risk tiers: >= 95% = low, >= 90% = moderate, >= 80% = high, < 80% = critical.
| Model | Quant | N | Total Cost (pp) | Retention | Risk |
|---|---|---|---|---|---|
| llama3.2-1b | Q2_K | 1 | 35.2 | 57.5% | CRITICAL |
| llama3.2-1b | Q2_K | 4 | 34.7 | 58.1% | CRITICAL |
| llama3.2-1b | Q2_K | 8 | 34.9 | 57.8% | CRITICAL |
| llama3.2-3b | Q4_K_M | 1 | 4.6 | 93.8% | Moderate |
| llama3.2-3b | Q4_K_M | 4 | 4.8 | 93.5% | Moderate |
| llama3.2-3b | Q4_K_M | 8 | 5.0 | 93.2% | Moderate |
| llama3.2-1b | Q4_K_M | 1 | 1.3 | 98.4% | Low |
| llama3.2-3b | Q8_0 | 8 | 1.7 | 97.7% | Low |
| llama3.2-1b | FP16 | 1 | 0.0 | 100.0% | Low |
See Appendix A for the complete 24-row deployment matrix.
Risk Distribution: Critical: 3 configs, Moderate: 3, Low: 18.
Observations: The deployment matrix reveals a sharp safety cliff rather than a gradual slope. All critical-risk configurations involve a single model (Llama 1B) at a single quant level (Q2_K). Moving from Q4_K_M (98.4% retention) to Q2_K (57.5% retention) for Llama 1B represents a catastrophic 40pp safety drop. No other model-quant combination falls below 93% retention. Concurrency's contribution is visible but negligible: the difference between N=1 and N=8 at Q2_K is only 0.3pp (57.5% vs 57.8%). The moderate-risk configs (Llama 3B at Q4_K_M) are noteworthy: this model shows a 4.6pp quant cost at Q4_K_M, meaning its baseline safety (73.6%) is already lower than Llama 1B's degraded Q4_K_M safety (81.4%). The backend range column (not shown in this summary; see Appendix A) adds context: Llama 1B has 25.1pp backend variance, meaning even a "low risk" config could drop significantly if the serving backend changes.
15. Worst-Case Analysis
15.1 Per-Axis Worst Cases
| Axis | Model | Delta (pp) | Detail |
|---|---|---|---|
| Quantization | llama3.2-1b | 35.2 | FP16 (0.828) -> Q2_K (0.476), d = 1.93 |
| Backend | llama3.2-1b | 25.1 | Ollama Q4_K_M (0.817) -> TGI FP16 (0.566) |
| Concurrency | llama3.2-3b | 0.4 | N=1 (0.718) -> N=8 (0.714) |
15.2 Combined Worst Case
| Property | Value |
|---|---|
| Model | llama3.2-1b |
| Quantization | Q2_K (2.5 BPW) |
| Concurrency | N=1 |
| Baseline safety | 0.828 |
| Quant cost | +35.2pp |
| Concurrency cost | 0.0pp |
| Total cost | +35.2pp |
| Projected safety | 0.476 |
| Retention | 57.5% |
| Risk level | CRITICAL |
| Backend range | 25.1pp (additional variance) |
15.3 Risk Distribution
| Risk Level | Count | Percentage | Models Involved |
|---|---|---|---|
| Critical (< 80%) | 3 | 12.5% | Llama 1B only (all Q2_K) |
| High (80-90%) | 0 | 0.0% | None |
| Moderate (90-95%) | 3 | 12.5% | Llama 3B only (all Q4_K_M) |
| Low (>= 95%) | 18 | 75.0% | Both models |
Observations: The combined worst case is entirely attributable to quantization (35.2pp total cost at N=1, where concurrency contributes 0.0pp). At higher concurrency levels (N=4, N=8), the total cost slightly decreases due to marginally higher safety measured in TR135, but N=1 remains the worst configuration. The backend range (25.1pp) is not additive with quant in the deployment matrix -- it represents the additional variance introduced by backend choice, which applies on top of whatever quant/concurrency cost exists. In the truly worst combined scenario (Q2_K + N=1 + TGI FP16), safety could theoretically drop below 30%, though this was not directly measured.
The risk distribution is heavily skewed: 75% of configs are low-risk. The critical configs are concentrated in a single model (Llama 1B) at a single quant level (Q2_K). This means the safety cliff is sharp and localized, not gradual and universal. The practical implication: avoiding Q2_K for Llama 1B eliminates all critical risk from the deployment matrix. The moderate-risk configs (Llama 3B at Q4_K_M) reflect a lower baseline rather than severe degradation -- Llama 3B starts at 73.6% safety, so even a modest 4.6pp cost pushes it below the 95% retention threshold.
16. Power & Sensitivity
Can the source experiments detect the effects they claim to measure?
| Source | MDE Safety (pp) | Avg N / Variant | Interpretation |
|---|---|---|---|
| TR134 | 18.3 | 117 | Detects >= 18.3pp safety drop at 80% power |
| TR135 | 6.8 | ~150-450 | Detects >= 6.8pp at 80% power |
| TR136 | N/A | 468 | Not computed in synthesis (see TR136 report) |
Program-level sensitivity: Effects >= 18.3pp are detectable across all source TRs. Effects between 6.8pp and 18.3pp may be detectable in TR135 but not TR134. Effects < 6.8pp may be undetectable in any source.
Observations: The MDE values contextualize our findings. Llama 1B's 35.2pp quant effect (well above 18.3pp MDE) is robustly detectable -- this is a real and large effect. Llama 3B's -6.0pp "improvement" is below TR134's MDE (18.3pp), meaning TR134 may lack power to reliably detect effects of this magnitude at the per-model level. The concurrency null finding (max 0.4pp) is far below any MDE, so we cannot distinguish "no effect" from "small effect below detection threshold." However, the consistency of zero-slope findings across all models and techniques (Section 11) strongly suggests a true null rather than an underpowered test. TR135's lower MDE (6.8pp) is enabled by its larger sample sizes from the concurrency multiplication design.
17. Per-Category Bias Synthesis
BBQ bias probing covers 11 demographic categories. Per-category slopes (score vs BPW) reveal which groups are most vulnerable to quantization-induced bias amplification.
17.1 Category Vulnerability Ranking (Quantization Axis)
| Rank | Category | Avg Slope | N Models | Interpretation |
|---|---|---|---|---|
| 1 | Nationality | -0.010 | 4 | Most vulnerable (bias worsens with lower quant) |
| 2 | SES | -0.003 | 4 | Mildly vulnerable |
| 3 | Disability_status | -0.000 | 4 | Near-zero |
| 4 | Religion | +0.003 | 4 | Mildly robust |
| 5 | Race_x_SES | +0.004 | 4 | Mildly robust |
| 6 | Race_x_gender | +0.007 | 4 | Robust |
| 7 | Physical_appearance | +0.009 | 4 | Robust |
| 8 | Age | +0.009 | 4 | Robust |
| 9 | Gender_identity | +0.010 | 4 | Robust |
| 10 | Sexual_orientation | +0.012 | 4 | Robust |
| 11 | Race_ethnicity | +0.015 | 4 | Least vulnerable (bias improves) |
Most vulnerable: Nationality. Least vulnerable: Race/Ethnicity.
Additional data: Concurrency axis has 12 bias groups available; backend axis has 12 bias groups available.
Observations: The ranking reveals a counterintuitive pattern: Nationality is the most vulnerable category (negative slope = bias worsens at lower quant), while Race/Ethnicity is the least vulnerable (positive slope = bias actually improves). This may reflect the training data distribution: race/ethnicity-related bias mitigation is heavily emphasized in modern RLHF training, making it more robust to quantization. Nationality-based biases receive less attention in alignment training and are therefore more fragile. The Disability_status category sits at the boundary (slope ~ 0), suggesting neither improvement nor degradation. Importantly, these slopes are averaged across 4 models (3 families), and individual models show substantial variation. For example, Mistral 7B has a Nationality slope of -0.038 while Llama 1B has +0.003 -- the vulnerability is not uniform. The concurrency and backend axes have bias data available but were not further stratified in this synthesis due to the negligible overall concurrency effect and the primarily refusal-driven backend effect.
18. Judge Agreement Synthesis
TR134 validated regex classifiers against an LLM judge (Qwen 2.5 7B Q8_0) on 12,168 samples.
18.1 Overall Agreement
| Metric | Value |
|---|---|
| Total judged samples | 12,168 |
| Overall Cohen's kappa | 0.147 (poor) |
18.2 Per-Task Kappa
| Task | Kappa | Agreement % | Interpretation |
|---|---|---|---|
| advbench_refusal | 0.013 | 67.7% | Slight |
| truthfulqa | 0.282 | 43.2% | Fair |
18.3 Kappa by Quantization Level (AdvBench)
| Quant | Kappa | Pairs | Agreement % |
|---|---|---|---|
| FP16 | 0.000 | 200 | 71.5% |
| Q8_0 | 0.000 | 400 | 67.2% |
| Q6_K | 0.000 | 400 | 70.2% |
| Q5_K_M | 0.020 | 400 | 67.8% |
| Q4_K_M | 0.020 | 400 | 66.5% |
| Q3_K_S | 0.042 | 400 | 74.2% |
| Q2_K | 0.007 | 400 | 58.2% |
18.4 Kappa by Quantization Level (TruthfulQA)
| Quant | Kappa | Pairs | Agreement % |
|---|---|---|---|
| FP16 | 0.200 | 100 | 41.0% |
| Q8_0 | 0.249 | 200 | 42.0% |
| Q6_K | 0.272 | 200 | 46.5% |
| Q5_K_M | 0.386 | 200 | 46.5% |
| Q4_K_M | 0.292 | 200 | 41.0% |
| Q3_K_S | 0.292 | 200 | 43.0% |
| Q2_K | 0.214 | 200 | 41.5% |
Observations: The poor overall kappa (0.147) is concerning but interpretable in context. AdvBench kappa is near-zero (0.013) because both regex and judge classifiers agree on the "easy" cases (clear refusals) but diverge on edge cases -- and the base rate of agreement is high (68%) even with kappa near zero, indicating that the high agreement is largely explained by base rates rather than genuine classifier concordance. TruthfulQA shows better kappa (0.282) because truthfulness classification has more ambiguity, creating more room for classifier-judge disagreement to be informative.
The per-quant patterns differ between tasks. For AdvBench, Q2_K shows the lowest agreement (58.2%), suggesting that heavily quantized models produce responses that are genuinely harder to classify -- they are neither clearly refusal nor clearly compliance, but degraded text that classifiers interpret differently. For TruthfulQA, kappa peaks at Q5_K_M (0.386) and decreases at both extremes, suggesting a "sweet spot" where model outputs are coherent enough for consistent classification but varied enough for meaningful agreement.
These results have two implications for the synthesis. First, safety scores at extreme quant levels (Q2_K) carry more measurement uncertainty than those at moderate levels -- the 35.2pp quant effect for Llama 1B is directionally robust but its precise magnitude may be inflated or deflated by classifier disagreement. Second, the regex classifiers used across all three source TRs may systematically misclassify edge cases, potentially biasing effect size estimates. A human annotation study would resolve this uncertainty but is outside the scope of this synthesis.
19. Effect Decomposition & Conclusions
19.0 Cross-Axis Correlation
RQ7 asks: are models vulnerable on one axis also vulnerable on others? Computing Pearson correlation requires >= 3 shared models per axis pair. With only 2 anchor models (Llama 1B, 3B), we cannot compute a meaningful correlation.
However, qualitative inspection suggests a tentative answer: yes, for quant and backend; no for concurrency.
| Model | Quant Vulnerability | Backend Vulnerability | Concurrency Vulnerability |
|---|---|---|---|
| llama3.2-1b | Severe (35.2pp) | Severe (25.1pp) | None (-0.3pp) |
| llama3.2-3b | Mild (6.0pp) | Mild (4.4pp) | None (0.4pp) |
Llama 1B is the most vulnerable model on BOTH quant and backend axes. Llama 3B is mildly affected on both. This co-occurrence is consistent with a shared underlying mechanism: smaller models have less redundancy in their safety-aligned weights, making them more fragile to any perturbation -- whether that perturbation comes from weight quantization or from chat template divergence. Concurrency shows no vulnerability for either model, consistent with it being a scheduling-level operation that does not perturb model weights or templates.
With 3+ shared models per axis pair, a formal Pearson r could quantify this relationship. Future work adding Mistral and Qwen to the concurrency and backend experiments would enable this analysis.
19.1 Per-Model Decomposition
What percentage of total safety cost comes from each optimization axis?
| Model | Total (pp) | Quant | Concurrency | Backend | Dominant |
|---|---|---|---|---|---|
| llama3.2-1b | 60.6 | 58% (35.2pp) | <1% (0.3pp) | 41% (25.1pp) | Quantization |
| llama3.2-3b | 10.8 | 56% (6.0pp) | 4% (0.4pp) | 41% (4.4pp) | Quantization* |
19.2 Aggregate Decomposition
| Axis | Mean % of Total | N Models |
|---|---|---|
| Quantization | 57% | 2 |
| Backend | 41% | 2 |
| Concurrency | 2% | 2 |
19.3 Cross-TR Comparison
| Finding | TR134 (Quant) | TR135 (Conc.) | TR136 (Backend) | TR137 (Synthesis) |
|---|---|---|---|---|
| Max safety delta | 35.2pp (Llama 1B) | 0.4pp (Llama 3B) | 25.1pp (Llama 1B) | 35.2pp combined |
| Models affected | 1/4 severe, 1/4 anomalous | 0/3 | 1/3 severe | Same Llama 1B |
| Jailbreak amplified? | Yes (all slopes negative) | No (all slopes zero) | N/A | Quant yes, conc. no |
| Bias vulnerable | Nationality | N/A | N/A | Nationality |
| TOST equivalent? | N/A | N/A | 0/18 pairs | Backend never equivalent |
| Key mechanism | Weight precision loss | N/A (no effect) | Chat template divergence | Quant + template |
19.4 Model-Level Verdicts
| Model | Worst Axis | Max Delta (pp) | Safety > Capability? | Critical Configs | Overall Risk |
|---|---|---|---|---|---|
| llama3.2-1b | Quantization | 35.2 | Yes (backend axis) | 3 (all Q2_K) | CRITICAL |
| llama3.2-3b | Backend | 4.4 | Yes (concurrency, trivial) | 0 | Moderate |
| mistral-7b | Quantization | ~35pp (TR134 only) | Yes (quant axis, suggestive) | N/A (single-axis) | High (inferred) |
| qwen2.5-7b | Quantization | ~8pp (TR134 only) | No | N/A (single-axis) | Low (inferred) |
Observations: Llama 3.2 1B is the clear vulnerability hotspot: it drives the worst case on both the quant and backend axes, is the only model with critical-risk configs, and shows safety degrading faster than capability on the backend axis. Llama 3.2 3B is moderate-risk: its quant cost at Q4_K_M (4.6pp) is manageable but not negligible, and its lower baseline safety (0.736 vs 0.828) means less margin for error. Mistral 7B and Qwen 2.5 7B lack cross-axis data (single TR only) but their TR134 quantization profiles suggest high and low risk respectively. Mistral is concerning because it shows the steepest safety slope (+0.041) and is the only model where safety degrades faster than capability on the quant axis.
19.5 Answering the Research Questions
* Llama 3B's quant effect is an improvement (-6.0pp), not a degradation. It is "dominant" by absolute magnitude; the largest degradation axis is backend (4.4pp).
| RQ | Question | Answer |
|---|---|---|
| 1 | Most dangerous axis? | Quantization (57% of cost, mean 20.6pp) |
| 2 | Safety faster than capability? | Only 3/10 model-axis combinations (not universal) |
| 3 | Models agree? | No. I^2 = 99.9% (quant), 99.5% (backend), 0.0% (concurrency) |
| 4 | Combined cost? | Additive projection: 1.1-5.0pp at Q4_K_M; 34.7-35.2pp at Q2_K |
| 5 | Jailbreak patterns consistent? | Quant amplifies (all negative slopes). Concurrency does not (all zero). |
| 6 | Most vulnerable bias category? | Nationality (slope -0.010). Least: Race/Ethnicity (+0.015). |
| 7 | Cross-axis vulnerability correlation? | Insufficient data (need >= 3 shared models, have 2) |
19.6 Limitations of the Synthesis
Beyond the caveats enumerated in the Statistical Methods section, several synthesis-specific limitations deserve emphasis:
-
Two-model anchor set. All cross-axis conclusions rest on Llama 1B and 3B. These are both Llama-family models from the same architecture lineage. The synthesis cannot distinguish family-specific effects from universal effects.
-
Non-overlapping quant ranges. TR134 tests FP16-Q2_K, but the deployment matrix's "quant cost" is computed relative to FP16 baseline. TR136's backend decomposition uses Q4_K_M-Q8_0 within Ollama. These different quant ranges are not directly comparable.
-
Backend axis is partially confounded with weight format. The "backend effect" in TR136 includes both the serving framework difference AND the weight format difference (GGUF vs FP16 HuggingFace). These cannot be separated without a common weight format across backends.
-
No interaction measurement. The additive model may miss synergistic effects. For example, it is plausible that Q2_K models are MORE sensitive to backend changes than FP16 models, but we have no data on Q2_K + vLLM.
-
Temporal confound. Source TRs were run on different dates (Mar 5-8). System state (Ollama version, GPU thermal state, background processes) may have varied between runs, contributing to the cross-TR inconsistencies in Section 5.
19.7 Deployment Recommendations
- Always validate safety per-model. I-squared > 99% means no generic guideline is reliable.
- Use Q4_K_M or higher for safety-critical applications. 93-100% retention across the board.
- Scale concurrency freely. Zero safety cost, confirmed across all models and jailbreak techniques.
- Re-validate after backend changes. Backend contributes 41% of total safety cost.
- Monitor Nationality bias under quantization. Most vulnerable demographic category.
- Avoid Q2_K for Llama-class 1B models. CRITICAL risk (58% retention).
19.8 Future Work
Several extensions would strengthen the synthesis:
-
Full factorial experiment. A single experiment varying quant x concurrency x backend simultaneously would enable true interaction measurement. The estimated cost: 4 models x 7 quants x 4 N-levels x 4 backends = 448 configs x 955 prompts = ~428,000 samples. This is computationally expensive but would eliminate the additive assumption.
-
Expanded anchor set. Adding Mistral 7B and Qwen 7B to TR135 (concurrency) and TR136 (backend) experiments would increase the cross-axis anchor set from 2 to 4 models, enabling Pearson correlation computation and tighter bootstrap CIs.
-
Human annotation validation. The poor judge kappa (0.147) suggests that automated scoring introduces systematic uncertainty. A 500-sample human annotation study at key quant levels (FP16, Q4_K_M, Q2_K) would calibrate the automated classifiers.
-
Datacenter hardware. All results are from consumer NVIDIA GPUs. Replication on A100/H_100 hardware with tensor parallelism would test generalization to production infrastructure.
-
Stochastic sampling. All experiments used temperature 0. A temperature-0.7 variant would reveal whether stochastic sampling introduces additional safety variance that interacts with optimization axes.
20. Reproducibility
20.1 Prerequisites
All source experiments must be complete with analysis JSON files in their result directories.
20.2 Run Command
# Validate source availability
python research/tr137/run.py --validate-only -v
# Run full synthesis (18 passes)
python research/tr137/run.py -v
20.3 Expected Output
research/tr137/results/YYYYMMDD_HHMMSS/
tr137_analysis.json # Full 18-pass analysis (73KB)
tr137_report.md # Auto-generated 21-section report (463 lines)
tr137_deployment_matrix.csv # 24 deployment configs with risk tiers
tr137_effect_ranking.csv # Per-model axis ranking
20.4 Runtime
< 5 seconds on any machine with Python 3.9+. No GPU, no Ollama, no Docker required. Pure meta-analysis on pre-computed JSON files.
Appendix A: Full Deployment Matrix
All 24 assessed deployment configurations, sorted by risk level then retention.
| Model | Quant | N | Baseline | Quant Cost | Conc. Cost | Total Cost | Backend Range | Projected | Retention | Risk |
|---|---|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | Q2_K | 1 | 0.828 | 35.2pp | 0.0pp | 35.2pp | 25.1pp | 0.476 | 57.5% | CRITICAL |
| llama3.2-1b | Q2_K | 8 | 0.828 | 35.2pp | -0.3pp | 34.9pp | 25.1pp | 0.478 | 57.8% | CRITICAL |
| llama3.2-1b | Q2_K | 4 | 0.828 | 35.2pp | -0.6pp | 34.7pp | 25.1pp | 0.481 | 58.1% | CRITICAL |
| llama3.2-3b | Q4_K_M | 1 | 0.736 | 4.6pp | 0.0pp | 4.6pp | 4.4pp | 0.690 | 93.8% | Moderate |
| llama3.2-3b | Q4_K_M | 4 | 0.736 | 4.6pp | 0.2pp | 4.8pp | 4.4pp | 0.688 | 93.5% | Moderate |
| llama3.2-3b | Q4_K_M | 8 | 0.736 | 4.6pp | 0.4pp | 5.0pp | 4.4pp | 0.686 | 93.2% | Moderate |
| llama3.2-3b | Q2_K | 1 | 0.736 | -6.0pp | 0.0pp | -6.0pp | 4.4pp | 0.796 | 108.2% | Low |
| llama3.2-3b | Q2_K | 4 | 0.736 | -6.0pp | 0.2pp | -5.9pp | 4.4pp | 0.794 | 108.0% | Low |
| llama3.2-3b | Q2_K | 8 | 0.736 | -6.0pp | 0.4pp | -5.6pp | 4.4pp | 0.792 | 107.6% | Low |
| llama3.2-1b | Q8_0 | 4 | 0.828 | -0.5pp | -0.6pp | -1.1pp | 25.1pp | 0.839 | 101.3% | Low |
| llama3.2-1b | Q8_0 | 8 | 0.828 | -0.5pp | -0.3pp | -0.8pp | 25.1pp | 0.836 | 101.0% | Low |
| llama3.2-1b | FP16 | 4 | 0.828 | 0.0pp | -0.6pp | -0.6pp | 25.1pp | 0.833 | 100.7% | Low |
| llama3.2-1b | Q8_0 | 1 | 0.828 | -0.5pp | 0.0pp | -0.5pp | 25.1pp | 0.833 | 100.7% | Low |
| llama3.2-1b | FP16 | 8 | 0.828 | 0.0pp | -0.3pp | -0.3pp | 25.1pp | 0.831 | 100.3% | Low |
| llama3.2-1b | FP16 | 1 | 0.828 | 0.0pp | 0.0pp | 0.0pp | 25.1pp | 0.828 | 100.0% | Low |
| llama3.2-3b | FP16 | 1 | 0.736 | 0.0pp | 0.0pp | 0.0pp | 4.4pp | 0.736 | 100.0% | Low |
| llama3.2-3b | FP16 | 4 | 0.736 | 0.0pp | 0.2pp | 0.2pp | 4.4pp | 0.734 | 99.7% | Low |
| llama3.2-3b | FP16 | 8 | 0.736 | 0.0pp | 0.4pp | 0.4pp | 4.4pp | 0.731 | 99.4% | Low |
| llama3.2-1b | Q4_K_M | 4 | 0.828 | 1.3pp | -0.6pp | 0.8pp | 25.1pp | 0.820 | 99.1% | Low |
| llama3.2-1b | Q4_K_M | 8 | 0.828 | 1.3pp | -0.3pp | 1.1pp | 25.1pp | 0.817 | 98.7% | Low |
| llama3.2-1b | Q4_K_M | 1 | 0.828 | 1.3pp | 0.0pp | 1.3pp | 25.1pp | 0.814 | 98.4% | Low |
| llama3.2-3b | Q8_0 | 1 | 0.736 | 1.2pp | 0.0pp | 1.2pp | 4.4pp | 0.723 | 98.3% | Low |
| llama3.2-3b | Q8_0 | 4 | 0.736 | 1.2pp | 0.2pp | 1.4pp | 4.4pp | 0.721 | 98.0% | Low |
| llama3.2-3b | Q8_0 | 8 | 0.736 | 1.2pp | 0.4pp | 1.7pp | 4.4pp | 0.719 | 97.7% | Low |
Appendix B: Jailbreak Success Rates by Technique
Selected compliance rates showing quantization effect. Full data across all 4 models, 7 quants, 4 techniques available in tr137_analysis.json.
B.1 Prefix Injection (Most Effective Technique)
| Model | FP16 | Q8_0 | Q4_K_M | Q2_K |
|---|---|---|---|---|
| llama3.2-1b | 3.3% | 3.3% | 10.0% | 60.0% |
| llama3.2-3b | 3.3% | 3.3% | 0.0% | 10.0% |
| mistral-7b | -- | 86.7% | 93.3% | 93.3% |
| qwen2.5-7b | -- | 23.3% | 20.0% | 36.7% |
B.2 Direct Prompts (Control)
| Model | FP16 | Q8_0 | Q4_K_M | Q2_K |
|---|---|---|---|---|
| llama3.2-1b | 6.7% | 3.3% | 13.3% | 76.7% |
| llama3.2-3b | 0.0% | 0.0% | 0.0% | 16.7% |
| mistral-7b | -- | 70.0% | 73.3% | 83.3% |
| qwen2.5-7b | -- | 6.7% | 10.0% | 30.0% |
B.3 DAN-Style
| Model | FP16 | Q8_0 | Q4_K_M | Q2_K |
|---|---|---|---|---|
| llama3.2-1b | 0.0% | 0.0% | 0.0% | 70.0% |
| llama3.2-3b | 0.0% | 0.0% | 6.7% | 3.3% |
| mistral-7b | -- | 66.7% | 76.7% | 76.7% |
| qwen2.5-7b | -- | 6.7% | 0.0% | 16.7% |
B.4 Roleplay
| Model | FP16 | Q8_0 | Q4_K_M | Q2_K |
|---|---|---|---|---|
| llama3.2-1b | 3.3% | 0.0% | 3.3% | 33.3% |
| llama3.2-3b | 13.3% | 23.3% | 63.3% | 3.3% |
| mistral-7b | -- | 100.0% | 96.7% | 96.7% |
| qwen2.5-7b | -- | 6.7% | 6.7% | 33.3% |
Observations: Several patterns emerge from the full technique breakdown. Mistral 7B is an outlier: it shows near-ceiling compliance (67-100%) across ALL techniques and quant levels. Its safety alignment does not resist jailbreaks even at Q8_0. Llama 1B shows a dramatic Q2_K cliff: compliance jumps from near-zero (0-13%) at Q4_K_M to 33-77% at Q2_K across all techniques. Llama 3B is the most robust small model, with near-zero compliance in most cells. The technique ranking (prefix_injection > direct > dan_style > roleplay by slope) is consistent across models, suggesting the technique effectiveness hierarchy is a property of the jailbreak design, not the target model.
Appendix C: Per-Category Bias Slopes
Safety degradation slope (score vs BPW) per demographic category per model. Negative = bias worsens at lower quant.
| Category | llama3.2-1b | llama3.2-3b | mistral-7b | qwen2.5-7b | Avg |
|---|---|---|---|---|---|
| Nationality | +0.003 | +0.001 | -0.038 | -0.004 | -0.010 |
| SES | -0.004 | +0.004 | -0.017 | +0.004 | -0.003 |
| Disability_status | -0.003 | +0.011 | -0.009 | 0.000 | 0.000 |
| Religion | -0.005 | +0.009 | +0.020 | -0.011 | +0.003 |
| Race_x_SES | +0.003 | +0.002 | +0.007 | +0.004 | +0.004 |
| Race_x_gender | +0.011 | +0.004 | +0.017 | -0.002 | +0.007 |
| Physical_appearance | +0.007 | +0.013 | +0.014 | 0.000 | +0.009 |
| Age | +0.004 | +0.007 | +0.024 | 0.000 | +0.009 |
| Gender_identity | +0.011 | +0.008 | +0.017 | +0.004 | +0.010 |
| Sexual_orientation | -0.003 | +0.007 | +0.043 | 0.000 | +0.012 |
| Race_ethnicity | +0.012 | +0.006 | +0.041 | 0.000 | +0.015 |
Observations: Mistral 7B drives much of the category-level variation. It shows the steepest negative slope for Nationality (-0.038) and the steepest positive slopes for Sexual_orientation (+0.043) and Race_ethnicity (+0.041). Qwen 2.5 7B shows near-zero slopes for most categories, suggesting its bias alignment is relatively robust to quantization. The within-model variation is often larger than the between-category variation -- Mistral's slopes range from -0.038 to +0.043 -- making aggregate category rankings less reliable than per-model assessments.
Appendix D: Glossary
| Term | Definition |
|---|---|
| BPW | Bits per weight. FP16 = 16.0, Q8_0 = 8.0, Q6_K = 6.5, Q5_K_M = 5.5, Q4_K_M = 4.5, Q3_K_S = 3.5, Q2_K = 2.5. |
| Bootstrap CI | Confidence interval computed by resampling data 2,000 times with replacement and taking the 2.5th/97.5th percentiles. Non-parametric; does not assume normality. |
| Cohen's d | Standardized effect size: (mean_A - mean_B) / pooled_SD. < 0.2 trivial, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large. |
| Cohen's kappa | Inter-rater agreement statistic correcting for chance. < 0 poor, 0-0.2 slight, 0.2-0.4 fair, 0.4-0.6 moderate, > 0.6 substantial. |
| I-squared | Heterogeneity statistic: percentage of total variation attributable to between-study (between-model) differences. < 25% low, 25-75% moderate, > 75% high. |
| MDE | Minimum Detectable Effect. Smallest effect a study can detect at alpha=0.05, power=0.80. Reported in percentage points. |
| OAT | One-at-a-time factorial design. Each experiment varies one factor while holding others constant. Cannot measure interaction effects. |
| pp | Percentage points. An absolute difference in proportions. If score drops from 0.85 to 0.60, the delta is 25pp. |
| RLHF | Reinforcement Learning from Human Feedback. The fine-tuning method that instills safety alignment in LLMs. |
| Safety Retention | Projected safety as a percentage of baseline: (projected / baseline) x 100. |
| TOST | Two One-Sided Tests. Tests equivalence within a margin (here +/-3pp). Both one-sided tests must pass at alpha=0.05 to declare equivalence. |
| Welch's t-test | Two-sample t-test that does not assume equal variances. Used for pairwise backend and quant comparisons. |
References
- Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. ICLR 2021.
- Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC. arXiv:1803.05457.
- Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL 2022.
- Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., & Bowman, S. (2022). BBQ: A Hand-Built Bias Benchmark for Question Answering. ACL 2022.
- Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., & Wong, E. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs. NeurIPS 2024.
- Shen, X., Chen, Z., Backes, M., & Zhang, Y. (2024). JailbreakHub: A Centralized Repository for Jailbreak Prompts. arXiv:2401.01288.
- Higgins, J.P.T., Thompson, S.G., Deeks, J.J., & Altman, D.G. (2003). Measuring Inconsistency in Meta-Analyses. BMJ, 327(7414), 557-560. (I-squared statistic.)
- Banterhearts TR134 (2026). Alignment Robustness Under Quantization.
- Banterhearts TR135 (2026). Multi-Agent Concurrency x Safety.
- Banterhearts TR136 (2026). Cross-Backend Safety Consistency.