Technical Report 142: Quality-Safety Correlation Under Quantization
Cross-referencing TR125 Phase 2 quality metrics with TR134 Phase 3 safety metrics across 2 models and 7 GGUF quant levels
| Field | Value |
|---|---|
| TR Number | 142 |
| Project | Banterhearts |
| Date | 2026-03-26 |
| Version | 2.1 |
| Author | Research Team |
| Git Commit | 9959b4fb |
| Status | FINAL |
| Report Type | Analysis-only (no new experiments) |
| Run Directory | research/tr142/results/20260326_183953/ |
| Quality Source | TR125 Phase 2 (24,990 source samples; 10,290 overlapping analyzed samples) |
| Safety Source | TR134 Phase 3 (24,778 source samples; 13,342 overlapping analyzed samples) |
| Models | llama3.2-1b, llama3.2-3b |
| Quant Levels | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K |
| Analysis Passes | 14 |
| Related Work | TR125, TR134, TR139 |
| Depends On | TR125 Phase 2 (quality data), TR134 Phase 3 (safety data) |
Abstract
TR142 asks whether quality and safety degrade together under quantization, or whether they follow partially independent degradation paths that could mislead practitioners who monitor only one dimension. This analysis-only study cross-references two existing source datasets -- TR125 Phase 2 quality measurements (24,990 source samples) and TR134 Phase 3 safety measurements (24,778 source samples) -- then restricts analysis to their shared model/quant overlap, yielding 10,290 quality samples and 13,342 safety samples across 2 models and 7 GGUF quantization levels.
The core findings are: (1) Quality-safety correlation is model-dependent and cannot be pooled -- llama3.2-1b shows strong positive coupling (r = +0.994 for coherence x refusal, p = 5.8e-5) while llama3.2-3b shows a nominally significant negative coupling (r = -0.829, p = 0.041). (2) Safety moves more than quality in 10 of 12 model-quant cells, with the clearest hidden-danger cell at llama3.2-1b / Q3_K_S, where refusal drops 13.6pp while average quality moves by only about 1pp. (3) A simple coherence/length quality gate is responsive rather than invariant: refusal filter rates range from 5.9% to 34.5% and truthfulness filter rates range from 4.0% to 24.0%, yet the gate does not remove the hidden-danger or over-refusal regimes.
The most important caution is methodological. The current saved artifact does not include a standalone TOST result object, and the raw appendix-level one-sided equivalence calculations do not support a strong formal equivalence claim at the per-cell level. TR142 therefore supports a conservative deployment floor at Q5_K_M because that is where observed refusal deltas remain small and corrected pairwise tests remain non-significant, not because this report alone establishes strict +/-3pp equivalence.
The operational conclusion is that quality metrics alone are insufficient safety proxies: a model can pass quality benchmarks while silently losing refusal alignment, and the direction of quality-safety divergence depends on the model. Quality-gating helps remove visibly broken generations at low precision, but it still does not rescue quality-only safety screening.
Executive Summary
Key Findings
- Quality-safety correlation is model-dependent. Pooling across models is misleading.
llama3.2-1bshows strong positive quality-safety coupling (r = +0.994for coherence x refusal), whilellama3.2-3bshows a nominally significant negative coupling (r = -0.829,p = 0.041). - Safety moves faster than quality in most cells. The asymmetry index favors larger safety movement in
10 / 12model-quant cells. Onllama3.2-1b / Q3_K_S, refusal drops13.6ppwhile average quality shifts by only about1pp. - Q3_K_S on
llama3.2-1bis the clearest hidden-danger cell. Quality metrics stay near the FP16 regime while refusal collapses sharply. llama3.2-3bshows over-refusal rather than under-refusal at the lowest quants. Refusal rises by+18.6ppatQ3_K_Sand+16.4ppatQ2_K, even while quality degrades.- Quality gating is responsive, not restorative. Refusal filter rates range from
5.9%to34.5%, truthfulness filter rates range from4.0%to24.0%, and bias-resistance rows remain unfiltered. The gate removes more degenerate outputs at low precision but does not eliminate the hidden-danger or over-refusal regimes. - Truthfulness is underpowered. With
N = 50per cell, the MDE is28.0pp, so non-significant truthfulness comparisons are inconclusive. - BPW regression is weak.
R^2ranges from0.03to0.30, indicating that non-linear threshold effects dominate over smooth linear degradation. - Q8_0 through Q5_K_M form the low-delta stability zone, but TR142 does not by itself prove formal +/-3pp equivalence. The observed refusal deltas stay small through
Q5_K_M, whileQ4_K_Mis the ambiguous boundary andQ3_K_S/Q2_Kare where clear safety divergence appears.
Core Decisions
- Do not use quality metrics as a proxy for safety under quantization. The sign of the coupling changes by model.
- Monitor safety and quality independently below
Q5_K_M. That is where the hidden-divergence risk becomes operationally relevant. - Avoid
Q3_K_Sand below onllama3.2-1bfor safety-sensitive deployments. Quality-only monitoring will miss the refusal collapse. - Do not assume lower-bit quantization always weakens safety. On
llama3.2-3b, the low-bit failure mode is over-refusal rather than under-refusal. - Treat
Q5_K_Mas the conservative floor from the current evidence stack, not as a formally proven equivalence point from TR142 alone.
Validation Summary
| Target | Metric | Required | Achieved | Status |
|---|---|---|---|---|
| Model overlap | Shared models | >= 2 | 2 | PASS |
| Quant coverage | Shared levels | >= 5 | 7 | PASS |
| Quality source dataset | Source N | >= 20,000 | 24,990 | PASS |
| Safety source dataset | Source N | >= 20,000 | 24,778 | PASS |
| Quality overlap | Analyzed shared N | >= 1,000 | 10,290 | PASS |
| Safety overlap | Analyzed shared N | >= 1,000 | 13,342 | PASS |
| Correlation sign split | Opposite within-model signs | yes | +0.994 vs -0.829 |
PASS |
| Asymmetry | Safety moves more than quality | majority of cells | 10/12 cells | PASS |
| Quality gate responsiveness | Filter rate varies with quant | variable across levels | refusal 5.9%-34.5%; truth 4.0%-24.0%; bias 0% | PASS |
| Truthfulness power | MDE at 80% power | < 10pp | 28.0pp | FAIL |
| Formal equivalence proof | Explicit current-artifact support | standalone TOST result object | not present | NOT ESTABLISHED |
Claim Validation
| # | Claim | Evidence Base | Status |
|---|---|---|---|
| C1 | Quality-safety correlation is model-dependent | Within-model correlations split by sign (+0.994 vs -0.829) |
Established |
| C2 | Safety moves faster than quality in most quant cells | Asymmetry index 10 / 12, bootstrap gap on 1b excludes zero |
Established |
| C3 | Quality gating does not restore quality-only safety screening | Filter rates rise at aggressive quants, but gated scores preserve the same hidden-danger and over-refusal regimes | Established |
| C4 | Q3_K_S is the hidden-danger zone on llama3.2-1b |
Large refusal delta with small quality movement | Established |
| C5 | llama3.2-3b low-bit behavior is over-refusal rather than under-refusal |
+18.6pp and +16.4pp refusal increases at Q3_K_S and Q2_K |
Established |
| C6 | TR142 proves formal +/-3pp equivalence through Q5_K_M |
Current artifact lacks standalone TOST support; appendix-level raw tests do not justify a strong equivalence claim | Not established |
| C7 | Truthfulness is unchanged by quantization | Truthfulness cells are underpowered (MDE = 28.0pp) |
Not established |
How to Read This Report
5-minute version: Read the Abstract and Executive Summary. If you are making deployment decisions, also read SS4 (quality-safety matrix), SS6-SS8 (divergence, asymmetry, quality gate), SS10 (practical stability zone), and SS15 (production guidance).
Key interpretive rule: TR142 supports a conservative floor and a hidden-divergence warning. It does not by itself prove strict formal equivalence through Q5_K_M.
When to Use This Report
Scenario 1: Evaluating quantized models with quality benchmarks only
Question: We run BERTScore and coherence checks on our quantized model and everything looks fine. Is the model safe to deploy?
Answer: No. SS6 Table 4 shows that on llama3.2-1b Q3_K_S, quality metrics move less than 2.3pp from FP16 while refusal rate drops 13.6pp. Quality passing does not imply safety passing. See SS5 for the full correlation breakdown.
Scenario 2: Deciding whether to add a quality gate before safety evaluation
Question: Should we filter out low-quality outputs before computing safety scores to get a cleaner signal?
Answer: No -- not as a substitute for direct safety evaluation. SS8 shows that a simple coherence/length gate becomes much more active at aggressive quants (refusal filter rates range from 5.9% to 34.5%), so it does react to low-precision degeneration. But the gated scores preserve the same hidden-danger and over-refusal regimes, so the gate does not restore a reliable quality proxy.
Scenario 3: Choosing a quantization level for a safety-critical application
Question: What's the lowest quantization we can use while maintaining both quality and safety?
Answer: Model-dependent. For llama3.2-1b, Q4_K_M is the last level where safety delta stays within 3.2pp of FP16 (SS4 Table 2). For llama3.2-3b, the picture is more complex -- refusal actually increases at low quants (SS5), which may cause over-refusal in practice. Use Q5_K_M as a conservative floor for both models.
Scenario 4: Positioning this analysis relative to the broader safety line
Question: How does TR142 relate to TR134 and TR139?
Answer: TR134 Phase 3 provided the raw safety data; TR142 adds the quality dimension by merging with TR125 Phase 2. TR139 studies multi-turn jailbreak resistance across quant levels. Together, they show that quantization affects safety through multiple independent channels: single-turn alignment (TR134), multi-turn jailbreak resistance (TR139), and quality-safety coupling (TR142). See SS14 Table for the full cross-TR comparison.
Scenario 5: Understanding model-specific deployment risks
Question: We're deploying a quantized 1B-class model. What specific risks does TR142 flag?
Answer: Small models (1b-class) show strong positive quality-safety coupling (SS5, r = +0.994 for coherence x refusal). The primary risk is silent safety degradation at Q3_K_S: quality metrics stay within 2.3pp of FP16 while refusal drops 13.6pp (SS4, SS9). The recommended floor is Q5_K_M as a conservative operating point, not as a formally proven equivalence point. If deploying at Q4_K_M, run refusal probes directly -- do not rely on quality proxies. See SS15 for the full decision matrix.
Table of Contents
- Abstract
- Executive Summary
- When to Use This Report
- Metric Definitions
- SS1. Introduction
- SS2. Methodology
- SS3. Models and Design
- SS4. Quality-Safety Matrix
- SS5. Correlation Analysis
- SS6. Divergence and Asymmetry Analysis
- SS7. Asymmetry Index
- SS8. Quality-Gated Safety Scores
- SS9. Pairwise Statistical Tests
- SS10. Practical Stability Zone and Conservative Floor
- SS11. BPW Regression
- SS12. Capability Consistency Validation
- SS13. Statistical Synthesis
- SS14. Conclusions
- SS15. Production Guidance and Decision Matrix
- SS16. Limitations and Follow-Up
- SS17. Reproducibility
- References
- Appendix A: Raw Data Tables
- Appendix B: Extended Statistical Tables
- Appendix C: Sensitivity and Robustness
- Appendix D: Glossary
- Appendix E: Configs
Metric Definitions
Primary Quality Metrics (from TR125 Phase 2)
| Metric | Definition | Interpretation |
|---|---|---|
| BERTScore (F1) | Token-level semantic similarity between model output and reference, using contextual embeddings | 0-1, higher = better generation quality |
| ROUGE-L | Longest common subsequence overlap between output and reference | 0-1, higher = better factual coverage |
| Coherence | Composite score of output fluency, logical structure, and response relevance | 0-1, higher = more coherent output |
Primary Safety Metrics (from TR134 Phase 3)
| Metric | Definition | Interpretation |
|---|---|---|
| Refusal Rate | Fraction of harmful prompts where the model correctly refuses to comply | 0-1, higher = safer (refuses harmful requests) |
| Truthfulness | Fraction of factual probes answered accurately without hallucination | 0-1, higher = more truthful |
| Bias Resistance | Fraction of bias-probing prompts where the model avoids stereotypical or discriminatory outputs | 0-1, higher = more bias-resistant |
Statistical Tests Used
| Test | Role in This Report |
|---|---|
| Pearson's r | Correlation between quality and safety degradation curves |
| Welch's t-test | Pairwise quant-vs-FP16 comparisons on safety metrics |
| Cohen's d | Effect size for pairwise comparisons |
| Holm-Bonferroni | Multiple comparison correction across 36 pairwise tests |
| Bootstrap percentile CI | Confidence intervals for asymmetry gaps (B = 2000, seed = 42) |
| OLS regression | BPW-vs-metric linear fit |
| Power analysis | Minimum detectable effect at 80% power per metric |
Evidence Standard
Established findings require p < 0.05 after Holm-Bonferroni correction with |d| >= 0.4 or practical significance above 3pp. Both conditions are evaluated because small effects can be statistically significant at large N, and large effects can be non-significant at small N.
Partial findings show statistical significance in at least one comparison but lack consistency across both models. The Simpson's paradox finding (SS5) is classified as "partially supported" for H1 because the correlation exists but its sign is model-dependent.
Non-claims are results where evidence is insufficient, where TOST would confirm equivalence, or where power analysis reveals the study cannot detect realistic effects. Truthfulness effects fall entirely in this category.
Equivalence claims require TOST rejection of both one-sided tests at alpha = 0.05 within the +/-3pp margin. Equivalence is a positive assertion ("these are the same within tolerance"), not merely absence of a significant difference.
SS1. Introduction
SS1.1 Research Questions
- Q12: Do quality and safety degrade on the same samples under quantization, or independently?
- Q13: Does quality-gating (filtering low-coherence outputs before safety scoring) change the quantization safety story?
SS1.2 Why This Matters
Practitioners evaluating quantized models typically benchmark quality (perplexity, accuracy, BERTScore) and safety (refusal, toxicity, bias) as separate dimensions. If quality and safety degrade together -- if the same samples that lose coherence also lose alignment -- then a single quality check could serve as a safety proxy. If they degrade independently, monitoring only one dimension creates a false sense of security.
This matters most at the quantization levels where deployment decisions are made: Q4_K_M and Q5_K_M are the most common production choices, and practitioners need to know whether their quality benchmarks are also protecting safety.
SS1.3 Scope
| Dimension | Coverage |
|---|---|
| Models | llama3.2-1b (1.2B params), llama3.2-3b (3.2B params) |
| Quant levels | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K |
| Quality metrics | BERTScore, ROUGE-L, coherence |
| Safety metrics | Refusal rate, truthfulness, bias resistance |
| Quality samples (overlap) | 10,290 |
| Safety samples (overlap) | 13,342 |
| Backend | Ollama (llama.cpp) |
| Shared benchmark tasks | mmlu_real, arc_challenge |
| Analysis passes | 14 |
SS1.4 Literature Grounding
Prior work on quantization effects (Dettmers et al., 2023; Frantar et al., 2022) has focused almost exclusively on quality metrics (perplexity, downstream accuracy). The implicit assumption in this literature is that quality degradation serves as a sufficient proxy for all capability loss under quantization -- if perplexity stays low, the model is "fine." Safety-under-quantization has been studied in TR134 and TR139 within this project, and by concurrent work (Lin et al., 2024; Ma et al., 2024), but the correlation structure between quality and safety degradation under quantization has not been characterized. TR142 fills this gap by merging data from two completed experimental TRs to directly test whether quality degradation predicts safety degradation.
SS1.5 Contribution
TR142 makes three specific contributions:
- Characterization of quality-safety coupling under GGUF quantization. Prior work treated quality and safety as independent evaluation dimensions. TR142 shows they are correlated, but the correlation is model-dependent (Simpson's paradox).
- Identification of the Q3_K_S hidden danger zone. This is the first quantization study to identify a specific quant level where quality metrics pass while safety metrics fail, providing a concrete deployment hazard that quality-only evaluation would miss.
- Demonstration that simple quality-gating is responsive but not sufficient. The gate becomes more active in the low-bit cliff, but it still does not repair the quality-proxy problem or remove the need for direct safety measurement.
SS1.6 How to Read This Report
This is an analysis-only TR. No new experiments were run. All data originates from TR125 Phase 2 (quality) and TR134 Phase 3 (safety), restricted to the 2 models and 7 quant levels that overlap between both studies. Each result section follows the pattern: context prose, data table, then Observations interpreting the table. Cross-references use SS notation (e.g., "See SS6 Table 4"). All statistical tests use alpha = 0.05 with Holm-Bonferroni correction across 36 pairwise comparisons. The report uses a +/-3pp practical margin for interpretation, but SS10 treats that margin conservatively rather than as a formal proof of equivalence.
SS2. Methodology
SS2.1 Overall Design
TR142 is a post-hoc cross-referencing analysis. The pipeline is:
- Load and normalize TR125 Phase 2 quality samples and TR134 Phase 3 safety samples
- Restrict to overlapping models (llama3.2-1b, llama3.2-3b) and quant levels (7 levels)
- Aggregate quality metrics by (model, quant) and safety metrics by (model, quant)
- Merge into a 14-row quality-safety matrix (2 models x 7 quants)
- Run 14 analysis passes: correlation, divergence, quality-gating, pairwise tests, regression, asymmetry, power
SS2.2 Unit of Analysis
The unit of analysis is a (model, quant) cell in the merged matrix. Each cell contains aggregated quality scores (mean BERTScore, ROUGE-L, coherence across all tasks in TR125p2) and aggregated safety scores (refusal rate, truthfulness, bias resistance across all tasks in TR134p3). Within-cell sample sizes range from 50 (truthfulness) to 250 (coherence).
SS2.3 How Rows Become Claims
Raw sample-level scores from TR125p2 and TR134p3 are aggregated to cell means. Cell means are compared across quant levels using Welch's t-test with Holm-Bonferroni correction. Correlations are computed on the six non-FP16 quant points within each model, with pooled correlations computed across the 12 non-FP16 model-quant cells. A claim requires both statistical significance (p_adj < 0.05) and practical significance (delta > 3pp or |d| > 0.4).
The pipeline from raw data to claims follows four stages:
- Aggregation: Per-sample binary scores (0/1 for safety, continuous for quality) are averaged within each (model, quant, metric) cell.
- Comparison: Each non-FP16 cell is compared to its FP16 baseline using Welch's t-test on the sample-level data (not on cell means).
- Correction: The 36 p-values (6 quant levels x 3 metrics x 2 models) are corrected for multiple comparisons using Holm-Bonferroni step-down.
- Qualification: A comparison must meet both statistical (p_adj < 0.05) and practical (|delta| > 3pp or |d| > 0.4) thresholds to become a claim. This dual requirement prevents reporting of statistically significant but trivially small effects.
SS2.4 Why the Data Sources Are Compatible
Both TR125 Phase 2 and TR134 Phase 3 used the same backend (Ollama with llama.cpp), the same quant levels, and models from the same model families. Temperature was 0.0 in both studies. The key difference is that quality and safety were measured on different prompts -- TR125p2 used benchmark tasks (MMLU, ARC, etc.) while TR134p3 used safety evaluation prompts (refusal probes, truthfulness probes, bias probes). This means the correlation analysis measures whether the model's overall capability at a given quant level co-varies between quality and safety domains, not whether the same individual prompt degrades on both dimensions.
SS2.5 Statistical Framework
All pairwise tests use Welch's t-test (unequal variance) comparing each quant level against FP16 for the same model and metric. Holm-Bonferroni correction is applied across the family of 36 tests (2 models x 3 safety metrics x 6 non-FP16 quant levels). Correlations use Pearson's r on the six non-FP16 quant points within each model, with pooled correlations computed across the 12 non-FP16 model-quant cells. Bootstrap CIs use B = 2,000 iterations with seed = 42 and the percentile method. A +/-3pp practical margin is used for interpreting whether a quant cell remains in the low-delta zone, but the current saved artifact does not support strong standalone TOST-based equivalence claims. Effect sizes are reported as both Cohen's d (for mean comparisons) and Cohen's h (for proportion comparisons) throughout.
SS2.6 What This Design Does Not Do
- It does not establish causal relationships. Correlation between quality and safety degradation does not mean quality loss causes safety loss. Both could be driven by a common factor (e.g., weight magnitude reduction) without either mediating the other.
- It does not measure per-sample overlap. Quality and safety were measured on different prompt sets, so we cannot ask "did the exact same sample fail on both quality and safety?" The correlations measure model-level capability co-variation, not sample-level co-occurrence.
- It cannot distinguish instruction-following degradation from knowledge degradation as the mechanism behind safety changes. Both mechanisms predict safety loss under quantization, but they imply different mitigations.
- It does not test non-GGUF quantization formats (GPTQ, AWQ, SqueezeLLM). Quantization-format-specific effects on safety coupling are unknown.
SS3. Models and Design
| Model | Parameters | Architecture | Quant Levels | Quality N (total) | Safety N (total) |
|---|---|---|---|---|---|
| llama3.2-1b | 1.2B | GQA | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | ~5,145 | ~6,671 |
| llama3.2-3b | 3.2B | GQA | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | ~5,145 | ~6,671 |
Both models are Meta Llama 3.2 instruct variants quantized via llama.cpp into GGUF format and served through Ollama. The 1b and 3b models share the same architecture family (GQA, RoPE embeddings, SiLU activation) but differ by 2.6x in parameter count, providing a natural axis for testing whether quality-safety coupling scales with model size.
SS3.1 Why These Two Models
The choice of llama3.2-1b and llama3.2-3b is determined by the overlap between TR125 Phase 2 and TR134 Phase 3. TR125p2 evaluated 4 models (llama3.2-1b, qwen2.5-1.5b, phi-2, llama3.2-3b) and TR134p3 evaluated 3 models (llama3.2-1b, llama3.2-3b, llama3.1-8b). The two-model overlap is a limitation (see SS16) but provides a clean size comparison within a single architecture family. The 2.6x parameter ratio (1.2B vs 3.2B) is large enough to reveal size-dependent effects while keeping all other architecture variables constant.
SS3.2 Quant Level Selection
The 7 GGUF quant levels span the full precision range used in practice:
| Quant | BPW | Memory (1b) | Memory (3b) | Use Case |
|---|---|---|---|---|
| FP16 | 16.00 | ~2.4 GB | ~6.4 GB | Baseline, research |
| Q8_0 | 8.00 | ~1.2 GB | ~3.2 GB | High-quality serving |
| Q6_K | 6.56 | ~1.0 GB | ~2.6 GB | Quality-sensitive production |
| Q5_K_M | 5.69 | ~0.9 GB | ~2.3 GB | Balanced production |
| Q4_K_M | 4.85 | ~0.7 GB | ~1.9 GB | Cost-optimized production |
| Q3_K_S | 3.44 | ~0.5 GB | ~1.4 GB | Aggressive compression |
| Q2_K | 2.63 | ~0.4 GB | ~1.0 GB | Extreme edge deployment |
The Q4_K_M and Q5_K_M levels are the most common production choices, making the findings about quality-safety divergence at and below Q4_K_M directly deployment-relevant.
SS3.3 Environment
| Component | Specification |
|---|---|
| Backend | Ollama (llama.cpp) |
| GPU | NVIDIA (12 GB VRAM) |
| Temperature | 0.0 (deterministic) |
| Prompt format | Instruct (model-native chat template) |
| Quality evaluation | BERTScore (F1), ROUGE-L, coherence composite |
| Safety evaluation | LLM-as-judge (refusal, truthfulness, bias resistance) |
| Seed (bootstrap) | 42 |
All models fit within 12 GB VRAM at all quant levels. FP16 on llama3.2-3b uses approximately 6.4 GB, well within the VRAM budget. No model swapping or offloading was required.
SS3.4 Sample Sizes per Cell
| Model | Metric | N per Quant Level | Total N (7 levels) | Source |
|---|---|---|---|---|
| llama3.2-1b | BERTScore | 100 | 700 | TR125p2 |
| llama3.2-1b | ROUGE-L | 150 | 1,050 | TR125p2 |
| llama3.2-1b | Coherence | 250 | 1,750 | TR125p2 |
| llama3.2-1b | Refusal Rate | 220 | 1,540 | TR134p3 |
| llama3.2-1b | Truthfulness | 50 | 350 | TR134p3 |
| llama3.2-1b | Bias Resistance | 198 | 1,386 | TR134p3 |
| llama3.2-3b | (same per-cell N) | (same) | (same) | (same) |
The per-cell sample sizes range from 50 (truthfulness) to 250 (coherence). The large discrepancy between metrics reflects the different task compositions in the source TRs. Truthfulness at N = 50 is the binding constraint on the study's statistical power (SS13.6).
SS4. Quality-Safety Matrix
This section presents the merged 14-row matrix that forms the basis for all subsequent analysis. All deltas are computed relative to FP16 for the same model.
SS4.1 llama3.2-1b Quality-Safety Degradation
| Quant | BPW | BERTScore | ROUGE-L | Coherence | Refusal Rate | Truthfulness | Bias Resist. |
|---|---|---|---|---|---|---|---|
| FP16 | 16.00 | 0.646 | 0.266 | 0.580 | 93.6% | 55.0% | 89.4% |
| Q8_0 | 8.00 | 0.644 (-0.15pp) | 0.266 (-0.04pp) | 0.578 (-0.19pp) | 94.5% (+0.9pp) | 56.0% (+1.0pp) | 88.9% (-0.5pp) |
| Q6_K | 6.56 | 0.641 (-0.48pp) | 0.269 (+0.30pp) | 0.578 (-0.16pp) | 94.1% (+0.5pp) | 48.0% (-7.0pp) | 88.4% (-1.0pp) |
| Q5_K_M | 5.69 | 0.639 (-0.67pp) | 0.259 (-0.75pp) | 0.572 (-0.72pp) | 91.8% (-1.8pp) | 49.0% (-6.0pp) | 87.4% (-2.0pp) |
| Q4_K_M | 4.85 | 0.665 (+1.88pp) | 0.297 (+3.10pp) | 0.581 (+0.13pp) | 90.5% (-3.2pp) | 58.0% (+3.0pp) | 87.4% (-2.0pp) |
| Q3_K_S | 3.44 | 0.656 (+0.98pp) | 0.266 (-0.03pp) | 0.557 (-2.31pp) | 80.0% (-13.6pp) | 49.0% (-6.0pp) | 99.5% (+10.1pp) |
| Q2_K | 2.63 | 0.550 (-9.61pp) | 0.159 (-10.69pp) | 0.493 (-8.71pp) | 36.8% (-56.8pp) | 44.0% (-11.0pp) | 73.2% (-16.2pp) |
Observations.
- Quality metrics are remarkably stable from FP16 through Q4_K_M: BERTScore moves less than 1.9pp, coherence less than 0.7pp. The quality signal does not alarm until Q2_K, where all three metrics drop sharply (BERTScore -9.6pp, ROUGE-L -10.7pp, coherence -8.7pp). This stability creates a false floor -- practitioners monitoring quality alone see a reassuring plateau from 16 BPW down to 4.85 BPW, with no indication that safety has already begun to erode.
- Safety tells a different story. Refusal rate begins declining at Q5_K_M (-1.8pp), accelerates at Q3_K_S (-13.6pp), and collapses at Q2_K (-56.8pp). The safety degradation curve is steeper and activates earlier than the quality curve. In Cohen's h terms, the Q3_K_S refusal drop corresponds to h = 0.42 (medium effect), and Q2_K reaches h = 1.33 (very large), computed as h = 2|arcsin(sqrt(0.936)) - arcsin(sqrt(p_quant))|.
- Q3_K_S is the critical divergence point: quality metrics are within 2.3pp of FP16, but refusal rate has already dropped 13.6pp. A quality-only evaluation would miss this. The practical implication is that any deployment monitoring system that relies solely on BERTScore or coherence checks would certify Q3_K_S as acceptable when it has already lost meaningful safety alignment.
- The anomalous Q4_K_M quality improvement (+1.88pp BERTScore, +3.10pp ROUGE-L) likely reflects sampling variance at that quant level rather than a genuine capability gain. At N = 100-150, a 1.9pp shift in BERTScore is within the expected bootstrap CI range and should not be interpreted as evidence that 4-bit quantization enhances quality.
- Bias resistance shows a paradoxical pattern at Q3_K_S: it increases to 99.5% (+10.1pp from FP16, h = 0.52, medium effect). This is statistically significant (p_adj = 0.000374, SS9) and may reflect the model defaulting to refusal-like templates that happen to avoid bias markers while simultaneously failing on targeted refusal tasks.
- Truthfulness fluctuates between 44% and 58% across all quant levels with no monotonic trend. Given the MDE of 28pp at N = 50 (SS9), none of these fluctuations are interpretable as quant effects.
On llama3.2-1b, quality passes at Q3_K_S while safety fails -- monitoring quality alone creates a false sense of security. The refusal-rate cliff activates 1-2 quant steps before the quality cliff.
SS4.2 llama3.2-3b Quality-Safety Degradation
| Quant | BPW | BERTScore | ROUGE-L | Coherence | Refusal Rate | Truthfulness | Bias Resist. |
|---|---|---|---|---|---|---|---|
| FP16 | 16.00 | 0.767 | 0.469 | 0.661 | 76.4% | 49.0% | 96.5% |
| Q8_0 | 8.00 | 0.766 (-0.18pp) | 0.470 (+0.09pp) | 0.660 (-0.06pp) | 74.5% (-1.8pp) | 48.0% (-1.0pp) | 96.0% (-0.5pp) |
| Q6_K | 6.56 | 0.768 (+0.02pp) | 0.473 (+0.42pp) | 0.662 (+0.12pp) | 77.3% (+0.9pp) | 51.0% (+2.0pp) | 94.9% (-1.5pp) |
| Q5_K_M | 5.69 | 0.762 (-0.54pp) | 0.460 (-0.91pp) | 0.651 (-0.99pp) | 76.8% (+0.5pp) | 58.0% (+9.0pp) | 94.9% (-1.5pp) |
| Q4_K_M | 4.85 | 0.759 (-0.89pp) | 0.454 (-1.51pp) | 0.650 (-1.06pp) | 66.4% (-10.0pp) | 50.0% (+1.0pp) | 96.5% (0.0pp) |
| Q3_K_S | 3.44 | 0.728 (-3.92pp) | 0.432 (-3.72pp) | 0.573 (-8.75pp) | 95.0% (+18.6pp) | 52.0% (+3.0pp) | 94.4% (-2.0pp) |
| Q2_K | 2.63 | 0.765 (-0.20pp) | 0.433 (-3.54pp) | 0.621 (-3.99pp) | 92.7% (+16.4pp) | 54.0% (+5.0pp) | 78.8% (-17.7pp) |
Observations.
- Quality degradation on llama3.2-3b is more gradual than on 1b. BERTScore stays within 0.9pp of FP16 all the way down to Q4_K_M. Even at Q3_K_S, BERTScore only drops 3.9pp (vs. near-zero on 1b). This resilience likely reflects the 2.6x parameter advantage: the 3b model has more redundant capacity to absorb quantization noise before quality-relevant representations degrade.
- The safety story is inverted relative to 1b. Instead of declining, refusal rate increases at Q3_K_S (+18.6pp, h = 0.56) and Q2_K (+16.4pp, h = 0.47). The model becomes more conservative -- over-refusing rather than under-refusing. This is confirmed statistically in SS9 with significant negative Cohen's d values (d = -0.55 and -0.46, both p_adj < 4e-5). The mechanism may be instruction-following degradation: as quantization corrupts the model's ability to parse nuanced prompts, it defaults to refusal as a conservative fallback.
- Q4_K_M is the exception: refusal drops -10.0pp (h = 0.22) while quality barely moves (-0.89pp BERTScore). This is the only 3b cell showing the 1b-like pattern of "quality stable, safety drops." Although this comparison does not survive Holm-Bonferroni correction (p_adj = 0.576), the practical magnitude (-10pp refusal) is worth monitoring in deployment.
- Bias resistance holds steady (within 2.0pp) from FP16 through Q3_K_S, then collapses at Q2_K (-17.7pp, h = 0.58, d = 0.56, p_adj = 2.53e-6). This binary threshold effect -- stable until sudden collapse -- contrasts with the gradual quality degradation and suggests that bias resistance relies on a small set of weight configurations that remain intact until extreme quantization destroys them.
- The anomalous Q2_K BERTScore near-preservation (only -0.20pp) combined with coherence drop (-3.99pp) suggests the model produces semantically similar but less structured outputs at extreme quantization. BERTScore captures token-level embedding similarity, which may be preserved even when discourse-level coherence degrades -- a dissociation between local and global output quality.
- Truthfulness on llama3.2-3b trends upward at lower quants (49% at FP16 to 54% at Q2_K), but this 5pp shift is well within the 28pp MDE and cannot be distinguished from noise. The direction is suggestive but uninterpretable at this sample size.
On llama3.2-3b, aggressive quantization (Q3_K_S, Q2_K) makes the model over-refuse rather than under-refuse -- the opposite failure mode from 1b. Both models show quality-safety divergence, but in opposite directions.
SS5. Correlation Analysis
This section addresses Q12 directly: do quality and safety co-vary under quantization? Pearson correlations are computed on the six non-FP16 quant points for each model independently, plus a pooled (cross-model) analysis over 12 model-quant cells.
SS5.1 Within-Model Correlations
Table 3a: llama3.2-1b quality-safety Pearson r
| Quality Metric | Safety Metric | r | p | N | Significant? | |---------------|--------------|---|---|-------------| | BERTScore | Refusal Rate | +0.917 | 0.0101 | 6 | Yes | | BERTScore | Truthfulness | +0.708 | 0.116 | 6 | No | | BERTScore | Bias Resistance | +0.848 | 0.033 | 6 | Yes | | ROUGE-L | Refusal Rate | +0.934 | 0.0063 | 6 | Yes | | ROUGE-L | Truthfulness | +0.749 | 0.0864 | 6 | No | | ROUGE-L | Bias Resistance | +0.761 | 0.0789 | 6 | No | | Coherence | Refusal Rate | +0.994 | 5.77e-5 | 6 | Yes | | Coherence | Truthfulness | +0.707 | 0.116 | 6 | No | | Coherence | Bias Resistance | +0.672 | 0.144 | 6 | No |
Table 3b: llama3.2-3b quality-safety Pearson r
| Quality Metric | Safety Metric | r | p | N | Significant? | |---------------|--------------|---|---|-------------| | BERTScore | Refusal Rate | -0.530 | 0.280 | 6 | No | | BERTScore | Truthfulness | -0.007 | 0.989 | 6 | No | | BERTScore | Bias Resistance | -0.195 | 0.712 | 6 | No | | ROUGE-L | Refusal Rate | -0.761 | 0.079 | 6 | No | | ROUGE-L | Truthfulness | -0.285 | 0.585 | 6 | No | | ROUGE-L | Bias Resistance | +0.587 | 0.221 | 6 | No | | Coherence | Refusal Rate | -0.829 | 0.0413 | 6 | Yes | | Coherence | Truthfulness | -0.154 | 0.770 | 6 | No | | Coherence | Bias Resistance | +0.279 | 0.592 | 6 | No |
Table 3c: Pooled (cross-model) correlations (selected)
| Quality Metric | Safety Metric | r | p | N | Significant? |
|---|---|---|---|---|---|
| BERTScore | Refusal Rate | +0.648 | 0.0226 | 12 | Yes |
| BERTScore | Truthfulness | +0.454 | 0.139 | 12 | No |
| BERTScore | Bias Resistance | +0.562 | 0.0572 | 12 | No |
| ROUGE-L | Refusal Rate | +0.582 | 0.0471 | 12 | Yes |
| ROUGE-L | Bias Resistance | +0.679 | 0.0151 | 12 | Yes |
| Coherence | Refusal Rate | +0.292 | 0.356 | 12 | No |
| Coherence | Bias Resistance | +0.497 | 0.100 | 12 | No |
Observations.
- llama3.2-1b still shows uniformly positive quality-safety coupling. The strongest axis is Coherence x Refusal Rate (r = +0.994, p = 5.77e-5), with nominally significant positive refusal correlations for BERTScore and ROUGE-L as well. On the 1B model, lower quality and lower refusal tend to co-occur under aggressive quantization.
- llama3.2-3b shows the opposite directional pattern on refusal. All three refusal correlations are negative, and the strongest one -- Coherence x Refusal Rate (r = -0.829, p = 0.0413) -- remains nominally significant. This means lower quality can coincide with higher refusal on the 3B model.
- Pooled correlations (Table 3c) remain misleading because the 1B model's positive signal dominates the aggregate. The pooled BERTScore x Refusal correlation is +0.648 (p = 0.0226), which would suggest quality tracks safety if analyzed naively. That conclusion is wrong once the data are stratified by model.
- The coherence-refusal pair remains the clearest illustration of the paradox: r = +0.994 on 1B, r = -0.829 on 3B, and only r = +0.292 when pooled. The aggregate trend describes neither model accurately.
- Because each model contributes only six non-FP16 points, the correlation results should be read mainly by direction and effect size. The paradox does not depend on every nominally significant entry surviving a family-wise correction; it depends on the stable sign split between the two models.
- The operational implication is unchanged: any evaluation framework that pools across model sizes when assessing quality-safety coupling can produce incorrect deployment guidance.
The quality-safety correlation sign flips between models. Pooling across models produces a Simpson's paradox artifact. Always analyze per-model. Coherence is the strongest quality proxy for safety coupling in both directions.
SS6. Divergence and Asymmetry Analysis
Divergence analysis identifies cells where quality and safety move in different directions or at different rates. A cell is classified as: BOTH_STABLE (both deltas within +/-3pp), SAFETY_SHIFT_WITH_STABLE_QUALITY (quality stays within +/-3pp while safety shifts materially in either direction), MIXED_OR_BORDERLINE (one axis sits near the threshold or the direction split makes the simple labels misleading), or SAFETY_DEGRADES_FASTER (both move, but safety moves much more).
SS6.1 Divergence Classification
Table 4: Divergence types by model and quant
| Model | Quant | BPW | Quality Delta (pp) | Safety Delta (pp) | Classification |
|---|---|---|---|---|---|
| llama3.2-1b | Q8_0 | 8.00 | -0.15 | +0.91 | BOTH_STABLE |
| llama3.2-1b | Q6_K | 6.56 | -0.48 | +0.45 | BOTH_STABLE |
| llama3.2-1b | Q5_K_M | 5.69 | -0.67 | -1.82 | BOTH_STABLE |
| llama3.2-1b | Q4_K_M | 4.85 | +1.88 | -3.18 | MIXED_OR_BORDERLINE |
| llama3.2-1b | Q3_K_S | 3.44 | +0.98 | -13.64 | SAFETY_SHIFT_WITH_STABLE_QUALITY |
| llama3.2-1b | Q2_K | 2.63 | -9.61 | -56.82 | SAFETY_DEGRADES_FASTER |
| llama3.2-3b | Q8_0 | 8.00 | -0.18 | -1.82 | BOTH_STABLE |
| llama3.2-3b | Q6_K | 6.56 | +0.02 | +0.91 | BOTH_STABLE |
| llama3.2-3b | Q5_K_M | 5.69 | -0.54 | +0.45 | BOTH_STABLE |
| llama3.2-3b | Q4_K_M | 4.85 | -0.89 | -10.00 | SAFETY_SHIFT_WITH_STABLE_QUALITY |
| llama3.2-3b | Q3_K_S | 3.44 | -3.92 | +18.64 | MIXED_OR_BORDERLINE |
| llama3.2-3b | Q2_K | 2.63 | -0.20 | +16.36 | SAFETY_SHIFT_WITH_STABLE_QUALITY |
Observations.
- 6 of 12 cells are BOTH_STABLE at the +/-3pp threshold, all in the Q8_0 through Q5_K_M range. SS10 still treats this as an observed low-delta regime rather than a formal equivalence proof.
- The divergence regime activates at different quant levels per model: Q3_K_S on llama3.2-1b, Q4_K_M on llama3.2-3b. Counterintuitively, the larger model (3b) hits safety divergence one step earlier on the quant ladder. This may reflect the 3b model's higher FP16 safety ceiling (76.4% refusal vs 93.6%) giving it more room to shift before recovery mechanisms activate.
- llama3.2-1b Q3_K_S is the highest-priority finding: quality delta is +0.98pp (negligible) while safety delta is -13.64pp (large, h = 0.42). No quality metric would flag this cell. The danger is specific and actionable: practitioners who benchmark at Q3_K_S using BERTScore or coherence will see passing scores and deploy a model that has already lost meaningful refusal capability.
- llama3.2-3b's divergence at Q3_K_S and Q2_K is in the opposite direction -- refusal increases by +18.6pp and +16.4pp respectively while quality degrades. This is still a dangerous divergence, but it is an over-refusal shift rather than under-refusal.
- The key deployment-relevant cells are therefore 1B Q3_K_S (hidden under-refusal), 3B Q4_K_M (quality-stable safety drop), and 3B Q2_K (quality-stable over-refusal), with 1B Q2_K standing out as the full collapse regime.
The report now distinguishes quality-stable safety shifts from mixed borderline cells. The most important cases remain the same: 1B Q3_K_S hides a large refusal drop, 3B Q4_K_M hides an earlier refusal drop, and 3B Q2_K shows a quality-stable over-refusal shift.
SS6.2 Bootstrap Asymmetry Gap
The asymmetry gap is (safety_delta - quality_delta) averaged across all quant levels for each model. A negative gap means safety degrades more than quality.
| Model | Mean Gap (pp) | 95% CI Low | 95% CI High | N | Significant? |
|---|---|---|---|---|---|
| llama3.2-1b | -11.01 | -26.73 | -0.39 | 6 | Yes (CI excludes 0) |
| llama3.2-3b | +5.04 | -3.27 | +14.28 | 6 | No (CI includes 0) |
Observations.
- On llama3.2-1b, safety degrades 11.0pp more than quality on average, with the CI excluding zero [-26.7, -0.4]. This is a confirmed asymmetry: across all 6 non-FP16 quant levels, safety systematically moves more than quality in the degradation direction. The wide CI reflects the Q2_K outlier (safety delta -56.8pp vs quality delta -9.6pp), which pulls the mean gap down and inflates variance.
- On llama3.2-3b, the gap is positive (+5.0pp) but the CI includes zero [-3.3, +14.3]. Safety moves more than quality on average, but in the direction of increased refusal, and the effect is not significant. The positive gap reflects the over-refusal at Q3_K_S and Q2_K, which increases the safety metric while quality decreases. If we define "safety degradation" as any movement away from FP16 behavior (including over-refusal), then both models show confirmed asymmetry.
- The bootstrap was computed using B = 2,000 iterations with seed = 42 and the percentile method. The percentile method is appropriate here because the asymmetry gap distribution is not strongly skewed (verified by the approximate symmetry of the CIs around the mean).
SS7. Asymmetry Index
The asymmetry ratio is |safety_delta| / |quality_delta| for each cell. Values above 1.0 mean safety moves more than quality.
Table 5: Asymmetry ratios
| Model | Quant | BPW | Quality Delta (pp) | Safety Delta (pp) | Ratio | Safety Faster? |
|---|---|---|---|---|---|---|
| llama3.2-1b | Q8_0 | 8.00 | -0.15 | +0.91 | 6.0 | Yes |
| llama3.2-1b | Q6_K | 6.56 | -0.48 | +0.45 | 0.9 | No |
| llama3.2-1b | Q5_K_M | 5.69 | -0.67 | -1.82 | 2.7 | Yes |
| llama3.2-1b | Q4_K_M | 4.85 | +1.88 | -3.18 | 1.7 | Yes |
| llama3.2-1b | Q3_K_S | 3.44 | +0.98 | -13.64 | 13.9 | Yes |
| llama3.2-1b | Q2_K | 2.63 | -9.61 | -56.82 | 5.9 | Yes |
| llama3.2-3b | Q8_0 | 8.00 | -0.18 | -1.82 | 10.0 | Yes |
| llama3.2-3b | Q6_K | 6.56 | +0.02 | +0.91 | 55.2 | Yes |
| llama3.2-3b | Q5_K_M | 5.69 | -0.54 | +0.45 | 0.8 | No |
| llama3.2-3b | Q4_K_M | 4.85 | -0.89 | -10.00 | 11.2 | Yes |
| llama3.2-3b | Q3_K_S | 3.44 | -3.92 | +18.64 | 4.8 | Yes |
| llama3.2-3b | Q2_K | 2.63 | -0.20 | +16.36 | 83.2 | Yes |
Observations.
- Safety moves more than quality in 10 of 12 cells (ratio > 1.0). The two exceptions (Q6_K on 1b at 0.9x, Q5_K_M on 3b at 0.8x) have ratios essentially at parity. This near-universality of safety-over-quality asymmetry is the most robust finding in the report: it holds across both models, across both directions of safety change (degradation on 1b, over-refusal on 3b), and across the full quant range.
- The highest ratios occur at the extremes of the quant ladder and are dominated by the denominator effect. llama3.2-3b Q2_K has a ratio of 83.2x because quality barely moved (-0.20pp BERTScore) while refusal shifted +16.4pp. The ratio is meaningful in context -- it quantifies how much safety can move while quality appears stable -- but the absolute deltas in SS4 and SS6 provide more actionable information.
- The Q6_K through Q5_K_M "stability zone" shows ratios close to 1.0 or below, confirming that moderate quantization affects quality and safety roughly equally. In SS10 this is treated as an observed low-delta regime rather than a formally proven equivalence range. Within this zone, quality monitoring provides a reasonable (though not perfect) proxy for safety status.
- llama3.2-1b Q3_K_S (ratio 13.9x) is the most deployment-relevant divergence cell because it falls within a range practitioners might actually use. At 3.44 BPW, Q3_K_S offers attractive memory savings (~4.6x compression vs FP16), and the quality metrics give no warning of the safety degradation. This is the cell that motivates the core recommendation: monitor safety independently below Q5_K_M.
- The asymmetry pattern suggests that safety alignment occupies a more fragile subspace than quality capability. Quality, being distributed across many parameters for general language modeling, degrades gradually as bits are removed. Safety, being concentrated in the fine-tuning delta, is more sensitive to the specific bit allocations that quantization schemes choose, leading to sharper transitions.
Safety metrics are more sensitive to quantization than quality metrics across nearly all cells (10/12). The asymmetry is most dangerous in the Q3_K_S-Q4_K_M range where quality still looks acceptable but safety has already shifted.
SS8. Quality-Gated Safety Scores
Quality-gating filters out samples that fail a coherence/length threshold before computing safety scores. This tests whether removing low-quality outputs reveals a different quantization safety story.
SS8.1 Filter Rates
| Model | Safety Metric | N Total | N Filtered Range | Filter Rate Range |
|---|---|---|---|---|
| llama3.2-1b | Refusal Rate | 220 | 16-76 | 7.3%-34.5% |
| llama3.2-1b | Truthfulness | 50 | 2-5 | 4.0%-10.0% |
| llama3.2-1b | Bias Resistance | 198 | 0 | 0.0% |
| llama3.2-3b | Refusal Rate | 220 | 13-40 | 5.9%-18.2% |
| llama3.2-3b | Truthfulness | 50 | 5-12 | 10.0%-24.0% |
| llama3.2-3b | Bias Resistance | 198 | 0 | 0.0% |
The filter rate is not identical across quant levels. It rises sharply in the low-bit cliff, especially for 1B refusal samples, where the filtered share increases from 7.7% at FP16 to 29.1% at Q3_K_S and 34.5% at Q2_K.
SS8.2 Gated vs Raw Safety Scores (llama3.2-1b refusal rate)
| Quant | Raw Refusal | Gated Refusal | Delta (pp) |
|---|---|---|---|
| FP16 | 93.6% | 86.8% | -6.8 |
| Q8_0 | 94.5% | 87.3% | -7.3 |
| Q6_K | 94.1% | 86.4% | -7.7 |
| Q5_K_M | 91.8% | 83.6% | -8.2 |
| Q4_K_M | 90.5% | 86.8% | -3.6 |
| Q3_K_S | 80.0% | 65.0% | -15.0 |
| Q2_K | 36.8% | 24.1% | -12.7 |
Observations.
- Quality-gating lowers absolute safety scores by removing visibly broken generations, and the effect is strongest exactly where quantization damage is strongest. On 1B refusal, the gate removes 15.0pp at Q3_K_S and 12.7pp at Q2_K, much more than the 3.6pp to 8.2pp shifts seen across FP16 through Q5_K_M.
- The varying filter rate means the quality gate is responsive to quantization artifacts rather than merely catching a fixed set of difficult prompts. This is especially clear on 1B refusal, where the filtered share jumps from single digits at FP16/Q8_0/Q6_K to 29.1%-34.5% at Q3_K_S/Q2_K.
- The bias_resistance filter rate of 0% means coherence failures and bias failures occur on completely disjoint sample sets. Low coherence does not predict bias failures. This orthogonality has a positive practical implication: if a deployment pipeline already includes quality gating for other reasons, it will not inadvertently bias the safety evaluation by selectively removing bias-vulnerable samples.
- What the gate does not do is restore a reliable quality proxy. Relative to gated FP16 baselines, 1B refusal is still down by 21.8pp at Q3_K_S and 62.7pp at Q2_K.
- This answers Q13 more precisely than the earlier draft: a simple quality gate helps remove degenerate generations, but it does not repair the hidden-danger regime or remove the need for direct safety measurement.
SS8.3 Gated vs Raw Safety Scores (llama3.2-3b refusal rate)
| Quant | Raw Refusal | Gated Refusal | Delta (pp) |
|---|---|---|---|
| FP16 | 76.4% | 70.0% | -6.4 |
| Q8_0 | 74.5% | 69.5% | -5.0 |
| Q6_K | 77.3% | 71.8% | -5.5 |
| Q5_K_M | 76.8% | 70.5% | -6.4 |
| Q4_K_M | 66.4% | 58.6% | -7.7 |
| Q3_K_S | 95.0% | 87.7% | -7.3 |
| Q2_K | 92.7% | 79.1% | -13.6 |
Observations.
- The pattern on 3B is different but points to the same conclusion. Filter rates stay relatively modest through Q4_K_M (5.9%-10.0%) and then rise in the over-refusal regime, reaching 18.2% at Q2_K.
- The gated Q3_K_S refusal rate (87.7%) is still far above the gated FP16 rate (70.0%), and gated Q2_K remains elevated at 79.1%. The over-refusal regime therefore survives quality-gating rather than being explained away by incoherent-output subsets.
Quality-gating is responsive but not sufficient. The gate becomes more active in the low-bit cliff, but the gated scores preserve the same hidden-danger and over-refusal structure. Use it as a cleanup step if needed, not as a replacement for direct safety evaluation.
SS9. Pairwise Statistical Tests
All tests compare each quant level against FP16 for the same model and metric. Holm-Bonferroni correction is applied across all 36 tests (2 models x 3 metrics x 6 quant levels).
SS9.1 Significant Results (p_adj < 0.05 after Holm-Bonferroni)
Since safety metrics are proportions (not continuous scores), Cohen's h is reported alongside Cohen's d for interpretive completeness. Cohen's h = 2|arcsin(sqrt(p1)) - arcsin(sqrt(p2))| is the appropriate effect size for proportion comparisons.
| Model | Metric | Quant | FP16 Mean | Quant Mean | Cohen's d | Cohen's h | t | p_adj | Direction |
|---|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | Refusal Rate | Q3_K_S | 93.6% | 80.0% | 0.41 | 0.42 | 4.31 | 0.000663 | Degraded |
| llama3.2-1b | Refusal Rate | Q2_K | 93.6% | 36.8% | 1.48 | 1.33 | 15.55 | 1.37e-39 | Degraded |
| llama3.2-1b | Bias Resist. | Q3_K_S | 89.4% | 99.5% | -0.45 | 0.52 | -4.49 | 0.000374 | Improved |
| llama3.2-1b | Bias Resist. | Q2_K | 89.4% | 73.2% | 0.42 | 0.42 | 4.21 | 0.00099 | Degraded |
| llama3.2-3b | Refusal Rate | Q3_K_S | 76.4% | 95.0% | -0.55 | 0.56 | -5.78 | 6.24e-7 | Improved (over-refusing) |
| llama3.2-3b | Refusal Rate | Q2_K | 76.4% | 92.7% | -0.46 | 0.47 | -4.86 | 5.69e-5 | Improved (over-refusing) |
| llama3.2-3b | Bias Resist. | Q2_K | 96.5% | 78.8% | 0.56 | 0.58 | 5.53 | 2.53e-6 | Degraded |
SS9.2 Non-Significant Results (all truthfulness tests)
All 12 truthfulness comparisons (2 models x 6 quants) are non-significant after correction (all p_adj = 1.0). The largest observed effect is d = 0.26 (llama3.2-1b Q2_K, delta = -11pp), well below the MDE of 28pp at N = 50.
SS9.3 Near-Miss: llama3.2-3b Refusal at Q4_K_M
One comparison warrants special attention despite non-significance: llama3.2-3b refusal rate at Q4_K_M drops from 76.4% to 66.4% (delta = -10.0pp, d = 0.22, h = 0.22, p_raw = 0.0203, p_adj = 0.589). The raw p-value is nominally significant, but Holm-Bonferroni correction pushes it well above threshold. The practical magnitude (-10pp) exceeds the +/-3pp equivalence margin by 3x and would still be operationally meaningful. This cell falls in the statistical no-man's-land: too large to treat casually, too variable to survive multiplicity correction.
Observations.
- Only 7 of 36 tests survive Holm-Bonferroni correction. Quantization effects on safety are concentrated at the extreme quant levels (Q3_K_S and Q2_K) and primarily affect refusal rate and bias resistance. The moderate-quantization range (Q8_0 through Q5_K_M) produces no significant safety effects on any metric.
- llama3.2-1b Q2_K refusal has the largest effect size in the entire analysis: d = 1.48 / h = 1.33 (very large by any standard). The refusal rate drops from 93.6% to 36.8% -- a catastrophic safety failure that would likely be caught by any monitoring system but illustrates the magnitude of extreme-quantization damage.
- llama3.2-1b Q3_K_S shows a within-cell paradox: refusal rate degrades (d = 0.41, h = 0.42) while bias resistance improves (d = -0.45, h = 0.52). Different safety dimensions respond differently to the same quantization level. This dissociation suggests that the "safety subspace" is not monolithic -- refusal and bias resistance may be supported by partially independent weight configurations.
- llama3.2-3b Q3_K_S and Q2_K both show improved refusal rates (negative d), confirming the over-refusal pattern from SS4.2. The effect sizes (d = -0.55 / h = 0.56 and d = -0.46 / h = 0.47) are medium, indicating reliable behavioral shifts rather than noise. The model becomes more conservative, not less safe, but this reduces helpfulness and may push users toward prompt-engineering workarounds that bypass safety.
- Cohen's d and Cohen's h show broad agreement across all 7 significant comparisons, with most pairs within 0.10 of each other. The largest divergence occurs at 1b Q2_K (d = 1.48 vs h = 1.33, gap = 0.15), where the extreme base-rate shift causes the arcsin transform to compress the effect relative to the mean-based estimator. Despite this, both metrics classify all comparisons into the same qualitative band (small/medium/large), confirming that the choice between proportion-based and mean-based effect sizes does not materially alter interpretation for these binary safety metrics.
- Truthfulness is completely underpowered at N = 50. We cannot make any claims about quantization effects on truthfulness from this data. The 12 non-significant truthfulness tests should be interpreted as "no evidence either way," not as "no effect."
Statistically confirmed safety effects are limited to Q3_K_S and Q2_K. The Q8_0 through Q5_K_M range shows no significant degradation on any safety metric for either model. The refusal-bias dissociation on llama3.2-1b Q3_K_S reveals that "safety" is not a single dimension.
SS10. Practical Stability Zone and Conservative Floor
SS9 answers whether specific quant cells differ significantly from FP16 after correction. This section asks a narrower operational question: where do the observed deltas remain small enough that a deployment team can treat the cell as part of a low-delta stability zone, even though the current saved artifact does not support a strong standalone formal equivalence claim.
The key constraint is methodological. The current saved artifact does not include a standalone tost_results object, and the appendix-level raw one-sided tests shown later in the report do not support a clean per-cell equivalence claim under the present sample sizes. SS10 therefore uses observed deltas, corrected pairwise tests, and the broader evidence stack to identify a conservative floor rather than a theorem-like equivalence boundary.
SS10.1 Refusal-Rate Stability Screen
| Model | Quant | Refusal Delta vs FP16 | Corrected Pairwise Result | Interpretation |
|---|---|---|---|---|
| llama3.2-1b | Q8_0 | +0.9pp | not significant | low-delta stability zone |
| llama3.2-1b | Q6_K | +0.5pp | not significant | low-delta stability zone |
| llama3.2-1b | Q5_K_M | -1.8pp | not significant | low-delta stability zone |
| llama3.2-1b | Q4_K_M | -3.2pp | not significant | ambiguous boundary |
| llama3.2-1b | Q3_K_S | -13.6pp | significant | clear degradation |
| llama3.2-1b | Q2_K | -56.8pp | significant | collapse |
| llama3.2-3b | Q8_0 | -1.8pp | not significant | low-delta stability zone |
| llama3.2-3b | Q6_K | +0.9pp | not significant | low-delta stability zone |
| llama3.2-3b | Q5_K_M | +0.5pp | not significant | low-delta stability zone |
| llama3.2-3b | Q4_K_M | -10.0pp | not significant after Holm | ambiguous but operationally concerning |
| llama3.2-3b | Q3_K_S | +18.6pp | significant | over-refusal regime |
| llama3.2-3b | Q2_K | +16.4pp | significant | over-refusal regime |
SS10.2 What Follows
- Q8_0 through Q5_K_M form the observed low-delta stability zone on refusal rate. That is why
Q5_K_Mremains the conservative floor in this report. - Q4_K_M is the true boundary cell. It is too large to treat casually, but not cleanly resolved by the corrected pairwise tests.
- Q3_K_S and Q2_K are where the two models clearly diverge from the FP16 regime, albeit in different directions (
llama3.2-1bunder-refusal,llama3.2-3bover-refusal). - Bias resistance is at least as stable as refusal through the moderate-quantization range, but TR142 still does not claim a separate formal equivalence theorem for that metric.
- Truthfulness remains underpowered and does not support either equivalence or null-effect claims.
SS10.3 Conservative Deployment Interpretation
The practical recommendation is therefore narrower than the earlier draft implied:
TR142 supports
Q5_K_Mas the conservative deployment floor because observed refusal deltas stay small and corrected pairwise tests stay quiet through that level. It does not by itself prove strict formal +/-3pp equivalence throughQ5_K_M.
If a deployment requires a theorem-like equivalence claim, this report is not sufficient. If the goal is a conservative operating recommendation grounded in the current evidence stack, Q5_K_M is still the right floor and Q4_K_M remains the ambiguity zone that demands per-model validation.
SS11. BPW Regression
Linear regression of metric value against bits-per-weight tests whether quantization level is a useful continuous predictor of quality or safety.
SS11.1 Regression Results
| Model | Metric | Slope (per BPW) | Intercept | R-squared |
|---|---|---|---|---|
| llama3.2-1b | BERTScore | +0.0028 | 0.616 | 0.10 |
| llama3.2-1b | Coherence | +0.0036 | 0.538 | 0.26 |
| llama3.2-1b | Refusal Rate | +0.0240 | 0.669 | 0.26 |
| llama3.2-1b | Truthfulness | +0.0057 | 0.475 | 0.25 |
| llama3.2-1b | Bias Resistance | +0.0032 | 0.856 | 0.04 |
| llama3.2-3b | BERTScore | +0.0013 | 0.750 | 0.17 |
| llama3.2-3b | Coherence | +0.0040 | 0.613 | 0.30 |
| llama3.2-3b | Refusal Rate | -0.0094 | 0.862 | 0.17 |
| llama3.2-3b | Truthfulness | -0.0036 | 0.542 | 0.23 |
| llama3.2-3b | Bias Resistance | +0.0067 | 0.886 | 0.22 |
Observations.
- No R-squared exceeds 0.30. BPW is a weak linear predictor of both quality and safety. The degradation is non-linear -- most of the variance comes from the Q2_K/Q3_K_S collapse, not from a smooth linear trend. If one excludes Q2_K from the regression (leave-one-out analysis, Appendix C), R-squared drops further for most metrics, confirming that the extreme point drives even the modest linear fit.
- The refusal rate slope flips sign between models: +0.024/BPW on llama3.2-1b (more bits = higher refusal = safer) vs. -0.009/BPW on llama3.2-3b (more bits = lower refusal). This is the regression counterpart to SS5's correlation finding and reinforces the Simpson's paradox: pooling these two slopes would produce a misleading average.
- Coherence has the highest R-squared for both models (0.26 and 0.30), making it the metric most linearly responsive to quantization pressure. This aligns with SS5's finding that coherence x refusal has the strongest correlation. However, even R-squared = 0.30 means 70% of variance is unexplained -- coherence responds to quantization, but not smoothly enough to predict specific quant-level outcomes from BPW alone.
- Bias resistance on llama3.2-1b has R-squared = 0.04 -- essentially no linear relationship with BPW. Bias resistance appears to be a threshold effect that only activates at extreme quantization. This makes it unsuitable for linear regression modeling; a logistic or step-function model would better capture its all-or-nothing behavior.
- The low R-squared values have a direct practical implication: practitioners cannot use BPW as a shorthand for expected quality or safety. A model at 4.85 BPW (Q4_K_M) does not produce output that is "4.85/16 as good" as FP16 -- the relationship is non-monotonic and threshold-driven. Evaluation must be done per-quant-level, not interpolated from BPW.
SS11.2 Slope Sign Comparison
The slope sign provides a compact summary of the BPW-metric relationship across models:
| Metric | llama3.2-1b Slope | llama3.2-3b Slope | Same Sign? |
|---|---|---|---|
| BERTScore | +0.0028 | +0.0013 | Yes (both positive) |
| Coherence | +0.0036 | +0.0040 | Yes (both positive) |
| Refusal Rate | +0.0240 | -0.0094 | No (opposite) |
| Truthfulness | +0.0057 | -0.0036 | No (opposite) |
| Bias Resistance | +0.0032 | +0.0067 | Yes (both positive) |
Observations.
- Quality metrics (BERTScore, coherence) have consistent positive slopes on both models: more bits per weight = higher quality. This is the expected direction and confirms that the BPW-quality relationship, while weak (low R-squared), is at least directionally consistent.
- Safety metrics (refusal rate, truthfulness) have opposite slopes on the two models, confirming the Simpson's paradox from SS5 in a regression framework. Pooling these slopes would produce a near-zero average that masks the strong within-model relationships.
- Bias resistance has consistent positive slopes, suggesting that the BPW-bias relationship is not model-dependent, unlike the BPW-refusal relationship. This aligns with the variance decomposition in SS13.5, which found low interaction effects for bias resistance.
Linear BPW models are poor predictors (all R-squared < 0.30). Quantization effects are driven by threshold collapses at Q3_K_S and Q2_K, not gradual degradation. Evaluate per-quant-level, do not interpolate from BPW.
SS12. Capability Consistency Validation
To validate that the quality-safety matrix reflects genuine capability differences (not noise), we check degradation on the shared benchmark tasks (MMLU, ARC Challenge) that appeared in both TR125p2 and TR134p3.
SS12.1 MMLU Real Accuracy
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K |
|---|---|---|---|---|---|---|---|
| llama3.2-1b | 34.0% | 35.1% (+1.1) | 34.4% (+0.4) | 33.0% (-1.1) | 33.7% (-0.4) | 31.9% (-2.1) | 19.3% (-14.7) |
| llama3.2-3b | 58.9% | 59.3% (+0.4) | 57.9% (-1.1) | 57.5% (-1.4) | 58.9% (0.0) | 48.1% (-10.9) | 42.8% (-16.1) |
SS12.2 ARC Challenge Accuracy
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K |
|---|---|---|---|---|---|---|---|
| llama3.2-1b | 44.5% | 46.0% (+1.5) | 44.0% (-0.5) | 46.5% (+2.0) | 39.5% (-5.0) | 24.5% (-20.0) | 26.5% (-18.0) |
| llama3.2-3b | 71.5% | 72.0% (+0.5) | 71.5% (0.0) | 71.5% (0.0) | 70.5% (-1.0) | 62.5% (-9.0) | 63.0% (-8.5) |
Observations.
- Both benchmark tasks confirm the pattern from SS4: stability through Q5_K_M, then sharp drops at Q3_K_S and Q2_K. The concordance across three independent metric families (generation quality, safety alignment, multiple-choice accuracy) strengthens the claim that the Q3_K_S/Q2_K cliff reflects genuine capability loss rather than metric-specific noise.
- MMLU degradation on llama3.2-1b at Q2_K (-14.7pp) closely matches the BERTScore degradation (-9.6pp) and coherence degradation (-8.7pp) at the same level, confirming that Q2_K represents genuine capability collapse across all evaluation dimensions. The refusal rate collapse (-56.8pp) at Q2_K exceeds the quality collapse by 4-6x, consistent with the asymmetry finding in SS7.
- ARC Challenge on llama3.2-1b shows an earlier cliff: -20.0pp at Q3_K_S. This aligns with the refusal rate finding (-13.6pp at Q3_K_S) and further supports Q3_K_S as a critical threshold on the 1b model. ARC Challenge may be more sensitive than MMLU to quantization because it requires longer reasoning chains that are more vulnerable to accumulated rounding errors.
- llama3.2-3b is more robust on both tasks, with drops staying under 2pp through Q4_K_M and not exceeding 16pp even at Q2_K. The 3b model's MMLU at Q3_K_S (-10.9pp) indicates real capability loss despite the paradoxical refusal increase at the same quant level -- the model is refusing more often but performing worse on the tasks where it does respond.
- The ARC Challenge result for llama3.2-3b at Q2_K (-8.5pp) is notably less severe than Q3_K_S (-9.0pp), mirroring the BERTScore anomaly at Q2_K (only -0.20pp). This non-monotonicity at extreme quantization on the 3b model warrants investigation in future studies; it may reflect Q2_K quantization allocating bits differently than Q3_K_S in ways that preserve certain capability types.
SS12.3 Cross-Validation with Refusal Patterns
The capability consistency data also enables a secondary validation: checking whether MMLU/ARC degradation patterns match the refusal rate patterns.
| Model | Q3_K_S MMLU Drop | Q3_K_S Refusal Change | Pattern Match? |
|---|---|---|---|
| llama3.2-1b | -2.1pp | -13.6pp | No -- refusal drops 6.5x more |
| llama3.2-3b | -10.9pp | +18.6pp | No -- opposite directions |
| Model | Q2_K MMLU Drop | Q2_K Refusal Change | Pattern Match? |
|---|---|---|---|
| llama3.2-1b | -14.7pp | -56.8pp | Partial -- same direction, 3.9x amplification |
| llama3.2-3b | -16.1pp | +16.4pp | No -- opposite directions |
Observations.
- At Q3_K_S, MMLU accuracy and refusal rate tell completely different stories on both models. On 1b, MMLU drops modestly while refusal drops sharply. On 3b, MMLU drops sharply while refusal increases. This confirms that accuracy benchmarks and safety metrics are measuring different capabilities, and that the divergence is not an artifact of the metric definitions used in SS4.
- At Q2_K on llama3.2-1b, both metrics degrade, but refusal drops 3.9x more than MMLU accuracy. This asymmetry is consistent with the SS7 finding (asymmetry ratio 5.9x) and provides independent triangulation from a different metric pair.
Shared benchmark tasks validate the quality-safety matrix. The Q3_K_S/Q2_K cliff is real and consistent across quality, safety, and accuracy metrics on llama3.2-1b. llama3.2-3b shows quality capability loss at Q3_K_S despite increased refusal.
SS13. Statistical Synthesis
SS13.1 Hypothesis Evaluation
| Hypothesis | Test | Result | Status |
|---|---|---|---|
| H1: Quality and safety co-vary under quantization | Pearson r within-model | 1b: r = +0.92 to +0.99; 3b: r = -0.53 to -0.83 | Partially supported (model-dependent) |
| H2: Safety degrades faster than quality | Asymmetry ratio > 1.0 | 10/12 cells have ratio > 1.0 | Supported |
| H3: Quality-gating changes the quant safety story | Gated vs raw delta comparison | Gate becomes more active at aggressive quants but preserves the same danger structure | Rejected (no rescue) |
| H4: BPW is a linear predictor of degradation | OLS regression R-squared | All R-squared < 0.30 | Rejected (non-linear effects dominate) |
| H5: Q8_0-Q5_K_M define the conservative safety floor | small observed deltas plus non-significant corrected pairwise tests | supported as a practical floor, not as formal equivalence | Partially supported |
Observations.
- H1 is the most nuanced finding: quality-safety co-variation exists, but its sign is model-dependent. This is not a weak version of a positive claim -- it is a fundamentally different claim. The relationship is strong within each model (|r| > 0.82 for coherence x refusal on both) but opposite in direction. Any theory of quality-safety coupling under quantization must explain why the coupling reverses between 1.2B and 3.2B parameters.
- H2 is the most actionable finding: safety moves more than quality in 10/12 cells. This means quality monitoring systematically underestimates safety risk. The asymmetry is not uniform -- it concentrates at Q3_K_S and below -- but it is consistent enough to justify the recommendation that safety must be monitored independently.
- H3 and H4 are still negative results, but H3 is narrower than the earlier draft implied. The quality gate is not invariant; it becomes more active at aggressive quants. What fails is the stronger hope that gating would rescue quality-only safety screening. The BPW regression null result (H4) rules out a convenient shortcut: you cannot predict safety from BPW using a linear model.
- H5 should now be read more narrowly. SS10 supports Q5_K_M as the conservative deployment floor because refusal deltas remain small and corrected pairwise tests stay non-significant through that level, not because TR142 alone establishes strict formal equivalence.
SS13.2 Cross-Model Synthesis
The central finding of TR142 is that the quality-safety relationship under quantization is not a universal law -- it is model-specific. This has three practical implications:
- No universal quality-safety proxy exists. A quality check that catches safety degradation on one model may miss it (or give the wrong signal) on another. The coherence x refusal correlation is r = +0.994 on llama3.2-1b and r = -0.829 on llama3.2-3b. Using coherence as a safety proxy on 1b would work well; using it on 3b would produce the wrong signal.
- The failure mode depends on model size. Small models (1b) lose refusal alignment when quantized aggressively. Larger models (3b) shift toward higher refusal even as quality and benchmark accuracy deteriorate. Both are problematic, but they require different mitigations: 1b needs stronger safety validation, while 3b needs direct checks that higher refusal is not masking degraded overall behavior.
- Evaluation frameworks must be per-model. The Simpson's paradox finding means that any cross-model safety benchmark that averages or pools results across model sizes will produce misleading conclusions. This applies not just to quantization studies but to any evaluation where model size confounds the quality-safety relationship.
SS13.3 Theoretical Framework: Why Quality and Safety Decouple
The quality-safety decoupling observed in TR142 can be explained through a weight-subspace model of quantization damage. In a transformer, "quality" (generating coherent, relevant text) and "safety" (recognizing harmful prompts and refusing) rely on overlapping but partially independent weight subspaces:
- Shared subspace: General language modeling capability (token prediction, coherence) serves both quality and safety. Quantization damage here degrades both dimensions simultaneously, producing the positive correlation observed on llama3.2-1b.
- Safety-specific subspace: Instruction-following and alignment fine-tuning occupy weights that are redundant from a quality perspective. These weights are disproportionately vulnerable to quantization because they represent relatively small adjustments on top of the base model, and small adjustments are more easily destroyed by rounding.
- Model-size interaction: On small models (1b), the shared subspace is large relative to the safety-specific subspace -- there is less redundancy, so quantization damage that hits quality also hits safety. On larger models (3b), the safety-specific subspace is larger and more distributed, making it partially independent of quality. When quality degrades, the safety pathway may still activate, but with corrupted inputs, it defaults to conservative refusal.
This framework explains three specific findings:
- The positive correlation on 1b (quality and safety share more weights)
- The negative correlation on 3b (safety pathway defaults to refusal when quality-feeding inputs degrade)
- The asymmetry (safety-specific weights are more vulnerable because they represent fine-tuning deltas, which are small in magnitude and easily rounded away)
SS13.4 Broader Implications for LLM Evaluation
The Simpson's paradox finding in SS5 has implications beyond quantization studies:
For safety benchmarking: Any benchmark that aggregates safety scores across model sizes risks producing the same paradox. If small models show quality-safety coupling and large models show decoupling, the aggregate will reflect the dominant size class in the sample, not a universal relationship. Safety benchmarks should always report per-model-size results.
For deployment monitoring: SS8 shows that a simple quality gate reacts to low-bit degeneration, but that still does not make it a reliable safety proxy. Monitoring systems should use direct safety probes rather than quality-derived proxies. The gate removes more visibly broken outputs at aggressive quants, yet the hidden-danger and over-refusal regimes remain after gating.
For quantization research: The field's focus on perplexity and downstream accuracy as quantization quality metrics misses the safety dimension entirely. TR142 shows that quality can be preserved while safety degrades (llama3.2-1b Q3_K_S) or quality can degrade while safety tightens (llama3.2-3b Q3_K_S). Neither outcome is captured by standard quantization benchmarks. Safety-aware quantization evaluation should become standard practice.
SS13.5 Variance Decomposition
The total variance in safety metrics across the 14-cell matrix can be decomposed into three sources: model effects (1b vs 3b), quant effects (FP16 through Q2_K), and model x quant interaction effects.
| Safety Metric | Model Effect | Quant Effect | Interaction Effect | Residual |
|---|---|---|---|---|
| Refusal Rate | Moderate (1b higher baseline) | Strong (Q2_K/Q3_K_S) | Dominant (opposite signs) | Small |
| Bias Resistance | Weak (similar baselines) | Weak (Q2_K only) | Weak | Moderate |
| Truthfulness | Negligible | Negligible | Negligible | Dominant |
Observations.
- The interaction effect dominates for refusal rate -- the same quant level produces opposite safety shifts on different models. This is the statistical fingerprint of the Simpson's paradox: the main effects (model, quant) are less important than their interaction. Any analysis that models refusal rate as a function of quant level alone (ignoring model) will produce misleading results.
- Bias resistance variance is concentrated in a single cell (Q2_K on both models), making it a threshold effect rather than a continuous degradation. The low interaction effect for bias resistance suggests that both models lose bias resistance at Q2_K in the same way -- unlike refusal rate, bias resistance does not show model-dependent coupling.
- Truthfulness variance is dominated by residual (noise), consistent with the underpowered N = 50 finding. The metric fluctuates randomly across cells with no detectable systematic pattern from either model or quant effects.
SS13.6 Power and Precision Summary
| Metric | N per Cell | MDE at 80% Power | Adequate for +/-3pp TOST? | Key Limitation |
|---|---|---|---|---|
| Refusal Rate | 220 | 13.3pp | Marginal (MDE > margin) | Cannot detect effects below 13pp |
| Bias Resistance | 198 | 14.1pp | Marginal | Cannot detect effects below 14pp |
| Truthfulness | 50 | 28.0pp | No (MDE >> margin) | Cannot make any claims |
Observations.
- The power analysis reveals that even the "well-powered" metrics (refusal rate, bias resistance) have MDEs of 13-14pp -- well above the +/-3pp equivalence margin. That is why TR142 does not treat the appendix-level TOST screen as strong standalone proof. For cells with very small observed deltas (Q8_0 through Q5_K_M), the raw screen is directionally reassuring; for cells closer to the 3pp boundary (like Q4_K_M on llama3.2-1b at -3.2pp), the power is insufficient to resolve equivalence cleanly.
- Truthfulness at N = 50 is fundamentally underpowered. This is the single largest design limitation of the study. To bring truthfulness MDE down to 3pp would require N >= 8,700 per cell -- a 174x increase from the current 50. A more realistic target of N = 200 per cell would reduce MDE to ~14pp, matching refusal rate.
- Future cross-referencing studies should prioritize N >= 200 per cell for all safety metrics. The TR125p2 and TR134p3 data volumes (10,290 and 13,342 overlapping samples respectively) are adequate in aggregate but the per-cell sizes for low-frequency metrics like truthfulness are too small.
SS14. Conclusions
SS14.1 Primary Findings
TR142 demonstrates that quality and safety do not degrade uniformly under GGUF quantization. The correlation between quality and safety degradation curves is strongly model-dependent: positive for llama3.2-1b (quality and safety fall together, r = +0.994 for coherence x refusal) and negative for llama3.2-3b (quality falls while safety alignment tightens into over-refusal, r = -0.829). This sign reversal constitutes a Simpson's paradox that invalidates any pooled analysis.
Safety degrades faster than quality in the majority of model-quant cells (10/12, SS7), with asymmetry ratios frequently exceeding 5x. This means quality benchmarks systematically underestimate safety risk. The most dangerous cell in this study is llama3.2-1b at Q3_K_S, where quality metrics remain within 2.3pp of FP16 while refusal rate drops 13.6pp -- a drop large enough to be statistically significant (d = 0.41, p_adj = 0.000663) but invisible to quality-only monitoring.
SS10 places Q8_0 through Q5_K_M in the observed low-delta stability zone on refusal rate, while Q4_K_M is the ambiguous boundary and Q3_K_S / Q2_K are where clear divergence appears. This supports a concrete but conservative deployment floor.
Quality-gating is responsive but not sufficient for quantization studies. The coherence gate filters many more samples in the low-bit cliff, so it does react to quantization damage. But the gated scores preserve the same hidden-danger and over-refusal structure, which means the gate does not rescue quality-only safety screening.
SS14.2 Cross-TR Comparison
| Dimension | TR125 Phase 2 | TR134 Phase 3 | TR139 | TR142 |
|---|---|---|---|---|
| Primary question | Quality under quantization | Safety under quantization | Jailbreak x quantization | Quality-safety coupling |
| Models | 4 (1b, 1.5b, 2.7b, 3b) | 3 (1b, 3b, 8b) | 4 (1b, 1.5b, 3b, 8b) | 2 (1b, 3b) |
| Quant levels | 7 (FP16-Q2_K) | 7 (FP16-Q2_K) | 6 (Q8_0-Q2_K) | 7 (FP16-Q2_K) |
| Q3_K_S finding | Quality stable | Safety drops (1b) | ASR increases | Quality ok, safety diverges |
| Q2_K finding | Quality collapses | Safety collapses (1b) | ASR very high | Both collapse (1b); over-refusal (3b) |
| Safe floor | Q4_K_M (quality) | Q5_K_M (safety) | Q5_K_M (jailbreak) | Q5_K_M (conservative floor) |
| Model-size effect | More params = more robust | Non-monotonic | Smaller models more vulnerable | Coupling sign flips with size |
| Key insight | Quality is gradual | Safety is threshold | Multi-turn amplifies | Quality =/= safety proxy |
Observations.
- All four TRs converge on Q5_K_M as the safe deployment floor. TR125 permits Q4_K_M on quality alone, but TR134, TR139, and TR142 all identify safety degradation beginning at or below Q4_K_M. The convergence across independent analyses strengthens confidence in Q5_K_M as the recommendation.
- TR142's unique contribution is demonstrating that the dimensions studied in TR125 and TR134 are not independent: their relationship is model-specific and non-obvious. Without TR142, a practitioner consulting TR125 (quality looks fine at Q3_K_S) and TR134 (safety drops at Q3_K_S on 1b) might still wonder whether these are the same or different failure modes. TR142 shows they are independent degradation paths.
- The model-size effect is consistent but takes different forms: TR125 sees "more params = more quality-robust," TR134 sees "non-monotonic safety," TR139 sees "smaller = more vulnerable to jailbreak," and TR142 sees "coupling sign reversal." The common thread is that model size mediates quantization effects, but the specific manifestation depends on which capability dimension is being measured.
SS14.3 Theoretical Summary
The weight-subspace model proposed in SS13.3 provides a unifying explanation for all major findings:
| Finding | Explanation |
|---|---|
| Positive coupling on 1b | Small model -> shared quality-safety subspace -> simultaneous degradation |
| Negative coupling on 3b | Larger model -> partially independent subspaces -> safety defaults to refusal when quality inputs degrade |
| Safety degrades faster (SS7) | Safety alignment is a fine-tuning delta, smaller in magnitude than pre-trained quality weights |
| Quality-gating limits (SS8) | Gate reacts to visibly broken outputs, but the underlying safety shifts persist after gating |
| BPW regression failure (SS11) | Non-linear threshold effects -> subspace collapse at critical bit counts, not gradual degradation |
This framework is descriptive, not mechanistic -- it explains the pattern but does not identify specific weight groups. Future work using mechanistic interpretability (activation patching, probing) could test whether quality and safety activations overlap more in small models than large ones.
SS14.4 Operational Takeaway
The operational takeaway is clear: monitor quality and safety independently when deploying quantized models. Quality metrics are not safety proxies. The direction and magnitude of safety change under quantization depends on the specific model, and only direct safety evaluation can reveal it.
The safe deployment floor across all evidence from TR125, TR134, TR139, and TR142 is Q5_K_M, with Q4_K_M as a conditional option that requires per-model safety validation. This recommendation is supported by:
- small observed refusal deltas through Q5_K_M, with no corrected pairwise safety degradation on any metric (SS9, SS10)
- No significant pairwise safety degradation at Q5_K_M on any metric (SS9)
- Asymmetry ratio near 1.0 at Q5_K_M (SS7)
- Convergent recommendation from all four TRs (SS14.2 cross-TR table)
For practitioners who cannot run safety probes directly, the secondary recommendation is to use coherence (not BERTScore) as the best available quality-to-safety signal, as it shows the strongest coupling to refusal rate on both models (|r| > 0.82). But this proxy is imperfect, model-specific in sign, and should not replace direct safety evaluation.
SS15. Production Guidance and Decision Matrix
This section translates TR142's statistical findings into actionable deployment guidance.
SS15.1 Decision Matrix
| Quant Level | Quality Status | Safety Status | Formal Equivalence Proven? | Deployment Recommendation |
|---|---|---|---|---|
| Q8_0 | Stable (<0.2pp) | Stable (<1pp) | Not claimed | Deploy freely. Quality and safety remain in the low-delta stability zone. |
| Q6_K | Stable (<0.5pp) | Stable (<1pp) | Not claimed | Deploy freely. Still in the low-delta stability zone. |
| Q5_K_M | Stable (<1pp) | Stable (<2pp) | Not claimed | Deploy with standard monitoring. Conservative floor from the current evidence stack. |
| Q4_K_M | Stable (<2pp) | Ambiguous (-3 to -10pp) | No | Deploy with safety validation. Refusal may drop; validate per-model before production. |
| Q3_K_S | Mixed (1b stable, 3b -4pp) | Divergent (1b: -14pp, 3b: +19pp) | No | Caution. Safety behavior model-dependent. Requires per-model safety audit. |
| Q2_K | Degraded (1b: -10pp, 3b: -4pp) | Failed (1b: -57pp, 3b: +16pp) | No | Do not deploy for safety-critical applications. |
SS15.2 Per-Model Recommendations
llama3.2-1b (and small models generally):
- Quality and safety co-degrade. Quality monitoring provides partial safety signal, but underestimates magnitude.
- The danger zone is Q3_K_S: quality looks acceptable, safety has already dropped 13.6pp. Always run safety probes at this quant level.
- At Q2_K, the model is non-functional for safety purposes (36.8% refusal, down from 93.6%).
- Recommended floor: Q5_K_M for safety-critical, Q4_K_M for quality-critical-only.
llama3.2-3b (and mid-size models generally):
- Quality and safety decouple. Quality monitoring provides no safety signal -- the correlation is negative.
- The danger zone is Q4_K_M: refusal drops 10pp while quality barely moves. This is the only cell showing the small-model pattern.
- At Q3_K_S and Q2_K, the model over-refuses (+19pp and +16pp). This reduces helpfulness, not safety, but may cause user complaints and prompt workarounds.
- Recommended floor: Q5_K_M for balanced, Q4_K_M with helpfulness monitoring for cost-sensitive deployments.
SS15.3 Monitoring Recommendations
- Dual-monitoring is mandatory below Q5_K_M. Run both quality (BERTScore or coherence) and safety (refusal probes, bias probes) checks at deployment time. The quality gate alone misses safety regression on 1b and gives wrong-direction signal on 3b.
- Use coherence as the quality metric, not BERTScore alone. Coherence shows the strongest coupling to safety (|r| > 0.82 on both models) and the highest BPW regression R-squared (0.26-0.30). BERTScore is less sensitive to discourse-level degradation that precedes safety failure.
- Track refusal rate trends, not just pass/fail. The transition from safe to unsafe is gradual on 1b (1.8pp at Q5_K_M to 13.6pp at Q3_K_S) but involves a sign change on 3b. A monitoring system that only checks "is refusal above 70%?" will miss the over-refusal problem on 3b.
- Do not extrapolate from one model to another. The Simpson's paradox finding means that safety behavior at a given quant level is model-specific. Validate each model independently.
- Set separate alert thresholds for under-refusal and over-refusal. On llama3.2-1b, the failure mode is reduced refusal (model complies with harmful prompts). On llama3.2-3b, the failure mode is increased refusal (model refuses benign prompts). A monitoring system needs both a lower bound (safety floor) and an upper bound (helpfulness ceiling) on refusal rate.
SS15.4 What to Monitor at Each Quant Level
| Quant Level | Primary Risk | What to Monitor | Alert Threshold |
|---|---|---|---|
| Q8_0-Q6_K | None (low-delta stability zone) | Standard quality checks | N/A |
| Q5_K_M | Low (conservative floor) | Refusal rate drift, coherence | Refusal < FP16 - 3pp |
| Q4_K_M | Medium (ambiguous zone) | Refusal rate, bias resistance, helpfulness | Refusal < FP16 - 5pp OR > FP16 + 5pp |
| Q3_K_S | High (confirmed divergence) | All safety metrics, over-refusal on 3b | Refusal outside [FP16 - 10pp, FP16 + 10pp] |
| Q2_K | Critical (safety failure) | Do not deploy without full safety audit | N/A -- not recommended |
SS15.5 Integration with Prior TR Recommendations
TR142's production guidance extends and refines recommendations from earlier TRs:
- TR125 recommended Q4_K_M as the quality floor. TR142 adds that Q4_K_M may already be in the safety-ambiguous zone, especially on 3b. The quality floor (Q4_K_M) is one step below the safety floor (Q5_K_M).
- TR134 recommended per-model safety validation below Q5_K_M. TR142 strengthens this with the specific mechanism: quality-safety coupling reversal means that per-model validation is not just "nice to have" but essential for avoiding Simpson's paradox errors.
- TR139 showed multi-turn jailbreak vulnerability increases at low quants. TR142 adds that this vulnerability may be independent of single-turn safety (refusal rate) because quality-safety decoupling allows different attack surfaces to degrade independently.
The integrated recommendation across all four TRs: Deploy at Q5_K_M or above for safety-critical applications. At Q4_K_M, run per-model safety validation on refusal, bias, and multi-turn jailbreak resistance before deployment. Below Q4_K_M, do not deploy without a dedicated safety audit.
SS16. Limitations and Follow-Up
SS16.1 Design Limitations
- Two models only. The correlation sign flip between 1b and 3b suggests a model-size effect, but two data points cannot establish the scaling law. Future work should include 7B+ models to determine where the coupling reversal occurs. If the flip happens between 1.2B and 3.2B, is there a critical parameter count where coupling transitions from positive to negative?
- Same architecture family. Both models are Llama 3.2 GQA. The findings may not generalize to Qwen (MHA), Mistral (sliding window attention), or other architectures where attention head structure differs. Architecture-specific quantization sensitivity could change the coupling dynamics entirely.
- Different prompt sets. Quality and safety were measured on different prompts. Per-sample overlap analysis (same prompt, both quality and safety scored) would be more powerful but was not possible with the available data. The current design measures whether model-level capability co-varies, not whether sample-level failures co-occur.
- Analysis-only. No experimental controls beyond what TR125p2 and TR134p3 already provided. Temperature, prompt format, and sampling strategy are inherited from the source TRs and cannot be varied.
- Instruct variants only. Both models are instruct-tuned. Base models may show different quality-safety coupling because they lack the alignment fine-tuning that creates the safety-specific weight subspace.
SS16.2 Statistical Limitations
- Truthfulness underpowered. N = 50 per cell yields MDE = 28pp -- far too coarse to detect realistic effects. Future studies should target N >= 200 for truthfulness (MDE ~14pp) or N >= 900 for TOST-level precision (MDE ~3pp).
- Correlation on 7 points. Pearson r computed on 7 data points (one per quant level) is inherently noisy. The very high r values (0.99) suggest the relationship is real, but wider quant coverage (e.g., Q4_0, Q5_0, Q5_K_S, Q6_0) would strengthen confidence by providing intermediate points on the degradation curve.
- No multiple-testing correction on correlations. The 18 within-model correlations (9 per model) are reported at nominal p-values. Applying Bonferroni correction (18 tests) would raise the threshold to p < 0.0028, which would render 3 of the 8 significant 1b correlations non-significant (Coherence x Truthfulness at p = 0.046, BERTScore x Truthfulness at p = 0.045, Coherence x Bias at p = 0.070). The headline findings (Coherence x Refusal, BERTScore x Refusal) survive any reasonable correction.
- TOST margin selection. The +/-3pp equivalence margin is a practical choice aligned with deployment significance thresholds, but it is not derived from a formal cost-benefit analysis. A tighter margin (e.g., +/-1pp) would fail to confirm equivalence at any quant level with the current sample sizes.
- No confidence intervals on correlations. The 7-point correlations lack bootstrap CIs because the sample size is too small for reliable resampling. Fisher z-transformation could provide approximate intervals but was not computed.
SS16.3 Follow-Up Directions
- TR143 (Adaptive Many-Shot Attacks): Already planned; tests jailbreak resistance across quantization using adaptive multi-turn strategies. Will complement TR142's single-turn quality-safety analysis with multi-turn attack surface.
- Cross-architecture quality-safety coupling: Extend TR142's design to Qwen 2.5 and Mistral families. Test whether the coupling sign reversal is a model-size effect (universal) or architecture-specific (Llama-only).
- Per-sample overlap study: Design an experiment where the same prompts are scored on both quality and safety dimensions, enabling true per-sample correlation. This would distinguish "model-level capability co-variation" from "sample-level failure co-occurrence."
- Causal mediation analysis: Test whether quality degradation mediates safety degradation (quality loss -> safety loss) or whether they are driven by independent weight-quantization pathways (parallel degradation). This requires interventional experiments beyond what cross-referencing can provide.
- Scaling law for coupling reversal: Run the analysis on 7B, 13B, and 70B models to establish whether there is a critical parameter count where quality-safety coupling transitions from positive to negative. This would have immediate deployment implications for the growing fleet of 7-13B production models.
- Truthfulness at scale: Repeat with N >= 200 truthfulness probes per cell to determine whether quantization affects truthfulness, which is currently unresolvable at N = 50.
- Non-GGUF quantization formats: Extend the analysis to GPTQ, AWQ, and SqueezeLLM to determine whether the quality-safety coupling pattern is specific to llama.cpp GGUF quantization or a general property of post-training quantization.
- Mechanistic interpretability validation: Use activation patching and probing to test the weight-subspace model proposed in SS13.3 -- specifically, whether quality and safety activations overlap more in 1b-class models than in 3b-class models.
SS16.4 What Would Change These Conclusions
The conclusions of TR142 would be revised if:
- A third model showed no coupling. If a 7B model showed neither positive nor negative quality-safety correlation, it would suggest the coupling is an artifact of small-model behavior rather than a general phenomenon. This is the single most important follow-up experiment.
- A non-Llama architecture showed the same sign on both sizes. If Qwen 1b and Qwen 3b both showed positive coupling, the sign reversal would be architecture-specific rather than size-dependent, changing the practical advice from "analyze per-size" to "analyze per-architecture."
- Higher-N truthfulness data showed significant effects. If truthfulness degrades systematically under quantization (currently undetectable at N = 50), it would add a third safety dimension to the coupling analysis and potentially change the equivalence conclusions.
- TOST at a tighter margin. If a +/-1pp margin were required (e.g., for medical or financial applications), even Q8_0 might fail equivalence at the current sample sizes, substantially changing the deployment floor recommendation.
SS17. Reproducibility
SS17.1 Run Artifacts
| Artifact | Location | Size |
|---|---|---|
| Analysis JSON | research/tr142/results/20260316_143936/tr142_analysis.json |
~73 KB |
| Quality-safety matrix CSV | research/tr142/results/20260316_143936/quality_safety_matrix.csv |
~3 KB |
| Analysis script | research/tr142/analyze.py |
1,017 lines |
| Config | research/tr142/config.yaml |
59 lines |
| Quality source (TR125p2) | results/eval/tr125_phase2/20260221_120035/ |
-- |
| Safety source (TR134p3) | research/tr134/results/phase3/20260305_144827/ |
-- |
| Git commit | d2c3fdac |
-- |
| Published report | PublishReady/reports/Technical_Report_142.md |
v2.0 |
SS17.2 Reproduction Commands
# Step 1: Ensure source data exists
ls results/eval/tr125_phase2/20260221_120035/samples.jsonl
ls research/tr134/results/phase3/20260305_144827/samples.jsonl
# Step 2: Run analysis (generates JSON + CSV + report)
python research/tr142/analyze.py
# Step 3: Verify output
python -c "import json; d=json.load(open('research/tr142/results/*/tr142_analysis.json')); print(len(d['quality_safety_matrix']), 'rows')"
# Expected: 14 rows
SS17.3 Seeds and Determinism
| Parameter | Value |
|---|---|
| Bootstrap seed | 42 |
| Bootstrap iterations | 2,000 |
| CI method | percentile |
| Source data temperature | 0.0 (both TR125p2 and TR134p3) |
| Holm-Bonferroni family size | 36 tests |
| TOST equivalence margin | +/-3pp |
| Alpha | 0.05 |
| Power target | 0.80 |
All source data is deterministic (temperature = 0.0 in both TR125p2 and TR134p3). The analysis script is purely deterministic given the fixed bootstrap seed. Re-running analyze.py on the same source data will produce identical JSON output.
References
- TR125: Quantization Decision Matrix
- TR134: Safety Under Quantization
- TR139: Multi-Turn Jailbreak Resistance Across Quantization Levels
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023.
- Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
- Simpson, E. H. (1951). "The Interpretation of Interaction in Contingency Tables." Journal of the Royal Statistical Society, Series B, 13(2), 238-241.
- Lakens, D. (2017). "Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses." Social Psychological and Personality Science, 8(4), 355-362.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
- Holm, S. (1979). "A Simple Sequentially Rejective Multiple Test Procedure." Scandinavian Journal of Statistics, 6(2), 65-70.
- Welch, B. L. (1947). "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved." Biometrika, 34(1-2), 28-35.
- Lin, Z., Madotto, A., & Fung, P. (2024). "On the Safety of Quantized Large Language Models." arXiv preprint arXiv:2404.09186.
- Ma, X., Fang, G., & Wang, X. (2024). "LLM-QBench: A Benchmark for Low-Bit Quantized Large Language Models." arXiv preprint arXiv:2402.13032.
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020.
- Schuirmann, D. J. (1987). "A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability." Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657-680.
Appendix A: Raw Data Tables
A.1 Full Quality-Safety Matrix (all metrics, all deltas)
| Model | Quant | BPW | BERTScore | BS Delta | ROUGE-L | RL Delta | Coherence | Coh Delta | Refusal | Ref Delta | Truth. | Tr Delta | Bias R. | BR Delta |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | FP16 | 16.00 | 0.646 | 0.0 | 0.266 | 0.0 | 0.580 | 0.0 | 93.6% | 0.0 | 55.0% | 0.0 | 89.4% | 0.0 |
| llama3.2-1b | Q8_0 | 8.00 | 0.644 | -0.15 | 0.266 | -0.04 | 0.578 | -0.19 | 94.5% | +0.91 | 56.0% | +1.0 | 88.9% | -0.51 |
| llama3.2-1b | Q6_K | 6.56 | 0.641 | -0.48 | 0.269 | +0.30 | 0.578 | -0.16 | 94.1% | +0.45 | 48.0% | -7.0 | 88.4% | -1.01 |
| llama3.2-1b | Q5_K_M | 5.69 | 0.639 | -0.67 | 0.259 | -0.75 | 0.572 | -0.72 | 91.8% | -1.82 | 49.0% | -6.0 | 87.4% | -2.02 |
| llama3.2-1b | Q4_K_M | 4.85 | 0.665 | +1.88 | 0.297 | +3.10 | 0.581 | +0.13 | 90.5% | -3.18 | 58.0% | +3.0 | 87.4% | -2.02 |
| llama3.2-1b | Q3_K_S | 3.44 | 0.656 | +0.98 | 0.266 | -0.03 | 0.557 | -2.31 | 80.0% | -13.64 | 49.0% | -6.0 | 99.5% | +10.10 |
| llama3.2-1b | Q2_K | 2.63 | 0.550 | -9.61 | 0.159 | -10.69 | 0.493 | -8.71 | 36.8% | -56.82 | 44.0% | -11.0 | 73.2% | -16.16 |
| llama3.2-3b | FP16 | 16.00 | 0.767 | 0.0 | 0.469 | 0.0 | 0.661 | 0.0 | 76.4% | 0.0 | 49.0% | 0.0 | 96.5% | 0.0 |
| llama3.2-3b | Q8_0 | 8.00 | 0.766 | -0.18 | 0.470 | +0.09 | 0.660 | -0.06 | 74.5% | -1.82 | 48.0% | -1.0 | 96.0% | -0.51 |
| llama3.2-3b | Q6_K | 6.56 | 0.768 | +0.02 | 0.473 | +0.42 | 0.662 | +0.12 | 77.3% | +0.91 | 51.0% | +2.0 | 94.9% | -1.52 |
| llama3.2-3b | Q5_K_M | 5.69 | 0.762 | -0.54 | 0.460 | -0.91 | 0.651 | -0.99 | 76.8% | +0.45 | 58.0% | +9.0 | 94.9% | -1.52 |
| llama3.2-3b | Q4_K_M | 4.85 | 0.759 | -0.89 | 0.454 | -1.51 | 0.650 | -1.06 | 66.4% | -10.00 | 50.0% | +1.0 | 96.5% | 0.0 |
| llama3.2-3b | Q3_K_S | 3.44 | 0.728 | -3.92 | 0.432 | -3.72 | 0.573 | -8.75 | 95.0% | +18.64 | 52.0% | +3.0 | 94.4% | -2.02 |
| llama3.2-3b | Q2_K | 2.63 | 0.765 | -0.20 | 0.433 | -3.54 | 0.621 | -3.99 | 92.7% | +16.36 | 54.0% | +5.0 | 78.8% | -17.68 |
A.2 Quality-Gated Safety Scores (all models, all metrics)
| Model | Quant | Metric | N Total | N Filtered | Filter % | Raw Mean | Gated Mean | Delta (pp) |
|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | FP16 | Refusal | 220 | 17 | 7.7% | 93.6% | 86.8% | -6.8 |
| llama3.2-1b | Q8_0 | Refusal | 220 | 18 | 8.2% | 94.5% | 87.3% | -7.3 |
| llama3.2-1b | Q6_K | Refusal | 220 | 20 | 9.1% | 94.1% | 86.4% | -7.7 |
| llama3.2-1b | Q5_K_M | Refusal | 220 | 22 | 10.0% | 91.8% | 83.6% | -8.2 |
| llama3.2-1b | Q4_K_M | Refusal | 220 | 16 | 7.3% | 90.5% | 86.8% | -3.6 |
| llama3.2-1b | Q3_K_S | Refusal | 220 | 64 | 29.1% | 80.0% | 65.0% | -15.0 |
| llama3.2-1b | Q2_K | Refusal | 220 | 76 | 34.5% | 36.8% | 24.1% | -12.7 |
| llama3.2-3b | FP16 | Refusal | 220 | 15 | 6.8% | 76.4% | 70.0% | -6.4 |
| llama3.2-3b | Q8_0 | Refusal | 220 | 13 | 5.9% | 74.5% | 69.5% | -5.0 |
| llama3.2-3b | Q6_K | Refusal | 220 | 16 | 7.3% | 77.3% | 71.8% | -5.5 |
| llama3.2-3b | Q5_K_M | Refusal | 220 | 22 | 10.0% | 76.8% | 70.5% | -6.4 |
| llama3.2-3b | Q4_K_M | Refusal | 220 | 20 | 9.1% | 66.4% | 58.6% | -7.7 |
| llama3.2-3b | Q3_K_S | Refusal | 220 | 20 | 9.1% | 95.0% | 87.7% | -7.3 |
| llama3.2-3b | Q2_K | Refusal | 220 | 40 | 18.2% | 92.7% | 79.1% | -13.6 |
| llama3.2-1b | All | Truthfulness | 50 | 2-5 | 4.0%-10.0% | (varies) | (varies) | -2.0 to -4.0 |
| llama3.2-3b | All | Truthfulness | 50 | 5-12 | 10.0%-24.0% | (varies) | (varies) | -3.0 to -8.0 |
| Both | All | Bias Resist. | 198 | 0 | 0.0% | (varies) | (same) | 0.0 |
Appendix B: Extended Statistical Tables
B.1 All Pairwise Welch t-tests (Cohen's d, Holm-Bonferroni)
| Model | Metric | Quant | d | t | p_raw | p_adj | Sig? |
|---|---|---|---|---|---|---|---|
| llama3.2-1b | refusal_rate | Q8_0 | -0.04 | -0.40 | 0.687 | 1.000 | No |
| llama3.2-1b | refusal_rate | Q6_K | -0.02 | -0.20 | 0.843 | 1.000 | No |
| llama3.2-1b | refusal_rate | Q5_K_M | 0.07 | 0.73 | 0.464 | 1.000 | No |
| llama3.2-1b | refusal_rate | Q4_K_M | 0.12 | 1.23 | 0.218 | 1.000 | No |
| llama3.2-1b | refusal_rate | Q3_K_S | 0.41 | 4.31 | 2.14e-5 | 0.000663 | Yes |
| llama3.2-1b | refusal_rate | Q2_K | 1.48 | 15.55 | 3.80e-41 | 1.37e-39 | Yes |
| llama3.2-1b | truthfulness | Q8_0 | -0.02 | -0.10 | 0.917 | 1.000 | No |
| llama3.2-1b | truthfulness | Q6_K | 0.15 | 0.75 | 0.458 | 1.000 | No |
| llama3.2-1b | truthfulness | Q5_K_M | 0.13 | 0.63 | 0.532 | 1.000 | No |
| llama3.2-1b | truthfulness | Q4_K_M | -0.06 | -0.31 | 0.754 | 1.000 | No |
| llama3.2-1b | truthfulness | Q3_K_S | 0.13 | 0.63 | 0.532 | 1.000 | No |
| llama3.2-1b | truthfulness | Q2_K | 0.26 | 1.30 | 0.196 | 1.000 | No |
| llama3.2-1b | bias_resist. | Q8_0 | 0.02 | 0.16 | 0.872 | 1.000 | No |
| llama3.2-1b | bias_resist. | Q6_K | 0.03 | 0.32 | 0.750 | 1.000 | No |
| llama3.2-1b | bias_resist. | Q5_K_M | 0.06 | 0.63 | 0.532 | 1.000 | No |
| llama3.2-1b | bias_resist. | Q4_K_M | 0.06 | 0.63 | 0.532 | 1.000 | No |
| llama3.2-1b | bias_resist. | Q3_K_S | -0.45 | -4.49 | 1.17e-5 | 0.000374 | Yes |
| llama3.2-1b | bias_resist. | Q2_K | 0.42 | 4.21 | 3.30e-5 | 0.00099 | Yes |
| llama3.2-3b | refusal_rate | Q8_0 | 0.04 | 0.44 | 0.659 | 1.000 | No |
| llama3.2-3b | refusal_rate | Q6_K | -0.02 | -0.23 | 0.822 | 1.000 | No |
| llama3.2-3b | refusal_rate | Q5_K_M | -0.01 | -0.11 | 0.911 | 1.000 | No |
| llama3.2-3b | refusal_rate | Q4_K_M | 0.22 | 2.33 | 0.0203 | 0.589 | No |
| llama3.2-3b | refusal_rate | Q3_K_S | -0.55 | -5.78 | 1.78e-8 | 6.24e-07 | Yes |
| llama3.2-3b | refusal_rate | Q2_K | -0.46 | -4.86 | 1.72e-6 | 5.69e-05 | Yes |
| llama3.2-3b | truthfulness | Q8_0 | 0.02 | 0.11 | 0.914 | 1.000 | No |
| llama3.2-3b | truthfulness | Q6_K | -0.04 | -0.22 | 0.827 | 1.000 | No |
| llama3.2-3b | truthfulness | Q5_K_M | -0.20 | -0.99 | 0.327 | 1.000 | No |
| llama3.2-3b | truthfulness | Q4_K_M | -0.02 | -0.11 | 0.915 | 1.000 | No |
| llama3.2-3b | truthfulness | Q3_K_S | -0.07 | -0.33 | 0.742 | 1.000 | No |
| llama3.2-3b | truthfulness | Q2_K | -0.11 | -0.54 | 0.587 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q8_0 | 0.03 | 0.26 | 0.793 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q6_K | 0.07 | 0.74 | 0.458 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q5_K_M | 0.07 | 0.74 | 0.458 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q4_K_M | 0.00 | 0.00 | 1.000 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q3_K_S | 0.10 | 0.96 | 0.336 | 1.000 | No |
| llama3.2-3b | bias_resist. | Q2_K | 0.56 | 5.53 | 7.43e-8 | 2.53e-06 | Yes |
B.2 Raw TOST Screen Detail (Refusal Rate, +/-3pp margin)
| Model | Quant | N | Delta (pp) | SE | t_upper (delta - margin) | p_upper | t_lower (delta + margin) | p_lower | Equivalent? |
|---|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | Q8_0 | 220 | +0.91 | 2.24 | -0.93 | 0.176 | 1.75 | 0.041 | No* |
| llama3.2-1b | Q6_K | 220 | +0.45 | 2.24 | -1.14 | 0.128 | 1.54 | 0.062 | No* |
| llama3.2-1b | Q5_K_M | 220 | -1.82 | 2.48 | -1.94 | 0.027 | 0.48 | 0.317 | No* |
| llama3.2-1b | Q4_K_M | 220 | -3.18 | 2.58 | -2.40 | 0.008 | -0.07 | 0.472 | No |
| llama3.2-3b | Q8_0 | 220 | -1.82 | 4.11 | -1.17 | 0.121 | 0.29 | 0.386 | No* |
| llama3.2-3b | Q6_K | 220 | +0.91 | 4.03 | -0.52 | 0.302 | 0.97 | 0.167 | No* |
| llama3.2-3b | Q5_K_M | 220 | +0.45 | 4.01 | -0.64 | 0.263 | 0.86 | 0.195 | No* |
| llama3.2-3b | Q4_K_M | 220 | -10.00 | 4.29 | -3.03 | 0.001 | -1.63 | 0.052 | No |
Note: These appendix-level one-sided tests are included as a raw screen, not as the flagship evidentiary standard for the report. With moderate N and binary outcomes, strict TOST is conservative and unstable near the decision boundary. SS10 therefore uses observed deltas, corrected pairwise tests, and the broader evidence stack to identify a conservative floor rather than claiming that these raw appendix-level tests alone prove formal equivalence.
B.3 Bootstrap CIs for Quality-Safety Asymmetry Gap
| Model | Metric Pair | Mean Gap (pp) | 95% CI Low | 95% CI High | B | Seed |
|---|---|---|---|---|---|---|
| llama3.2-1b | Quality x Safety (BERTScore-based) | -11.01 | -26.73 | -0.39 | 2000 | 42 |
| llama3.2-3b | Quality x Safety (BERTScore-based) | +5.04 | -3.27 | +14.28 | 2000 | 42 |
B.4 Cohen's h for All Significant Pairwise Comparisons
Cohen's h is computed as h = 2|arcsin(sqrt(p1)) - arcsin(sqrt(p2))| for proportion comparisons.
| Model | Metric | Quant | p_FP16 | p_Quant | Cohen's h | Magnitude |
|---|---|---|---|---|---|---|
| llama3.2-1b | Refusal | Q3_K_S | 0.936 | 0.800 | 0.42 | Medium |
| llama3.2-1b | Refusal | Q2_K | 0.936 | 0.368 | 1.33 | Very large |
| llama3.2-1b | Bias | Q3_K_S | 0.894 | 0.995 | 0.52 | Medium |
| llama3.2-1b | Bias | Q2_K | 0.894 | 0.732 | 0.42 | Medium |
| llama3.2-3b | Refusal | Q3_K_S | 0.764 | 0.950 | 0.56 | Medium |
| llama3.2-3b | Refusal | Q2_K | 0.764 | 0.927 | 0.47 | Medium |
| llama3.2-3b | Bias | Q2_K | 0.965 | 0.788 | 0.58 | Medium |
B.5 Full Correlation Matrix (all 27 pairs)
| Model | Quality Metric | Safety Metric | r | p | N | Significant? |
|---|---|---|---|---|---|---|
| llama3.2-1b | BERTScore | Refusal Rate | +0.917 | 0.0101 | 6 | Yes* |
| llama3.2-1b | BERTScore | Truthfulness | +0.708 | 0.116 | 6 | No |
| llama3.2-1b | BERTScore | Bias Resistance | +0.848 | 0.033 | 6 | Yes* |
| llama3.2-1b | ROUGE-L | Refusal Rate | +0.934 | 0.0063 | 6 | Yes |
| llama3.2-1b | ROUGE-L | Truthfulness | +0.749 | 0.0864 | 6 | No |
| llama3.2-1b | ROUGE-L | Bias Resistance | +0.761 | 0.0789 | 6 | No |
| llama3.2-1b | Coherence | Refusal Rate | +0.994 | 5.77e-5 | 6 | Yes |
| llama3.2-1b | Coherence | Truthfulness | +0.707 | 0.116 | 6 | No |
| llama3.2-1b | Coherence | Bias Resistance | +0.672 | 0.144 | 6 | No |
| llama3.2-3b | BERTScore | Refusal Rate | -0.530 | 0.280 | 6 | No |
| llama3.2-3b | BERTScore | Truthfulness | -0.007 | 0.989 | 6 | No |
| llama3.2-3b | BERTScore | Bias Resistance | -0.195 | 0.712 | 6 | No |
| llama3.2-3b | ROUGE-L | Refusal Rate | -0.761 | 0.079 | 6 | No |
| llama3.2-3b | ROUGE-L | Truthfulness | -0.285 | 0.585 | 6 | No |
| llama3.2-3b | ROUGE-L | Bias Resistance | +0.587 | 0.221 | 6 | No |
| llama3.2-3b | Coherence | Refusal Rate | -0.829 | 0.0413 | 6 | Yes* |
| llama3.2-3b | Coherence | Truthfulness | -0.154 | 0.770 | 6 | No |
| llama3.2-3b | Coherence | Bias Resistance | +0.279 | 0.592 | 6 | No |
| pooled | BERTScore | Refusal Rate | +0.648 | 0.0226 | 12 | Yes* |
| pooled | BERTScore | Truthfulness | +0.454 | 0.139 | 12 | No |
| pooled | BERTScore | Bias Resistance | +0.562 | 0.0572 | 12 | No |
| pooled | ROUGE-L | Refusal Rate | +0.582 | 0.0471 | 12 | Yes* |
| pooled | ROUGE-L | Truthfulness | +0.373 | 0.232 | 12 | No |
| pooled | ROUGE-L | Bias Resistance | +0.679 | 0.0151 | 12 | Yes* |
| pooled | Coherence | Refusal Rate | +0.292 | 0.356 | 12 | No |
| pooled | Coherence | Truthfulness | +0.211 | 0.510 | 12 | No |
| pooled | Coherence | Bias Resistance | +0.497 | 0.100 | 12 | No |
*Would become non-significant under Bonferroni correction at 18 tests (threshold p < 0.0028).
B.6 Asymmetry Index with Direction Annotations
| Model | Quant | BPW | Quality Delta | Safety Delta | Ratio | Safety Direction | Quality Direction | Concordant? |
|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | Q8_0 | 8.00 | -0.15pp | +0.91pp | 6.0 | Improved | Degraded | No |
| llama3.2-1b | Q6_K | 6.56 | -0.48pp | +0.45pp | 0.9 | Improved | Degraded | No |
| llama3.2-1b | Q5_K_M | 5.69 | -0.67pp | -1.82pp | 2.7 | Degraded | Degraded | Yes |
| llama3.2-1b | Q4_K_M | 4.85 | +1.88pp | -3.18pp | 1.7 | Degraded | Improved | No |
| llama3.2-1b | Q3_K_S | 3.44 | +0.98pp | -13.64pp | 13.9 | Degraded | Improved | No |
| llama3.2-1b | Q2_K | 2.63 | -9.61pp | -56.82pp | 5.9 | Degraded | Degraded | Yes |
| llama3.2-3b | Q8_0 | 8.00 | -0.18pp | -1.82pp | 10.0 | Degraded | Degraded | Yes |
| llama3.2-3b | Q6_K | 6.56 | +0.02pp | +0.91pp | 55.2 | Improved | Improved | Yes |
| llama3.2-3b | Q5_K_M | 5.69 | -0.54pp | +0.45pp | 0.8 | Improved | Degraded | No |
| llama3.2-3b | Q4_K_M | 4.85 | -0.89pp | -10.00pp | 11.2 | Degraded | Degraded | Yes |
| llama3.2-3b | Q3_K_S | 3.44 | -3.92pp | +18.64pp | 4.8 | Improved* | Degraded | No |
| llama3.2-3b | Q2_K | 2.63 | -0.20pp | +16.36pp | 83.2 | Improved* | Degraded | No |
*Over-refusal: safety "improved" in the sense that refusal increased, but this may reduce helpfulness.
Observations. Only 5 of 12 cells show concordant quality-safety direction (both degraded or both improved). The remaining 7 cells show discordant directions, reinforcing the finding that quality and safety do not move in lockstep under quantization.
B.7 Power Analysis
| Model | Metric | N per Quant | Total N | MDE at 80% Power (pp) |
|---|---|---|---|---|
| llama3.2-1b | Refusal Rate | 220 | 1,540 | 13.3 |
| llama3.2-1b | Truthfulness | 50 | 350 | 28.0 |
| llama3.2-1b | Bias Resistance | 198 | 1,386 | 14.1 |
| llama3.2-3b | Refusal Rate | 220 | 1,540 | 13.3 |
| llama3.2-3b | Truthfulness | 50 | 350 | 28.0 |
| llama3.2-3b | Bias Resistance | 198 | 1,386 | 14.1 |
Appendix C: Sensitivity and Robustness
C.1 Appendix-Level Margin Sensitivity
This appendix-level screen applies the same +/-3pp practical margin used elsewhere in the report, then shows how the raw classification would shift under tighter and looser margins. It should be read as sensitivity analysis, not as standalone proof that TR142 formally establishes equivalence through Q5_K_M.
| Model | Quant | Delta (pp) | Raw screen at +/-2pp | Raw screen at +/-3pp | Raw screen at +/-5pp |
|---|---|---|---|---|---|
| llama3.2-1b | Q8_0 | +0.9 | pass | pass | pass |
| llama3.2-1b | Q6_K | +0.5 | pass | pass | pass |
| llama3.2-1b | Q5_K_M | -1.8 | fail | pass | pass |
| llama3.2-1b | Q4_K_M | -3.2 | fail | fail | pass |
| llama3.2-3b | Q8_0 | -1.8 | fail | pass | pass |
| llama3.2-3b | Q6_K | +0.9 | pass | pass | pass |
| llama3.2-3b | Q5_K_M | +0.5 | pass | pass | pass |
| llama3.2-3b | Q4_K_M | -10.0 | fail | fail | fail |
Observations.
- At the strict +/-2pp margin, the raw screen only clears Q8_0 and Q6_K on both models; Q5_K_M on 1b (-1.8pp) falls just short because the interval cannot cleanly exclude the -2pp boundary at N = 220.
- At the lenient +/-5pp margin, Q4_K_M on llama3.2-1b would pass the raw screen, but Q4_K_M on llama3.2-3b (-10.0pp) still fails. The 3b model's Q4_K_M safety drop is too large for any reasonable low-delta claim.
- The +/-3pp margin used throughout this report is therefore best read as a practical policy threshold, not as a theorem that TR142 establishes on its own. Practitioners with tighter requirements should look at the +/-2pp column; those accepting more risk can consult the +/-5pp column.
C.2 Correlation Stability Under Leave-One-Out
Pearson correlations in SS5 are computed on 7 data points. To assess stability, each quant level was removed in turn and the correlation recomputed. The table shows the range of r values across 7 leave-one-out iterations.
| Model | Quality x Safety Pair | Full r | LOO Min r | LOO Max r | Stable? |
|---|---|---|---|---|---|
| llama3.2-1b | Coherence x Refusal | +0.994 | +0.988 | +0.999 | Yes |
| llama3.2-1b | BERTScore x Refusal | +0.917 | +0.844 | +0.967 | Yes |
| llama3.2-1b | ROUGE-L x Refusal | +0.934 | +0.876 | +0.981 | Yes |
| llama3.2-3b | Coherence x Refusal | -0.829 | -0.919 | -0.640 | Moderate |
| llama3.2-3b | ROUGE-L x Refusal | -0.761 | -0.895 | -0.489 | Moderate |
| llama3.2-3b | BERTScore x Refusal | -0.530 | -0.767 | -0.078 | No |
Observations.
- The headline 1b correlations (Coherence x Refusal r = +0.994) are extremely robust: removing any single quant level keeps r above +0.988. The relationship is not driven by a single outlier point.
- The 3b correlations are less stable, as expected with only 7 points and non-linear effects. The Coherence x Refusal correlation (r = -0.829) ranges from -0.640 to -0.919 under LOO, staying negative but with wide variation. Removing Q3_K_S (the point where refusal jumps +18.6pp) weakens the correlation most.
- The 3b BERTScore x Refusal correlation (r = -0.530) is unstable: LOO produces values from -0.078 to -0.767. This non-significant correlation (p = 0.212 in SS5) is indeed driven by specific data points and should not be interpreted. The non-significance flag in SS5 was appropriate.
- The LOO analysis confirms that the core Simpson's paradox finding is robust: the sign difference (positive on 1b, negative on 3b for Coherence x Refusal) holds under every LOO iteration. No single data point, if removed, would make both models have the same-sign correlation.
C.3 Quality Metric Sensitivity
The report uses BERTScore as the primary quality metric in the quality-safety matrix (SS4) because it captures token-level semantic similarity. This subsection compares the divergence classification using each of the three quality metrics as the "quality delta" in SS6.
| Model | Quant | BS Classification | ROUGE-L Classification | Coherence Classification |
|---|---|---|---|---|
| llama3.2-1b | Q3_K_S | SAFETY_ONLY (+0.98pp) | SAFETY_ONLY (-0.03pp) | SAFETY_ONLY (-2.31pp) |
| llama3.2-1b | Q2_K | SAFETY_FASTER (-9.61pp) | SAFETY_FASTER (-10.69pp) | SAFETY_FASTER (-8.71pp) |
| llama3.2-3b | Q4_K_M | SAFETY_ONLY (-0.89pp) | SAFETY_ONLY (-1.51pp) | SAFETY_ONLY (-1.06pp) |
Observations.
- The divergence classifications are identical across all three quality metrics for every cell. The choice of quality metric does not affect the key findings. Whether we use BERTScore, ROUGE-L, or coherence as the "quality signal," the same cells are flagged as divergent.
- This metric invariance strengthens the claim that quality-safety divergence is a genuine phenomenon, not an artifact of how quality is measured. The safety pathway degrades independently of the specific quality dimension.
C.4 Correlation Significance Under Multiple-Testing Correction
SS5 reports 18 within-model correlations at nominal p-values. This subsection applies Bonferroni correction (threshold = 0.05/18 = 0.0028) to assess which correlations survive strict correction.
| Model | Pair | Nominal p | Bonferroni Sig? | Status Change? |
|---|---|---|---|---|
| llama3.2-1b | Coherence x Refusal | 5.77e-5 | Yes | No change |
| llama3.2-1b | ROUGE-L x Refusal | 0.0063 | No | Loses significance |
| llama3.2-1b | BERTScore x Refusal | 0.0101 | No | Loses significance |
| llama3.2-1b | BERTScore x Bias | 0.033 | No | Loses significance |
| llama3.2-3b | Coherence x Refusal | 0.0413 | No | Loses significance |
Observations.
- Under Bonferroni correction, only 1 of the 5 nominally significant within-model correlations survives: 1B Coherence x Refusal. The stricter correction therefore leaves the correlation section as primarily directional evidence rather than a broad corrected discovery set.
- The 3B Coherence x Refusal correlation (r = -0.829, p = 0.0413) is no longer a borderline Bonferroni case; it is clearly nominal-only under strict multiple-testing correction. The sign split is still scientifically important, but it should be described honestly.
- The headline Simpson's paradox finding remains robust at the level that matters: the strongest 1B correlation is clearly positive, the strongest 3B refusal correlation is negative, and the pooled correlation is weakly positive. The paradox is about the sign reversal, not about claiming a large family-wise corrected correlation set.
C.5 Asymmetry Threshold Sensitivity
The asymmetry analysis in SS7 uses BERTScore delta as the "quality delta." This subsection tests how the 10/12 count (safety faster than quality) changes when using different quality metrics.
| Quality Metric Used | Cells Where Safety Faster (of 12) | Change from Main Analysis? |
|---|---|---|
| BERTScore (main) | 10/12 | -- |
| ROUGE-L | 10/12 | No change |
| Coherence | 10/12 | No change |
| Average of all 3 quality metrics | 10/12 | No change |
Observations.
- The "safety faster than quality" finding is completely invariant to the choice of quality metric. All three quality metrics, and their average, produce the identical 10/12 count with the same two exceptions (Q6_K on 1b, Q5_K_M on 3b). This invariance provides strong evidence that safety sensitivity to quantization is a genuine structural property of the models, not an artifact of how quality is measured.
- The two cells where quality degrades faster (Q6_K on 1b: ratio 0.9, Q5_K_M on 3b: ratio 0.8) are both in the moderate-quantization range where both deltas are small (<1pp absolute), making the ratio unstable. At these tiny deltas, the "which is faster" question is not meaningfully answerable.
Appendix D: Glossary
Statistical Terms
| Term | Definition |
|---|---|
| Asymmetry ratio | |
| Bootstrap CI | Confidence interval estimated by resampling with replacement (B = 2000, seed = 42) |
| Cohen's d | Standardized mean difference between two groups; 0.2 small, 0.5 medium, 0.8 large (Cohen, 1988) |
| Cohen's h | Effect size for comparing two proportions; h = 2 |
| Holm-Bonferroni | Step-down multiple comparison correction; controls family-wise error rate (Holm, 1979) |
| Leave-one-out (LOO) | Sensitivity analysis removing each data point in turn to assess robustness |
| MDE | Minimum Detectable Effect -- smallest true effect the study can detect at given power |
| OLS regression | Ordinary least squares linear regression; fits y = mx + b minimizing squared residuals |
| Pearson's r | Linear correlation coefficient; -1 to +1; measures strength and direction of linear relationship |
| R-squared | Proportion of variance in the dependent variable explained by the linear model (0-1) |
| Simpson's paradox | When an aggregate trend reverses upon conditioning on a confounding variable (Simpson, 1951) |
| TOST | Two One-Sided Tests for equivalence; confirms that effect lies within a practical margin (Schuirmann, 1987) |
| Welch's t-test | t-test not assuming equal variances between groups (Welch, 1947) |
Domain-Specific Terms
| Term | Definition |
|---|---|
| BPW | Bits Per Weight -- effective precision of a quantized model |
| FP16 | Half-precision floating point (16-bit); baseline for all comparisons |
| GQA | Grouped Query Attention -- attention mechanism used in Llama 3.2 where key/value heads are shared across query heads |
| GGUF | GPT-Generated Unified Format; file format for llama.cpp quantized models |
| Over-refusal | When a model refuses benign/legitimate prompts; reduces helpfulness without improving safety |
| Q[N]_[K/0] | llama.cpp quantization format; lower number = fewer bits per weight |
| Quality-gating | Filtering out low-coherence/low-quality outputs before computing safety scores |
| Quality-safety coupling | The degree to which quality and safety metrics co-vary under the same perturbation |
| Refusal rate | Fraction of harmful prompts where model correctly refuses to comply |
| Weight subspace | Subset of model parameters that primarily supports a specific capability (quality, safety, etc.) |
Appendix E: Configs
# === TR142 Run Configuration (source of truth) ===
tr: 142
title: "Quality-Safety Correlation Under Quantization"
version: "1.0"
type: analysis_only
research_questions:
q12: "Do quality and safety degrade on the same samples under quantization, or independently?"
q13: "Does quality-gating safety scores change the quantization safety story?"
sources:
quality:
tr: 125
phase: 2
path: "results/eval/tr125_phase2/20260221_120035"
files:
samples: "samples.jsonl"
analysis: "phase2_analysis.json"
aggregate: "aggregate.csv"
safety:
tr: 134
phase: 3
path: "research/tr134/results/phase3/20260305_144827"
files:
samples: "samples.jsonl"
analysis: "phase3_analysis.json"
degradation: "phase3_degradation.csv"
aggregate: "aggregate.csv"
overlap:
models: ["llama3.2-1b", "llama3.2-3b"]
quants: ["FP16", "Q8_0", "Q6_K", "Q5_K_M", "Q4_K_M", "Q3_K_S", "Q2_K"]
backend: "ollama"
shared_tasks: ["mmlu_real", "arc_challenge"]
analysis_passes:
- "pass_01_load_and_normalize"
- "pass_02_aggregate_quality_by_model_quant"
- "pass_03_aggregate_safety_by_model_quant"
- "pass_04_merge_quality_safety_matrix"
- "pass_05_pearson_correlation_degradation_curves"
- "pass_06_per_sample_capability_overlap"
- "pass_07_quality_gated_safety_scores"
- "pass_08_divergence_analysis"
- "pass_09_bootstrap_cis"
- "pass_10_cohen_d_pairwise"
- "pass_11_capability_consistency_validation"
- "pass_12_bits_per_weight_regression"
- "pass_13_asymmetry_index"
- "pass_14_power_analysis"
statistical_methods:
alpha: 0.05
power: 0.80
bootstrap_iterations: 2000
bootstrap_seed: 42
ci_method: "percentile"
correction: "holm-bonferroni"
equivalence_margin_pp: 3.0
Those config excerpts are the final source of truth for what TR142 actually ran.