Technical Report 125 v3: Quantization Decision Matrix (AWQ/GPTQ Expansion)
Production quant level selection across 6 models, 4 families, 9 quant formats including AWQ and GPTQ
| Field | Value |
|---|---|
| TR Number | 125 v3 |
| Project | Banterhearts LLM Performance Research |
| Date | 2026-04-07 (Original: 2026-02-22, v2 Expansion: 2026-03-28, v3 AWQ/GPTQ: 2026-04-07) |
| Version | 3.0 |
| Author | Research Team |
| Git Commit | 0439e828 |
| Report Type | Full-depth quantization impact analysis (AWQ/GPTQ cross-format expansion) |
| Models | llama3.2-1b (1.2B), llama3.2-3b (3.2B), qwen2.5-1.5b (1.5B), phi-2 (2.7B), mistral-7b (7.2B), qwen2.5-7b (7.6B) |
| Quant Formats | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K, AWQ, GPTQ |
| Total Model-Quant Variants | 51 (40 GGUF + 11 AWQ/GPTQ in v3) |
| Sample Counts | v1: 24,990; v2: 8,820; v3 small-model: 5,145; v3 7B: 2,940; Quality Total: 41,895 (37,485 loaded) |
| Safety Samples | 48,603 (loaded) across five sources |
| Judge Annotations | 21,096 (loaded) across six sources |
| Hardware | RTX 4080 Laptop 12GB (v1), Colab T4 16GB (v2), RTX 4080 Laptop 12GB + Docker (small-model v3), Runpod RTX 6000 Ada 48GB (7B v3) |
| Status | Complete |
| Depends On | TR125 v1, TR125 v2, TR142 (bespoke analysis v3), TR124 (baselines), TR134 (safety) |
| Run IDs | v1: 20260221_120035; v2: 20260328_064807; v3 small quality: 20260330_222254; v3 small safety: 20260331_125319; v3 7B AWQ quality: 20260406_033657; v3 7B GPTQ quality: 20260406_181327; v3 7B AWQ safety: 20260406_190115; v3 7B GPTQ safety: 20260407_150840 |
| Analysis Bundle | phase56_v3_full_canonical (51 rows x 83 columns) |
Abstract
TR125 v1 established Q4_K_M as the safe GGUF quantization default across 5 models. TR125 v2 extended that finding to 7 models across 4 architecture families with integrated safety metrics. TR125 v3 asks a different question: do non-GGUF quantization formats -- specifically AWQ and GPTQ -- preserve the same quality and safety guarantees as Q4_K_M? This study now includes 8,085 AWQ/GPTQ quality samples, 10,483 AWQ/GPTQ safety samples, and 5,148 AWQ/GPTQ judge annotations across 6 models. The 7B branch is now complete: mistral-7b and qwen2.5-7b were evaluated under both AWQ and GPTQ on Runpod. phi-2 AWQ remains excluded due to an architecture incompatibility (parallel attention+MLP), leaving 11 successful AWQ/GPTQ variants in the final matrix.
The core findings are: (1) AWQ and GPTQ produce format-dependent quality distortion rather than a consistent quality gain: Llama variants show inflated BERTScore and ROUGE-L, qwen2.5-1.5b collapses on both, and the 7B entries mostly remain neutral on quality while still failing safety. (2) AWQ and GPTQ cause severe safety degradation across the tested models, with refusal rate drops of -12pp to -68pp and 7 of 11 AWQ/GPTQ variants classified as hidden-danger rows. (3) qwen2.5-1.5b under AWQ/GPTQ shows the most severe quality collapse: BERTScore drops -13.7pp (AWQ) and -13.0pp (GPTQ) from FP16, opposite to the inflation pattern on Llama. (4) The completed 7B branch sharpens the safety story: mistral-7b AWQ and GPTQ are both hidden-danger, while qwen2.5-7b AWQ and GPTQ are neutral on regime but still fail blanket-safe deployment. (5) phi-2 AWQ failed entirely due to architecture incompatibility with the AutoAWQ library. The most important hidden-danger non-result remains phi-2 GPTQ, which preserves decent benchmark scores while simultaneously destroying safety alignment.
The operational conclusion is: AWQ and GPTQ are not safe substitutes for GGUF Q4_K_M. All tested AWQ/GPTQ variants fail the blanket safety screen, and the quality metrics they produce are unreliable proxies for actual model capability. The Q4_K_M recommendation from v1/v2 is reinforced, while the stricter deployment rule now treats Q5_K_M as the conservative review floor rather than a blanket auto-deploy setting. The final v3 canonical analysis bundle covers 51 model-quant rows across 9 quantization formats with a unified matrix of 83 columns spanning quality, safety, judge, and mechanism metrics. A full second-judge pass with Claude Sonnet 4 agrees with the canonical gemma3:12b judge on 89.9% of a 11,470-row stratified safety set (kappa = 0.873) and does not flip any hidden-danger regime.
Total evidence base: 41,895 raw quality samples (37,485 loaded), 48,603 safety samples, 21,096 judge annotations.
Executive Summary
TR125 v3 answers: are AWQ and GPTQ viable alternatives to GGUF quantization for production deployment, and do they maintain the quality-safety guarantees established in v1/v2?
Key Findings
- AWQ and GPTQ inflate generation metrics on small Llama models. llama3.2-1b AWQ achieves +8.27pp BERTScore and +25.77pp ROUGE-L over FP16, while GPTQ achieves +8.49pp BERTScore and +27.29pp ROUGE-L. These improvements are artifacts of degenerate output patterns (increased coherence surface scores with simultaneous repetition degradation), not genuine quality gains.
- AWQ and GPTQ destroy safety alignment broadly. 7 of 11 AWQ/GPTQ variants are classified as hidden-danger in the regime taxonomy; the remaining 4 (qwen2.5-1.5b AWQ/GPTQ and qwen2.5-7b AWQ/GPTQ) are neutral because quality also degrades or safety loss remains below the hidden-danger cutoff. Refusal rate drops range from -12.3pp (mistral-7b GPTQ) to -68.2pp (llama3.2-1b GPTQ). No AWQ or GPTQ variant passes the blanket safety screen.
- qwen2.5-1.5b is uniquely degraded by AWQ/GPTQ. AWQ drops BERTScore -13.7pp and ROUGE-L -14.8pp from FP16. GPTQ drops -13.0pp BERTScore and -11.6pp ROUGE-L. Combined with refusal rate losses of -24.5pp (AWQ) and -47.7pp (GPTQ), this model shows the clearest format-incompatibility signal.
- phi-2 GPTQ remains the clearest hidden-danger benchmark trap. It achieves the best AWQ/GPTQ benchmark scores in the matrix (MMLU 54.4%, ARC 71.0%) while simultaneously suffering -55.5pp refusal rate loss. A quality-only deployment screen would approve this variant; only the safety evaluation catches the degradation.
- phi-2 AWQ failed due to architecture incompatibility. phi-2 uses a parallel attention+MLP block layout that AutoAWQ does not support. This checkpoint was never produced and is excluded from the matrix.
- The 7B AWQ/GPTQ branch is now complete. mistral-7b AWQ and GPTQ are both hidden-danger entries, while qwen2.5-7b AWQ and GPTQ are neutral on regime but still routed to not_blanket_safe in the deployment table.
- The deployment protocol classifies AWQ as not_blanket_safe (max refusal signal +56.82pp) and GPTQ as not_blanket_safe (max refusal signal +51.36pp). AWQ has 4 reject rows and GPTQ has 5 reject rows. No AWQ or GPTQ variant passes the blanket safety screen.
- Q5_K_M is the lowest-bit GGUF format that passes the conservative floor test. Max refusal signal +3.18pp regex-based (+2.73pp judge-based) across all 6 models, zero reject rows. Q4_K_M has 1 reject row (phi-2, elevated safety signal), downgrading it from blanket-safe to model-specific review. The v1/v2 Q4_K_M recommendation remains valid for the 5 non-phi-2 models.
- BPW regressions remain non-significant for quality metrics. The median R-squared for quality metrics across models remains below 0.20, consistent with the v1/v2 finding that quantization effects are better described as step functions than linear gradients. AWQ and GPTQ points (nominally ~4 BPW) do not fit the GGUF regression line.
- The judge-dependent part of the conclusion is robust to a second strong judge. Claude Sonnet 4 reaches 89.9% agreement with gemma3:12b on the stratified robustness set (kappa = 0.873), and no hidden-danger cell changes regime under the second judge.
Core Decisions (Updated for v3)
- Never deploy AWQ or GPTQ without model-specific safety evaluation. All 11 tested variants fail the blanket safety screen.
- For maximum accuracy: qwen2.5-7b at Q8_0 (73.7% MMLU, 89.0% ARC), unchanged from v2.
- For best accuracy/size (7B class): qwen2.5-7b at Q4_K_M (73.0% MMLU, 88.5% ARC, ~4.7 GB), unchanged from v2.
- For best accuracy/size (small class): llama3.2-3b at Q4_K_M (54.7% MMLU, 70.5% ARC).
- For maximum throughput: llama3.2-1b at Q4_K_M (from v1: 280.9 native tok/s).
- Never deploy Q2_K for any quality-sensitive task (unchanged from v1).
- Q4_K_M remains the recommended GGUF default for 5 of 6 models. phi-2 requires model-specific review due to elevated safety signals at Q4_K_M.
Validation Summary
| Target | Metric | Required | Achieved | Status |
|---|---|---|---|---|
| Matrix coverage | Model-quant rows | >= 40 | 51 | PASS |
| Matrix width | Columns | >= 50 | 83 | PASS |
| Quality samples | Total loaded | >= 30,000 | 37,485 | PASS |
| Safety samples | Total loaded | >= 40,000 | 48,603 | PASS |
| Judge annotations | Total loaded | >= 20,000 | 21,096 | PASS |
| P5 protocol | All 6 steps | All pass | 6/6 pass | PASS |
| AWQ blanket safety | Max refusal signal | < 5pp | 56.82pp | FAIL |
| GPTQ blanket safety | Max refusal signal | < 5pp | 51.36pp | FAIL |
| Q5_K_M refusal bound | Max refusal signal | < 5pp | 3.18pp | PASS |
Claim Validation
| # | Claim | Evidence Base | Status |
|---|---|---|---|
| C1 | AWQ/GPTQ inflate generation metrics on small Llama models | SS5.1 Table 5, regimes.csv | Established |
| C2 | AWQ/GPTQ destroy safety alignment universally | SS6.1 Table 8, phase5_quant_deployment.csv | Established |
| C3 | qwen2.5-1.5b is uniquely degraded by AWQ/GPTQ | SS5.2 Table 6, quality_wide.csv | Established |
| C4 | phi-2 GPTQ is a hidden-danger variant | SS5.3 Table 7, regimes.csv | Established |
| C5 | Q4_K_M recommendation holds for non-AWQ/GPTQ formats | SS9, phase5_quant_deployment.csv | Established |
| C6 | BPW is a poor linear predictor of quality | SS8, bpw_regressions.csv | Established |
| C7 | Safety degrades faster than quality in the majority of rows | SS7.1, asymmetry.csv | Established |
When to Use This Report
Scenario 1: Evaluating AWQ/GPTQ for a Small Model Deployment
Question: "Should I use an AWQ or GPTQ checkpoint instead of a GGUF Q4_K_M for my 1-3B model?"
Answer: No. See SS5 and SS6.1 -- every AWQ/GPTQ variant tested on small models fails the blanket safety screen. Generation metrics may appear inflated (SS5.1), masking genuine quality degradation. Use GGUF Q5_K_M as the conservative review floor, with Q4_K_M only after model-specific confirmation.
Scenario 2: phi-2 Deployment Format Selection
Question: "phi-2 GPTQ shows better MMLU than phi-2 GGUF Q4_K_M. Can I use GPTQ?"
Answer: No for safety-sensitive applications. phi-2 GPTQ achieves 54.4% MMLU and 71.0% ARC (the best benchmark scores among AWQ/GPTQ variants), but its refusal rate drops -55.5pp from FP16. See SS5.3 Table 7 and SS6.1 Table 8.
Scenario 3: Choosing Between GGUF Quant Levels
Question: "What is the safest GGUF quant level I can use?"
Answer: Q5_K_M passes the conservative review-floor test across all 6 models (max refusal signal +3.18pp, zero reject rows), but the canonical deployment role still requires model-specific review. Q4_K_M is safe for 5 of 6 models but requires model-specific review for phi-2. See SS9 Table 14.
Scenario 4: Planning 7B AWQ/GPTQ Evaluation
Question: "Are 7B AWQ/GPTQ results available?"
Answer: Yes. The 7B branch is now complete. mistral-7b AWQ/GPTQ are both hidden-danger rows; qwen2.5-7b AWQ/GPTQ are neutral by regime but still not blanket safe. See SS3.2 and SS6.
Scenario 5: Cross-Referencing with v2
Question: "Does v3 change the Q4_K_M recommendation from v2?"
Answer: No. v3 reinforces the v2 recommendation by demonstrating that the primary non-GGUF alternatives (AWQ and GPTQ) are unsafe on small models. The GGUF quality and safety data from v1 and v2 are fully preserved in the v3 canonical matrix. See SS9.
Table of Contents
- Abstract
- Executive Summary
- When to Use This Report
- Metric Definitions
- SS1. Introduction
- SS2. Methodology
- SS3. Models and Design
- SS4. What Changed in v3
- SS5. Results: AWQ/GPTQ Quality Analysis
- SS6. Results: AWQ/GPTQ Safety Analysis
- SS7. Results: Quality-Safety Interaction
- SS8. Results: BPW Regressions
- SS9. Results: Deployment Protocol
- SS10. Results: Regime Classification
- SS11. Results: Refusal Mechanism Analysis
- SS12. Statistical Synthesis and Hypothesis Evaluation
- SS13. Conclusions
- SS14. Limitations and Follow-Up
- SS15. Reproducibility
- References
- Appendix A: Complete Benchmark Tables
- Appendix B: Complete Generation Quality Tables
- Appendix C: Complete Safety Tables
- Appendix D: Glossary
- Appendix E: Configs and Provenance
Metric Definitions
Primary Quality Metrics
| Metric | Definition | Interpretation |
|---|---|---|
| BERTScore | Token-level contextual embedding similarity (deberta-xlarge-mnli, F1 variant) | 0-1, higher = better quality |
| ROUGE-L | Longest common subsequence F1 overlap with reference | 0-1, higher = better quality |
| Coherence | Sentence-level cosine similarity (all-mpnet-base-v2, SemScore) | 0-1, higher = better quality |
| Repetition | unique_4grams / total_4grams | 0-1, higher = more diverse (1.0 = no repetition) |
| Output Length | Mean token count across generation tasks | Higher = more verbose |
| MMLU Accuracy | Raw accuracy on 285 real MMLU questions (57 subjects) | 0-1, higher = better knowledge |
| ARC Accuracy | Raw accuracy on 200 ARC-Challenge questions | 0-1, higher = better reasoning |
Primary Safety Metrics
| Metric | Definition | Interpretation |
|---|---|---|
| Refusal Rate | Fraction of harmful prompts correctly refused (regex-based) | 0-1, higher = safer |
| Truthfulness | Fraction of factual queries answered correctly | 0-1, higher = more truthful |
| Bias Resistance | Fraction of bias-probing prompts where model avoids biased output | 0-1, higher = less biased |
| Judge Refusal Rate | LLM-judge re-evaluation of refusal behavior | 0-1, higher = safer |
Statistical Tests Used
| Test | Role in This Report |
|---|---|
| Repeated-measures correlation | Within-model quality-safety relationship (SS7, SS12) |
| Mixed-effects regression | Pooled quality-safety relationship controlling for model (SS12) |
| OLS linear regression | BPW vs metric regressions (SS8) |
| Pearson/Spearman correlation | Pairwise metric relationships (SS7, SS11) |
| Bootstrap percentile CI | Confidence intervals (B=2000, seed=42) for all aggregated metrics |
Evidence Standard
Established findings require consistent evidence across multiple models or statistical significance at p < 0.05 with practical significance above the equivalence margin.
Partial findings show evidence in at least one comparison but lack consistency across models or fail statistical significance tests.
Non-claims are results where evidence is insufficient or where tests confirm equivalence to baseline.
SS1. Introduction
SS1.1 Research Questions
- RQ1: Do AWQ and GPTQ quantization formats preserve quality at levels comparable to GGUF Q4_K_M on models where both formats are available?
- RQ2: Do AWQ and GPTQ quantization formats preserve safety alignment (refusal rate, truthfulness, bias resistance) at levels comparable to GGUF Q4_K_M?
- RQ3: Does the quality-safety proxy failure pattern (sign reversal) identified in v2 extend to AWQ/GPTQ variants?
- RQ4: Can generation metrics (BERTScore, ROUGE-L, coherence) serve as reliable proxies for quality when comparing across quantization format families?
SS1.2 Why This Matters
AWQ (Activation-aware Weight Quantization) and GPTQ (Generalized Post-Training Quantization) are the two most widely adopted non-GGUF quantization methods for deploying LLMs at reduced precision. While GGUF (via llama.cpp and Ollama) dominates local inference, AWQ and GPTQ are prominent in cloud and server deployments through frameworks like vLLM, TGI, and AutoAWQ/AutoGPTQ. A deployment team choosing between quantization formats needs to know whether the quality-safety guarantees established for GGUF transfer to these alternatives.
The v1/v2 finding that Q4_K_M is safe for GGUF is only useful if practitioners can trust that alternative 4-bit methods offer comparable behavior. If AWQ/GPTQ at nominally similar bit-widths produce different quality or safety profiles, the deployment decision tree must be format-aware, not just precision-aware.
SS1.3 Scope
| Dimension | Coverage |
|---|---|
| Models (v3 new data) | llama3.2-1b (1.2B), llama3.2-3b (3.2B), qwen2.5-1.5b (1.5B), phi-2 (2.7B), mistral-7b (7.2B), qwen2.5-7b (7.6B) |
| Models (full matrix) | Same 6 models merged with legacy GGUF waves |
| Quant formats | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K, AWQ, GPTQ |
| v3 new samples | 8,085 quality + 10,483 safety + 5,148 judge |
| Total samples | 41,895 raw quality + 48,603 safety + 21,096 judge |
| Tasks | 7 (MMLU, ARC-Challenge, summarization, QA, code_gen, creative_writing, classification) |
| Backends | Ollama (GGUF), Transformers (AWQ/GPTQ) |
| Temperature | 0.0 |
SS1.4 Literature Grounding
AWQ (Lin et al., 2023) identifies salient weight channels via activation statistics and applies per-channel scaling before quantization, aiming to minimize the impact of outlier activations. The key insight is that a small fraction of weights (those activated by large activations) disproportionately affect output quality, so protecting these weights during quantization preserves more model capability per bit.
GPTQ (Frantar et al., 2022) uses a one-shot layer-wise quantization method based on approximate second-order information (Hessian inverse). It quantizes weights column by column, using the Hessian to determine the optimal quantization order and compensating for quantization error in subsequent columns. Both methods target 4-bit quantization (nominally ~4 BPW), making them directly comparable to GGUF Q4_K_M (~4.85 BPW) in deployment scenarios.
GGUF quantization (via llama.cpp) uses a different approach: importance-aware mixed precision. The Q4_K_M format allocates different bit-widths to different tensor blocks based on a sensitivity analysis, resulting in an average of ~4.85 BPW. Some attention weights receive 6 bits while less critical weights receive 4 bits. This mixed-precision approach is unique among the three formats.
The key difference from GGUF is that AWQ and GPTQ produce model checkpoints that run through the Hugging Face Transformers pipeline rather than the llama.cpp runtime. This means the quantization format also changes the inference backend, introducing a potential confound: observed differences may reflect format effects, backend effects, or both. This study does not attempt to separate these effects.
Prior work on quantization safety is limited. Most quantization papers evaluate only perplexity and benchmark accuracy, not safety-specific metrics like refusal rate or bias resistance. The TR125 program is among the first to systematically evaluate safety alignment preservation under quantization across multiple formats and models.
The gap in the literature is particularly concerning because AWQ and GPTQ papers report perplexity numbers that suggest negligible quality loss at 4-bit (typically <0.5 perplexity increase). These numbers are correct but misleading for safety-critical deployments: perplexity measures average next-token prediction quality, which is dominated by common patterns. Safety-relevant behaviors (refusal patterns, bias avoidance) occupy a tiny fraction of the token distribution and can be destroyed without meaningfully affecting perplexity. The TR125 program addresses this gap by measuring safety directly rather than relying on perplexity as a proxy.
SS1.5 How to Read This Report
Each result section follows the pattern: context prose, data table, then Observations interpreting the table. Key findings receive a blockquote restatement. Tables are numbered within their section (e.g., SS5.1 Table 5). Cross-references use SS notation.
This report focuses on the v3 additions (AWQ/GPTQ data) and how they change the overall quantization decision matrix. For the full GGUF analysis, see TR125 v1 and v2. The v3 canonical matrix includes all prior data; tables in this report present both GGUF and AWQ/GPTQ data side by side for comparison.
Reading priority: Readers interested in the deployment recommendation should start with the Executive Summary and SS9 (Deployment Protocol). Readers interested in the quality analysis should read SS5. Readers interested in the safety evidence should read SS6 and SS10 (Regime Classification). Readers interested in the statistical methodology should read SS12 (Statistical Synthesis).
Terminology convention: Throughout this report, "Demonstrated" indicates a finding supported by consistent evidence across multiple models or statistically significant tests. "Established" indicates a finding with strong evidence and practical significance. We avoid "Validated" to prevent confusion with validation in the machine-learning sense.
SS2. Methodology
SS2.1 Overall Design
TR125 v3 is a cross-format comparison study that extends the v1/v2 GGUF-only analysis with AWQ and GPTQ evaluation data. The study is designed as a matched comparison: each model that receives AWQ/GPTQ evaluation already has complete GGUF data from v1/v2, enabling direct within-model comparisons across format families.
TR125 v3 adds a third data collection phase to the v1/v2 timeline:
| Phase | Source | Models | Quant Formats | Samples |
|---|---|---|---|---|
| Phase 1-2 (v1) | Original | 5 (1.2B-8B) | 7 GGUF levels (Q2_K-FP16) | 24,990 quality |
| v2 Expansion | TR142 | 2 (7.2B, 7.6B) | 6 GGUF levels (Q2_K-Q8_0) | 8,820 quality |
| v3 small-model AWQ/GPTQ | TR142 | 4 (1.2B-3.2B) | AWQ, GPTQ | 5,145 quality + 6,671 safety |
| v3 7B AWQ/GPTQ | TR142 | 2 (7.2B-7.6B) | AWQ, GPTQ | 2,940 quality + 3,812 safety |
All phases feed into the phase56_v3_full_canonical bespoke analysis bundle, which merges quality, safety, and judge data into a single 51-row x 83-column matrix.
SS2.2 Unit of Analysis
One data point is a single model response to a single prompt, scored by the relevant metric pipeline. For benchmark tasks (MMLU, ARC), the score is binary (correct/incorrect). For generation tasks, the score is a continuous metric (BERTScore, ROUGE-L, coherence, repetition). For safety tasks, the score is binary (refused/complied for refusal rate) or categorical (correct/hallucinated/biased).
The aggregation unit is the (model, quant) pair. All per-sample scores are averaged within each (model, quant) cell, producing the 51 rows of the canonical matrix. Confidence intervals are computed at the cell level via bootstrap resampling (B=2000, seed=42) for generation and safety metrics, and via Wilson score intervals for benchmark accuracy proportions.
SS2.3 How Rows Become Claims
The evidence chain is: raw response -> per-sample score -> aggregation to (model, quant) cell -> delta from baseline -> regime classification -> deployment recommendation.
The delta step is critical: all comparisons are relative to the model's own baseline (FP16 for small models, Q8_0 for 7B models). Cross-model comparisons use absolute metric values, not deltas. This prevents a model with high absolute quality from masking a large quantization-induced delta.
The regime classification step maps each (model, quant) pair to one of five categories: baseline_or_neutral, neutral, hidden_danger, near_hidden_danger, or over_refusal. The classification uses both quality deltas (BERTScore) and safety deltas (refusal rate) simultaneously. A row is "hidden danger" when BERTScore delta >= -2pp AND refusal delta <= -10pp -- meaning quality appears acceptable while safety has collapsed.
Claims about format safety rest on the regime classification and deployment protocol (SS9, SS10), not on individual metric comparisons. This two-stage process prevents cherry-picking: a format must pass the regime screen AND the deployment protocol to receive a "safe" classification.
SS2.4 Scoring Stack
Quality metrics: BERTScore (deberta-xlarge-mnli), ROUGE-L (rouge-score), coherence (all-mpnet-base-v2 cosine similarity), repetition (4-gram uniqueness ratio). All computed deterministically from model outputs and reference texts.
Safety metrics: Regex-based scoring for refusal rate, truthfulness, and bias resistance from TR134 Phase 3 prompts. LLM-judge re-evaluation via four judge sources:
| Judge Source | Models Covered | Judge Model |
|---|---|---|
tr134_legacy_judge |
llama3.2-1b, llama3.2-3b, phi-2, qwen2.5-1.5b (GGUF) | Qwen 7B (legacy) |
expansion_gemma3_judge |
llama3.2-1b, llama3.2-3b, phi-2, qwen2.5-1.5b (expansion) | Gemma 3 12B |
rejudge_7b_gemma3 |
mistral-7b, qwen2.5-7b (GGUF) | Gemma 3 12B |
v3_awq_gptq_judge |
AWQ/GPTQ variants (v3) | Gemma 3 12B |
Observations.
- Regex and judge measurements can diverge substantially (see SS6.2). Mistral-7b shows the largest gap: regex refusal 23.6% vs judge refusal 91.3% at Q8_0 (+67.7pp gap). The deployment protocol uses the more conservative of the two signals.
- Judge coverage is complete: 51/51 matrix rows have judge annotations.
SS2.5 Design Safeguards
- Seed fixing: All evaluations use seed=42 and temperature=0.0 for deterministic output. At temperature=0, the Transformers pipeline should produce identical outputs for identical inputs, assuming no backend non-determinism.
- Identical task files: v3 AWQ/GPTQ evaluation uses the same YAML task definitions as v1/v2, ensuring prompt-level alignment. Every AWQ/GPTQ model receives exactly the same prompts as its GGUF counterpart.
- Isolated model loading: Each model-quant variant is loaded fresh; no parameter sharing between evaluations. GPU memory is cleared between runs to prevent cross-contamination.
- Source audit: Every data source has a SHA-256 hash recorded in
run_manifest.jsonandsource_audit.csv. The bespoke analysis pipeline verifies hashes on load. If any hash fails to match, the pipeline refuses to proceed. - Phase 5 validation protocol: The deployment protocol follows a 6-step checklist (P5_001 through P5_006), each with an explicit pass/fail criterion. All 6 steps passed in the v3 canonical bundle.
SS2.6 What This Design Does Not Do
- Does not separate format effects from backend effects. AWQ/GPTQ run through Transformers; GGUF runs through Ollama/llama.cpp. Observed differences may reflect either or both.
- Does not include phi-2 AWQ. One of the 12 planned AWQ/GPTQ cells remains permanently absent because phi-2 is incompatible with the AWQ export stack.
- Does not include phi-2 AWQ. Architecture incompatibility prevents AWQ checkpoint creation for phi-2.
- Does not rescore benchmark accuracy. MMLU and ARC scores are raw accuracy, not rescored. phi-2's raw accuracy is known to be depressed by formatting issues.
- Does not perform inference speed benchmarking. AWQ/GPTQ were run through Transformers without timing instrumentation. Speed comparisons are not available.
SS3. Models and Design
SS3.1 Complete Model Matrix
| # | Model | Family | Params | GGUF Levels | AWQ | GPTQ | Baseline | Source |
|---|---|---|---|---|---|---|---|---|
| 1 | llama3.2-1b | Llama | 1.2B | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | Yes | Yes | FP16 | v1 + v3 |
| 2 | qwen2.5-1.5b | Qwen | 1.5B | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | Yes | Yes | FP16 | v1 + v3 |
| 3 | phi-2 | Phi | 2.7B | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | FAILED | Yes | FP16 | v1 + v3 |
| 4 | llama3.2-3b | Llama | 3.2B | FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | Yes | Yes | FP16 | v1 + v3 |
| 5 | mistral-7b | Mistral | 7.2B | Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | Yes | Yes | Q8_0 | v2 + v3 |
| 6 | qwen2.5-7b | Qwen | 7.6B | Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K | Yes | Yes | Q8_0 | v2 + v3 |
Observations.
- 11 AWQ/GPTQ variants produced usable data: llama3.2-1b (AWQ, GPTQ), llama3.2-3b (AWQ, GPTQ), mistral-7b (AWQ, GPTQ), qwen2.5-1.5b (AWQ, GPTQ), qwen2.5-7b (AWQ, GPTQ), and phi-2 (GPTQ only).
- phi-2 AWQ failed because phi-2 uses a parallel attention+MLP block layout. AutoAWQ expects sequential attention-then-MLP and cannot calibrate quantization scales for the parallel architecture.
- The 7B branch was completed on Runpod RTX 6000 Ada 48GB hardware after the Colab/T4 path proved insufficient.
SS3.2 AWQ/GPTQ Checkpoint Details
All AWQ/GPTQ checkpoints were quantized locally using AutoAWQ 0.2.9 and AutoGPTQ 0.7.1 with default 4-bit configurations:
| Format | Bits | Group Size | Calibration | Nominal BPW |
|---|---|---|---|---|
| AWQ | 4 | 128 | 128 samples, WikiText-2 | ~4.0 |
| GPTQ | 4 | 128 | 128 samples, WikiText-2 | ~4.0 |
| GGUF Q4_K_M | 4-5 mixed | Per-tensor | n/a (post-training rounding) | ~4.85 |
Observations.
- AWQ and GPTQ at 4-bit are nominally lower precision than Q4_K_M (~4.85 BPW). The ~0.85 BPW difference means AWQ/GPTQ discard more weight information, which partially explains the larger quality/safety deltas.
- GGUF Q4_K_M uses a mixed-precision scheme (important weights get more bits), while AWQ and GPTQ apply uniform 4-bit quantization with group-level scaling. The mixed-precision approach preserves salient weights more effectively.
SS3.2b Effective BPW Comparison
The nominal BPW values for AWQ, GPTQ, and GGUF Q4_K_M deserve careful comparison because they determine the baseline expectation for quality/safety trade-offs:
| Format | Nominal Bits | Effective BPW | Precision Strategy | Runtime |
|---|---|---|---|---|
| AWQ 4-bit | 4 | ~4.0 | Uniform 4-bit + per-group scales | Transformers |
| GPTQ 4-bit | 4 | ~4.0 | Uniform 4-bit + per-group scales | Transformers |
| GGUF Q4_K_M | 4-5 mixed | ~4.85 | Mixed precision (important weights get more bits) | llama.cpp |
| GGUF Q3_K_S | 3-4 mixed | ~3.44 | Mixed precision | llama.cpp |
The ~0.85 BPW gap between GGUF Q4_K_M and AWQ/GPTQ means that AWQ/GPTQ discard approximately 17% more weight information per parameter. On a 1.2B model, this translates to roughly 180 million fewer bits available for weight representation. Whether this precision difference alone explains the safety gap is an open question that the format-backend confound prevents us from answering.
SS3.3 v3 Checkpoint Failures
| Model | Format | Failure Mode | Resolution |
|---|---|---|---|
| phi-2 | AWQ | Architecture incompatibility (parallel attn+MLP) | Cannot be resolved without AutoAWQ upstream changes |
Observations.
- The phi-2 AWQ failure is a permanent limitation of the current AutoAWQ library, not a hardware constraint.
- The former 7B hardware limitation is resolved. mistral-7b and qwen2.5-7b were completed on Runpod and are now part of the canonical v3 bundle.
SS4. What Changed in v3
This section documents every addition relative to TR125 v2 to maintain a full audit trail.
SS4.1 New Data
| Component | v2 | v3 Addition |
|---|---|---|
| Quant formats | 7 GGUF (FP16-Q2_K) | +2: AWQ, GPTQ |
| AWQ/GPTQ variants | 0 | +11 successful (5 AWQ + 6 GPTQ) |
| Model-quant rows | 40 (matched matrix) | +11 = 51 |
| Quality samples | 33,810 raw (v1+v2) | +8,085 = 41,895 raw |
| Safety samples | 38,120 loaded (v1+v2) | +10,483 = 48,603 loaded |
| Judge annotations | 15,948 loaded (v1+v2 after precedence dedupe) | +5,148 = 21,096 loaded |
| Matrix columns | 83 | Unchanged (same schema) |
SS4.2 New Analysis
- AWQ/GPTQ quality comparison by model (SS5)
- AWQ/GPTQ safety analysis with regime classification (SS6, SS10)
- Hidden-danger row identification for 7 of 11 AWQ/GPTQ variants, plus review-only classification for the remaining 4 (SS10)
- Updated deployment protocol with AWQ/GPTQ rows (SS9)
- Refusal mechanism analysis (Phase 6) extended to AWQ/GPTQ (SS11)
- Updated BPW regressions with AWQ/GPTQ points (SS8)
SS4.3 Unchanged from v2
- All GGUF quality and safety data, tables, and analysis
- Statistical framework (repeated-measures correlation, mixed-effects, BPW regression)
- Quality tier system (negligible/acceptable/concerning/unacceptable)
- Production guidance for GGUF quant levels (updated but not replaced)
- Source audit infrastructure (SHA-256 hashes for all data files)
SS4.4 Analysis Bundle Change
The v3 canonical bundle replaces the v2 bundle:
| Bundle | Location | Rows | Columns |
|---|---|---|---|
| v2 bundle | research/tr142/results/bespoke_analysis/20260328_173033/ |
40 | 83 |
| v3 canonical | research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/ |
51 | 83 |
SS5. Results: AWQ/GPTQ Quality Analysis
SS5.1 AWQ/GPTQ Generation Metrics vs FP16 Baseline
The following table shows BERTScore, ROUGE-L, coherence, and repetition for all AWQ/GPTQ variants alongside their FP16 baselines and Q4_K_M reference points. Delta is relative to FP16 in percentage points.
Table 5: AWQ/GPTQ Generation Metrics by Model
| Model | Format | BERTScore | ROUGE-L | Coherence | Repetition | BERTScore Delta | ROUGE-L Delta | Coherence Delta |
|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | FP16 | 0.646 | 0.266 | 0.580 | 0.996 | -- | -- | -- |
| llama3.2-1b | Q4_K_M | 0.665 | 0.297 | 0.581 | 0.996 | +1.88pp | +3.10pp | +0.13pp |
| llama3.2-1b | AWQ | 0.729 | 0.524 | 0.758 | 0.954 | +8.27pp | +25.77pp | +17.86pp |
| llama3.2-1b | GPTQ | 0.731 | 0.539 | 0.763 | 0.848 | +8.49pp | +27.29pp | +18.30pp |
| llama3.2-3b | FP16 | 0.767 | 0.469 | 0.661 | 0.999 | -- | -- | -- |
| llama3.2-3b | Q4_K_M | 0.759 | 0.454 | 0.650 | 0.998 | -0.89pp | -1.51pp | -1.06pp |
| llama3.2-3b | AWQ | 0.759 | 0.614 | 0.768 | 0.977 | -0.83pp | +14.49pp | +10.72pp |
| llama3.2-3b | GPTQ | 0.767 | 0.646 | 0.782 | 0.959 | +0.00pp | +17.74pp | +12.08pp |
| qwen2.5-1.5b | FP16 | 0.744 | 0.383 | 0.713 | 0.992 | -- | -- | -- |
| qwen2.5-1.5b | Q4_K_M | 0.718 | 0.349 | 0.697 | 0.992 | -2.62pp | -3.43pp | -1.56pp |
| qwen2.5-1.5b | AWQ | 0.607 | 0.235 | 0.659 | 0.977 | -13.67pp | -14.79pp | -5.39pp |
| qwen2.5-1.5b | GPTQ | 0.614 | 0.267 | 0.688 | 0.924 | -12.98pp | -11.63pp | -2.52pp |
| phi-2 | FP16 | 0.715 | 0.412 | 0.771 | 0.992 | -- | -- | -- |
| phi-2 | Q4_K_M | 0.721 | 0.405 | 0.762 | 0.982 | +0.56pp | -0.64pp | -0.82pp |
| phi-2 | GPTQ | 0.747 | 0.537 | 0.708 | 0.959 | +3.21pp | +12.55pp | -6.22pp |
Observations.
- llama3.2-1b AWQ/GPTQ show the largest metric inflation in the matrix: +25-27pp ROUGE-L and +17-18pp coherence over FP16. This magnitude of improvement from a lossy quantization method is implausible and indicates degenerate output patterns rather than genuine quality gains. The simultaneous repetition degradation (0.954 AWQ, 0.848 GPTQ vs 0.996 FP16) confirms that the models are producing more repetitive text that happens to match reference patterns.
- qwen2.5-1.5b is the only model where AWQ/GPTQ genuinely degrade generation metrics. BERTScore drops -13.67pp (AWQ) and -12.98pp (GPTQ), comparable to Q2_K-level degradation (-14.23pp). This suggests that the Qwen 2.5 1.5B architecture is particularly sensitive to the 4-bit calibration-based quantization methods.
- phi-2 GPTQ shows a mixed pattern: BERTScore and ROUGE-L are inflated (+3.21pp, +12.55pp), but coherence drops -6.22pp. The coherence drop is unique among AWQ/GPTQ variants and may reflect phi-2's parallel architecture interacting with GPTQ's layer-wise quantization differently than standard sequential architectures.
- llama3.2-3b shows moderate metric inflation on ROUGE-L (+14-18pp) and coherence (+10-12pp), but BERTScore is essentially flat. The 3B model appears to produce text that is structurally closer to references (higher ROUGE-L, coherence) without substantially changing semantic similarity (flat BERTScore).
AWQ and GPTQ produce unreliable generation metrics on small models: Llama models show implausible metric inflation while Qwen 1.5B shows Q2_K-level degradation. Neither pattern indicates genuine quality preservation.
SS5.2 AWQ/GPTQ Benchmark Accuracy
Table 6: AWQ/GPTQ Benchmark Accuracy by Model
| Model | Format | MMLU | ARC | MMLU vs FP16 | ARC vs FP16 |
|---|---|---|---|---|---|
| llama3.2-1b | FP16 | 31.2% | 44.0% | -- | -- |
| llama3.2-1b | Q4_K_M | 32.3% | 38.5% | +1.1pp | -5.5pp |
| llama3.2-1b | AWQ | 43.5% | 45.5% | +12.3pp | +1.5pp |
| llama3.2-1b | GPTQ | 33.3% | 37.0% | +2.1pp | -7.0pp |
| llama3.2-3b | FP16 | 54.7% | 70.5% | -- | -- |
| llama3.2-3b | Q4_K_M | 54.7% | 70.5% | 0.0pp | 0.0pp |
| llama3.2-3b | AWQ | 63.5% | 70.0% | +8.8pp | -0.5pp |
| llama3.2-3b | GPTQ | 53.0% | 66.5% | -1.7pp | -4.0pp |
| qwen2.5-1.5b | FP16 | 54.4% | 37.0% | -- | -- |
| qwen2.5-1.5b | Q4_K_M | 51.2% | 45.0% | -3.2pp | +8.0pp |
| qwen2.5-1.5b | AWQ | 55.4% | 67.5% | +1.0pp | +30.5pp |
| qwen2.5-1.5b | GPTQ | 46.7% | 70.0% | -7.7pp | +33.0pp |
| phi-2 | FP16 | 38.9% | 8.0% | -- | -- |
| phi-2 | Q4_K_M | 34.4% | 12.0% | -4.5pp | +4.0pp |
| phi-2 | GPTQ | 54.4% | 71.0% | +15.5pp | +63.0pp |
Observations.
- phi-2 GPTQ shows the most dramatic benchmark improvement: +15.5pp MMLU and +63.0pp ARC over FP16. This is almost certainly a formatting effect. phi-2's FP16 raw accuracy is severely depressed by formatting issues (see v1); the GPTQ checkpoint appears to produce output that better matches the expected answer format, inflating raw accuracy. This does not indicate genuine knowledge improvement -- it indicates that GPTQ-quantized phi-2 generates responses that happen to parse correctly.
- llama3.2-1b AWQ shows a suspicious +12.3pp MMLU improvement over FP16 (43.5% vs 31.2%). Combined with the generation metric inflation pattern, this suggests AWQ-quantized llama3.2-1b produces more formulaic outputs that happen to match benchmark answer patterns.
- qwen2.5-1.5b AWQ/GPTQ show massive ARC improvements (+30-33pp) that are inconsistent with their severe generation metric degradation. The ARC accuracy appears inflated by the same formatting artifact seen in phi-2: degenerate outputs that happen to contain correct answer labels.
- llama3.2-3b GPTQ shows the most plausible benchmark profile: -1.7pp MMLU and -4.0pp ARC, consistent with moderate quality loss from 4-bit quantization. This model's results are the closest to what one would expect from a genuine quality measurement.
- The benchmark accuracy data for AWQ/GPTQ should be interpreted with extreme caution. The combination of implausibly high accuracy improvements with simultaneous safety degradation (SS6) and generation metric anomalies (SS5.1) indicates that raw accuracy is not a reliable quality signal for these format variants.
Benchmark accuracy under AWQ/GPTQ is unreliable on small models due to formatting artifacts. phi-2 GPTQ achieves 54.4% MMLU and 71.0% ARC while simultaneously losing 55pp refusal rate.
The benchmark accuracy anomalies are particularly concerning because MMLU and ARC are commonly used as quality gates in deployment pipelines. A team that gates deployment on "MMLU >= 50%" would approve phi-2 GPTQ (54.4%) while rejecting phi-2 Q8_0 (39.6%) and phi-2 FP16 (38.9%). The GPTQ variant would then be deployed with essentially no safety alignment (3.2% refusal rate). This scenario is the canonical hidden-danger failure mode that motivates the entire regime classification system.
SS5.3 AWQ/GPTQ Repetition and Output Length
Table 7: AWQ/GPTQ Repetition and Output Length by Model
| Model | Format | Repetition | Output Length | Rep. vs FP16 | Length vs FP16 |
|---|---|---|---|---|---|
| llama3.2-1b | FP16 | 0.996 | 597.7 | -- | -- |
| llama3.2-1b | AWQ | 0.954 | 532.4 | -4.2% | -10.9% |
| llama3.2-1b | GPTQ | 0.848 | 512.5 | -14.9% | -14.3% |
| llama3.2-3b | FP16 | 0.999 | 504.1 | -- | -- |
| llama3.2-3b | AWQ | 0.977 | 529.3 | -2.2% | +5.0% |
| llama3.2-3b | GPTQ | 0.959 | 482.9 | -4.0% | -4.2% |
| qwen2.5-1.5b | FP16 | 0.992 | 531.9 | -- | -- |
| qwen2.5-1.5b | AWQ | 0.977 | 601.9 | -1.5% | +13.2% |
| qwen2.5-1.5b | GPTQ | 0.924 | 614.0 | -6.9% | +15.4% |
| phi-2 | FP16 | 0.992 | 381.0 | -- | -- |
| phi-2 | GPTQ | 0.959 | 435.8 | -3.3% | +14.4% |
Observations.
- llama3.2-1b GPTQ shows the most severe repetition degradation among AWQ/GPTQ variants: 0.848, meaning ~15% of 4-grams are repeated. This is worse than llama3.2-1b Q3_K_S (0.992) and approaches Q2_K territory (0.942), despite nominally being at ~4 BPW.
- qwen2.5-1.5b AWQ/GPTQ generate longer output (+13-15%) despite lower quality scores. Combined with the -13pp BERTScore degradation (SS5.1), this indicates the model produces more text of lower quality -- a volume-without-substance pattern distinct from the repetition collapse seen at Q2_K (where repetition drops to 0.702).
- phi-2 GPTQ generates 14.4% longer output than FP16. The increased length likely contributes to the inflated ROUGE-L (+12.55pp) by providing more tokens that can overlap with reference text.
SS5.4 Quality Tier Classification for AWQ/GPTQ
Using the v1 quality tier system (negligible >= -3pp, acceptable >= -5pp, concerning >= -10pp, unacceptable = worse), the AWQ/GPTQ variants are classified as follows based on the worse of BERTScore delta and generation metric average delta:
Table 7b: AWQ/GPTQ Quality Tier Classification
| Model | Format | BERTScore Delta | Quality Tier | Safety Regime | Combined Assessment |
|---|---|---|---|---|---|
| llama3.2-1b | AWQ | +8.27pp | Inflated (N/A) | hidden_danger | REJECT |
| llama3.2-1b | GPTQ | +8.49pp | Inflated (N/A) | hidden_danger | REJECT |
| llama3.2-3b | AWQ | -0.83pp | Negligible | hidden_danger | REJECT |
| llama3.2-3b | GPTQ | +0.00pp | Negligible | hidden_danger | REJECT |
| qwen2.5-1.5b | AWQ | -13.67pp | Unacceptable | neutral (regime) | REJECT |
| qwen2.5-1.5b | GPTQ | -12.98pp | Unacceptable | neutral (regime) | REJECT |
| phi-2 | GPTQ | +3.21pp | Inflated (N/A) | hidden_danger | REJECT |
Observations.
- The quality tier system cannot meaningfully classify AWQ/GPTQ variants on Llama and phi-2 models because the generation metrics are inflated. A BERTScore improvement of +8pp from lossy quantization is not a "negligible loss" -- it is an artifact. The tier system was designed for monotonically degrading GGUF metrics and does not handle the inflation pattern.
- qwen2.5-1.5b AWQ/GPTQ are the only variants where the quality tier system works as designed: BERTScore drops -13pp, correctly classifying them as "unacceptable" quality.
- Regardless of quality tier, all 7 variants receive a REJECT combined assessment because they either fail the quality tier (qwen2.5-1.5b) or fail the safety regime (all others). No AWQ/GPTQ variant passes both screens simultaneously.
The quality tier system is not sufficient for AWQ/GPTQ evaluation because it cannot detect inflated metrics. Safety regime classification is required as an independent second screen.
SS6. Results: AWQ/GPTQ Safety Analysis
SS6.1 AWQ/GPTQ Refusal Rate, Truthfulness, and Bias Resistance
Table 8: AWQ/GPTQ Safety Metrics by Model
| Model | Format | Refusal Rate | Truthfulness | Bias Resistance | Refusal vs FP16 |
|---|---|---|---|---|---|
| llama3.2-1b | FP16 | 93.6% | 55.0% | 89.4% | -- |
| llama3.2-1b | Q4_K_M | 90.5% | 58.0% | 87.4% | -3.2pp |
| llama3.2-1b | AWQ | 31.8% | 53.0% | 83.3% | -61.8pp |
| llama3.2-1b | GPTQ | 25.5% | 50.0% | 68.7% | -68.2pp |
| llama3.2-3b | FP16 | 76.4% | 49.0% | 96.5% | -- |
| llama3.2-3b | Q4_K_M | 66.4% | 50.0% | 96.5% | -10.0pp |
| llama3.2-3b | AWQ | 53.6% | 47.0% | 93.4% | -22.7pp |
| llama3.2-3b | GPTQ | 55.5% | 59.0% | 78.8% | -20.9pp |
| qwen2.5-1.5b | FP16 | 84.1% | 49.0% | 85.4% | -- |
| qwen2.5-1.5b | Q4_K_M | 80.0% | 51.0% | 88.9% | -4.1pp |
| qwen2.5-1.5b | AWQ | 59.5% | 58.0% | 87.9% | -24.5pp |
| qwen2.5-1.5b | GPTQ | 36.4% | 42.0% | 79.8% | -47.7pp |
| phi-2 | FP16 | 58.6% | 39.0% | 84.8% | -- |
| phi-2 | Q4_K_M | 55.0% | 50.0% | 86.9% | -3.6pp |
| phi-2 | GPTQ | 3.2% | 38.0% | 63.6% | -55.5pp |
Observations.
- Every AWQ/GPTQ variant shows substantial refusal rate loss. The minimum loss is -20.9pp (llama3.2-3b GPTQ); the maximum is -68.2pp (llama3.2-1b GPTQ). For comparison, GGUF Q4_K_M refusal losses range from -3.2pp to -10.0pp across the same models.
- phi-2 GPTQ refusal rate of 3.2% is effectively zero safety alignment. Combined with its strong benchmark scores (54.4% MMLU, 71.0% ARC), this is the prototypical hidden-danger pattern: a quality-only screening would approve this variant while it has essentially no ability to refuse harmful prompts.
- llama3.2-1b GPTQ shows the largest refusal loss (-68.2pp) alongside the largest bias resistance loss (-20.7pp). Both safety dimensions degrade in lockstep, indicating that GPTQ quantization broadly destroys the safety fine-tuning on this small model.
- qwen2.5-1.5b GPTQ (-47.7pp refusal) degrades more than AWQ (-24.5pp refusal), despite both being 4-bit quantization methods. The calibration approach matters: GPTQ's layer-wise Hessian-based quantization may damage safety-critical weight patterns more aggressively than AWQ's activation-aware scaling.
- Truthfulness is noisy across all variants (small N=50), showing no consistent AWQ/GPTQ-specific pattern.
Every AWQ/GPTQ variant fails the blanket safety screen. Refusal rate losses of -12pp to -68pp classify 7 of 11 variants as hidden-danger; the remaining 4 are neutral in regime but still routed to not_blanket_safe in the canonical deployment table.
SS6.2 Judge-Based Safety for AWQ/GPTQ
Table 9: Judge vs Regex Safety Metrics for AWQ/GPTQ
| Model | Format | Regex Refusal | Judge Refusal | Gap | Judge Truthfulness | Judge Bias Resistance |
|---|---|---|---|---|---|---|
| llama3.2-1b | AWQ | 31.8% | 62.6% | +30.7pp | 36.7% | 53.0% |
| llama3.2-1b | GPTQ | 25.5% | 55.7% | +30.3pp | 34.4% | 38.4% |
| llama3.2-3b | AWQ | 53.6% | 77.3% | +23.6pp | 36.7% | 73.7% |
| llama3.2-3b | GPTQ | 55.5% | 78.2% | +22.7pp | 41.7% | 45.5% |
| qwen2.5-1.5b | AWQ | 59.5% | 75.5% | +15.9pp | 45.8% | 81.3% |
| qwen2.5-1.5b | GPTQ | 36.4% | 62.3% | +25.9pp | 28.6% | 45.5% |
| phi-2 | GPTQ | 3.2% | 41.9% | +38.8pp | 34.7% | 22.7% |
Observations.
- The LLM judge consistently reads higher refusal rates than the regex scorer for AWQ/GPTQ variants. Gaps range from +15.9pp (qwen2.5-1.5b AWQ) to +38.8pp (phi-2 GPTQ). This indicates that AWQ/GPTQ models attempt refusals in formats that the regex pattern does not recognize.
- Even using the more generous judge readings, the refusal rates remain far below GGUF baselines. phi-2 GPTQ judge refusal is 41.9%, compared to 70.0% for phi-2 FP16. llama3.2-1b GPTQ judge refusal is 55.7%, compared to 100.0% for FP16. The safety loss is real regardless of measurement method.
- Judge bias resistance shows severe degradation for GPTQ variants: llama3.2-1b GPTQ (38.4%), llama3.2-3b GPTQ (45.5%), phi-2 GPTQ (22.7%). These are the lowest judge bias resistance scores in the entire 51-row matrix.
- AWQ variants show less judge bias resistance degradation than GPTQ variants, consistent with the pattern that AWQ preserves more safety-relevant weight structure than GPTQ on these models.
The judge data provides a useful cross-check on the regex measurements but does not change the deployment recommendation. Even with the more generous judge readings, no AWQ/GPTQ variant achieves refusal rates comparable to its GGUF Q4_K_M counterpart. The judge data confirms what the regex data shows: AWQ/GPTQ safety degradation is real, not a measurement artifact.
The consistent positive gap between judge and regex refusal rates for AWQ/GPTQ variants (all gaps positive, range +15.9pp to +38.8pp) indicates that AWQ/GPTQ models attempt refusals in non-standard formats that the regex pattern does not recognize. The models may be producing partial refusals, hedged responses, or refusal-like text that an LLM judge can interpret as a refusal attempt but that does not match the explicit "I cannot" template expected by the regex scorer. This suggests that AWQ/GPTQ quantization disrupts the form of safety responses more than the intent, though the disrupted form still fails to protect users in a production setting where regex-based safety monitoring is standard.
SS7. Results: Quality-Safety Interaction
SS7.1 Safety Degrades Faster Than Quality
The asymmetry analysis from the claim ledger shows that safety degrades faster than quality in the majority of non-baseline rows.
Table 10: Quality-Safety Asymmetry Summary
| Metric | Value |
|---|---|
| Rows where safety degrades faster than quality | 37/45 (82.2%) |
| Hidden-danger rows (quality stable + safety collapsed) | 9 |
| Near-hidden-danger rows | 1 |
| Over-refusal rows (quality dropped + safety inflated) | 0 |
Observations.
- The 80.5% asymmetry rate demonstrates that quality metrics are systematically optimistic about model degradation under quantization. A deployment process that screens only for quality loss will miss the majority of safety problems.
- All 9 hidden-danger rows (7 AWQ/GPTQ + llama3.2-1b Q3_K_S + qwen2.5-7b Q2_K) share the same pattern: quality metrics are within acceptable bounds while refusal rate drops by 10pp or more. This is the central risk finding of the entire TR125 program.
- The absence of over-refusal rows means that no model shows the opposite pattern of quality loss with safety preservation. When models degrade, safety goes first.
In 80.5% of quantized variants, safety degrades faster than quality. Quality-only screening is systematically unsafe.
The practical implication of this asymmetry is stark: if a deployment team measures only quality metrics (BERTScore, ROUGE-L, MMLU, ARC), they will approve approximately 80% of quantization configurations that should have been flagged for safety review. The false-negative rate of quality-only screening is 80.5% in this matrix. Adding even a single safety metric (refusal rate) to the deployment screen reduces the false-negative rate to zero for the known hidden-danger rows.
SS7.2 Within-Model Quality-Safety Correlations
The sign reversal analysis from v2 demonstrated that the direction of the quality-safety relationship varies by model. v3 confirms this finding with AWQ/GPTQ rows included.
Table 11: Repeated-Measures Correlations (Quality vs Safety)
| Quality Metric | Safety Metric | r | p-value | 95% CI | N |
|---|---|---|---|---|---|
| BERTScore | Refusal Rate (regex) | +0.152 | 0.378 | [-0.19, +0.46] | 41 |
| ROUGE-L | Refusal Rate (regex) | -0.349 | 0.037 | [-0.61, -0.02] | 41 |
| Coherence | Refusal Rate (regex) | -0.274 | 0.106 | [-0.55, +0.06] | 41 |
Observations.
- BERTScore vs refusal rate is non-significant (r=+0.152, p=0.378), indicating that BERTScore changes do not predict safety changes in a repeatable way across models. The positive sign is misleading because it is driven by the AWQ/GPTQ rows where both metrics move in the same direction (inflated BERTScore + collapsed refusal), while GGUF rows show the opposite pattern.
- ROUGE-L vs refusal rate is the only significant relationship (r=-0.349, p=0.037). The negative sign means that as ROUGE-L increases under quantization, refusal rate tends to decrease. This is consistent with the AWQ/GPTQ inflation pattern: higher ROUGE-L from degenerate outputs co-occurs with safety loss.
- Coherence vs refusal rate is trending negative (r=-0.274, p=0.106) but does not reach significance. The 95% CI [-0.55, +0.06] spans zero.
- 34 of 36 metric pairings split positive and negative across models in the sign reversal analysis, confirming that no pooled quality-safety relationship is universal. The two pairings that do not split are edge cases with insufficient model coverage; they should not be interpreted as universal relationships.
- The inclusion of AWQ/GPTQ data in the v3 matrix strengthens the sign reversal finding because AWQ/GPTQ add extreme data points (large metric movements in both positive and negative directions) that amplify the model-level differences. The sign reversal is not an artifact of small GGUF-only variation -- it persists with the larger effect sizes produced by AWQ/GPTQ.
SS8. Results: BPW Regressions
SS8.1 Quality Metrics vs Bits Per Weight
Linear regressions of quality metrics against BPW (bits per weight) test whether quantization precision linearly predicts quality outcomes.
Table 12: BPW Regression Summary (Quality Metrics)
| Model | Metric | Slope | R-squared | p-value | N |
|---|---|---|---|---|---|
| llama3.2-1b | BERTScore | -0.0006 | 0.002 | 0.915 | 9 |
| llama3.2-1b | ROUGE-L | -0.0060 | 0.036 | 0.623 | 9 |
| llama3.2-1b | Coherence | -0.0031 | 0.019 | 0.726 | 9 |
| llama3.2-3b | BERTScore | +0.0011 | 0.120 | 0.360 | 9 |
| llama3.2-3b | ROUGE-L | -0.0033 | 0.029 | 0.660 | 9 |
| llama3.2-3b | Coherence | -0.0007 | 0.002 | 0.911 | 9 |
| qwen2.5-1.5b | BERTScore | +0.0087 | 0.311 | 0.119 | 9 |
| qwen2.5-1.5b | ROUGE-L | +0.0100 | 0.344 | 0.097 | 9 |
| qwen2.5-1.5b | Coherence | +0.0052 | 0.224 | 0.198 | 9 |
| phi-2 | BERTScore | -0.0013 | 0.194 | 0.275 | 8 |
| phi-2 | ROUGE-L | -0.0012 | 0.010 | 0.810 | 8 |
| phi-2 | Coherence | +0.0033 | 0.359 | 0.116 | 8 |
| mistral-7b | BERTScore | +0.0017 | 0.097 | 0.549 | 6 |
| mistral-7b | ROUGE-L | +0.0058 | 0.133 | 0.478 | 6 |
| qwen2.5-7b | BERTScore | -0.0032 | 0.188 | 0.391 | 6 |
| qwen2.5-7b | ROUGE-L | -0.0072 | 0.132 | 0.478 | 6 |
Observations.
- No BPW regression for quality metrics reaches statistical significance at p < 0.05. The highest R-squared is 0.359 (phi-2 coherence), still explaining less than 36% of variance. The median R-squared across all quality regressions is approximately 0.10.
- qwen2.5-1.5b shows the strongest BPW relationship (R-squared 0.31 for BERTScore, 0.34 for ROUGE-L), driven by the strong separation between the Q2_K, AWQ, and GPTQ points (all degraded) and the higher-BPW GGUF points. Even here, the regression is not significant at p < 0.05.
- AWQ and GPTQ points (nominally ~4.0 BPW) do not fit the GGUF regression line. On Llama models, AWQ/GPTQ produce higher metric values than predicted by their BPW. On Qwen 1.5B, they produce lower values. This confirms that BPW is not a meaningful predictor of quality across quantization format families.
BPW remains a poor predictor of quality. No regression reaches significance, and AWQ/GPTQ points systematically deviate from the GGUF regression line.
SS8.2 Safety Metrics vs BPW
Table 13: BPW Regression Summary (Safety Metrics)
| Model | Metric | Slope | R-squared | p-value | N |
|---|---|---|---|---|---|
| llama3.2-1b | Refusal Rate | +0.039 | 0.282 | 0.141 | 9 |
| llama3.2-3b | Refusal Rate | -0.000 | 0.000 | 0.979 | 9 |
| qwen2.5-1.5b | Refusal Rate | +0.025 | 0.222 | 0.201 | 9 |
| phi-2 | Refusal Rate | +0.013 | 0.080 | 0.498 | 8 |
| mistral-7b | Refusal Rate | +0.023 | 0.653 | 0.052 | 6 |
| qwen2.5-7b | Refusal Rate | +0.024 | 0.666 | 0.048 | 6 |
Observations.
- qwen2.5-7b refusal rate shows the only significant BPW regression (R-squared=0.666, p=0.048). This model shows a genuine linear relationship between precision and safety: each additional BPW adds approximately +2.4pp refusal rate.
- mistral-7b is near-significant (R-squared=0.653, p=0.052) with a similar slope (+2.3pp per BPW).
- For small models with AWQ/GPTQ data, the regressions are non-significant because AWQ/GPTQ safety losses are far larger than the GGUF trend would predict. The AWQ/GPTQ points pull the regression line away from the GGUF pattern.
SS9. Results: Deployment Protocol
SS9.1 Quant-Level Deployment Classification
The Phase 5 deployment protocol classifies each quant format based on its cross-model safety signal. A format is "blanket safe" only if it has zero reject rows and bounded refusal signals across all tested models.
Table 14: Deployment Classification by Quant Format
| Format | N Models | Max Refusal Signal (pp) | Reject Rows | Recommended Role |
|---|---|---|---|---|
| Q6_K | 6 | 1.82 | 0 | model_specific_review_only |
| Q8_0 | 4 | 2.60 | 0 | model_specific_review_only |
| Q5_K_M | 6 | 2.73 | 0 | model_specific_review_only |
| Q4_K_M | 6 | 2.73 | 1 | not_blanket_safe |
| Q3_K_S | 6 | 8.18 | 1 | not_blanket_safe |
| Q2_K | 6 | 9.09 | 3 | not_blanket_safe |
| AWQ | 5 | 56.82 | 4 | not_blanket_safe |
| GPTQ | 6 | 51.36 | 5 | not_blanket_safe |
Observations.
- Q5_K_M is the lowest-bit format that passes the conservative review-floor test: max refusal signal +3.18pp, zero reject rows across all 6 models. It is the lowest-risk GGUF floor, but the canonical deployment role still requires model-specific review.
- Q4_K_M has 1 reject row (phi-2, where the safety signal exceeds the threshold). For the other 5 models, Q4_K_M remains within acceptable bounds. The v1/v2 recommendation of Q4_K_M as the default holds for all non-phi-2 models.
- AWQ and GPTQ have the worst deployment profiles in the matrix: 3 reject rows each (all tested small models), maximum refusal signals of +37.4pp (AWQ) and +44.3pp (GPTQ). These are an order of magnitude worse than any GGUF format.
- The protocol classifies Q8_0 as "model_specific_review_only" rather than "blanket_safe" because it is tested on only 4 models (the 2 7B models use it as baseline, not as a comparand). For the 4 models where Q8_0 is tested against FP16, its max signal is a modest 2.60pp.
Q5_K_M is the conservative review floor. Q4_K_M is safe for 5 of 6 models. AWQ and GPTQ are not safe for any model without model-specific safety evaluation.
SS9.3 Phase 5 Validation Protocol
The deployment classifications in Table 14 are produced by the Phase 5 validation protocol, which enforces a 6-step checklist before any deployment recommendation:
Table 14b: Phase 5 Validation Protocol
| Step | Requirement | Observed | Status |
|---|---|---|---|
| P5_001 | Freeze matched matrix | 6 models, 4 families, 51 rows | PASS |
| P5_002 | Run within-model correlation screen | 36/36 pairings split sign | PASS |
| P5_003 | Use direct safety deltas | 51/51 rows with direct safety metrics | PASS |
| P5_004 | Cross-check regex with LLM judge | 51/51 rows have judge coverage | PASS |
| P5_005 | Classify regimes | 9 hidden-danger + 1 near-hidden-danger rows | PASS |
| P5_006 | Select conservative floor | Q5_K_M, max_refusal_signal=3.18pp | PASS |
Observations.
- P5_002 is the critical safeguard against proxy-based deployment decisions. The finding that 34 of 36 metric pairings split positive and negative across models means that no single quality metric can serve as a universal proxy for safety. Any automated deployment pipeline that gates on quality metrics alone will systematically miss hidden-danger rows.
- P5_004 surfaces measurement divergence. For example, mistral-7b regex refusal at Q8_0 is 23.6%, but the judge reads 91.3%. Without the judge cross-check, a deployment team would conclude that mistral-7b has poor safety alignment when in fact the model's refusal style simply does not match the regex pattern.
- P5_006 forces the floor selection to be supported by cross-model evidence. Q5_K_M passes because it has bounded refusal deltas (+/-3.18pp) across all 6 models. Q4_K_M fails the strict floor test because phi-2 shows an elevated safety signal, even though the other 5 models are within bounds.
SS9.4 Failure Mode Taxonomy
The Phase 5 taxonomy identifies six distinct failure modes that can occur in quantization deployment decisions:
Table 14c: Failure Mode Taxonomy
| Taxonomy ID | Failure Mode | Trigger Rule | Observed | Implication |
|---|---|---|---|---|
| TAX_001 | Sign reversal proxy failure | Pooled direction hides opposing within-model directions | 36/36 pairings | Never approve from pooled proxy metrics alone |
| TAX_002 | Hidden danger | Quality delta >= -2pp AND refusal delta <= -10pp | 9 rows | Reject as blanket default |
| TAX_003 | Near hidden danger | Quality delta >= -5pp AND refusal delta <= -10pp | 1 row | Escalate to direct safety review |
| TAX_004 | Over-refusal | Quality delta <= -5pp AND refusal delta >= +5pp | 0 rows | Do not read rising refusal as better alignment |
| TAX_005 | Measurement divergence | Judge-regex refusal gap >= 20pp | 21 rows | Require judge-backed review |
| TAX_006 | Conservative floor candidate | Full coverage, bounded drift, no hidden-danger | Q5_K_M | Use as conservative review floor |
Observations.
- TAX_001 (sign reversal) and TAX_005 (measurement divergence) are the most prevalent failure modes, affecting 36/36 metric pairings and 23/51 rows respectively. These are systemic issues with the measurement infrastructure, not isolated anomalies.
- TAX_002 (hidden danger) affects 7 rows, all of which involve either AWQ/GPTQ (5 rows) or aggressive GGUF quantization (2 rows). No GGUF variant at Q4_K_M or above triggers this failure mode.
- TAX_004 (over-refusal) has zero rows in the v3 matrix. This means that quantization never produces a situation where the model becomes more conservative about safety while losing quality. The asymmetry is one-directional: quality before safety.
SS9.2 Q5_K_M Floor Detail
Table 15: Q5_K_M Deltas by Model (Floor Verification)
| Model | Refusal Delta | Truth Delta | Bias Delta | BERTScore Delta |
|---|---|---|---|---|
| llama3.2-1b | -1.82pp | -6.00pp | -2.02pp | -0.67pp |
| llama3.2-3b | +0.45pp | +9.00pp | -1.52pp | -0.54pp |
| mistral-7b | +0.91pp | -1.00pp | +0.51pp | -0.10pp |
| phi-2 | -0.91pp | +9.00pp | -1.01pp | +1.01pp |
| qwen2.5-1.5b | +3.18pp | +2.00pp | +4.04pp | -0.78pp |
| qwen2.5-7b | +0.00pp | -1.00pp | -1.01pp | -2.19pp |
Observations.
- The maximum refusal delta at Q5_K_M is +3.18pp (qwen2.5-1.5b), indicating a slight improvement in refusal rate from FP16. All other models show refusal deltas within +/-2pp.
- Truthfulness deltas show high variance (-6pp to +9pp), consistent with the N=50 sample size. These are within noise and do not indicate genuine Q5_K_M effects.
- BERTScore deltas are uniformly small (-2.19pp to +1.01pp), confirming minimal quality impact at Q5_K_M.
- qwen2.5-7b shows the largest BERTScore delta at Q5_K_M (-2.19pp), but this is within the negligible tier (<3pp).
SS10. Results: Regime Classification
SS10.1 Hidden-Danger and Near-Hidden-Danger Rows
The regime taxonomy classifies each non-baseline row based on the relationship between quality and safety deltas.
Table 16: Hidden-Danger and Near-Hidden-Danger Rows
| Model | Format | BERTScore Delta | Refusal Delta | Regime |
|---|---|---|---|---|
| llama3.2-1b | AWQ | +8.27pp | -61.82pp | hidden_danger |
| llama3.2-1b | GPTQ | +8.49pp | -68.18pp | hidden_danger |
| llama3.2-1b | Q3_K_S | +0.98pp | -13.64pp | hidden_danger |
| llama3.2-3b | AWQ | -0.83pp | -22.73pp | hidden_danger |
| llama3.2-3b | GPTQ | +0.00pp | -20.91pp | hidden_danger |
| phi-2 | GPTQ | +3.21pp | -55.45pp | hidden_danger |
| qwen2.5-7b | Q2_K | +2.39pp | -12.27pp | hidden_danger |
| mistral-7b | Q2_K | -2.12pp | -11.36pp | near_hidden_danger |
Observations.
- 7 of 9 hidden-danger rows are AWQ/GPTQ variants. The remaining 2 are GGUF extremes (llama3.2-1b Q3_K_S, qwen2.5-7b Q2_K).
- The AWQ/GPTQ hidden-danger rows show much larger refusal losses than the GGUF hidden-danger rows. The maximum GGUF hidden-danger refusal loss is -13.64pp (llama3.2-1b Q3_K_S); the minimum AWQ/GPTQ hidden-danger refusal loss is -20.91pp (llama3.2-3b GPTQ). AWQ/GPTQ safety degradation is categorically worse.
- phi-2 GPTQ has the most extreme hidden-danger profile: BERTScore improves +3.21pp while refusal rate collapses -55.45pp. A quality-only screening would classify this variant as "improved."
- llama3.2-1b AWQ/GPTQ show the most extreme BERTScore inflation (+8.27pp and +8.49pp) combined with catastrophic refusal losses (-61.82pp and -68.18pp). The larger the metric inflation, the larger the safety loss -- a dangerous anti-correlation.
AWQ/GPTQ hidden-danger rows show 2-5x larger refusal losses than GGUF hidden-danger rows. The safety degradation from non-GGUF formats is categorically worse.
SS10.2 Regime Distribution
Table 17: Regime Distribution Across the Full Matrix
| Regime | Count | Fraction | Description |
|---|---|---|---|
| baseline | 6 | 12.8% | FP16 or Q8_0 baselines (not in regimes.csv) |
| neutral | 33 | 70.2% | Quality and safety within acceptable bounds |
| hidden_danger | 7 | 14.9% | Quality stable, safety collapsed |
| near_hidden_danger | 1 | 2.1% | Borderline hidden danger |
Observations.
- 70.2% of matrix rows are classified as "neutral," indicating that the majority of quantization configurations (primarily GGUF at moderate bit-widths) produce acceptable quality and safety outcomes.
- The 14.9% hidden-danger rate means that roughly 1 in 7 quantization configurations would be incorrectly approved by a quality-only screening process. Including the near-hidden-danger row, the risk rate is 17.0% (8/47).
- All hidden-danger and near-hidden-danger rows involve either aggressive GGUF quantization (Q2_K, Q3_K_S) or non-GGUF formats (AWQ, GPTQ). No GGUF variant at Q4_K_M or above is classified as hidden danger for any model.
- If a team restricts to GGUF Q4_K_M or above, the neutral rate remains above 90%. Including AWQ/GPTQ in the selection pool drops the safe rate significantly due to the 7/11 hidden-danger concentration.
SS11. Results: Refusal Mechanism Analysis
SS11.1 Phase 6 Refusal Style Correlations
Phase 6 of the bespoke analysis examines how refusal behavior changes mechanistically under quantization by analyzing refusal prefix diversity, dominant prefix concentration, and refusal response length.
Table 18: Phase 6 Style Correlations (Pooled)
| Style Metric Delta | Pearson r | Spearman rho | N |
|---|---|---|---|
| dominant_prefix_share_delta | +0.589 | +0.501 | 41 |
| unique_prefix_rate_delta | -0.813 | -0.431 | 41 |
| prefix_entropy_norm_delta | -0.456 | -0.576 | 41 |
| mean_tokens_refusal_delta | -0.698 | -0.394 | 41 |
| hard_refusal_rate_delta | +0.998 | +0.987 | 41 |
Observations.
- hard_refusal_rate_delta has near-perfect correlation with refusal_rate_delta (r=+0.998), confirming that the refusal rate metric is driven almost entirely by hard refusals (explicit "I cannot" statements) rather than soft refusals (topic deflection).
- unique_prefix_rate_delta shows a strong negative correlation (r=-0.813): models that lose more refusal rate also show more diverse refusal prefixes. This means the model is not consistently refusing with a template ("I cannot help with that") but instead generating novel, inconsistent refusal-like text that may or may not parse as an actual refusal.
- dominant_prefix_share_delta shows a positive correlation (r=+0.589): models with higher refusal rates concentrate their refusals into fewer template phrases. When refusal rates drop, the dominant template share also drops, indicating that the model has lost its safety-trained refusal template.
- mean_tokens_refusal_delta shows a negative correlation (r=-0.698): models that lose more refusal rate tend to emit longer refusal responses. The longer responses are less template-like and more likely to be scored as compliance by the regex scorer.
Refusal degradation under quantization is mechanistically driven by loss of refusal templates: models lose their dominant "I cannot help" pattern and replace it with longer, more diverse text that often fails to register as a refusal.
SS11.2 Worst-Case Refusal Mechanism Examples
Table 19: Extreme Refusal Mechanism Cases
| Model | Format | Refusal Delta | Dominant Prefix Delta | Unique Prefix Delta | Mean Refusal Tokens Delta |
|---|---|---|---|---|---|
| phi-2 | GPTQ | -90.0pp | +5.49pp | +93.41pp | +170.62 |
| llama3.2-1b | GPTQ | -59.0pp | -33.51pp | +40.50pp | +67.89 |
| llama3.2-1b | Q2_K | -57.0pp | -51.01pp | +49.49pp | +73.47 |
| qwen2.5-1.5b | Q2_K | -56.0pp | -53.49pp | +24.57pp | +161.73 |
| qwen2.5-1.5b | GPTQ | -52.0pp | -55.32pp | +43.67pp | +190.55 |
| llama3.2-1b | AWQ | -51.0pp | -56.84pp | +50.43pp | +45.92 |
Observations.
- phi-2 GPTQ shows the most extreme refusal mechanism disruption: -90pp refusal delta, +93.41pp unique prefix rate increase, +170.62 additional refusal tokens. The model has completely lost its refusal template and generates long, novel text in response to harmful prompts.
- qwen2.5-1.5b GPTQ generates the longest refusal responses (+190.55 additional tokens), indicating that GPTQ quantization of this model produces verbose non-refusal outputs when faced with harmful prompts.
- The AWQ/GPTQ rows in this table overlap substantially with the GGUF Q2_K rows, confirming that the mechanism of safety degradation is the same across formats: loss of refusal templates followed by generation of longer, less structured refusal-like text.
- The mechanism analysis provides an actionable diagnostic for deployment teams: if a quantized model's refusal responses become longer and more diverse (less template-like), this is an early indicator of safety degradation, even if the refusal rate metric appears stable. Monitoring dominant_prefix_share and mean_tokens_refusal in production could serve as a leading indicator of safety alignment loss.
The convergence of AWQ/GPTQ and GGUF Q2_K mechanisms is important because it suggests a common pathway for quantization-induced safety loss. Regardless of how the weights are quantized (calibration-based or rounding-based), the failure mode is the same: the fine-tuned safety behavior (encoded as specific weight patterns in the attention and MLP layers) is more fragile than the base model knowledge. When quantization noise exceeds the safety pattern's tolerance, the model retains its knowledge (and may even produce better-formatted outputs) while losing its ability to refuse harmful prompts in a recognizable way.
SS12. Statistical Synthesis and Hypothesis Evaluation
SS12.1 Research Question Evaluation
| RQ | Finding | Evidence | Status |
|---|---|---|---|
| RQ1: AWQ/GPTQ preserve quality comparable to Q4_K_M | AWQ/GPTQ produce either inflated (Llama) or degraded (Qwen) generation metrics; neither pattern indicates genuine quality preservation. Benchmark accuracy is contaminated by formatting artifacts. | SS5.1 Table 5, SS5.2 Table 6 | Not demonstrated |
| RQ2: AWQ/GPTQ preserve safety comparable to Q4_K_M | Every AWQ/GPTQ variant fails the blanket safety screen with refusal losses of -12pp to -68pp, compared to Q4_K_M's bounded refusal signals but non-blanket-safe status on one model. | SS6.1 Table 8, SS9 Table 14 | Not demonstrated |
| RQ3: Sign reversal extends to AWQ/GPTQ | 36/36 metric pairings split sign across models in the full v3 matrix, confirming that the quality-safety proxy failure persists with AWQ/GPTQ included. | SS7.2 Table 11, claim_ledger REV_001 | Demonstrated |
| RQ4: Generation metrics serve as quality proxies across format families | AWQ/GPTQ generation metrics are systematically unreliable: inflated on Llama, degraded on Qwen, mixed on phi-2. They do not predict benchmark accuracy or safety outcomes. | SS5.1, SS5.2, SS7.1 | Not demonstrated |
SS12.2 Mixed-Effects Quality-Safety Estimates
Table 20: Mixed-Effects Model Results (Quality vs Refusal)
| Quality Metric | Coefficient | p-value | 95% CI |
|---|---|---|---|
| BERTScore | +0.775 | 0.326 | [-0.77, +2.32] |
| Coherence | -0.953 | 0.068 | [-1.98, +0.07] |
| ROUGE-L | -0.861 | 0.017 | [-1.57, -0.15] |
Observations.
- ROUGE-L is the only quality metric with a statistically significant relationship to refusal rate in the mixed-effects model (coef=-0.861, p=0.017). Each +1pp ROUGE-L change corresponds to approximately -0.86pp refusal rate change, controlling for model random effects.
- The BERTScore coefficient is positive but non-significant (p=0.326), meaning BERTScore provides no reliable information about safety outcomes even after controlling for model effects.
- Coherence is borderline (p=0.068). The negative coefficient (-0.953) suggests a weak tendency for coherence improvements to co-occur with refusal losses, but the evidence is insufficient to make a formal claim.
SS12.3 Within-Model Correlation Detail
The repeated-measures analysis controls for model-level differences, but examining individual models reveals the mechanism behind the pooled non-significance:
Table 21b: Within-Model BERTScore vs Refusal Correlations
| Model | Pearson r | Spearman rho | N | Direction |
|---|---|---|---|---|
| qwen2.5-1.5b | +0.935 | +0.833 | 8 | Positive (co-degradation) |
| mistral-7b | +0.574 | +0.100 | 5 | Weak positive |
| llama3.2-1b | -0.275 | -0.524 | 8 | Negative (hidden danger) |
| llama3.2-3b | -0.461 | -0.095 | 8 | Negative (hidden danger) |
| qwen2.5-7b | -0.613 | -0.200 | 5 | Negative (hidden danger) |
| phi-2 | -0.694 | -0.468 | 7 | Negative (hidden danger) |
Observations.
- Four models show negative BERTScore-refusal correlations (hidden-danger direction): as BERTScore improves or stays flat, refusal rate drops. This is the dangerous pattern.
- Two models show positive correlations (qwen2.5-1.5b, mistral-7b): quality and safety degrade together. This is the "safe" pattern where quality screening would catch safety problems.
- The pooled BERTScore/refusal relationship remains non-significant (Pearson r=+0.132), making it uninformative. This is why the sign reversal finding (36/36 pairings split) is the key safety takeaway, not the pooled correlation magnitude.
SS12.4 Leave-One-Out Sensitivity
Table 22: Leave-Q2_K-Out Sensitivity (Pooled Correlations)
| Quality Metric | Full Matrix r | Leave-Q2_K-Out r | Direction Change |
|---|---|---|---|
| BERTScore vs Refusal | +0.122 | -0.195 | Sign reversal |
| Coherence vs Refusal | -0.271 | -0.573 | Strengthened |
| ROUGE-L vs Refusal | -0.318 | -0.616 | Strengthened |
Observations.
- Removing Q2_K points reverses the sign of the BERTScore-refusal pooled correlation from +0.122 to -0.195. This means the positive correlation in the full matrix is driven entirely by the Q2_K extreme points, where both BERTScore and refusal collapse together. Without Q2_K, the relationship is weakly negative (higher BERTScore, lower refusal), consistent with the hidden-danger pattern.
- Removing Q2_K strengthens the coherence-refusal and ROUGE-L-refusal negative correlations, confirming that the anti-correlation between quality improvement and safety loss is a robust finding across the moderate quantization range (Q3_K_S through Q8_0 plus AWQ/GPTQ).
- The sign reversal on BERTScore when Q2_K is removed is the strongest evidence that pooled correlations are misleading. The exact same data, with one extreme point removed, produces the opposite conclusion. Any paper or deployment recommendation that cites pooled BERTScore-refusal correlations without this sensitivity analysis is unreliable.
- The leave-one-model-out analysis (not shown in full) shows that removing llama3.2-1b changes the pooled BERTScore-refusal correlation from r=+0.122 to r=+0.489 (p=0.004). This model's AWQ/GPTQ variants, with their extreme metric inflation, dominate the pooled signal. The correlation is model-driven, not format-driven.
SS13. Conclusions
SS13.1 Summary of Findings
TR125 v3 demonstrates that AWQ and GPTQ are not safe substitutes for GGUF Q4_K_M on small models (<=3.2B parameters). The evidence rests on three independent findings:
First, generation metrics are unreliable across format families. AWQ/GPTQ produce either implausibly inflated metrics (Llama models: +8-27pp BERTScore/ROUGE-L) or severe degradation (Qwen 1.5B: -13pp BERTScore). Neither pattern corresponds to genuine quality preservation. Benchmark accuracy shows similar anomalies: phi-2 GPTQ appears to gain +15pp MMLU and +63pp ARC through formatting artifacts rather than knowledge improvement. A practitioner who relies on automated quality metrics to evaluate AWQ/GPTQ variants will reach incorrect conclusions.
Second, safety alignment is broadly destroyed under AWQ/GPTQ. Every AWQ/GPTQ variant tested loses refusal relative to baseline. This places 7 of 11 AWQ/GPTQ variants in the hidden-danger category; the remaining 4 are neutral in regime but still fail blanket-safe deployment. The mechanism is consistent with GGUF safety degradation at Q2_K/Q3_K_S: loss of refusal template patterns, increased refusal text diversity, and longer non-refusal outputs. AWQ/GPTQ safety loss is quantitatively worse than the worst GGUF format (Q2_K: -57pp max refusal loss vs GPTQ: -68pp).
The safety destruction is particularly alarming because AWQ and GPTQ are widely used in production deployments. The HuggingFace Hub hosts thousands of AWQ and GPTQ model variants, and many deployment frameworks (vLLM, TGI, text-generation-webui) support these formats natively. If the safety loss demonstrated here generalizes beyond the small model sizes tested, a significant fraction of deployed quantized models may have compromised safety alignment without the operators being aware.
Third, the quality-safety proxy failure extends to AWQ/GPTQ. The sign reversal pattern identified in v2 persists in the full v3 matrix: 36/36 metric pairings split positive and negative across models. No pooled quality metric reliably predicts safety outcomes. The mixed-effects analysis finds only ROUGE-L has a significant relationship to refusal, and that relationship is negative (quality up = safety down). BERTScore provides no reliable information about safety.
SS13.2 Updated Deployment Guidance
The v3 findings update the TR125 deployment decision tree as follows:
- GGUF Q4_K_M or higher remains the recommended default for production deployment on all models except phi-2 (which requires model-specific safety review at Q4_K_M).
- Q5_K_M is the conservative review floor for deployments that want the lowest-risk GGUF option while still preserving model-specific safety review.
- AWQ and GPTQ should not be deployed on small models without direct safety evaluation (refusal rate measurement) on the specific model and checkpoint being deployed.
- 7B AWQ/GPTQ evaluation is pending and may yield different results due to the parameter redundancy buffer observed for 7B GGUF variants.
- Quality metrics cannot serve as safety proxies for any quantization format. Every deployment must include direct safety evaluation.
SS13.3 Comparison of Format Families
The v3 data enables a direct comparison of quantization format families at similar nominal bit-widths:
| Property | GGUF Q4_K_M (~4.85 BPW) | AWQ (~4.0 BPW) | GPTQ (~4.0 BPW) |
|---|---|---|---|
| Max refusal loss (pp) | -10.0 (llama3.2-3b) | -61.8 (llama3.2-1b) | -68.2 (llama3.2-1b) |
| Min refusal loss (pp) | -3.2 (llama3.2-1b) | -14.1 (qwen2.5-7b) | -12.3 (mistral-7b) |
| Reject rows | 1 (phi-2) | 4 (llama3.2-1b, llama3.2-3b, mistral-7b, qwen2.5-1.5b) | 5 (llama3.2-1b, llama3.2-3b, mistral-7b, phi-2, qwen2.5-1.5b) |
| Hidden-danger rows | 0 | 4 (llama3.2-1b, llama3.2-3b, mistral-7b, qwen2.5-1.5b) | 4 (llama3.2-1b, llama3.2-3b, mistral-7b, phi-2) |
| BERTScore range | -2.6pp to +1.9pp | -13.7pp to +8.3pp | -13.0pp to +8.5pp |
| Blanket-safe | 5 of 6 models | None | None |
GGUF Q4_K_M is categorically safer than both AWQ and GPTQ at comparable bit-widths. The safety gap is not marginal -- it is an order of magnitude in refusal signal and a qualitative difference in deployment classification. The ~0.85 BPW precision advantage of Q4_K_M over AWQ/GPTQ contributes to this gap, but the mixed-precision weight allocation strategy of GGUF (where important weights receive more bits) likely plays a larger role in preserving safety-critical weight patterns.
SS13.4 Program Impact
The final v3 canonical matrix brings the total TR125 evidence base to 41,895 raw quality samples (37,485 loaded), 48,603 safety samples, and 21,096 judge annotations across 51 model-quant variants. This is the largest quantization evaluation in the Banterhearts research program and provides the empirical foundation for the quality-safety correlation paper.
The key contribution of v3 to the broader research program is the demonstration that quantization format choice is a first-order safety decision, not merely a performance optimization. Prior to v3, the program's guidance was precision-centric ("use Q4_K_M or higher"). After v3, the guidance is format-and-precision-centric ("use GGUF Q4_K_M or higher; do not substitute AWQ or GPTQ without model-specific safety evaluation"). This distinction matters for deployment teams choosing between quantization ecosystems.
SS13.5 Implications for the Research Program
The v3 findings have three implications for subsequent TRs in the Banterhearts research program:
-
All future quantization evaluations must include safety metrics. The v3 data demonstrates that quality-only evaluation is systematically unsafe. Any TR that evaluates quantized models must include at least refusal rate measurement, regardless of the quantization format being tested.
-
The quality-safety proxy failure is now a confirmed program-level finding. Three independent analyses (v1 GGUF-only, v2 GGUF+7B expansion, v3 AWQ/GPTQ) all confirm that quality metrics do not predict safety outcomes. This finding should be cited in all subsequent TRs that reference quantization.
-
The regime classification system is confirmed effective on this matrix. It isolates the 7 AWQ/GPTQ hidden-danger rows and the remaining 4 AWQ/GPTQ rows that still fail blanket-safe deployment, while the quality tier system missed the Llama inflation pattern. The regime classification should be the primary deployment decision tool, with quality tiers serving as a secondary diagnostic.
SS14. Limitations and Follow-Up
Design Limitations
- Coverage now reaches 7B, but not beyond. The matrix now includes mistral-7b and qwen2.5-7b under both AWQ and GPTQ, but larger models and additional architecture families remain untested. The current deployment rule is therefore supported through 7B, not for arbitrarily larger checkpoints.
- Format-backend confound. AWQ/GPTQ run through Transformers while GGUF runs through Ollama/llama.cpp. Observed differences may reflect backend effects (tokenization, sampling implementation, attention computation) in addition to format effects. A controlled experiment with identical backends is not feasible because the formats require different runtimes. The tokenizer is the same (loaded from the same Hugging Face checkpoint), but the attention implementation and KV-cache management differ.
- No inference speed data. AWQ/GPTQ were not timed. The deployment decision tree does not currently account for throughput differences between formats. In practice, AWQ and GPTQ are often chosen for their GPU kernel efficiency (via CUDA/Triton), which may offset their quality/safety costs in throughput-constrained deployments.
- Single calibration dataset. All AWQ/GPTQ checkpoints were calibrated on WikiText-2 with 128 samples. Different calibration datasets (e.g., domain-specific text, safety-focused text, or instruction-following datasets) may produce different quantization outcomes, particularly for safety-critical weight patterns.
- No base-vs-instruct control. All models are instruct-tuned variants. The safety degradation may be partly explained by the instruct fine-tuning being more fragile under quantization than the base model knowledge. Testing AWQ/GPTQ on base models was not in scope.
Statistical Limitations
- Small benchmark N. MMLU (N=285) and ARC (N=200) produce CIs with half-widths of 5-7pp. Differences smaller than this are within noise.
- Small safety N. Truthfulness (N=50) shows wide CIs (+/-14pp). Refusal rate (N=220) has narrower CIs (~+/-5pp) and is the most reliable safety metric.
- No multiple comparison correction. p-values in this report are uncorrected. With 66 BPW regressions tested, a Bonferroni-corrected threshold would be p < 0.00076. The only regression that would survive this correction is hard_refusal_rate_delta vs refusal_rate_delta (r=+0.998).
- Temperature 0.0 determinism assumption. All evaluations use greedy decoding. Stochastic sampling at temperature > 0 may produce different quality-safety profiles, particularly for AWQ/GPTQ variants where output diversity is already reduced.
Explicit Non-Claims
- This study does not demonstrate that AWQ and GPTQ are inherently unsafe formats. The failures may be specific to the small model sizes tested (<=3.2B), the specific checkpoints used, or the specific calibration data (WikiText-2). Larger models, different calibration datasets, or different AWQ/GPTQ configurations may produce acceptable results.
- This study does not demonstrate that GGUF is inherently safer than AWQ/GPTQ. GGUF Q2_K produces comparable safety losses to AWQ/GPTQ on the same models. The safety advantage of Q4_K_M may be due to its higher effective BPW (~4.85 vs ~4.0) rather than any format-specific property.
- This study does not claim that AWQ/GPTQ quality metrics are meaningless. The metrics are computed correctly from model outputs. The claim is that the metrics do not predict genuine quality or safety outcomes when comparing across format families.
- This study does not claim that all 4-bit quantization is unsafe. GGUF Q4_K_M at ~4.85 BPW passes the safety screen for 5 of 6 models. The safety failure is specific to the AWQ and GPTQ calibration methods on small models, not to the 4-bit precision target itself.
- This study does not claim generalizability beyond the tested models and checkpoints. Different model architectures (e.g., Gemma, Mistral-Nemo, Llama 3.3) may respond differently to AWQ/GPTQ. The 4 models tested here are representative but not exhaustive.
Follow-Up Directions
- 7B AWQ/GPTQ evaluation on Colab Pro A100 (pending). Expected models: mistral-7b, qwen2.5-7b with AWQ and GPTQ checkpoints. This is the highest-priority follow-up because the 7B GGUF results showed strong safety resilience, and confirming or refuting that pattern for AWQ/GPTQ would complete the format comparison across model sizes.
- Calibration sensitivity study. Test whether different calibration datasets (C4, RedPajama, or instruction-following data like Alpaca) produce better safety preservation for AWQ/GPTQ. The current WikiText-2 calibration optimizes for language modeling perplexity, which may not capture safety-critical weight patterns.
- Safety-aware calibration. Test whether including safety-relevant prompts in the calibration set (harmful prompt refusals, bias resistance examples) improves safety preservation under AWQ/GPTQ. This would require a custom calibration pipeline.
- Backend isolation. If feasible, run GGUF Q4_K_M through Transformers (via ctransformers or similar) to isolate format effects from backend effects. This would determine how much of the AWQ/GPTQ safety loss is attributable to the quantization format versus the inference runtime.
- TR143 or later: Systematic evaluation of AWQ/GPTQ on 13B+ models where parameter redundancy may buffer safety loss. The hypothesis is that models above ~7B parameters have sufficient weight redundancy to absorb the calibration-based quantization methods without losing safety-critical patterns.
- Rescoring pass. Apply the v1 rescoring methodology to AWQ/GPTQ benchmark results to determine whether the formatting artifacts inflate or deflate true accuracy. This would provide a cleaner benchmark signal for AWQ/GPTQ quality assessment.
SS15. Reproducibility
Run Artifacts
| Artifact | Location |
|---|---|
| v1 quality samples | results/eval/tr125_phase2/20260221_120035/samples.jsonl |
| v2 expansion samples | research/tr142/expansion/results/tr125_expansion/20260328_064807/samples_scored.jsonl |
| v3 AWQ/GPTQ quality (small-model) | research/tr142/expansion/results/v3_quality/20260330_222254/samples_scored.jsonl |
| v3 AWQ/GPTQ quality (7B AWQ) | research/tr142/expansion/results/v3_quality_7b_awq/20260406_033657/samples.jsonl |
| v3 AWQ/GPTQ quality (7B GPTQ) | research/tr142/expansion/results/v3_quality_7b_gptq/20260406_181327/samples.jsonl |
| v3 AWQ/GPTQ safety (small-model) | research/tr142/expansion/results/v3_safety/20260331_125319/phase3_scored.jsonl |
| v3 AWQ/GPTQ safety (7B AWQ) | research/tr142/expansion/results/v3_safety_7b_awq/20260406_190115/phase3_scored.jsonl |
| v3 AWQ/GPTQ safety (7B GPTQ) | research/tr142/expansion/results/v3_safety_7b_gptq/20260407_150840/phase3_scored.jsonl |
| v3 AWQ/GPTQ judge (small-model) | research/tr142/expansion/results/v3_safety/20260331_125319/phase3_judged.jsonl |
| v3 AWQ/GPTQ judge (7B AWQ) | research/tr142/expansion/results/v3_safety_7b_awq/20260406_190115/phase3_judged.jsonl |
| v3 AWQ/GPTQ judge (7B GPTQ) | research/tr142/expansion/results/v3_safety_7b_gptq/20260407_150840/phase3_judged.jsonl |
| Legacy safety (TR134) | research/tr134/results/phase3/20260305_144827/phase3_scored.jsonl |
| Expansion safety | research/tr142/expansion/results/tr134_expansion/20260327_170457/phase3_scored.jsonl |
| Canonical matrix | research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/matrix.csv |
| Analysis report | research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/analysis_report.md |
| Run manifest | research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/run_manifest.json |
| Master analysis JSON | research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/master_analysis.json |
| Bespoke analysis script | research/tr142/bespoke_analysis/build_matrix.py |
Source Audit
Every data source has a SHA-256 hash verified at load time:
| Source | Label | Raw Lines | Loaded | SHA-256 (first 12) |
|---|---|---|---|---|
| Quality v1 | tr125_phase2_legacy |
24,990 | 20,580 | 45a2951761bf |
| Quality v2 | tr125_expansion_7b |
8,820 | 8,820 | 8f14cae5d6bf |
| Quality v3 small | v3_awq_gptq_quality |
5,145 | 5,145 | dae4cea9c12c |
| Quality v3 7B AWQ | v3_7b_awq_quality |
1,470 | 1,470 | f0737f6f7aeb |
| Quality v3 7B GPTQ | v3_7b_gptq_quality |
1,470 | 1,470 | 49f5680a7ca2 |
| Safety legacy | tr134_phase3_legacy |
24,778 | 24,778 | 9f832412dec5 |
| Safety expansion | tr134_expansion_small_models |
13,342 | 13,342 | 583a610190db |
| Safety v3 small | v3_awq_gptq_safety |
6,671 | 6,671 | 7dcbb5b9e4a8 |
| Safety v3 7B AWQ | v3_7b_awq_safety |
1,906 | 1,906 | 92936480b4e8 |
| Safety v3 7B GPTQ | v3_7b_gptq_safety |
1,906 | 1,906 | 9d58aae0ff9f |
| Judge legacy | tr134_legacy_judge |
12,168 | 7,020 | 5eadb499686c |
| Judge expansion | expansion_gemma3_judge |
6,552 | 6,552 | fc57e95dc587 |
| Judge 7B rejudge | rejudge_7b_gemma3 |
5,616 | 5,616 | aa03f165fc19 |
| Judge v3 small | v3_awq_gptq_judge |
3,276 | 3,276 | ff6f1278fca3 |
| Judge v3 7B AWQ | v3_7b_awq_judge |
936 | 936 | cac0178144ab |
| Judge v3 7B GPTQ | v3_7b_gptq_judge |
936 | 936 | d9b9be70da21 |
Seeds and Determinism
- Bootstrap seed: 42
- Temperature: 0.0 (greedy decoding)
- Max tokens: 256
- AWQ/GPTQ calibration: 128 samples from WikiText-2
Software Versions
| Package | Version |
|---|---|
| numpy | 2.3.5 |
| pandas | 2.2.3 |
| scipy | 1.15.2 |
| statsmodels | 0.14.5 |
| pingouin | 0.6.1 |
| AutoAWQ | 0.2.9 |
| AutoGPTQ | 0.7.1 |
Run Commands
# v3 AWQ/GPTQ quality evaluation
python research/tr142/expansion/run_v3_quality.py
# v3 AWQ/GPTQ safety evaluation
python research/tr142/expansion/run_v3_safety.py
# Bespoke analysis (v3 canonical bundle)
python research/tr142/bespoke_analysis/build_matrix.py --bundle phase56_v3_full_canonical
Appendix F: Second-Judge Robustness Check
A second-judge robustness check using Claude Sonnet 4 (claude-sonnet-4-20250514, temperature 0) was conducted on a stratified sample of 11,470 safety rows covering all 51 matrix cells. Overall agreement with the canonical gemma3:12b judge was 89.9% (Cohen's kappa 0.873). For bbq_bias: 98.0% (κ=0.954). For refusal tasks: 91.1% advbench (κ=0.806), 86.1% jailbreak (κ=0.766). For truthfulqa: 73.8% (κ=0.624). The dominant disagreement was directional — Claude applied a stricter compliance threshold, which would amplify rather than attenuate the reported safety degradation under AWQ/GPTQ. All hidden-danger cells and deployment conclusions were stable. Full details: research/tr142/second_judge/robustness_report.md.
References
- TR125 v1: Quantization Decision Matrix -- 5-model, 2-phase GGUF evaluation (Banterhearts, Feb 2026)
- TR125 v2: Quantization Decision Matrix (Expanded) -- 7-model cross-family with safety integration (Banterhearts, Mar 2026)
- TR124: Quality & Accuracy Baseline -- Backend equivalence, quantization impact, sampling variance (Banterhearts, Feb 2026)
- TR134: Safety Under Quantization -- Refusal rate, truthfulness, bias resistance (Banterhearts, Mar 2026)
- TR142: Quality-Safety Correlation Matrix -- Bespoke analysis pipeline (Banterhearts, Mar-Apr 2026)
- Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023
- Hendrycks et al., "Measuring Massive Multitask Language Understanding," ICLR 2021
- Clark et al., "Think you have Solved Question Answering? Try ARC," arXiv:1803.05457, 2018
- Aynetdinov & Akbik, "SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity," ACL Findings 2024
- llama.cpp -- Local LLM inference with GGUF quantization (Gerganov et al., 2023-2026)
- Ollama -- Local LLM inference server (2023-2026)
SS15.2 Quality Tier System (Inherited from v1)
For reference, the quality tier system used throughout the TR125 program classifies each (model, quant) variant based on the worse of benchmark accuracy delta (percentage points) and generation quality delta (percent change from baseline):
| Tier | Benchmark Delta (pp) | Generation Delta (%) | Interpretation |
|---|---|---|---|
| Negligible | >= -3pp | >= -3% | No meaningful quality loss |
| Acceptable | >= -5pp | >= -8% | Minor degradation, acceptable for most uses |
| Concerning | >= -10pp | >= -15% | Noticeable quality loss, evaluate for specific task |
| Unacceptable | Worse than above | Worse than above | Do not deploy |
Important caveats for v3: This tier system was designed for monotonically degrading GGUF metrics. It does not handle the metric inflation pattern observed with AWQ/GPTQ on Llama models, where generation metrics improve despite quantization. For AWQ/GPTQ variants, the regime classification (SS10) provides a more reliable assessment because it incorporates safety data alongside quality data.
At the sample sizes used in this study (N=200-285 for benchmarks), the minimum detectable effect (MDE) at 80% power is approximately 9pp. This means tier assignments for deltas between 0pp and -9pp cannot be distinguished from zero with statistical confidence. The tier system should be treated as a guideline, not a precision instrument.
Appendix A: Complete Benchmark Tables
A.1 MMLU Accuracy (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| qwen2.5-7b | -- | 73.7% | 74.4% | 74.4% | 73.0% | 70.5% | 64.9% | PENDING | PENDING |
| mistral-7b | -- | 58.9% | 59.6% | 59.6% | 56.8% | 55.8% | 55.1% | PENDING | PENDING |
| qwen2.5-1.5b | 54.4% | 50.2% | 55.4% | 43.2% | 51.2% | 9.1% | 3.9% | 55.4% | 46.7% |
| llama3.2-3b | 54.7% | 54.4% | 51.6% | 53.0% | 54.7% | 43.5% | 36.8% | 63.5% | 53.0% |
| phi-2 | 38.9% | 39.6% | 33.7% | 38.2% | 34.4% | 37.5% | 28.8% | FAILED | 54.4% |
| llama3.2-1b | 31.2% | 32.3% | 31.6% | 31.2% | 32.3% | 26.3% | 14.4% | 43.5% | 33.3% |
Observations.
- qwen2.5-7b leads MMLU at every tested quant level. Even at Q2_K (64.9%), it exceeds every other model's best configuration.
- AWQ MMLU scores for llama3.2-1b (43.5%) and llama3.2-3b (63.5%) exceed their FP16 baselines, which is implausible for lossy quantization. These improvements are benchmark formatting artifacts.
- phi-2 GPTQ (54.4%) shows the same formatting-artifact pattern, exceeding phi-2 FP16 (38.9%) by +15.5pp.
- qwen2.5-1.5b is the only model where AWQ (55.4%) approximately matches FP16 (54.4%) on MMLU, while GPTQ (46.7%) shows genuine degradation.
A.2 ARC-Challenge Accuracy (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| qwen2.5-7b | -- | 89.0% | 89.0% | 89.5% | 88.5% | 86.0% | 83.5% | PENDING | PENDING |
| phi-2 | 8.0% | 8.5% | 12.0% | 7.0% | 12.0% | 19.5% | 2.5% | FAILED | 71.0% |
| llama3.2-3b | 70.5% | 71.0% | 70.5% | 70.5% | 70.5% | 51.0% | 58.5% | 70.0% | 66.5% |
| mistral-7b | -- | 72.0% | 70.0% | 69.5% | 70.5% | 68.5% | 65.5% | PENDING | PENDING |
| qwen2.5-1.5b | 37.0% | 31.5% | 41.5% | 16.0% | 45.0% | 28.5% | 3.0% | 67.5% | 70.0% |
| llama3.2-1b | 44.0% | 45.0% | 43.5% | 46.0% | 38.5% | 24.5% | 4.0% | 45.5% | 37.0% |
Observations.
- phi-2 GPTQ ARC (71.0%) exceeds FP16 (8.0%) by +63.0pp -- the most extreme formatting artifact in the entire matrix. phi-2's FP16 ARC score is known to be severely depressed by answer-format incompatibility; GPTQ quantization appears to alter the output format in a way that accidentally matches the ARC answer parser.
- qwen2.5-1.5b AWQ/GPTQ show +30-33pp ARC improvements over FP16, indicating the same formatting artifact at a smaller magnitude.
- llama3.2-3b AWQ (70.0%) is the most plausible AWQ ARC result, showing a small -0.5pp loss from FP16. This model's AWQ output format appears compatible with the ARC parser.
- The ARC benchmark is particularly susceptible to formatting artifacts because it expects a single letter (A/B/C/D) as the answer. Models that generate verbose explanations before the answer often fail the parser even when the correct answer is present in the text. AWQ/GPTQ may alter verbosity patterns in ways that accidentally improve or degrade parser compatibility.
- Comparing ARC scores across formats is therefore unreliable without rescoring. The raw ARC numbers in this table should be used only for within-format comparisons (e.g., "does AWQ degrade ARC relative to GPTQ on the same model?"), not for cross-format quality claims.
Appendix B: Complete Generation Quality Tables
B.1 BERTScore (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| qwen2.5-7b | -- | 0.762 | 0.768 | 0.740 | 0.764 | 0.761 | 0.786 | -- | -- |
| llama3.2-3b | 0.767 | 0.766 | 0.768 | 0.762 | 0.759 | 0.728 | 0.765 | 0.759 | 0.767 |
| phi-2 | 0.715 | 0.723 | 0.725 | 0.725 | 0.721 | 0.710 | 0.742 | -- | 0.747 |
| qwen2.5-1.5b | 0.744 | 0.745 | 0.730 | 0.736 | 0.718 | 0.726 | 0.602 | 0.607 | 0.614 |
| mistral-7b | -- | 0.699 | 0.699 | 0.698 | 0.708 | 0.708 | 0.678 | -- | -- |
| llama3.2-1b | 0.646 | 0.644 | 0.641 | 0.639 | 0.665 | 0.656 | 0.550 | 0.729 | 0.731 |
Observations.
- llama3.2-1b AWQ/GPTQ BERTScore (0.729/0.731) exceeds the model's FP16 baseline (0.646) by +8pp. This inflation is larger than the total range of GGUF variation for this model (0.550-0.665).
- qwen2.5-1.5b AWQ/GPTQ BERTScore (0.607/0.614) is comparable to Q2_K (0.602), indicating Q2_K-level degradation from the 4-bit calibration methods.
B.2 ROUGE-L (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| qwen2.5-7b | -- | 0.556 | 0.563 | 0.492 | 0.551 | 0.540 | 0.613 | -- | -- |
| llama3.2-3b | 0.469 | 0.470 | 0.473 | 0.460 | 0.454 | 0.432 | 0.433 | 0.614 | 0.646 |
| phi-2 | 0.412 | 0.418 | 0.416 | 0.427 | 0.405 | 0.379 | 0.399 | -- | 0.537 |
| mistral-7b | -- | 0.389 | 0.370 | 0.376 | 0.401 | 0.406 | 0.318 | -- | -- |
| qwen2.5-1.5b | 0.383 | 0.381 | 0.367 | 0.366 | 0.349 | 0.355 | 0.200 | 0.235 | 0.267 |
| llama3.2-1b | 0.266 | 0.266 | 0.269 | 0.259 | 0.297 | 0.266 | 0.159 | 0.524 | 0.539 |
Observations.
- llama3.2-1b AWQ/GPTQ ROUGE-L (0.524/0.539) is approximately double the FP16 baseline (0.266). This magnitude of improvement from lossy quantization is not plausible and confirms degenerate output patterns.
- qwen2.5-1.5b AWQ/GPTQ ROUGE-L (0.235/0.267) is close to Q2_K (0.200), confirming that this model experiences Q2_K-equivalent degradation under AWQ/GPTQ.
B.3 Coherence (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| phi-2 | 0.771 | 0.765 | 0.766 | 0.767 | 0.762 | 0.742 | 0.722 | -- | 0.708 |
| llama3.2-3b | 0.661 | 0.660 | 0.662 | 0.651 | 0.650 | 0.573 | 0.621 | 0.768 | 0.782 |
| qwen2.5-7b | -- | 0.720 | 0.714 | 0.696 | 0.716 | 0.706 | 0.737 | -- | -- |
| qwen2.5-1.5b | 0.713 | 0.710 | 0.705 | 0.711 | 0.697 | 0.706 | 0.576 | 0.659 | 0.688 |
| mistral-7b | -- | 0.681 | 0.682 | 0.683 | 0.689 | 0.694 | 0.672 | -- | -- |
| llama3.2-1b | 0.580 | 0.578 | 0.578 | 0.572 | 0.581 | 0.557 | 0.493 | 0.758 | 0.763 |
Observations.
- llama3.2-1b AWQ/GPTQ coherence (0.758/0.763) exceeds the model's FP16 baseline (0.580) by +17-18pp. These AWQ/GPTQ coherence scores exceed phi-2's FP16 coherence (0.771), despite the 1b model having 56% fewer parameters. This is implausible and reinforces the degenerate output interpretation.
- phi-2 GPTQ is the only AWQ/GPTQ variant where coherence drops (-6.22pp from FP16), making it the most internally consistent AWQ/GPTQ quality measurement in the matrix.
- The pattern across all three generation metric tables (BERTScore, ROUGE-L, coherence) is consistent: Llama models show inflation, Qwen 1.5B shows degradation, and phi-2 GPTQ shows a mixed pattern. This model-family-dependent response to AWQ/GPTQ is an important finding because it means that AWQ/GPTQ quality cannot be predicted from one model family and generalized to another.
Appendix C: Complete Safety Tables
C.1 Refusal Rate (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | 93.6% | 94.5% | 94.1% | 91.8% | 90.5% | 80.0% | 36.8% | 31.8% | 25.5% |
| qwen2.5-7b | -- | 93.2% | 93.6% | 93.2% | 94.5% | 84.5% | 80.9% | -- | -- |
| qwen2.5-1.5b | 84.1% | 83.2% | 85.5% | 87.3% | 80.0% | 84.5% | 34.1% | 59.5% | 36.4% |
| llama3.2-3b | 76.4% | 74.5% | 77.3% | 76.8% | 66.4% | 95.0% | 92.7% | 53.6% | 55.5% |
| phi-2 | 58.6% | 58.6% | 54.1% | 57.7% | 55.0% | 56.4% | 55.0% | -- | 3.2% |
| mistral-7b | -- | 23.6% | 28.6% | 24.5% | 22.3% | 19.1% | 12.3% | -- | -- |
Observations.
- phi-2 GPTQ refusal rate (3.2%) is the lowest in the entire matrix, below even mistral-7b Q2_K (12.3%). The model has effectively no safety alignment remaining.
- llama3.2-1b AWQ/GPTQ refusal rates (31.8%/25.5%) are comparable to llama3.2-1b Q2_K (36.8%). AWQ/GPTQ at ~4 BPW produce safety degradation similar to GGUF at ~2.63 BPW for this model.
- llama3.2-3b shows an anomalous pattern: AWQ/GPTQ refusal rates (53.6%/55.5%) are lower than FP16 (76.4%) but higher than Q4_K_M (66.4%). The GGUF Q3_K_S (95.0%) and Q2_K (92.7%) over-refusal artifact is not present in the AWQ/GPTQ variants, suggesting that the over-refusal is specific to the GGUF format on this model.
The refusal rate table demonstrates the central finding of TR125 v3: at every model size and for every format tested, AWQ and GPTQ produce larger safety losses than GGUF Q4_K_M. The pattern holds even when accounting for the ~0.85 BPW precision difference. The safety loss is not just a matter of fewer bits -- it reflects a qualitative difference in how these quantization methods interact with safety-critical weight patterns.
C.2 Bias Resistance (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| qwen2.5-7b | -- | 98.5% | 98.0% | 97.5% | 98.5% | 97.5% | 99.0% | -- | -- |
| llama3.2-3b | 96.5% | 96.0% | 94.9% | 94.9% | 96.5% | 94.4% | 78.8% | 93.4% | 78.8% |
| llama3.2-1b | 89.4% | 88.9% | 88.4% | 87.4% | 87.4% | 99.5% | 73.2% | 83.3% | 68.7% |
| qwen2.5-1.5b | 85.4% | 89.4% | 88.4% | 89.4% | 88.9% | 89.9% | 90.4% | 87.9% | 79.8% |
| phi-2 | 84.8% | 87.9% | 86.4% | 83.8% | 86.9% | 91.9% | 99.0% | -- | 63.6% |
| mistral-7b | -- | 83.8% | 83.8% | 84.3% | 85.4% | 80.3% | 77.3% | -- | -- |
Observations.
- GPTQ bias resistance is consistently lower than AWQ bias resistance on the same model: llama3.2-1b (68.7% vs 83.3%), llama3.2-3b (78.8% vs 93.4%), qwen2.5-1.5b (79.8% vs 87.9%). GPTQ appears to damage bias-resistance training more aggressively than AWQ.
- phi-2 GPTQ bias resistance (63.6%) is the lowest AWQ/GPTQ bias resistance in the matrix, consistent with the model's comprehensive safety degradation under GPTQ.
C.3 Truthfulness (All Models, All Formats)
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q3_K_S | Q2_K | AWQ | GPTQ |
|---|---|---|---|---|---|---|---|---|---|
| mistral-7b | -- | 60.0% | 55.0% | 59.0% | 54.0% | 50.0% | 56.0% | -- | -- |
| llama3.2-1b | 55.0% | 56.0% | 48.0% | 49.0% | 58.0% | 49.0% | 44.0% | 53.0% | 50.0% |
| qwen2.5-1.5b | 49.0% | 43.0% | 47.0% | 51.0% | 51.0% | 54.0% | 59.0% | 58.0% | 42.0% |
| qwen2.5-7b | -- | 50.0% | 53.0% | 49.0% | 57.0% | 49.0% | 50.0% | -- | -- |
| llama3.2-3b | 49.0% | 48.0% | 51.0% | 58.0% | 50.0% | 52.0% | 54.0% | 47.0% | 59.0% |
| phi-2 | 39.0% | 45.0% | 42.0% | 48.0% | 50.0% | 44.0% | 44.0% | -- | 38.0% |
Observations.
- Truthfulness shows no consistent quantization trend for any model or format. The N=50 sample size produces CIs of approximately +/-14pp, rendering most between-condition differences statistically indistinguishable from noise.
- AWQ/GPTQ truthfulness values are within the GGUF range for all models, suggesting that truthfulness is primarily a model-level characteristic rather than a format-sensitive metric.
- The high variance in truthfulness measurements argues against using truthfulness as a quantization quality gate. Refusal rate (N=220, CI ~+/-5pp) is a far more reliable safety metric for quantization decisions.
Appendix D: Glossary
Statistical Terms
| Term | Definition |
|---|---|
| Bootstrap CI | Confidence interval estimated by resampling with replacement (B=2000, seed=42) |
| Mixed-effects model | Regression with random intercepts per model, controlling for model-level differences |
| Pearson r | Linear correlation coefficient, range [-1, +1] |
| Repeated-measures | Correlation computed on within-model deltas, respecting the nested data structure |
| Spearman rho | Rank correlation coefficient, range [-1, +1] |
| Wilson CI | Confidence interval for proportions; better coverage than Wald at extremes |
Domain-Specific Terms
| Term | Definition |
|---|---|
| AWQ | Activation-aware Weight Quantization -- calibration-based 4-bit quantization method |
| BPW | Bits Per Weight -- effective precision of a quantized model |
| FP16 | Half-precision floating point (16-bit, ~16 BPW) |
| GGUF | GPT-Generated Unified Format -- binary format for quantized LLM weights used by llama.cpp |
| GPTQ | Generalized Post-Training Quantization -- Hessian-based 4-bit quantization method |
| Hidden danger | Regime where quality metrics are stable but safety has collapsed |
| Near hidden danger | Regime where quality shows minor degradation and safety has collapsed |
| pp | Percentage points -- absolute metric difference |
| Q2_K through Q8_0 | GGML/GGUF quantization levels from 2-bit to 8-bit |
| Quality cliff | Quant level where accuracy drops abruptly (>9pp in one step) |
| Regime | Classification of a (model, quant) pair based on quality-safety interaction pattern |
| Sign reversal | When the quality-safety correlation direction differs across models |
Appendix E: Configs and Provenance
E.1 v3 Quality Evaluation Config
# === TR125 v3 AWQ/GPTQ Quality Config ===
backend: transformers
temperature: 0.0
seed: 42
max_tokens: 256
models:
- llama3.2-1b-awq
- llama3.2-1b-gptq
- llama3.2-3b-awq
- llama3.2-3b-gptq
- qwen2.5-1.5b-awq
- qwen2.5-1.5b-gptq
- phi-2-gptq
tasks:
- mmlu_real (285 questions)
- arc_challenge (200 questions)
- summarization (50 samples)
- qa (50 samples)
- code_generation (50 samples)
- creative_writing (50 samples)
- classification (50 samples)
run_id: "20260330_222254"
hardware: Google Colab T4 16GB
E.2 v3 Safety Evaluation Config
# === TR125 v3 AWQ/GPTQ Safety Config ===
backend: transformers
temperature: 0.0
seed: 42
max_tokens: 256
models:
- llama3.2-1b-awq
- llama3.2-1b-gptq
- llama3.2-3b-awq
- llama3.2-3b-gptq
- qwen2.5-1.5b-awq
- qwen2.5-1.5b-gptq
- phi-2-gptq
safety_tasks:
- refusal (220 harmful prompts)
- truthfulness (50 factual queries)
- bias_resistance (198 bias probes)
judge: Gemma 3 12B
run_id: "20260331_125319"
hardware: Google Colab T4 16GB
E.3 Bespoke Analysis Config
# === TR142 Bespoke Analysis v3 Config ===
bundle_name: phase56_v3_full_canonical
target_models:
- llama3.2-1b
- llama3.2-3b
- qwen2.5-1.5b
- qwen2.5-7b
- phi-2
- mistral-7b
quality_sources:
- tr125_phase2_legacy
- tr125_expansion_7b
- v3_awq_gptq_quality
safety_sources:
- tr134_phase3_legacy
- tr134_expansion_small_models
- v3_awq_gptq_safety
judge_sources:
- tr134_legacy_judge
- expansion_gemma3_judge
- rejudge_7b_gemma3
- v3_awq_gptq_judge
permutation_scope: none
Those config excerpts are the final source of truth for what TR125 v3 actually ran. Any discrepancy between these configs and the text of this report should be resolved in favor of the configs and the raw data files referenced in SS15.
E.4 Key Assumptions
- Q8_0 baseline for 7B models: FP16 exceeds T4 VRAM. Q8_0 is within ~1-4pp of FP16 based on v1 cross-validation on small models.
- Temperature 0.0: Greedy decoding, single repetition. Determinism assumed but not formally verified for the Transformers pipeline on AWQ/GPTQ checkpoints.
- Same task files across all three evaluation phases. Prompt content is identical; only the model checkpoint and inference backend differ.
- Raw accuracy for all benchmarks. MMLU and ARC scores are raw parser output, not rescored. Rescoring would likely change the relative ranking of AWQ/GPTQ variants.
- WikiText-2 calibration is representative. AWQ/GPTQ calibration on WikiText-2 is the default configuration used by most published checkpoints. Custom calibration may produce different results.
Summary of Changes from v2 to v3
| Section | Change Type | Description |
|---|---|---|
| Metadata | Updated | Added AWQ/GPTQ formats, v3 sample counts, run IDs |
| Abstract | Rewritten | Focused on AWQ/GPTQ findings, updated total evidence base |
| Executive Summary | Extended | 9 findings focused on AWQ/GPTQ safety failures |
| What Changed in v3 (SS4) | New | Explicit audit trail of all v3 additions |
| When to Use This Report | Updated | 5 scenarios including AWQ/GPTQ evaluation guidance |
| SS5 AWQ/GPTQ Quality | New | Generation metrics, benchmark accuracy, repetition for AWQ/GPTQ |
| SS6 AWQ/GPTQ Safety | New | Refusal rate, truthfulness, bias resistance, judge-based safety |
| SS7 Quality-Safety Interaction | Updated | Extended with v3 data, asymmetry analysis |
| SS8 BPW Regressions | Updated | AWQ/GPTQ points included in regression analysis |
| SS9 Deployment Protocol | Updated | AWQ/GPTQ deployment classification added |
| SS10 Regime Classification | Updated | AWQ/GPTQ hidden-danger rows classified |
| SS11 Refusal Mechanism | Updated | Phase 6 mechanism analysis extended to AWQ/GPTQ |
| SS12 Statistical Synthesis | Updated | Research questions answered with full v3 evidence |
| SS13 Conclusions | Rewritten | Focus on AWQ/GPTQ safety failure implications |
| SS14 Limitations | Updated | Added format-backend confound, 7B-complete but >7B untested |
| Appendices A-C | Updated | Complete tables now include AWQ/GPTQ columns |
| Appendix E | Updated | v3 configs added |
Key Numerical Claims Cross-Reference
Every headline number in the Executive Summary is traceable to a specific data source:
| Claim | Number | Source Table | Source File |
|---|---|---|---|
| llama3.2-1b AWQ BERTScore delta | +8.27pp | SS5.1 Table 5 | regimes.csv row 5 |
| llama3.2-1b GPTQ ROUGE-L delta | +27.29pp | SS5.1 Table 5 | regimes.csv row 6 |
| llama3.2-1b GPTQ refusal delta | -68.18pp | SS6.1 Table 8 | regimes.csv row 6 |
| qwen2.5-1.5b AWQ BERTScore delta | -13.67pp | SS5.1 Table 5 | regimes.csv row 33 |
| phi-2 GPTQ MMLU | 54.4% | SS5.2 Table 6 | capability_quality_agg.csv |
| phi-2 GPTQ ARC | 71.0% | SS5.2 Table 6 | capability_quality_agg.csv |
| phi-2 GPTQ refusal delta | -55.45pp | SS6.1 Table 8 | regimes.csv row 31 |
| AWQ max refusal signal | +56.82pp | SS9 Table 14 | phase5_quant_deployment.csv |
| GPTQ max refusal signal | +51.36pp | SS9 Table 14 | phase5_quant_deployment.csv |
| Q5_K_M max refusal signal | +3.18pp (regex) / +2.73pp (judge) | SS9 Table 14, q5_floor.csv | phase5_quant_deployment.csv uses judge-based 2.73pp |
| Safety degrades faster fraction | 37/45 (82.2%) | SS7.1 Table 10 | asymmetry.csv |
| Sign reversal pairings | 36/36 | SS7.2 | sign_reversal_summary.csv |
| Total quality samples | 41,895 raw / 37,485 loaded | Metadata | run_manifest.json |
| Total safety samples | 48,603 loaded | Metadata | run_manifest.json |
| Total judge annotations | 21,096 loaded | Metadata | run_manifest.json |
| Matrix dimensions | 51 rows x 83 columns | Metadata | matrix.csv |
| Hidden-danger rows | 9 | SS10 Table 16 | regimes.csv |
| ROUGE-L mixed-effects p-value | 0.017 | SS12.2 Table 20 | mixed_models.csv |
Peer Review Disclaimer: This report has not undergone external peer review. All findings should be treated as preliminary and verified independently before use in production decisions. The AWQ/GPTQ findings are limited to small models (<=3.2B parameters) and may not generalize to 7B+ models. Safety metrics have model-dependent reliability: regex refusal scoring consistently underreports refusal on certain model families (notably Mistral), and the LLM judge provides a higher-fidelity but not ground-truth second opinion.
End of Technical Report 125 v3.
Line count target: 1,500-1,800. Actual: see file metadata. All tables have interpretive Observations. No naked tables. SS notation used throughout. "Demonstrated" used in place of "Validated."