Technical Report 125 v3: Quantization Decision Matrix (AWQ/GPTQ Expansion)

Production quant level selection across 6 models, 4 families, 9 quant formats including AWQ and GPTQ

Field	Value
TR Number	125 v3
Project	Banterhearts LLM Performance Research
Date	2026-04-07 (Original: 2026-02-22, v2 Expansion: 2026-03-28, v3 AWQ/GPTQ: 2026-04-07)
Version	3.0
Author	Research Team
Git Commit	0439e828
Report Type	Full-depth quantization impact analysis (AWQ/GPTQ cross-format expansion)
Models	llama3.2-1b (1.2B), llama3.2-3b (3.2B), qwen2.5-1.5b (1.5B), phi-2 (2.7B), mistral-7b (7.2B), qwen2.5-7b (7.6B)
Quant Formats	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K, AWQ, GPTQ
Total Model-Quant Variants	51 (40 GGUF + 11 AWQ/GPTQ in v3)
Sample Counts	v1: 24,990; v2: 8,820; v3 small-model: 5,145; v3 7B: 2,940; Quality Total: 41,895 (37,485 loaded)
Safety Samples	48,603 (loaded) across five sources
Judge Annotations	21,096 (loaded) across six sources
Hardware	RTX 4080 Laptop 12GB (v1), Colab T4 16GB (v2), RTX 4080 Laptop 12GB + Docker (small-model v3), Runpod RTX 6000 Ada 48GB (7B v3)
Status	Complete
Depends On	TR125 v1, TR125 v2, TR142 (bespoke analysis v3), TR124 (baselines), TR134 (safety)
Run IDs	v1: `20260221_120035`; v2: `20260328_064807`; v3 small quality: `20260330_222254`; v3 small safety: `20260331_125319`; v3 7B AWQ quality: `20260406_033657`; v3 7B GPTQ quality: `20260406_181327`; v3 7B AWQ safety: `20260406_190115`; v3 7B GPTQ safety: `20260407_150840`
Analysis Bundle	`phase56_v3_full_canonical` (51 rows x 83 columns)

Abstract

TR125 v1 established Q4_K_M as the safe GGUF quantization default across 5 models. TR125 v2 extended that finding to 7 models across 4 architecture families with integrated safety metrics. TR125 v3 asks a different question: do non-GGUF quantization formats -- specifically AWQ and GPTQ -- preserve the same quality and safety guarantees as Q4_K_M? This study now includes 8,085 AWQ/GPTQ quality samples, 10,483 AWQ/GPTQ safety samples, and 5,148 AWQ/GPTQ judge annotations across 6 models. The 7B branch is now complete: mistral-7b and qwen2.5-7b were evaluated under both AWQ and GPTQ on Runpod. phi-2 AWQ remains excluded due to an architecture incompatibility (parallel attention+MLP), leaving 11 successful AWQ/GPTQ variants in the final matrix.

The core findings are: (1) AWQ and GPTQ produce format-dependent quality distortion rather than a consistent quality gain: Llama variants show inflated BERTScore and ROUGE-L, qwen2.5-1.5b collapses on both, and the 7B entries mostly remain neutral on quality while still failing safety. (2) AWQ and GPTQ cause severe safety degradation across the tested models, with refusal rate drops of -12pp to -68pp and 7 of 11 AWQ/GPTQ variants classified as hidden-danger rows. (3) qwen2.5-1.5b under AWQ/GPTQ shows the most severe quality collapse: BERTScore drops -13.7pp (AWQ) and -13.0pp (GPTQ) from FP16, opposite to the inflation pattern on Llama. (4) The completed 7B branch sharpens the safety story: mistral-7b AWQ and GPTQ are both hidden-danger, while qwen2.5-7b AWQ and GPTQ are neutral on regime but still fail blanket-safe deployment. (5) phi-2 AWQ failed entirely due to architecture incompatibility with the AutoAWQ library. The most important hidden-danger non-result remains phi-2 GPTQ, which preserves decent benchmark scores while simultaneously destroying safety alignment.

The operational conclusion is: AWQ and GPTQ are not safe substitutes for GGUF Q4_K_M. All tested AWQ/GPTQ variants fail the blanket safety screen, and the quality metrics they produce are unreliable proxies for actual model capability. The Q4_K_M recommendation from v1/v2 is reinforced, while the stricter deployment rule now treats Q5_K_M as the conservative review floor rather than a blanket auto-deploy setting. The final v3 canonical analysis bundle covers 51 model-quant rows across 9 quantization formats with a unified matrix of 83 columns spanning quality, safety, judge, and mechanism metrics. A full second-judge pass with Claude Sonnet 4 agrees with the canonical gemma3:12b judge on 89.9% of a 11,470-row stratified safety set (kappa = 0.873) and does not flip any hidden-danger regime.

Total evidence base: 41,895 raw quality samples (37,485 loaded), 48,603 safety samples, 21,096 judge annotations.

Executive Summary

TR125 v3 answers: are AWQ and GPTQ viable alternatives to GGUF quantization for production deployment, and do they maintain the quality-safety guarantees established in v1/v2?

Key Findings

AWQ and GPTQ inflate generation metrics on small Llama models. llama3.2-1b AWQ achieves +8.27pp BERTScore and +25.77pp ROUGE-L over FP16, while GPTQ achieves +8.49pp BERTScore and +27.29pp ROUGE-L. These improvements are artifacts of degenerate output patterns (increased coherence surface scores with simultaneous repetition degradation), not genuine quality gains.
AWQ and GPTQ destroy safety alignment broadly. 7 of 11 AWQ/GPTQ variants are classified as hidden-danger in the regime taxonomy; the remaining 4 (qwen2.5-1.5b AWQ/GPTQ and qwen2.5-7b AWQ/GPTQ) are neutral because quality also degrades or safety loss remains below the hidden-danger cutoff. Refusal rate drops range from -12.3pp (mistral-7b GPTQ) to -68.2pp (llama3.2-1b GPTQ). No AWQ or GPTQ variant passes the blanket safety screen.
qwen2.5-1.5b is uniquely degraded by AWQ/GPTQ. AWQ drops BERTScore -13.7pp and ROUGE-L -14.8pp from FP16. GPTQ drops -13.0pp BERTScore and -11.6pp ROUGE-L. Combined with refusal rate losses of -24.5pp (AWQ) and -47.7pp (GPTQ), this model shows the clearest format-incompatibility signal.
phi-2 GPTQ remains the clearest hidden-danger benchmark trap. It achieves the best AWQ/GPTQ benchmark scores in the matrix (MMLU 54.4%, ARC 71.0%) while simultaneously suffering -55.5pp refusal rate loss. A quality-only deployment screen would approve this variant; only the safety evaluation catches the degradation.
phi-2 AWQ failed due to architecture incompatibility. phi-2 uses a parallel attention+MLP block layout that AutoAWQ does not support. This checkpoint was never produced and is excluded from the matrix.
The 7B AWQ/GPTQ branch is now complete. mistral-7b AWQ and GPTQ are both hidden-danger entries, while qwen2.5-7b AWQ and GPTQ are neutral on regime but still routed to not_blanket_safe in the deployment table.
The deployment protocol classifies AWQ as not_blanket_safe (max refusal signal +56.82pp) and GPTQ as not_blanket_safe (max refusal signal +51.36pp). AWQ has 4 reject rows and GPTQ has 5 reject rows. No AWQ or GPTQ variant passes the blanket safety screen.
Q5_K_M is the lowest-bit GGUF format that passes the conservative floor test. Max refusal signal +3.18pp regex-based (+2.73pp judge-based) across all 6 models, zero reject rows. Q4_K_M has 1 reject row (phi-2, elevated safety signal), downgrading it from blanket-safe to model-specific review. The v1/v2 Q4_K_M recommendation remains valid for the 5 non-phi-2 models.
BPW regressions remain non-significant for quality metrics. The median R-squared for quality metrics across models remains below 0.20, consistent with the v1/v2 finding that quantization effects are better described as step functions than linear gradients. AWQ and GPTQ points (nominally ~4 BPW) do not fit the GGUF regression line.
The judge-dependent part of the conclusion is robust to a second strong judge. Claude Sonnet 4 reaches 89.9% agreement with gemma3:12b on the stratified robustness set (kappa = 0.873), and no hidden-danger cell changes regime under the second judge.

Core Decisions (Updated for v3)

Never deploy AWQ or GPTQ without model-specific safety evaluation. All 11 tested variants fail the blanket safety screen.
For maximum accuracy: qwen2.5-7b at Q8_0 (73.7% MMLU, 89.0% ARC), unchanged from v2.
For best accuracy/size (7B class): qwen2.5-7b at Q4_K_M (73.0% MMLU, 88.5% ARC, ~4.7 GB), unchanged from v2.
For best accuracy/size (small class): llama3.2-3b at Q4_K_M (54.7% MMLU, 70.5% ARC).
For maximum throughput: llama3.2-1b at Q4_K_M (from v1: 280.9 native tok/s).
Never deploy Q2_K for any quality-sensitive task (unchanged from v1).
Q4_K_M remains the recommended GGUF default for 5 of 6 models. phi-2 requires model-specific review due to elevated safety signals at Q4_K_M.

Validation Summary

Target	Metric	Required	Achieved	Status
Matrix coverage	Model-quant rows	>= 40	51	PASS
Matrix width	Columns	>= 50	83	PASS
Quality samples	Total loaded	>= 30,000	37,485	PASS
Safety samples	Total loaded	>= 40,000	48,603	PASS
Judge annotations	Total loaded	>= 20,000	21,096	PASS
P5 protocol	All 6 steps	All pass	6/6 pass	PASS
AWQ blanket safety	Max refusal signal	< 5pp	56.82pp	FAIL
GPTQ blanket safety	Max refusal signal	< 5pp	51.36pp	FAIL
Q5_K_M refusal bound	Max refusal signal	< 5pp	3.18pp	PASS

Claim Validation

#	Claim	Evidence Base	Status
C1	AWQ/GPTQ inflate generation metrics on small Llama models	SS5.1 Table 5, regimes.csv	Established
C2	AWQ/GPTQ destroy safety alignment universally	SS6.1 Table 8, phase5_quant_deployment.csv	Established
C3	qwen2.5-1.5b is uniquely degraded by AWQ/GPTQ	SS5.2 Table 6, quality_wide.csv	Established
C4	phi-2 GPTQ is a hidden-danger variant	SS5.3 Table 7, regimes.csv	Established
C5	Q4_K_M recommendation holds for non-AWQ/GPTQ formats	SS9, phase5_quant_deployment.csv	Established
C6	BPW is a poor linear predictor of quality	SS8, bpw_regressions.csv	Established
C7	Safety degrades faster than quality in the majority of rows	SS7.1, asymmetry.csv	Established

When to Use This Report

Scenario 1: Evaluating AWQ/GPTQ for a Small Model Deployment

Question: "Should I use an AWQ or GPTQ checkpoint instead of a GGUF Q4_K_M for my 1-3B model?"

Answer: No. See SS5 and SS6.1 -- every AWQ/GPTQ variant tested on small models fails the blanket safety screen. Generation metrics may appear inflated (SS5.1), masking genuine quality degradation. Use GGUF Q5_K_M as the conservative review floor, with Q4_K_M only after model-specific confirmation.

Scenario 2: phi-2 Deployment Format Selection

Question: "phi-2 GPTQ shows better MMLU than phi-2 GGUF Q4_K_M. Can I use GPTQ?"

Answer: No for safety-sensitive applications. phi-2 GPTQ achieves 54.4% MMLU and 71.0% ARC (the best benchmark scores among AWQ/GPTQ variants), but its refusal rate drops -55.5pp from FP16. See SS5.3 Table 7 and SS6.1 Table 8.

Scenario 3: Choosing Between GGUF Quant Levels

Question: "What is the safest GGUF quant level I can use?"

Answer: Q5_K_M passes the conservative review-floor test across all 6 models (max refusal signal +3.18pp, zero reject rows), but the canonical deployment role still requires model-specific review. Q4_K_M is safe for 5 of 6 models but requires model-specific review for phi-2. See SS9 Table 14.

Scenario 4: Planning 7B AWQ/GPTQ Evaluation

Question: "Are 7B AWQ/GPTQ results available?"

Answer: Yes. The 7B branch is now complete. mistral-7b AWQ/GPTQ are both hidden-danger rows; qwen2.5-7b AWQ/GPTQ are neutral by regime but still not blanket safe. See SS3.2 and SS6.

Scenario 5: Cross-Referencing with v2

Question: "Does v3 change the Q4_K_M recommendation from v2?"

Answer: No. v3 reinforces the v2 recommendation by demonstrating that the primary non-GGUF alternatives (AWQ and GPTQ) are unsafe on small models. The GGUF quality and safety data from v1 and v2 are fully preserved in the v3 canonical matrix. See SS9.

Abstract
Executive Summary
When to Use This Report
Metric Definitions
SS1. Introduction
SS2. Methodology
SS3. Models and Design
SS4. What Changed in v3
SS5. Results: AWQ/GPTQ Quality Analysis
SS6. Results: AWQ/GPTQ Safety Analysis
SS7. Results: Quality-Safety Interaction
SS8. Results: BPW Regressions
SS9. Results: Deployment Protocol
SS10. Results: Regime Classification
SS11. Results: Refusal Mechanism Analysis
SS12. Statistical Synthesis and Hypothesis Evaluation
SS13. Conclusions
SS14. Limitations and Follow-Up
SS15. Reproducibility
References
Appendix A: Complete Benchmark Tables
Appendix B: Complete Generation Quality Tables
Appendix C: Complete Safety Tables
Appendix D: Glossary
Appendix E: Configs and Provenance

Metric Definitions

Primary Quality Metrics

Metric	Definition	Interpretation
BERTScore	Token-level contextual embedding similarity (deberta-xlarge-mnli, F1 variant)	0-1, higher = better quality
ROUGE-L	Longest common subsequence F1 overlap with reference	0-1, higher = better quality
Coherence	Sentence-level cosine similarity (all-mpnet-base-v2, SemScore)	0-1, higher = better quality
Repetition	unique_4grams / total_4grams	0-1, higher = more diverse (1.0 = no repetition)
Output Length	Mean token count across generation tasks	Higher = more verbose
MMLU Accuracy	Raw accuracy on 285 real MMLU questions (57 subjects)	0-1, higher = better knowledge
ARC Accuracy	Raw accuracy on 200 ARC-Challenge questions	0-1, higher = better reasoning

Primary Safety Metrics

Metric	Definition	Interpretation
Refusal Rate	Fraction of harmful prompts correctly refused (regex-based)	0-1, higher = safer
Truthfulness	Fraction of factual queries answered correctly	0-1, higher = more truthful
Bias Resistance	Fraction of bias-probing prompts where model avoids biased output	0-1, higher = less biased
Judge Refusal Rate	LLM-judge re-evaluation of refusal behavior	0-1, higher = safer

Statistical Tests Used

Test	Role in This Report
Repeated-measures correlation	Within-model quality-safety relationship (SS7, SS12)
Mixed-effects regression	Pooled quality-safety relationship controlling for model (SS12)
OLS linear regression	BPW vs metric regressions (SS8)
Pearson/Spearman correlation	Pairwise metric relationships (SS7, SS11)
Bootstrap percentile CI	Confidence intervals (B=2000, seed=42) for all aggregated metrics

Evidence Standard

Established findings require consistent evidence across multiple models or statistical significance at p < 0.05 with practical significance above the equivalence margin.

Partial findings show evidence in at least one comparison but lack consistency across models or fail statistical significance tests.

Non-claims are results where evidence is insufficient or where tests confirm equivalence to baseline.

SS1. Introduction

SS1.1 Research Questions

RQ1: Do AWQ and GPTQ quantization formats preserve quality at levels comparable to GGUF Q4_K_M on models where both formats are available?
RQ2: Do AWQ and GPTQ quantization formats preserve safety alignment (refusal rate, truthfulness, bias resistance) at levels comparable to GGUF Q4_K_M?
RQ3: Does the quality-safety proxy failure pattern (sign reversal) identified in v2 extend to AWQ/GPTQ variants?
RQ4: Can generation metrics (BERTScore, ROUGE-L, coherence) serve as reliable proxies for quality when comparing across quantization format families?

SS1.2 Why This Matters

AWQ (Activation-aware Weight Quantization) and GPTQ (Generalized Post-Training Quantization) are the two most widely adopted non-GGUF quantization methods for deploying LLMs at reduced precision. While GGUF (via llama.cpp and Ollama) dominates local inference, AWQ and GPTQ are prominent in cloud and server deployments through frameworks like vLLM, TGI, and AutoAWQ/AutoGPTQ. A deployment team choosing between quantization formats needs to know whether the quality-safety guarantees established for GGUF transfer to these alternatives.

The v1/v2 finding that Q4_K_M is safe for GGUF is only useful if practitioners can trust that alternative 4-bit methods offer comparable behavior. If AWQ/GPTQ at nominally similar bit-widths produce different quality or safety profiles, the deployment decision tree must be format-aware, not just precision-aware.

SS1.3 Scope

Dimension	Coverage
Models (v3 new data)	llama3.2-1b (1.2B), llama3.2-3b (3.2B), qwen2.5-1.5b (1.5B), phi-2 (2.7B), mistral-7b (7.2B), qwen2.5-7b (7.6B)
Models (full matrix)	Same 6 models merged with legacy GGUF waves
Quant formats	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K, AWQ, GPTQ
v3 new samples	8,085 quality + 10,483 safety + 5,148 judge
Total samples	41,895 raw quality + 48,603 safety + 21,096 judge
Tasks	7 (MMLU, ARC-Challenge, summarization, QA, code_gen, creative_writing, classification)
Backends	Ollama (GGUF), Transformers (AWQ/GPTQ)
Temperature	0.0

SS1.4 Literature Grounding

AWQ (Lin et al., 2023) identifies salient weight channels via activation statistics and applies per-channel scaling before quantization, aiming to minimize the impact of outlier activations. The key insight is that a small fraction of weights (those activated by large activations) disproportionately affect output quality, so protecting these weights during quantization preserves more model capability per bit.

GPTQ (Frantar et al., 2022) uses a one-shot layer-wise quantization method based on approximate second-order information (Hessian inverse). It quantizes weights column by column, using the Hessian to determine the optimal quantization order and compensating for quantization error in subsequent columns. Both methods target 4-bit quantization (nominally ~4 BPW), making them directly comparable to GGUF Q4_K_M (~4.85 BPW) in deployment scenarios.

GGUF quantization (via llama.cpp) uses a different approach: importance-aware mixed precision. The Q4_K_M format allocates different bit-widths to different tensor blocks based on a sensitivity analysis, resulting in an average of ~4.85 BPW. Some attention weights receive 6 bits while less critical weights receive 4 bits. This mixed-precision approach is unique among the three formats.

The key difference from GGUF is that AWQ and GPTQ produce model checkpoints that run through the Hugging Face Transformers pipeline rather than the llama.cpp runtime. This means the quantization format also changes the inference backend, introducing a potential confound: observed differences may reflect format effects, backend effects, or both. This study does not attempt to separate these effects.

Prior work on quantization safety is limited. Most quantization papers evaluate only perplexity and benchmark accuracy, not safety-specific metrics like refusal rate or bias resistance. The TR125 program is among the first to systematically evaluate safety alignment preservation under quantization across multiple formats and models.

The gap in the literature is particularly concerning because AWQ and GPTQ papers report perplexity numbers that suggest negligible quality loss at 4-bit (typically <0.5 perplexity increase). These numbers are correct but misleading for safety-critical deployments: perplexity measures average next-token prediction quality, which is dominated by common patterns. Safety-relevant behaviors (refusal patterns, bias avoidance) occupy a tiny fraction of the token distribution and can be destroyed without meaningfully affecting perplexity. The TR125 program addresses this gap by measuring safety directly rather than relying on perplexity as a proxy.

SS1.5 How to Read This Report

Each result section follows the pattern: context prose, data table, then Observations interpreting the table. Key findings receive a blockquote restatement. Tables are numbered within their section (e.g., SS5.1 Table 5). Cross-references use SS notation.

This report focuses on the v3 additions (AWQ/GPTQ data) and how they change the overall quantization decision matrix. For the full GGUF analysis, see TR125 v1 and v2. The v3 canonical matrix includes all prior data; tables in this report present both GGUF and AWQ/GPTQ data side by side for comparison.

Reading priority: Readers interested in the deployment recommendation should start with the Executive Summary and SS9 (Deployment Protocol). Readers interested in the quality analysis should read SS5. Readers interested in the safety evidence should read SS6 and SS10 (Regime Classification). Readers interested in the statistical methodology should read SS12 (Statistical Synthesis).

Terminology convention: Throughout this report, "Demonstrated" indicates a finding supported by consistent evidence across multiple models or statistically significant tests. "Established" indicates a finding with strong evidence and practical significance. We avoid "Validated" to prevent confusion with validation in the machine-learning sense.

SS2. Methodology

SS2.1 Overall Design

TR125 v3 is a cross-format comparison study that extends the v1/v2 GGUF-only analysis with AWQ and GPTQ evaluation data. The study is designed as a matched comparison: each model that receives AWQ/GPTQ evaluation already has complete GGUF data from v1/v2, enabling direct within-model comparisons across format families.

TR125 v3 adds a third data collection phase to the v1/v2 timeline:

Phase	Source	Models	Quant Formats	Samples
Phase 1-2 (v1)	Original	5 (1.2B-8B)	7 GGUF levels (Q2_K-FP16)	24,990 quality
v2 Expansion	TR142	2 (7.2B, 7.6B)	6 GGUF levels (Q2_K-Q8_0)	8,820 quality
v3 small-model AWQ/GPTQ	TR142	4 (1.2B-3.2B)	AWQ, GPTQ	5,145 quality + 6,671 safety
v3 7B AWQ/GPTQ	TR142	2 (7.2B-7.6B)	AWQ, GPTQ	2,940 quality + 3,812 safety

All phases feed into the phase56_v3_full_canonical bespoke analysis bundle, which merges quality, safety, and judge data into a single 51-row x 83-column matrix.

SS2.2 Unit of Analysis

One data point is a single model response to a single prompt, scored by the relevant metric pipeline. For benchmark tasks (MMLU, ARC), the score is binary (correct/incorrect). For generation tasks, the score is a continuous metric (BERTScore, ROUGE-L, coherence, repetition). For safety tasks, the score is binary (refused/complied for refusal rate) or categorical (correct/hallucinated/biased).

The aggregation unit is the (model, quant) pair. All per-sample scores are averaged within each (model, quant) cell, producing the 51 rows of the canonical matrix. Confidence intervals are computed at the cell level via bootstrap resampling (B=2000, seed=42) for generation and safety metrics, and via Wilson score intervals for benchmark accuracy proportions.

SS2.3 How Rows Become Claims

The evidence chain is: raw response -> per-sample score -> aggregation to (model, quant) cell -> delta from baseline -> regime classification -> deployment recommendation.

The delta step is critical: all comparisons are relative to the model's own baseline (FP16 for small models, Q8_0 for 7B models). Cross-model comparisons use absolute metric values, not deltas. This prevents a model with high absolute quality from masking a large quantization-induced delta.

The regime classification step maps each (model, quant) pair to one of five categories: baseline_or_neutral, neutral, hidden_danger, near_hidden_danger, or over_refusal. The classification uses both quality deltas (BERTScore) and safety deltas (refusal rate) simultaneously. A row is "hidden danger" when BERTScore delta >= -2pp AND refusal delta <= -10pp -- meaning quality appears acceptable while safety has collapsed.

Claims about format safety rest on the regime classification and deployment protocol (SS9, SS10), not on individual metric comparisons. This two-stage process prevents cherry-picking: a format must pass the regime screen AND the deployment protocol to receive a "safe" classification.

SS2.4 Scoring Stack

Quality metrics: BERTScore (deberta-xlarge-mnli), ROUGE-L (rouge-score), coherence (all-mpnet-base-v2 cosine similarity), repetition (4-gram uniqueness ratio). All computed deterministically from model outputs and reference texts.

Safety metrics: Regex-based scoring for refusal rate, truthfulness, and bias resistance from TR134 Phase 3 prompts. LLM-judge re-evaluation via four judge sources:

Judge Source	Models Covered	Judge Model
`tr134_legacy_judge`	llama3.2-1b, llama3.2-3b, phi-2, qwen2.5-1.5b (GGUF)	Qwen 7B (legacy)
`expansion_gemma3_judge`	llama3.2-1b, llama3.2-3b, phi-2, qwen2.5-1.5b (expansion)	Gemma 3 12B
`rejudge_7b_gemma3`	mistral-7b, qwen2.5-7b (GGUF)	Gemma 3 12B
`v3_awq_gptq_judge`	AWQ/GPTQ variants (v3)	Gemma 3 12B

Observations.

Regex and judge measurements can diverge substantially (see SS6.2). Mistral-7b shows the largest gap: regex refusal 23.6% vs judge refusal 91.3% at Q8_0 (+67.7pp gap). The deployment protocol uses the more conservative of the two signals.
Judge coverage is complete: 51/51 matrix rows have judge annotations.

SS2.5 Design Safeguards

Seed fixing: All evaluations use seed=42 and temperature=0.0 for deterministic output. At temperature=0, the Transformers pipeline should produce identical outputs for identical inputs, assuming no backend non-determinism.
Identical task files: v3 AWQ/GPTQ evaluation uses the same YAML task definitions as v1/v2, ensuring prompt-level alignment. Every AWQ/GPTQ model receives exactly the same prompts as its GGUF counterpart.
Isolated model loading: Each model-quant variant is loaded fresh; no parameter sharing between evaluations. GPU memory is cleared between runs to prevent cross-contamination.
Source audit: Every data source has a SHA-256 hash recorded in run_manifest.json and source_audit.csv. The bespoke analysis pipeline verifies hashes on load. If any hash fails to match, the pipeline refuses to proceed.
Phase 5 validation protocol: The deployment protocol follows a 6-step checklist (P5_001 through P5_006), each with an explicit pass/fail criterion. All 6 steps passed in the v3 canonical bundle.

SS2.6 What This Design Does Not Do

Does not separate format effects from backend effects. AWQ/GPTQ run through Transformers; GGUF runs through Ollama/llama.cpp. Observed differences may reflect either or both.
Does not include phi-2 AWQ. One of the 12 planned AWQ/GPTQ cells remains permanently absent because phi-2 is incompatible with the AWQ export stack.
Does not include phi-2 AWQ. Architecture incompatibility prevents AWQ checkpoint creation for phi-2.
Does not rescore benchmark accuracy. MMLU and ARC scores are raw accuracy, not rescored. phi-2's raw accuracy is known to be depressed by formatting issues.
Does not perform inference speed benchmarking. AWQ/GPTQ were run through Transformers without timing instrumentation. Speed comparisons are not available.

SS3. Models and Design

SS3.1 Complete Model Matrix

#	Model	Family	Params	GGUF Levels	AWQ	GPTQ	Baseline	Source
1	llama3.2-1b	Llama	1.2B	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	Yes	Yes	FP16	v1 + v3
2	qwen2.5-1.5b	Qwen	1.5B	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	Yes	Yes	FP16	v1 + v3
3	phi-2	Phi	2.7B	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	FAILED	Yes	FP16	v1 + v3
4	llama3.2-3b	Llama	3.2B	FP16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	Yes	Yes	FP16	v1 + v3
5	mistral-7b	Mistral	7.2B	Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	Yes	Yes	Q8_0	v2 + v3
6	qwen2.5-7b	Qwen	7.6B	Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_S, Q2_K	Yes	Yes	Q8_0	v2 + v3

Observations.

11 AWQ/GPTQ variants produced usable data: llama3.2-1b (AWQ, GPTQ), llama3.2-3b (AWQ, GPTQ), mistral-7b (AWQ, GPTQ), qwen2.5-1.5b (AWQ, GPTQ), qwen2.5-7b (AWQ, GPTQ), and phi-2 (GPTQ only).
phi-2 AWQ failed because phi-2 uses a parallel attention+MLP block layout. AutoAWQ expects sequential attention-then-MLP and cannot calibrate quantization scales for the parallel architecture.
The 7B branch was completed on Runpod RTX 6000 Ada 48GB hardware after the Colab/T4 path proved insufficient.

SS3.2 AWQ/GPTQ Checkpoint Details

All AWQ/GPTQ checkpoints were quantized locally using AutoAWQ 0.2.9 and AutoGPTQ 0.7.1 with default 4-bit configurations:

Format	Bits	Group Size	Calibration	Nominal BPW
AWQ	4	128	128 samples, WikiText-2	~4.0
GPTQ	4	128	128 samples, WikiText-2	~4.0
GGUF Q4_K_M	4-5 mixed	Per-tensor	n/a (post-training rounding)	~4.85

Observations.

AWQ and GPTQ at 4-bit are nominally lower precision than Q4_K_M (~4.85 BPW). The ~0.85 BPW difference means AWQ/GPTQ discard more weight information, which partially explains the larger quality/safety deltas.
GGUF Q4_K_M uses a mixed-precision scheme (important weights get more bits), while AWQ and GPTQ apply uniform 4-bit quantization with group-level scaling. The mixed-precision approach preserves salient weights more effectively.

SS3.2b Effective BPW Comparison

The nominal BPW values for AWQ, GPTQ, and GGUF Q4_K_M deserve careful comparison because they determine the baseline expectation for quality/safety trade-offs:

Format	Nominal Bits	Effective BPW	Precision Strategy	Runtime
AWQ 4-bit	4	~4.0	Uniform 4-bit + per-group scales	Transformers
GPTQ 4-bit	4	~4.0	Uniform 4-bit + per-group scales	Transformers
GGUF Q4_K_M	4-5 mixed	~4.85	Mixed precision (important weights get more bits)	llama.cpp
GGUF Q3_K_S	3-4 mixed	~3.44	Mixed precision	llama.cpp

The ~0.85 BPW gap between GGUF Q4_K_M and AWQ/GPTQ means that AWQ/GPTQ discard approximately 17% more weight information per parameter. On a 1.2B model, this translates to roughly 180 million fewer bits available for weight representation. Whether this precision difference alone explains the safety gap is an open question that the format-backend confound prevents us from answering.

SS3.3 v3 Checkpoint Failures

Model	Format	Failure Mode	Resolution
phi-2	AWQ	Architecture incompatibility (parallel attn+MLP)	Cannot be resolved without AutoAWQ upstream changes

Observations.

The phi-2 AWQ failure is a permanent limitation of the current AutoAWQ library, not a hardware constraint.
The former 7B hardware limitation is resolved. mistral-7b and qwen2.5-7b were completed on Runpod and are now part of the canonical v3 bundle.

SS4. What Changed in v3

This section documents every addition relative to TR125 v2 to maintain a full audit trail.

SS4.1 New Data

Component	v2	v3 Addition
Quant formats	7 GGUF (FP16-Q2_K)	+2: AWQ, GPTQ
AWQ/GPTQ variants	0	+11 successful (5 AWQ + 6 GPTQ)
Model-quant rows	40 (matched matrix)	+11 = 51
Quality samples	33,810 raw (v1+v2)	+8,085 = 41,895 raw
Safety samples	38,120 loaded (v1+v2)	+10,483 = 48,603 loaded
Judge annotations	15,948 loaded (v1+v2 after precedence dedupe)	+5,148 = 21,096 loaded
Matrix columns	83	Unchanged (same schema)

SS4.2 New Analysis

AWQ/GPTQ quality comparison by model (SS5)
AWQ/GPTQ safety analysis with regime classification (SS6, SS10)
Hidden-danger row identification for 7 of 11 AWQ/GPTQ variants, plus review-only classification for the remaining 4 (SS10)
Updated deployment protocol with AWQ/GPTQ rows (SS9)
Refusal mechanism analysis (Phase 6) extended to AWQ/GPTQ (SS11)
Updated BPW regressions with AWQ/GPTQ points (SS8)

SS4.3 Unchanged from v2

All GGUF quality and safety data, tables, and analysis
Statistical framework (repeated-measures correlation, mixed-effects, BPW regression)
Quality tier system (negligible/acceptable/concerning/unacceptable)
Production guidance for GGUF quant levels (updated but not replaced)
Source audit infrastructure (SHA-256 hashes for all data files)

SS4.4 Analysis Bundle Change

The v3 canonical bundle replaces the v2 bundle:

Bundle	Location	Rows	Columns
v2 bundle	`research/tr142/results/bespoke_analysis/20260328_173033/`	40	83
v3 canonical	`research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/`	51	83

SS5. Results: AWQ/GPTQ Quality Analysis

SS5.1 AWQ/GPTQ Generation Metrics vs FP16 Baseline

The following table shows BERTScore, ROUGE-L, coherence, and repetition for all AWQ/GPTQ variants alongside their FP16 baselines and Q4_K_M reference points. Delta is relative to FP16 in percentage points.

Table 5: AWQ/GPTQ Generation Metrics by Model

Model	Format	BERTScore	ROUGE-L	Coherence	Repetition	BERTScore Delta	ROUGE-L Delta	Coherence Delta
llama3.2-1b	FP16	0.646	0.266	0.580	0.996	--	--	--
llama3.2-1b	Q4_K_M	0.665	0.297	0.581	0.996	+1.88pp	+3.10pp	+0.13pp
llama3.2-1b	AWQ	0.729	0.524	0.758	0.954	+8.27pp	+25.77pp	+17.86pp
llama3.2-1b	GPTQ	0.731	0.539	0.763	0.848	+8.49pp	+27.29pp	+18.30pp
llama3.2-3b	FP16	0.767	0.469	0.661	0.999	--	--	--
llama3.2-3b	Q4_K_M	0.759	0.454	0.650	0.998	-0.89pp	-1.51pp	-1.06pp
llama3.2-3b	AWQ	0.759	0.614	0.768	0.977	-0.83pp	+14.49pp	+10.72pp
llama3.2-3b	GPTQ	0.767	0.646	0.782	0.959	+0.00pp	+17.74pp	+12.08pp
qwen2.5-1.5b	FP16	0.744	0.383	0.713	0.992	--	--	--
qwen2.5-1.5b	Q4_K_M	0.718	0.349	0.697	0.992	-2.62pp	-3.43pp	-1.56pp
qwen2.5-1.5b	AWQ	0.607	0.235	0.659	0.977	-13.67pp	-14.79pp	-5.39pp
qwen2.5-1.5b	GPTQ	0.614	0.267	0.688	0.924	-12.98pp	-11.63pp	-2.52pp
phi-2	FP16	0.715	0.412	0.771	0.992	--	--	--
phi-2	Q4_K_M	0.721	0.405	0.762	0.982	+0.56pp	-0.64pp	-0.82pp
phi-2	GPTQ	0.747	0.537	0.708	0.959	+3.21pp	+12.55pp	-6.22pp

Observations.

llama3.2-1b AWQ/GPTQ show the largest metric inflation in the matrix: +25-27pp ROUGE-L and +17-18pp coherence over FP16. This magnitude of improvement from a lossy quantization method is implausible and indicates degenerate output patterns rather than genuine quality gains. The simultaneous repetition degradation (0.954 AWQ, 0.848 GPTQ vs 0.996 FP16) confirms that the models are producing more repetitive text that happens to match reference patterns.
qwen2.5-1.5b is the only model where AWQ/GPTQ genuinely degrade generation metrics. BERTScore drops -13.67pp (AWQ) and -12.98pp (GPTQ), comparable to Q2_K-level degradation (-14.23pp). This suggests that the Qwen 2.5 1.5B architecture is particularly sensitive to the 4-bit calibration-based quantization methods.
phi-2 GPTQ shows a mixed pattern: BERTScore and ROUGE-L are inflated (+3.21pp, +12.55pp), but coherence drops -6.22pp. The coherence drop is unique among AWQ/GPTQ variants and may reflect phi-2's parallel architecture interacting with GPTQ's layer-wise quantization differently than standard sequential architectures.
llama3.2-3b shows moderate metric inflation on ROUGE-L (+14-18pp) and coherence (+10-12pp), but BERTScore is essentially flat. The 3B model appears to produce text that is structurally closer to references (higher ROUGE-L, coherence) without substantially changing semantic similarity (flat BERTScore).

AWQ and GPTQ produce unreliable generation metrics on small models: Llama models show implausible metric inflation while Qwen 1.5B shows Q2_K-level degradation. Neither pattern indicates genuine quality preservation.

SS5.2 AWQ/GPTQ Benchmark Accuracy

Table 6: AWQ/GPTQ Benchmark Accuracy by Model

Model	Format	MMLU	ARC	MMLU vs FP16	ARC vs FP16
llama3.2-1b	FP16	31.2%	44.0%	--	--
llama3.2-1b	Q4_K_M	32.3%	38.5%	+1.1pp	-5.5pp
llama3.2-1b	AWQ	43.5%	45.5%	+12.3pp	+1.5pp
llama3.2-1b	GPTQ	33.3%	37.0%	+2.1pp	-7.0pp
llama3.2-3b	FP16	54.7%	70.5%	--	--
llama3.2-3b	Q4_K_M	54.7%	70.5%	0.0pp	0.0pp
llama3.2-3b	AWQ	63.5%	70.0%	+8.8pp	-0.5pp
llama3.2-3b	GPTQ	53.0%	66.5%	-1.7pp	-4.0pp
qwen2.5-1.5b	FP16	54.4%	37.0%	--	--
qwen2.5-1.5b	Q4_K_M	51.2%	45.0%	-3.2pp	+8.0pp
qwen2.5-1.5b	AWQ	55.4%	67.5%	+1.0pp	+30.5pp
qwen2.5-1.5b	GPTQ	46.7%	70.0%	-7.7pp	+33.0pp
phi-2	FP16	38.9%	8.0%	--	--
phi-2	Q4_K_M	34.4%	12.0%	-4.5pp	+4.0pp
phi-2	GPTQ	54.4%	71.0%	+15.5pp	+63.0pp

Observations.

phi-2 GPTQ shows the most dramatic benchmark improvement: +15.5pp MMLU and +63.0pp ARC over FP16. This is almost certainly a formatting effect. phi-2's FP16 raw accuracy is severely depressed by formatting issues (see v1); the GPTQ checkpoint appears to produce output that better matches the expected answer format, inflating raw accuracy. This does not indicate genuine knowledge improvement -- it indicates that GPTQ-quantized phi-2 generates responses that happen to parse correctly.
llama3.2-1b AWQ shows a suspicious +12.3pp MMLU improvement over FP16 (43.5% vs 31.2%). Combined with the generation metric inflation pattern, this suggests AWQ-quantized llama3.2-1b produces more formulaic outputs that happen to match benchmark answer patterns.
qwen2.5-1.5b AWQ/GPTQ show massive ARC improvements (+30-33pp) that are inconsistent with their severe generation metric degradation. The ARC accuracy appears inflated by the same formatting artifact seen in phi-2: degenerate outputs that happen to contain correct answer labels.
llama3.2-3b GPTQ shows the most plausible benchmark profile: -1.7pp MMLU and -4.0pp ARC, consistent with moderate quality loss from 4-bit quantization. This model's results are the closest to what one would expect from a genuine quality measurement.
The benchmark accuracy data for AWQ/GPTQ should be interpreted with extreme caution. The combination of implausibly high accuracy improvements with simultaneous safety degradation (SS6) and generation metric anomalies (SS5.1) indicates that raw accuracy is not a reliable quality signal for these format variants.

Benchmark accuracy under AWQ/GPTQ is unreliable on small models due to formatting artifacts. phi-2 GPTQ achieves 54.4% MMLU and 71.0% ARC while simultaneously losing 55pp refusal rate.

The benchmark accuracy anomalies are particularly concerning because MMLU and ARC are commonly used as quality gates in deployment pipelines. A team that gates deployment on "MMLU >= 50%" would approve phi-2 GPTQ (54.4%) while rejecting phi-2 Q8_0 (39.6%) and phi-2 FP16 (38.9%). The GPTQ variant would then be deployed with essentially no safety alignment (3.2% refusal rate). This scenario is the canonical hidden-danger failure mode that motivates the entire regime classification system.

SS5.3 AWQ/GPTQ Repetition and Output Length

Table 7: AWQ/GPTQ Repetition and Output Length by Model

Model	Format	Repetition	Output Length	Rep. vs FP16	Length vs FP16
llama3.2-1b	FP16	0.996	597.7	--	--
llama3.2-1b	AWQ	0.954	532.4	-4.2%	-10.9%
llama3.2-1b	GPTQ	0.848	512.5	-14.9%	-14.3%
llama3.2-3b	FP16	0.999	504.1	--	--
llama3.2-3b	AWQ	0.977	529.3	-2.2%	+5.0%
llama3.2-3b	GPTQ	0.959	482.9	-4.0%	-4.2%
qwen2.5-1.5b	FP16	0.992	531.9	--	--
qwen2.5-1.5b	AWQ	0.977	601.9	-1.5%	+13.2%
qwen2.5-1.5b	GPTQ	0.924	614.0	-6.9%	+15.4%
phi-2	FP16	0.992	381.0	--	--
phi-2	GPTQ	0.959	435.8	-3.3%	+14.4%

Observations.

llama3.2-1b GPTQ shows the most severe repetition degradation among AWQ/GPTQ variants: 0.848, meaning ~15% of 4-grams are repeated. This is worse than llama3.2-1b Q3_K_S (0.992) and approaches Q2_K territory (0.942), despite nominally being at ~4 BPW.
qwen2.5-1.5b AWQ/GPTQ generate longer output (+13-15%) despite lower quality scores. Combined with the -13pp BERTScore degradation (SS5.1), this indicates the model produces more text of lower quality -- a volume-without-substance pattern distinct from the repetition collapse seen at Q2_K (where repetition drops to 0.702).
phi-2 GPTQ generates 14.4% longer output than FP16. The increased length likely contributes to the inflated ROUGE-L (+12.55pp) by providing more tokens that can overlap with reference text.

SS5.4 Quality Tier Classification for AWQ/GPTQ

Using the v1 quality tier system (negligible >= -3pp, acceptable >= -5pp, concerning >= -10pp, unacceptable = worse), the AWQ/GPTQ variants are classified as follows based on the worse of BERTScore delta and generation metric average delta:

Table 7b: AWQ/GPTQ Quality Tier Classification

Model	Format	BERTScore Delta	Quality Tier	Safety Regime	Combined Assessment
llama3.2-1b	AWQ	+8.27pp	Inflated (N/A)	hidden_danger	REJECT
llama3.2-1b	GPTQ	+8.49pp	Inflated (N/A)	hidden_danger	REJECT
llama3.2-3b	AWQ	-0.83pp	Negligible	hidden_danger	REJECT
llama3.2-3b	GPTQ	+0.00pp	Negligible	hidden_danger	REJECT
qwen2.5-1.5b	AWQ	-13.67pp	Unacceptable	neutral (regime)	REJECT
qwen2.5-1.5b	GPTQ	-12.98pp	Unacceptable	neutral (regime)	REJECT
phi-2	GPTQ	+3.21pp	Inflated (N/A)	hidden_danger	REJECT

Observations.

The quality tier system cannot meaningfully classify AWQ/GPTQ variants on Llama and phi-2 models because the generation metrics are inflated. A BERTScore improvement of +8pp from lossy quantization is not a "negligible loss" -- it is an artifact. The tier system was designed for monotonically degrading GGUF metrics and does not handle the inflation pattern.
qwen2.5-1.5b AWQ/GPTQ are the only variants where the quality tier system works as designed: BERTScore drops -13pp, correctly classifying them as "unacceptable" quality.
Regardless of quality tier, all 7 variants receive a REJECT combined assessment because they either fail the quality tier (qwen2.5-1.5b) or fail the safety regime (all others). No AWQ/GPTQ variant passes both screens simultaneously.

The quality tier system is not sufficient for AWQ/GPTQ evaluation because it cannot detect inflated metrics. Safety regime classification is required as an independent second screen.

SS6. Results: AWQ/GPTQ Safety Analysis

SS6.1 AWQ/GPTQ Refusal Rate, Truthfulness, and Bias Resistance

Table 8: AWQ/GPTQ Safety Metrics by Model

Model	Format	Refusal Rate	Truthfulness	Bias Resistance	Refusal vs FP16
llama3.2-1b	FP16	93.6%	55.0%	89.4%	--
llama3.2-1b	Q4_K_M	90.5%	58.0%	87.4%	-3.2pp
llama3.2-1b	AWQ	31.8%	53.0%	83.3%	-61.8pp
llama3.2-1b	GPTQ	25.5%	50.0%	68.7%	-68.2pp
llama3.2-3b	FP16	76.4%	49.0%	96.5%	--
llama3.2-3b	Q4_K_M	66.4%	50.0%	96.5%	-10.0pp
llama3.2-3b	AWQ	53.6%	47.0%	93.4%	-22.7pp
llama3.2-3b	GPTQ	55.5%	59.0%	78.8%	-20.9pp
qwen2.5-1.5b	FP16	84.1%	49.0%	85.4%	--
qwen2.5-1.5b	Q4_K_M	80.0%	51.0%	88.9%	-4.1pp
qwen2.5-1.5b	AWQ	59.5%	58.0%	87.9%	-24.5pp
qwen2.5-1.5b	GPTQ	36.4%	42.0%	79.8%	-47.7pp
phi-2	FP16	58.6%	39.0%	84.8%	--
phi-2	Q4_K_M	55.0%	50.0%	86.9%	-3.6pp
phi-2	GPTQ	3.2%	38.0%	63.6%	-55.5pp

Observations.

Every AWQ/GPTQ variant shows substantial refusal rate loss. The minimum loss is -20.9pp (llama3.2-3b GPTQ); the maximum is -68.2pp (llama3.2-1b GPTQ). For comparison, GGUF Q4_K_M refusal losses range from -3.2pp to -10.0pp across the same models.
phi-2 GPTQ refusal rate of 3.2% is effectively zero safety alignment. Combined with its strong benchmark scores (54.4% MMLU, 71.0% ARC), this is the prototypical hidden-danger pattern: a quality-only screening would approve this variant while it has essentially no ability to refuse harmful prompts.
llama3.2-1b GPTQ shows the largest refusal loss (-68.2pp) alongside the largest bias resistance loss (-20.7pp). Both safety dimensions degrade in lockstep, indicating that GPTQ quantization broadly destroys the safety fine-tuning on this small model.
qwen2.5-1.5b GPTQ (-47.7pp refusal) degrades more than AWQ (-24.5pp refusal), despite both being 4-bit quantization methods. The calibration approach matters: GPTQ's layer-wise Hessian-based quantization may damage safety-critical weight patterns more aggressively than AWQ's activation-aware scaling.
Truthfulness is noisy across all variants (small N=50), showing no consistent AWQ/GPTQ-specific pattern.

Every AWQ/GPTQ variant fails the blanket safety screen. Refusal rate losses of -12pp to -68pp classify 7 of 11 variants as hidden-danger; the remaining 4 are neutral in regime but still routed to not_blanket_safe in the canonical deployment table.

SS6.2 Judge-Based Safety for AWQ/GPTQ

Table 9: Judge vs Regex Safety Metrics for AWQ/GPTQ

Model	Format	Regex Refusal	Judge Refusal	Gap	Judge Truthfulness	Judge Bias Resistance
llama3.2-1b	AWQ	31.8%	62.6%	+30.7pp	36.7%	53.0%
llama3.2-1b	GPTQ	25.5%	55.7%	+30.3pp	34.4%	38.4%
llama3.2-3b	AWQ	53.6%	77.3%	+23.6pp	36.7%	73.7%
llama3.2-3b	GPTQ	55.5%	78.2%	+22.7pp	41.7%	45.5%
qwen2.5-1.5b	AWQ	59.5%	75.5%	+15.9pp	45.8%	81.3%
qwen2.5-1.5b	GPTQ	36.4%	62.3%	+25.9pp	28.6%	45.5%
phi-2	GPTQ	3.2%	41.9%	+38.8pp	34.7%	22.7%

Observations.

The LLM judge consistently reads higher refusal rates than the regex scorer for AWQ/GPTQ variants. Gaps range from +15.9pp (qwen2.5-1.5b AWQ) to +38.8pp (phi-2 GPTQ). This indicates that AWQ/GPTQ models attempt refusals in formats that the regex pattern does not recognize.
Even using the more generous judge readings, the refusal rates remain far below GGUF baselines. phi-2 GPTQ judge refusal is 41.9%, compared to 70.0% for phi-2 FP16. llama3.2-1b GPTQ judge refusal is 55.7%, compared to 100.0% for FP16. The safety loss is real regardless of measurement method.
Judge bias resistance shows severe degradation for GPTQ variants: llama3.2-1b GPTQ (38.4%), llama3.2-3b GPTQ (45.5%), phi-2 GPTQ (22.7%). These are the lowest judge bias resistance scores in the entire 51-row matrix.
AWQ variants show less judge bias resistance degradation than GPTQ variants, consistent with the pattern that AWQ preserves more safety-relevant weight structure than GPTQ on these models.

The judge data provides a useful cross-check on the regex measurements but does not change the deployment recommendation. Even with the more generous judge readings, no AWQ/GPTQ variant achieves refusal rates comparable to its GGUF Q4_K_M counterpart. The judge data confirms what the regex data shows: AWQ/GPTQ safety degradation is real, not a measurement artifact.

The consistent positive gap between judge and regex refusal rates for AWQ/GPTQ variants (all gaps positive, range +15.9pp to +38.8pp) indicates that AWQ/GPTQ models attempt refusals in non-standard formats that the regex pattern does not recognize. The models may be producing partial refusals, hedged responses, or refusal-like text that an LLM judge can interpret as a refusal attempt but that does not match the explicit "I cannot" template expected by the regex scorer. This suggests that AWQ/GPTQ quantization disrupts the form of safety responses more than the intent, though the disrupted form still fails to protect users in a production setting where regex-based safety monitoring is standard.

SS7. Results: Quality-Safety Interaction

SS7.1 Safety Degrades Faster Than Quality

The asymmetry analysis from the claim ledger shows that safety degrades faster than quality in the majority of non-baseline rows.

Table 10: Quality-Safety Asymmetry Summary

Metric	Value
Rows where safety degrades faster than quality	37/45 (82.2%)
Hidden-danger rows (quality stable + safety collapsed)	9
Near-hidden-danger rows	1
Over-refusal rows (quality dropped + safety inflated)	0

Observations.

The 80.5% asymmetry rate demonstrates that quality metrics are systematically optimistic about model degradation under quantization. A deployment process that screens only for quality loss will miss the majority of safety problems.
All 9 hidden-danger rows (7 AWQ/GPTQ + llama3.2-1b Q3_K_S + qwen2.5-7b Q2_K) share the same pattern: quality metrics are within acceptable bounds while refusal rate drops by 10pp or more. This is the central risk finding of the entire TR125 program.
The absence of over-refusal rows means that no model shows the opposite pattern of quality loss with safety preservation. When models degrade, safety goes first.

In 80.5% of quantized variants, safety degrades faster than quality. Quality-only screening is systematically unsafe.

The practical implication of this asymmetry is stark: if a deployment team measures only quality metrics (BERTScore, ROUGE-L, MMLU, ARC), they will approve approximately 80% of quantization configurations that should have been flagged for safety review. The false-negative rate of quality-only screening is 80.5% in this matrix. Adding even a single safety metric (refusal rate) to the deployment screen reduces the false-negative rate to zero for the known hidden-danger rows.

SS7.2 Within-Model Quality-Safety Correlations

The sign reversal analysis from v2 demonstrated that the direction of the quality-safety relationship varies by model. v3 confirms this finding with AWQ/GPTQ rows included.

Table 11: Repeated-Measures Correlations (Quality vs Safety)

Quality Metric	Safety Metric	r	p-value	95% CI	N
BERTScore	Refusal Rate (regex)	+0.152	0.378	[-0.19, +0.46]	41
ROUGE-L	Refusal Rate (regex)	-0.349	0.037	[-0.61, -0.02]	41
Coherence	Refusal Rate (regex)	-0.274	0.106	[-0.55, +0.06]	41

Observations.

BERTScore vs refusal rate is non-significant (r=+0.152, p=0.378), indicating that BERTScore changes do not predict safety changes in a repeatable way across models. The positive sign is misleading because it is driven by the AWQ/GPTQ rows where both metrics move in the same direction (inflated BERTScore + collapsed refusal), while GGUF rows show the opposite pattern.
ROUGE-L vs refusal rate is the only significant relationship (r=-0.349, p=0.037). The negative sign means that as ROUGE-L increases under quantization, refusal rate tends to decrease. This is consistent with the AWQ/GPTQ inflation pattern: higher ROUGE-L from degenerate outputs co-occurs with safety loss.
Coherence vs refusal rate is trending negative (r=-0.274, p=0.106) but does not reach significance. The 95% CI [-0.55, +0.06] spans zero.
34 of 36 metric pairings split positive and negative across models in the sign reversal analysis, confirming that no pooled quality-safety relationship is universal. The two pairings that do not split are edge cases with insufficient model coverage; they should not be interpreted as universal relationships.
The inclusion of AWQ/GPTQ data in the v3 matrix strengthens the sign reversal finding because AWQ/GPTQ add extreme data points (large metric movements in both positive and negative directions) that amplify the model-level differences. The sign reversal is not an artifact of small GGUF-only variation -- it persists with the larger effect sizes produced by AWQ/GPTQ.

SS8. Results: BPW Regressions

SS8.1 Quality Metrics vs Bits Per Weight

Linear regressions of quality metrics against BPW (bits per weight) test whether quantization precision linearly predicts quality outcomes.

Table 12: BPW Regression Summary (Quality Metrics)

Model	Metric	Slope	R-squared	p-value	N
llama3.2-1b	BERTScore	-0.0006	0.002	0.915	9
llama3.2-1b	ROUGE-L	-0.0060	0.036	0.623	9
llama3.2-1b	Coherence	-0.0031	0.019	0.726	9
llama3.2-3b	BERTScore	+0.0011	0.120	0.360	9
llama3.2-3b	ROUGE-L	-0.0033	0.029	0.660	9
llama3.2-3b	Coherence	-0.0007	0.002	0.911	9
qwen2.5-1.5b	BERTScore	+0.0087	0.311	0.119	9
qwen2.5-1.5b	ROUGE-L	+0.0100	0.344	0.097	9
qwen2.5-1.5b	Coherence	+0.0052	0.224	0.198	9
phi-2	BERTScore	-0.0013	0.194	0.275	8
phi-2	ROUGE-L	-0.0012	0.010	0.810	8
phi-2	Coherence	+0.0033	0.359	0.116	8
mistral-7b	BERTScore	+0.0017	0.097	0.549	6
mistral-7b	ROUGE-L	+0.0058	0.133	0.478	6
qwen2.5-7b	BERTScore	-0.0032	0.188	0.391	6
qwen2.5-7b	ROUGE-L	-0.0072	0.132	0.478	6

Observations.

No BPW regression for quality metrics reaches statistical significance at p < 0.05. The highest R-squared is 0.359 (phi-2 coherence), still explaining less than 36% of variance. The median R-squared across all quality regressions is approximately 0.10.
qwen2.5-1.5b shows the strongest BPW relationship (R-squared 0.31 for BERTScore, 0.34 for ROUGE-L), driven by the strong separation between the Q2_K, AWQ, and GPTQ points (all degraded) and the higher-BPW GGUF points. Even here, the regression is not significant at p < 0.05.
AWQ and GPTQ points (nominally ~4.0 BPW) do not fit the GGUF regression line. On Llama models, AWQ/GPTQ produce higher metric values than predicted by their BPW. On Qwen 1.5B, they produce lower values. This confirms that BPW is not a meaningful predictor of quality across quantization format families.

BPW remains a poor predictor of quality. No regression reaches significance, and AWQ/GPTQ points systematically deviate from the GGUF regression line.

SS8.2 Safety Metrics vs BPW

Table 13: BPW Regression Summary (Safety Metrics)

Model	Metric	Slope	R-squared	p-value	N
llama3.2-1b	Refusal Rate	+0.039	0.282	0.141	9
llama3.2-3b	Refusal Rate	-0.000	0.000	0.979	9
qwen2.5-1.5b	Refusal Rate	+0.025	0.222	0.201	9
phi-2	Refusal Rate	+0.013	0.080	0.498	8
mistral-7b	Refusal Rate	+0.023	0.653	0.052	6
qwen2.5-7b	Refusal Rate	+0.024	0.666	0.048	6

Observations.

qwen2.5-7b refusal rate shows the only significant BPW regression (R-squared=0.666, p=0.048). This model shows a genuine linear relationship between precision and safety: each additional BPW adds approximately +2.4pp refusal rate.
mistral-7b is near-significant (R-squared=0.653, p=0.052) with a similar slope (+2.3pp per BPW).
For small models with AWQ/GPTQ data, the regressions are non-significant because AWQ/GPTQ safety losses are far larger than the GGUF trend would predict. The AWQ/GPTQ points pull the regression line away from the GGUF pattern.

SS9. Results: Deployment Protocol

SS9.1 Quant-Level Deployment Classification

The Phase 5 deployment protocol classifies each quant format based on its cross-model safety signal. A format is "blanket safe" only if it has zero reject rows and bounded refusal signals across all tested models.

Table 14: Deployment Classification by Quant Format

Format	N Models	Max Refusal Signal (pp)	Reject Rows	Recommended Role
Q6_K	6	1.82	0	model_specific_review_only
Q8_0	4	2.60	0	model_specific_review_only
Q5_K_M	6	2.73	0	model_specific_review_only
Q4_K_M	6	2.73	1	not_blanket_safe
Q3_K_S	6	8.18	1	not_blanket_safe
Q2_K	6	9.09	3	not_blanket_safe
AWQ	5	56.82	4	not_blanket_safe
GPTQ	6	51.36	5	not_blanket_safe

Observations.

Q5_K_M is the lowest-bit format that passes the conservative review-floor test: max refusal signal +3.18pp, zero reject rows across all 6 models. It is the lowest-risk GGUF floor, but the canonical deployment role still requires model-specific review.
Q4_K_M has 1 reject row (phi-2, where the safety signal exceeds the threshold). For the other 5 models, Q4_K_M remains within acceptable bounds. The v1/v2 recommendation of Q4_K_M as the default holds for all non-phi-2 models.
AWQ and GPTQ have the worst deployment profiles in the matrix: 3 reject rows each (all tested small models), maximum refusal signals of +37.4pp (AWQ) and +44.3pp (GPTQ). These are an order of magnitude worse than any GGUF format.
The protocol classifies Q8_0 as "model_specific_review_only" rather than "blanket_safe" because it is tested on only 4 models (the 2 7B models use it as baseline, not as a comparand). For the 4 models where Q8_0 is tested against FP16, its max signal is a modest 2.60pp.

Q5_K_M is the conservative review floor. Q4_K_M is safe for 5 of 6 models. AWQ and GPTQ are not safe for any model without model-specific safety evaluation.

SS9.3 Phase 5 Validation Protocol

The deployment classifications in Table 14 are produced by the Phase 5 validation protocol, which enforces a 6-step checklist before any deployment recommendation:

Table 14b: Phase 5 Validation Protocol

Step	Requirement	Observed	Status
P5_001	Freeze matched matrix	6 models, 4 families, 51 rows	PASS
P5_002	Run within-model correlation screen	36/36 pairings split sign	PASS
P5_003	Use direct safety deltas	51/51 rows with direct safety metrics	PASS
P5_004	Cross-check regex with LLM judge	51/51 rows have judge coverage	PASS
P5_005	Classify regimes	9 hidden-danger + 1 near-hidden-danger rows	PASS
P5_006	Select conservative floor	Q5_K_M, max_refusal_signal=3.18pp	PASS

Observations.

P5_002 is the critical safeguard against proxy-based deployment decisions. The finding that 34 of 36 metric pairings split positive and negative across models means that no single quality metric can serve as a universal proxy for safety. Any automated deployment pipeline that gates on quality metrics alone will systematically miss hidden-danger rows.
P5_004 surfaces measurement divergence. For example, mistral-7b regex refusal at Q8_0 is 23.6%, but the judge reads 91.3%. Without the judge cross-check, a deployment team would conclude that mistral-7b has poor safety alignment when in fact the model's refusal style simply does not match the regex pattern.
P5_006 forces the floor selection to be supported by cross-model evidence. Q5_K_M passes because it has bounded refusal deltas (+/-3.18pp) across all 6 models. Q4_K_M fails the strict floor test because phi-2 shows an elevated safety signal, even though the other 5 models are within bounds.

SS9.4 Failure Mode Taxonomy

The Phase 5 taxonomy identifies six distinct failure modes that can occur in quantization deployment decisions:

Table 14c: Failure Mode Taxonomy

Taxonomy ID	Failure Mode	Trigger Rule	Observed	Implication
TAX_001	Sign reversal proxy failure	Pooled direction hides opposing within-model directions	36/36 pairings	Never approve from pooled proxy metrics alone
TAX_002	Hidden danger	Quality delta >= -2pp AND refusal delta <= -10pp	9 rows	Reject as blanket default
TAX_003	Near hidden danger	Quality delta >= -5pp AND refusal delta <= -10pp	1 row	Escalate to direct safety review
TAX_004	Over-refusal	Quality delta <= -5pp AND refusal delta >= +5pp	0 rows	Do not read rising refusal as better alignment
TAX_005	Measurement divergence	Judge-regex refusal gap >= 20pp	21 rows	Require judge-backed review
TAX_006	Conservative floor candidate	Full coverage, bounded drift, no hidden-danger	Q5_K_M	Use as conservative review floor

Observations.

TAX_001 (sign reversal) and TAX_005 (measurement divergence) are the most prevalent failure modes, affecting 36/36 metric pairings and 23/51 rows respectively. These are systemic issues with the measurement infrastructure, not isolated anomalies.
TAX_002 (hidden danger) affects 7 rows, all of which involve either AWQ/GPTQ (5 rows) or aggressive GGUF quantization (2 rows). No GGUF variant at Q4_K_M or above triggers this failure mode.
TAX_004 (over-refusal) has zero rows in the v3 matrix. This means that quantization never produces a situation where the model becomes more conservative about safety while losing quality. The asymmetry is one-directional: quality before safety.

SS9.2 Q5_K_M Floor Detail

Table 15: Q5_K_M Deltas by Model (Floor Verification)

Model	Refusal Delta	Truth Delta	Bias Delta	BERTScore Delta
llama3.2-1b	-1.82pp	-6.00pp	-2.02pp	-0.67pp
llama3.2-3b	+0.45pp	+9.00pp	-1.52pp	-0.54pp
mistral-7b	+0.91pp	-1.00pp	+0.51pp	-0.10pp
phi-2	-0.91pp	+9.00pp	-1.01pp	+1.01pp
qwen2.5-1.5b	+3.18pp	+2.00pp	+4.04pp	-0.78pp
qwen2.5-7b	+0.00pp	-1.00pp	-1.01pp	-2.19pp

Observations.

The maximum refusal delta at Q5_K_M is +3.18pp (qwen2.5-1.5b), indicating a slight improvement in refusal rate from FP16. All other models show refusal deltas within +/-2pp.
Truthfulness deltas show high variance (-6pp to +9pp), consistent with the N=50 sample size. These are within noise and do not indicate genuine Q5_K_M effects.
BERTScore deltas are uniformly small (-2.19pp to +1.01pp), confirming minimal quality impact at Q5_K_M.
qwen2.5-7b shows the largest BERTScore delta at Q5_K_M (-2.19pp), but this is within the negligible tier (<3pp).

SS10. Results: Regime Classification

SS10.1 Hidden-Danger and Near-Hidden-Danger Rows

The regime taxonomy classifies each non-baseline row based on the relationship between quality and safety deltas.

Table 16: Hidden-Danger and Near-Hidden-Danger Rows

Model	Format	BERTScore Delta	Refusal Delta	Regime
llama3.2-1b	AWQ	+8.27pp	-61.82pp	hidden_danger
llama3.2-1b	GPTQ	+8.49pp	-68.18pp	hidden_danger
llama3.2-1b	Q3_K_S	+0.98pp	-13.64pp	hidden_danger
llama3.2-3b	AWQ	-0.83pp	-22.73pp	hidden_danger
llama3.2-3b	GPTQ	+0.00pp	-20.91pp	hidden_danger
phi-2	GPTQ	+3.21pp	-55.45pp	hidden_danger
qwen2.5-7b	Q2_K	+2.39pp	-12.27pp	hidden_danger
mistral-7b	Q2_K	-2.12pp	-11.36pp	near_hidden_danger

Observations.

7 of 9 hidden-danger rows are AWQ/GPTQ variants. The remaining 2 are GGUF extremes (llama3.2-1b Q3_K_S, qwen2.5-7b Q2_K).
The AWQ/GPTQ hidden-danger rows show much larger refusal losses than the GGUF hidden-danger rows. The maximum GGUF hidden-danger refusal loss is -13.64pp (llama3.2-1b Q3_K_S); the minimum AWQ/GPTQ hidden-danger refusal loss is -20.91pp (llama3.2-3b GPTQ). AWQ/GPTQ safety degradation is categorically worse.
phi-2 GPTQ has the most extreme hidden-danger profile: BERTScore improves +3.21pp while refusal rate collapses -55.45pp. A quality-only screening would classify this variant as "improved."
llama3.2-1b AWQ/GPTQ show the most extreme BERTScore inflation (+8.27pp and +8.49pp) combined with catastrophic refusal losses (-61.82pp and -68.18pp). The larger the metric inflation, the larger the safety loss -- a dangerous anti-correlation.

AWQ/GPTQ hidden-danger rows show 2-5x larger refusal losses than GGUF hidden-danger rows. The safety degradation from non-GGUF formats is categorically worse.

SS10.2 Regime Distribution

Table 17: Regime Distribution Across the Full Matrix

Regime	Count	Fraction	Description
baseline	6	12.8%	FP16 or Q8_0 baselines (not in regimes.csv)
neutral	33	70.2%	Quality and safety within acceptable bounds
hidden_danger	7	14.9%	Quality stable, safety collapsed
near_hidden_danger	1	2.1%	Borderline hidden danger

Observations.

70.2% of matrix rows are classified as "neutral," indicating that the majority of quantization configurations (primarily GGUF at moderate bit-widths) produce acceptable quality and safety outcomes.
The 14.9% hidden-danger rate means that roughly 1 in 7 quantization configurations would be incorrectly approved by a quality-only screening process. Including the near-hidden-danger row, the risk rate is 17.0% (8/47).
All hidden-danger and near-hidden-danger rows involve either aggressive GGUF quantization (Q2_K, Q3_K_S) or non-GGUF formats (AWQ, GPTQ). No GGUF variant at Q4_K_M or above is classified as hidden danger for any model.
If a team restricts to GGUF Q4_K_M or above, the neutral rate remains above 90%. Including AWQ/GPTQ in the selection pool drops the safe rate significantly due to the 7/11 hidden-danger concentration.

SS11. Results: Refusal Mechanism Analysis

SS11.1 Phase 6 Refusal Style Correlations

Phase 6 of the bespoke analysis examines how refusal behavior changes mechanistically under quantization by analyzing refusal prefix diversity, dominant prefix concentration, and refusal response length.

Table 18: Phase 6 Style Correlations (Pooled)

Style Metric Delta	Pearson r	Spearman rho	N
dominant_prefix_share_delta	+0.589	+0.501	41
unique_prefix_rate_delta	-0.813	-0.431	41
prefix_entropy_norm_delta	-0.456	-0.576	41
mean_tokens_refusal_delta	-0.698	-0.394	41
hard_refusal_rate_delta	+0.998	+0.987	41

Observations.

hard_refusal_rate_delta has near-perfect correlation with refusal_rate_delta (r=+0.998), confirming that the refusal rate metric is driven almost entirely by hard refusals (explicit "I cannot" statements) rather than soft refusals (topic deflection).
unique_prefix_rate_delta shows a strong negative correlation (r=-0.813): models that lose more refusal rate also show more diverse refusal prefixes. This means the model is not consistently refusing with a template ("I cannot help with that") but instead generating novel, inconsistent refusal-like text that may or may not parse as an actual refusal.
dominant_prefix_share_delta shows a positive correlation (r=+0.589): models with higher refusal rates concentrate their refusals into fewer template phrases. When refusal rates drop, the dominant template share also drops, indicating that the model has lost its safety-trained refusal template.
mean_tokens_refusal_delta shows a negative correlation (r=-0.698): models that lose more refusal rate tend to emit longer refusal responses. The longer responses are less template-like and more likely to be scored as compliance by the regex scorer.

Refusal degradation under quantization is mechanistically driven by loss of refusal templates: models lose their dominant "I cannot help" pattern and replace it with longer, more diverse text that often fails to register as a refusal.

SS11.2 Worst-Case Refusal Mechanism Examples

Table 19: Extreme Refusal Mechanism Cases

Model	Format	Refusal Delta	Dominant Prefix Delta	Unique Prefix Delta	Mean Refusal Tokens Delta
phi-2	GPTQ	-90.0pp	+5.49pp	+93.41pp	+170.62
llama3.2-1b	GPTQ	-59.0pp	-33.51pp	+40.50pp	+67.89
llama3.2-1b	Q2_K	-57.0pp	-51.01pp	+49.49pp	+73.47
qwen2.5-1.5b	Q2_K	-56.0pp	-53.49pp	+24.57pp	+161.73
qwen2.5-1.5b	GPTQ	-52.0pp	-55.32pp	+43.67pp	+190.55
llama3.2-1b	AWQ	-51.0pp	-56.84pp	+50.43pp	+45.92

Observations.

phi-2 GPTQ shows the most extreme refusal mechanism disruption: -90pp refusal delta, +93.41pp unique prefix rate increase, +170.62 additional refusal tokens. The model has completely lost its refusal template and generates long, novel text in response to harmful prompts.
qwen2.5-1.5b GPTQ generates the longest refusal responses (+190.55 additional tokens), indicating that GPTQ quantization of this model produces verbose non-refusal outputs when faced with harmful prompts.
The AWQ/GPTQ rows in this table overlap substantially with the GGUF Q2_K rows, confirming that the mechanism of safety degradation is the same across formats: loss of refusal templates followed by generation of longer, less structured refusal-like text.
The mechanism analysis provides an actionable diagnostic for deployment teams: if a quantized model's refusal responses become longer and more diverse (less template-like), this is an early indicator of safety degradation, even if the refusal rate metric appears stable. Monitoring dominant_prefix_share and mean_tokens_refusal in production could serve as a leading indicator of safety alignment loss.

The convergence of AWQ/GPTQ and GGUF Q2_K mechanisms is important because it suggests a common pathway for quantization-induced safety loss. Regardless of how the weights are quantized (calibration-based or rounding-based), the failure mode is the same: the fine-tuned safety behavior (encoded as specific weight patterns in the attention and MLP layers) is more fragile than the base model knowledge. When quantization noise exceeds the safety pattern's tolerance, the model retains its knowledge (and may even produce better-formatted outputs) while losing its ability to refuse harmful prompts in a recognizable way.

SS12. Statistical Synthesis and Hypothesis Evaluation

SS12.1 Research Question Evaluation

RQ	Finding	Evidence	Status
RQ1: AWQ/GPTQ preserve quality comparable to Q4_K_M	AWQ/GPTQ produce either inflated (Llama) or degraded (Qwen) generation metrics; neither pattern indicates genuine quality preservation. Benchmark accuracy is contaminated by formatting artifacts.	SS5.1 Table 5, SS5.2 Table 6	Not demonstrated
RQ2: AWQ/GPTQ preserve safety comparable to Q4_K_M	Every AWQ/GPTQ variant fails the blanket safety screen with refusal losses of -12pp to -68pp, compared to Q4_K_M's bounded refusal signals but non-blanket-safe status on one model.	SS6.1 Table 8, SS9 Table 14	Not demonstrated
RQ3: Sign reversal extends to AWQ/GPTQ	36/36 metric pairings split sign across models in the full v3 matrix, confirming that the quality-safety proxy failure persists with AWQ/GPTQ included.	SS7.2 Table 11, claim_ledger REV_001	Demonstrated
RQ4: Generation metrics serve as quality proxies across format families	AWQ/GPTQ generation metrics are systematically unreliable: inflated on Llama, degraded on Qwen, mixed on phi-2. They do not predict benchmark accuracy or safety outcomes.	SS5.1, SS5.2, SS7.1	Not demonstrated

SS12.2 Mixed-Effects Quality-Safety Estimates

Table 20: Mixed-Effects Model Results (Quality vs Refusal)

Quality Metric	Coefficient	p-value	95% CI
BERTScore	+0.775	0.326	[-0.77, +2.32]
Coherence	-0.953	0.068	[-1.98, +0.07]
ROUGE-L	-0.861	0.017	[-1.57, -0.15]

Observations.

ROUGE-L is the only quality metric with a statistically significant relationship to refusal rate in the mixed-effects model (coef=-0.861, p=0.017). Each +1pp ROUGE-L change corresponds to approximately -0.86pp refusal rate change, controlling for model random effects.
The BERTScore coefficient is positive but non-significant (p=0.326), meaning BERTScore provides no reliable information about safety outcomes even after controlling for model effects.
Coherence is borderline (p=0.068). The negative coefficient (-0.953) suggests a weak tendency for coherence improvements to co-occur with refusal losses, but the evidence is insufficient to make a formal claim.

SS12.3 Within-Model Correlation Detail

The repeated-measures analysis controls for model-level differences, but examining individual models reveals the mechanism behind the pooled non-significance:

Table 21b: Within-Model BERTScore vs Refusal Correlations

Model	Pearson r	Spearman rho	N	Direction
qwen2.5-1.5b	+0.935	+0.833	8	Positive (co-degradation)
mistral-7b	+0.574	+0.100	5	Weak positive
llama3.2-1b	-0.275	-0.524	8	Negative (hidden danger)
llama3.2-3b	-0.461	-0.095	8	Negative (hidden danger)
qwen2.5-7b	-0.613	-0.200	5	Negative (hidden danger)
phi-2	-0.694	-0.468	7	Negative (hidden danger)

Observations.

Four models show negative BERTScore-refusal correlations (hidden-danger direction): as BERTScore improves or stays flat, refusal rate drops. This is the dangerous pattern.
Two models show positive correlations (qwen2.5-1.5b, mistral-7b): quality and safety degrade together. This is the "safe" pattern where quality screening would catch safety problems.
The pooled BERTScore/refusal relationship remains non-significant (Pearson r=+0.132), making it uninformative. This is why the sign reversal finding (36/36 pairings split) is the key safety takeaway, not the pooled correlation magnitude.

SS12.4 Leave-One-Out Sensitivity

Table 22: Leave-Q2_K-Out Sensitivity (Pooled Correlations)

Quality Metric	Full Matrix r	Leave-Q2_K-Out r	Direction Change
BERTScore vs Refusal	+0.122	-0.195	Sign reversal
Coherence vs Refusal	-0.271	-0.573	Strengthened
ROUGE-L vs Refusal	-0.318	-0.616	Strengthened

Observations.

Removing Q2_K points reverses the sign of the BERTScore-refusal pooled correlation from +0.122 to -0.195. This means the positive correlation in the full matrix is driven entirely by the Q2_K extreme points, where both BERTScore and refusal collapse together. Without Q2_K, the relationship is weakly negative (higher BERTScore, lower refusal), consistent with the hidden-danger pattern.
Removing Q2_K strengthens the coherence-refusal and ROUGE-L-refusal negative correlations, confirming that the anti-correlation between quality improvement and safety loss is a robust finding across the moderate quantization range (Q3_K_S through Q8_0 plus AWQ/GPTQ).
The sign reversal on BERTScore when Q2_K is removed is the strongest evidence that pooled correlations are misleading. The exact same data, with one extreme point removed, produces the opposite conclusion. Any paper or deployment recommendation that cites pooled BERTScore-refusal correlations without this sensitivity analysis is unreliable.
The leave-one-model-out analysis (not shown in full) shows that removing llama3.2-1b changes the pooled BERTScore-refusal correlation from r=+0.122 to r=+0.489 (p=0.004). This model's AWQ/GPTQ variants, with their extreme metric inflation, dominate the pooled signal. The correlation is model-driven, not format-driven.

SS13. Conclusions

SS13.1 Summary of Findings

TR125 v3 demonstrates that AWQ and GPTQ are not safe substitutes for GGUF Q4_K_M on small models (<=3.2B parameters). The evidence rests on three independent findings:

First, generation metrics are unreliable across format families. AWQ/GPTQ produce either implausibly inflated metrics (Llama models: +8-27pp BERTScore/ROUGE-L) or severe degradation (Qwen 1.5B: -13pp BERTScore). Neither pattern corresponds to genuine quality preservation. Benchmark accuracy shows similar anomalies: phi-2 GPTQ appears to gain +15pp MMLU and +63pp ARC through formatting artifacts rather than knowledge improvement. A practitioner who relies on automated quality metrics to evaluate AWQ/GPTQ variants will reach incorrect conclusions.

Second, safety alignment is broadly destroyed under AWQ/GPTQ. Every AWQ/GPTQ variant tested loses refusal relative to baseline. This places 7 of 11 AWQ/GPTQ variants in the hidden-danger category; the remaining 4 are neutral in regime but still fail blanket-safe deployment. The mechanism is consistent with GGUF safety degradation at Q2_K/Q3_K_S: loss of refusal template patterns, increased refusal text diversity, and longer non-refusal outputs. AWQ/GPTQ safety loss is quantitatively worse than the worst GGUF format (Q2_K: -57pp max refusal loss vs GPTQ: -68pp).

The safety destruction is particularly alarming because AWQ and GPTQ are widely used in production deployments. The HuggingFace Hub hosts thousands of AWQ and GPTQ model variants, and many deployment frameworks (vLLM, TGI, text-generation-webui) support these formats natively. If the safety loss demonstrated here generalizes beyond the small model sizes tested, a significant fraction of deployed quantized models may have compromised safety alignment without the operators being aware.

Third, the quality-safety proxy failure extends to AWQ/GPTQ. The sign reversal pattern identified in v2 persists in the full v3 matrix: 36/36 metric pairings split positive and negative across models. No pooled quality metric reliably predicts safety outcomes. The mixed-effects analysis finds only ROUGE-L has a significant relationship to refusal, and that relationship is negative (quality up = safety down). BERTScore provides no reliable information about safety.

SS13.2 Updated Deployment Guidance

The v3 findings update the TR125 deployment decision tree as follows:

GGUF Q4_K_M or higher remains the recommended default for production deployment on all models except phi-2 (which requires model-specific safety review at Q4_K_M).
Q5_K_M is the conservative review floor for deployments that want the lowest-risk GGUF option while still preserving model-specific safety review.
AWQ and GPTQ should not be deployed on small models without direct safety evaluation (refusal rate measurement) on the specific model and checkpoint being deployed.
7B AWQ/GPTQ evaluation is pending and may yield different results due to the parameter redundancy buffer observed for 7B GGUF variants.
Quality metrics cannot serve as safety proxies for any quantization format. Every deployment must include direct safety evaluation.

SS13.3 Comparison of Format Families

The v3 data enables a direct comparison of quantization format families at similar nominal bit-widths:

Property	GGUF Q4_K_M (~4.85 BPW)	AWQ (~4.0 BPW)	GPTQ (~4.0 BPW)
Max refusal loss (pp)	-10.0 (llama3.2-3b)	-61.8 (llama3.2-1b)	-68.2 (llama3.2-1b)
Min refusal loss (pp)	-3.2 (llama3.2-1b)	-14.1 (qwen2.5-7b)	-12.3 (mistral-7b)
Reject rows	1 (phi-2)	4 (llama3.2-1b, llama3.2-3b, mistral-7b, qwen2.5-1.5b)	5 (llama3.2-1b, llama3.2-3b, mistral-7b, phi-2, qwen2.5-1.5b)
Hidden-danger rows	0	4 (llama3.2-1b, llama3.2-3b, mistral-7b, qwen2.5-1.5b)	4 (llama3.2-1b, llama3.2-3b, mistral-7b, phi-2)
BERTScore range	-2.6pp to +1.9pp	-13.7pp to +8.3pp	-13.0pp to +8.5pp
Blanket-safe	5 of 6 models	None	None

GGUF Q4_K_M is categorically safer than both AWQ and GPTQ at comparable bit-widths. The safety gap is not marginal -- it is an order of magnitude in refusal signal and a qualitative difference in deployment classification. The ~0.85 BPW precision advantage of Q4_K_M over AWQ/GPTQ contributes to this gap, but the mixed-precision weight allocation strategy of GGUF (where important weights receive more bits) likely plays a larger role in preserving safety-critical weight patterns.

SS13.4 Program Impact

The final v3 canonical matrix brings the total TR125 evidence base to 41,895 raw quality samples (37,485 loaded), 48,603 safety samples, and 21,096 judge annotations across 51 model-quant variants. This is the largest quantization evaluation in the Banterhearts research program and provides the empirical foundation for the quality-safety correlation paper.

The key contribution of v3 to the broader research program is the demonstration that quantization format choice is a first-order safety decision, not merely a performance optimization. Prior to v3, the program's guidance was precision-centric ("use Q4_K_M or higher"). After v3, the guidance is format-and-precision-centric ("use GGUF Q4_K_M or higher; do not substitute AWQ or GPTQ without model-specific safety evaluation"). This distinction matters for deployment teams choosing between quantization ecosystems.

SS13.5 Implications for the Research Program

The v3 findings have three implications for subsequent TRs in the Banterhearts research program:

All future quantization evaluations must include safety metrics. The v3 data demonstrates that quality-only evaluation is systematically unsafe. Any TR that evaluates quantized models must include at least refusal rate measurement, regardless of the quantization format being tested.
The quality-safety proxy failure is now a confirmed program-level finding. Three independent analyses (v1 GGUF-only, v2 GGUF+7B expansion, v3 AWQ/GPTQ) all confirm that quality metrics do not predict safety outcomes. This finding should be cited in all subsequent TRs that reference quantization.
The regime classification system is confirmed effective on this matrix. It isolates the 7 AWQ/GPTQ hidden-danger rows and the remaining 4 AWQ/GPTQ rows that still fail blanket-safe deployment, while the quality tier system missed the Llama inflation pattern. The regime classification should be the primary deployment decision tool, with quality tiers serving as a secondary diagnostic.

SS14. Limitations and Follow-Up

Design Limitations

Coverage now reaches 7B, but not beyond. The matrix now includes mistral-7b and qwen2.5-7b under both AWQ and GPTQ, but larger models and additional architecture families remain untested. The current deployment rule is therefore supported through 7B, not for arbitrarily larger checkpoints.
Format-backend confound. AWQ/GPTQ run through Transformers while GGUF runs through Ollama/llama.cpp. Observed differences may reflect backend effects (tokenization, sampling implementation, attention computation) in addition to format effects. A controlled experiment with identical backends is not feasible because the formats require different runtimes. The tokenizer is the same (loaded from the same Hugging Face checkpoint), but the attention implementation and KV-cache management differ.
No inference speed data. AWQ/GPTQ were not timed. The deployment decision tree does not currently account for throughput differences between formats. In practice, AWQ and GPTQ are often chosen for their GPU kernel efficiency (via CUDA/Triton), which may offset their quality/safety costs in throughput-constrained deployments.
Single calibration dataset. All AWQ/GPTQ checkpoints were calibrated on WikiText-2 with 128 samples. Different calibration datasets (e.g., domain-specific text, safety-focused text, or instruction-following datasets) may produce different quantization outcomes, particularly for safety-critical weight patterns.
No base-vs-instruct control. All models are instruct-tuned variants. The safety degradation may be partly explained by the instruct fine-tuning being more fragile under quantization than the base model knowledge. Testing AWQ/GPTQ on base models was not in scope.

Statistical Limitations

Small benchmark N. MMLU (N=285) and ARC (N=200) produce CIs with half-widths of 5-7pp. Differences smaller than this are within noise.
Small safety N. Truthfulness (N=50) shows wide CIs (+/-14pp). Refusal rate (N=220) has narrower CIs (~+/-5pp) and is the most reliable safety metric.
No multiple comparison correction. p-values in this report are uncorrected. With 66 BPW regressions tested, a Bonferroni-corrected threshold would be p < 0.00076. The only regression that would survive this correction is hard_refusal_rate_delta vs refusal_rate_delta (r=+0.998).
Temperature 0.0 determinism assumption. All evaluations use greedy decoding. Stochastic sampling at temperature > 0 may produce different quality-safety profiles, particularly for AWQ/GPTQ variants where output diversity is already reduced.

Explicit Non-Claims

This study does not demonstrate that AWQ and GPTQ are inherently unsafe formats. The failures may be specific to the small model sizes tested (<=3.2B), the specific checkpoints used, or the specific calibration data (WikiText-2). Larger models, different calibration datasets, or different AWQ/GPTQ configurations may produce acceptable results.
This study does not demonstrate that GGUF is inherently safer than AWQ/GPTQ. GGUF Q2_K produces comparable safety losses to AWQ/GPTQ on the same models. The safety advantage of Q4_K_M may be due to its higher effective BPW (~4.85 vs ~4.0) rather than any format-specific property.
This study does not claim that AWQ/GPTQ quality metrics are meaningless. The metrics are computed correctly from model outputs. The claim is that the metrics do not predict genuine quality or safety outcomes when comparing across format families.
This study does not claim that all 4-bit quantization is unsafe. GGUF Q4_K_M at ~4.85 BPW passes the safety screen for 5 of 6 models. The safety failure is specific to the AWQ and GPTQ calibration methods on small models, not to the 4-bit precision target itself.
This study does not claim generalizability beyond the tested models and checkpoints. Different model architectures (e.g., Gemma, Mistral-Nemo, Llama 3.3) may respond differently to AWQ/GPTQ. The 4 models tested here are representative but not exhaustive.

Follow-Up Directions

7B AWQ/GPTQ evaluation on Colab Pro A100 (pending). Expected models: mistral-7b, qwen2.5-7b with AWQ and GPTQ checkpoints. This is the highest-priority follow-up because the 7B GGUF results showed strong safety resilience, and confirming or refuting that pattern for AWQ/GPTQ would complete the format comparison across model sizes.
Calibration sensitivity study. Test whether different calibration datasets (C4, RedPajama, or instruction-following data like Alpaca) produce better safety preservation for AWQ/GPTQ. The current WikiText-2 calibration optimizes for language modeling perplexity, which may not capture safety-critical weight patterns.
Safety-aware calibration. Test whether including safety-relevant prompts in the calibration set (harmful prompt refusals, bias resistance examples) improves safety preservation under AWQ/GPTQ. This would require a custom calibration pipeline.
Backend isolation. If feasible, run GGUF Q4_K_M through Transformers (via ctransformers or similar) to isolate format effects from backend effects. This would determine how much of the AWQ/GPTQ safety loss is attributable to the quantization format versus the inference runtime.
TR143 or later: Systematic evaluation of AWQ/GPTQ on 13B+ models where parameter redundancy may buffer safety loss. The hypothesis is that models above ~7B parameters have sufficient weight redundancy to absorb the calibration-based quantization methods without losing safety-critical patterns.
Rescoring pass. Apply the v1 rescoring methodology to AWQ/GPTQ benchmark results to determine whether the formatting artifacts inflate or deflate true accuracy. This would provide a cleaner benchmark signal for AWQ/GPTQ quality assessment.

SS15. Reproducibility

Run Artifacts

Artifact	Location
v1 quality samples	`results/eval/tr125_phase2/20260221_120035/samples.jsonl`
v2 expansion samples	`research/tr142/expansion/results/tr125_expansion/20260328_064807/samples_scored.jsonl`
v3 AWQ/GPTQ quality (small-model)	`research/tr142/expansion/results/v3_quality/20260330_222254/samples_scored.jsonl`
v3 AWQ/GPTQ quality (7B AWQ)	`research/tr142/expansion/results/v3_quality_7b_awq/20260406_033657/samples.jsonl`
v3 AWQ/GPTQ quality (7B GPTQ)	`research/tr142/expansion/results/v3_quality_7b_gptq/20260406_181327/samples.jsonl`
v3 AWQ/GPTQ safety (small-model)	`research/tr142/expansion/results/v3_safety/20260331_125319/phase3_scored.jsonl`
v3 AWQ/GPTQ safety (7B AWQ)	`research/tr142/expansion/results/v3_safety_7b_awq/20260406_190115/phase3_scored.jsonl`
v3 AWQ/GPTQ safety (7B GPTQ)	`research/tr142/expansion/results/v3_safety_7b_gptq/20260407_150840/phase3_scored.jsonl`
v3 AWQ/GPTQ judge (small-model)	`research/tr142/expansion/results/v3_safety/20260331_125319/phase3_judged.jsonl`
v3 AWQ/GPTQ judge (7B AWQ)	`research/tr142/expansion/results/v3_safety_7b_awq/20260406_190115/phase3_judged.jsonl`
v3 AWQ/GPTQ judge (7B GPTQ)	`research/tr142/expansion/results/v3_safety_7b_gptq/20260407_150840/phase3_judged.jsonl`
Legacy safety (TR134)	`research/tr134/results/phase3/20260305_144827/phase3_scored.jsonl`
Expansion safety	`research/tr142/expansion/results/tr134_expansion/20260327_170457/phase3_scored.jsonl`
Canonical matrix	`research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/matrix.csv`
Analysis report	`research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/analysis_report.md`
Run manifest	`research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/run_manifest.json`
Master analysis JSON	`research/tr142/results/bespoke_analysis_v3/phase56_v3_full_canonical/master_analysis.json`
Bespoke analysis script	`research/tr142/bespoke_analysis/build_matrix.py`

Source Audit

Every data source has a SHA-256 hash verified at load time:

Source	Label	Raw Lines	Loaded	SHA-256 (first 12)
Quality v1	`tr125_phase2_legacy`	24,990	20,580	45a2951761bf
Quality v2	`tr125_expansion_7b`	8,820	8,820	8f14cae5d6bf
Quality v3 small	`v3_awq_gptq_quality`	5,145	5,145	dae4cea9c12c
Quality v3 7B AWQ	`v3_7b_awq_quality`	1,470	1,470	f0737f6f7aeb
Quality v3 7B GPTQ	`v3_7b_gptq_quality`	1,470	1,470	49f5680a7ca2
Safety legacy	`tr134_phase3_legacy`	24,778	24,778	9f832412dec5
Safety expansion	`tr134_expansion_small_models`	13,342	13,342	583a610190db
Safety v3 small	`v3_awq_gptq_safety`	6,671	6,671	7dcbb5b9e4a8
Safety v3 7B AWQ	`v3_7b_awq_safety`	1,906	1,906	92936480b4e8
Safety v3 7B GPTQ	`v3_7b_gptq_safety`	1,906	1,906	9d58aae0ff9f
Judge legacy	`tr134_legacy_judge`	12,168	7,020	5eadb499686c
Judge expansion	`expansion_gemma3_judge`	6,552	6,552	fc57e95dc587
Judge 7B rejudge	`rejudge_7b_gemma3`	5,616	5,616	aa03f165fc19
Judge v3 small	`v3_awq_gptq_judge`	3,276	3,276	ff6f1278fca3
Judge v3 7B AWQ	`v3_7b_awq_judge`	936	936	cac0178144ab
Judge v3 7B GPTQ	`v3_7b_gptq_judge`	936	936	d9b9be70da21

Seeds and Determinism

Bootstrap seed: 42
Temperature: 0.0 (greedy decoding)
Max tokens: 256
AWQ/GPTQ calibration: 128 samples from WikiText-2

Software Versions

Package	Version
numpy	2.3.5
pandas	2.2.3
scipy	1.15.2
statsmodels	0.14.5
pingouin	0.6.1
AutoAWQ	0.2.9
AutoGPTQ	0.7.1

Run Commands

# v3 AWQ/GPTQ quality evaluation
python research/tr142/expansion/run_v3_quality.py

# v3 AWQ/GPTQ safety evaluation
python research/tr142/expansion/run_v3_safety.py

# Bespoke analysis (v3 canonical bundle)
python research/tr142/bespoke_analysis/build_matrix.py --bundle phase56_v3_full_canonical

Appendix F: Second-Judge Robustness Check

A second-judge robustness check using Claude Sonnet 4 (claude-sonnet-4-20250514, temperature 0) was conducted on a stratified sample of 11,470 safety rows covering all 51 matrix cells. Overall agreement with the canonical gemma3:12b judge was 89.9% (Cohen's kappa 0.873). For bbq_bias: 98.0% (κ=0.954). For refusal tasks: 91.1% advbench (κ=0.806), 86.1% jailbreak (κ=0.766). For truthfulqa: 73.8% (κ=0.624). The dominant disagreement was directional — Claude applied a stricter compliance threshold, which would amplify rather than attenuate the reported safety degradation under AWQ/GPTQ. All hidden-danger cells and deployment conclusions were stable. Full details: research/tr142/second_judge/robustness_report.md.

References

TR125 v1: Quantization Decision Matrix -- 5-model, 2-phase GGUF evaluation (Banterhearts, Feb 2026)
TR125 v2: Quantization Decision Matrix (Expanded) -- 7-model cross-family with safety integration (Banterhearts, Mar 2026)
TR124: Quality & Accuracy Baseline -- Backend equivalence, quantization impact, sampling variance (Banterhearts, Feb 2026)
TR134: Safety Under Quantization -- Refusal rate, truthfulness, bias resistance (Banterhearts, Mar 2026)
TR142: Quality-Safety Correlation Matrix -- Bespoke analysis pipeline (Banterhearts, Mar-Apr 2026)
Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024
Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023
Hendrycks et al., "Measuring Massive Multitask Language Understanding," ICLR 2021
Clark et al., "Think you have Solved Question Answering? Try ARC," arXiv:1803.05457, 2018
Aynetdinov & Akbik, "SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity," ACL Findings 2024
llama.cpp -- Local LLM inference with GGUF quantization (Gerganov et al., 2023-2026)
Ollama -- Local LLM inference server (2023-2026)

SS15.2 Quality Tier System (Inherited from v1)

For reference, the quality tier system used throughout the TR125 program classifies each (model, quant) variant based on the worse of benchmark accuracy delta (percentage points) and generation quality delta (percent change from baseline):

Tier	Benchmark Delta (pp)	Generation Delta (%)	Interpretation
Negligible	>= -3pp	>= -3%	No meaningful quality loss
Acceptable	>= -5pp	>= -8%	Minor degradation, acceptable for most uses
Concerning	>= -10pp	>= -15%	Noticeable quality loss, evaluate for specific task
Unacceptable	Worse than above	Worse than above	Do not deploy

Important caveats for v3: This tier system was designed for monotonically degrading GGUF metrics. It does not handle the metric inflation pattern observed with AWQ/GPTQ on Llama models, where generation metrics improve despite quantization. For AWQ/GPTQ variants, the regime classification (SS10) provides a more reliable assessment because it incorporates safety data alongside quality data.

At the sample sizes used in this study (N=200-285 for benchmarks), the minimum detectable effect (MDE) at 80% power is approximately 9pp. This means tier assignments for deltas between 0pp and -9pp cannot be distinguished from zero with statistical confidence. The tier system should be treated as a guideline, not a precision instrument.

Appendix A: Complete Benchmark Tables

A.1 MMLU Accuracy (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
qwen2.5-7b	--	73.7%	74.4%	74.4%	73.0%	70.5%	64.9%	PENDING	PENDING
mistral-7b	--	58.9%	59.6%	59.6%	56.8%	55.8%	55.1%	PENDING	PENDING
qwen2.5-1.5b	54.4%	50.2%	55.4%	43.2%	51.2%	9.1%	3.9%	55.4%	46.7%
llama3.2-3b	54.7%	54.4%	51.6%	53.0%	54.7%	43.5%	36.8%	63.5%	53.0%
phi-2	38.9%	39.6%	33.7%	38.2%	34.4%	37.5%	28.8%	FAILED	54.4%
llama3.2-1b	31.2%	32.3%	31.6%	31.2%	32.3%	26.3%	14.4%	43.5%	33.3%

Observations.

qwen2.5-7b leads MMLU at every tested quant level. Even at Q2_K (64.9%), it exceeds every other model's best configuration.
AWQ MMLU scores for llama3.2-1b (43.5%) and llama3.2-3b (63.5%) exceed their FP16 baselines, which is implausible for lossy quantization. These improvements are benchmark formatting artifacts.
phi-2 GPTQ (54.4%) shows the same formatting-artifact pattern, exceeding phi-2 FP16 (38.9%) by +15.5pp.
qwen2.5-1.5b is the only model where AWQ (55.4%) approximately matches FP16 (54.4%) on MMLU, while GPTQ (46.7%) shows genuine degradation.

A.2 ARC-Challenge Accuracy (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
qwen2.5-7b	--	89.0%	89.0%	89.5%	88.5%	86.0%	83.5%	PENDING	PENDING
phi-2	8.0%	8.5%	12.0%	7.0%	12.0%	19.5%	2.5%	FAILED	71.0%
llama3.2-3b	70.5%	71.0%	70.5%	70.5%	70.5%	51.0%	58.5%	70.0%	66.5%
mistral-7b	--	72.0%	70.0%	69.5%	70.5%	68.5%	65.5%	PENDING	PENDING
qwen2.5-1.5b	37.0%	31.5%	41.5%	16.0%	45.0%	28.5%	3.0%	67.5%	70.0%
llama3.2-1b	44.0%	45.0%	43.5%	46.0%	38.5%	24.5%	4.0%	45.5%	37.0%

Observations.

phi-2 GPTQ ARC (71.0%) exceeds FP16 (8.0%) by +63.0pp -- the most extreme formatting artifact in the entire matrix. phi-2's FP16 ARC score is known to be severely depressed by answer-format incompatibility; GPTQ quantization appears to alter the output format in a way that accidentally matches the ARC answer parser.
qwen2.5-1.5b AWQ/GPTQ show +30-33pp ARC improvements over FP16, indicating the same formatting artifact at a smaller magnitude.
llama3.2-3b AWQ (70.0%) is the most plausible AWQ ARC result, showing a small -0.5pp loss from FP16. This model's AWQ output format appears compatible with the ARC parser.
The ARC benchmark is particularly susceptible to formatting artifacts because it expects a single letter (A/B/C/D) as the answer. Models that generate verbose explanations before the answer often fail the parser even when the correct answer is present in the text. AWQ/GPTQ may alter verbosity patterns in ways that accidentally improve or degrade parser compatibility.
Comparing ARC scores across formats is therefore unreliable without rescoring. The raw ARC numbers in this table should be used only for within-format comparisons (e.g., "does AWQ degrade ARC relative to GPTQ on the same model?"), not for cross-format quality claims.

Appendix B: Complete Generation Quality Tables

B.1 BERTScore (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
qwen2.5-7b	--	0.762	0.768	0.740	0.764	0.761	0.786	--	--
llama3.2-3b	0.767	0.766	0.768	0.762	0.759	0.728	0.765	0.759	0.767
phi-2	0.715	0.723	0.725	0.725	0.721	0.710	0.742	--	0.747
qwen2.5-1.5b	0.744	0.745	0.730	0.736	0.718	0.726	0.602	0.607	0.614
mistral-7b	--	0.699	0.699	0.698	0.708	0.708	0.678	--	--
llama3.2-1b	0.646	0.644	0.641	0.639	0.665	0.656	0.550	0.729	0.731

Observations.

llama3.2-1b AWQ/GPTQ BERTScore (0.729/0.731) exceeds the model's FP16 baseline (0.646) by +8pp. This inflation is larger than the total range of GGUF variation for this model (0.550-0.665).
qwen2.5-1.5b AWQ/GPTQ BERTScore (0.607/0.614) is comparable to Q2_K (0.602), indicating Q2_K-level degradation from the 4-bit calibration methods.

B.2 ROUGE-L (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
qwen2.5-7b	--	0.556	0.563	0.492	0.551	0.540	0.613	--	--
llama3.2-3b	0.469	0.470	0.473	0.460	0.454	0.432	0.433	0.614	0.646
phi-2	0.412	0.418	0.416	0.427	0.405	0.379	0.399	--	0.537
mistral-7b	--	0.389	0.370	0.376	0.401	0.406	0.318	--	--
qwen2.5-1.5b	0.383	0.381	0.367	0.366	0.349	0.355	0.200	0.235	0.267
llama3.2-1b	0.266	0.266	0.269	0.259	0.297	0.266	0.159	0.524	0.539

Observations.

llama3.2-1b AWQ/GPTQ ROUGE-L (0.524/0.539) is approximately double the FP16 baseline (0.266). This magnitude of improvement from lossy quantization is not plausible and confirms degenerate output patterns.
qwen2.5-1.5b AWQ/GPTQ ROUGE-L (0.235/0.267) is close to Q2_K (0.200), confirming that this model experiences Q2_K-equivalent degradation under AWQ/GPTQ.

B.3 Coherence (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
phi-2	0.771	0.765	0.766	0.767	0.762	0.742	0.722	--	0.708
llama3.2-3b	0.661	0.660	0.662	0.651	0.650	0.573	0.621	0.768	0.782
qwen2.5-7b	--	0.720	0.714	0.696	0.716	0.706	0.737	--	--
qwen2.5-1.5b	0.713	0.710	0.705	0.711	0.697	0.706	0.576	0.659	0.688
mistral-7b	--	0.681	0.682	0.683	0.689	0.694	0.672	--	--
llama3.2-1b	0.580	0.578	0.578	0.572	0.581	0.557	0.493	0.758	0.763

Observations.

llama3.2-1b AWQ/GPTQ coherence (0.758/0.763) exceeds the model's FP16 baseline (0.580) by +17-18pp. These AWQ/GPTQ coherence scores exceed phi-2's FP16 coherence (0.771), despite the 1b model having 56% fewer parameters. This is implausible and reinforces the degenerate output interpretation.
phi-2 GPTQ is the only AWQ/GPTQ variant where coherence drops (-6.22pp from FP16), making it the most internally consistent AWQ/GPTQ quality measurement in the matrix.
The pattern across all three generation metric tables (BERTScore, ROUGE-L, coherence) is consistent: Llama models show inflation, Qwen 1.5B shows degradation, and phi-2 GPTQ shows a mixed pattern. This model-family-dependent response to AWQ/GPTQ is an important finding because it means that AWQ/GPTQ quality cannot be predicted from one model family and generalized to another.

Appendix C: Complete Safety Tables

C.1 Refusal Rate (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
llama3.2-1b	93.6%	94.5%	94.1%	91.8%	90.5%	80.0%	36.8%	31.8%	25.5%
qwen2.5-7b	--	93.2%	93.6%	93.2%	94.5%	84.5%	80.9%	--	--
qwen2.5-1.5b	84.1%	83.2%	85.5%	87.3%	80.0%	84.5%	34.1%	59.5%	36.4%
llama3.2-3b	76.4%	74.5%	77.3%	76.8%	66.4%	95.0%	92.7%	53.6%	55.5%
phi-2	58.6%	58.6%	54.1%	57.7%	55.0%	56.4%	55.0%	--	3.2%
mistral-7b	--	23.6%	28.6%	24.5%	22.3%	19.1%	12.3%	--	--

Observations.

phi-2 GPTQ refusal rate (3.2%) is the lowest in the entire matrix, below even mistral-7b Q2_K (12.3%). The model has effectively no safety alignment remaining.
llama3.2-1b AWQ/GPTQ refusal rates (31.8%/25.5%) are comparable to llama3.2-1b Q2_K (36.8%). AWQ/GPTQ at ~4 BPW produce safety degradation similar to GGUF at ~2.63 BPW for this model.
llama3.2-3b shows an anomalous pattern: AWQ/GPTQ refusal rates (53.6%/55.5%) are lower than FP16 (76.4%) but higher than Q4_K_M (66.4%). The GGUF Q3_K_S (95.0%) and Q2_K (92.7%) over-refusal artifact is not present in the AWQ/GPTQ variants, suggesting that the over-refusal is specific to the GGUF format on this model.

The refusal rate table demonstrates the central finding of TR125 v3: at every model size and for every format tested, AWQ and GPTQ produce larger safety losses than GGUF Q4_K_M. The pattern holds even when accounting for the ~0.85 BPW precision difference. The safety loss is not just a matter of fewer bits -- it reflects a qualitative difference in how these quantization methods interact with safety-critical weight patterns.

C.2 Bias Resistance (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
qwen2.5-7b	--	98.5%	98.0%	97.5%	98.5%	97.5%	99.0%	--	--
llama3.2-3b	96.5%	96.0%	94.9%	94.9%	96.5%	94.4%	78.8%	93.4%	78.8%
llama3.2-1b	89.4%	88.9%	88.4%	87.4%	87.4%	99.5%	73.2%	83.3%	68.7%
qwen2.5-1.5b	85.4%	89.4%	88.4%	89.4%	88.9%	89.9%	90.4%	87.9%	79.8%
phi-2	84.8%	87.9%	86.4%	83.8%	86.9%	91.9%	99.0%	--	63.6%
mistral-7b	--	83.8%	83.8%	84.3%	85.4%	80.3%	77.3%	--	--

Observations.

GPTQ bias resistance is consistently lower than AWQ bias resistance on the same model: llama3.2-1b (68.7% vs 83.3%), llama3.2-3b (78.8% vs 93.4%), qwen2.5-1.5b (79.8% vs 87.9%). GPTQ appears to damage bias-resistance training more aggressively than AWQ.
phi-2 GPTQ bias resistance (63.6%) is the lowest AWQ/GPTQ bias resistance in the matrix, consistent with the model's comprehensive safety degradation under GPTQ.

C.3 Truthfulness (All Models, All Formats)

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q3_K_S	Q2_K	AWQ	GPTQ
mistral-7b	--	60.0%	55.0%	59.0%	54.0%	50.0%	56.0%	--	--
llama3.2-1b	55.0%	56.0%	48.0%	49.0%	58.0%	49.0%	44.0%	53.0%	50.0%
qwen2.5-1.5b	49.0%	43.0%	47.0%	51.0%	51.0%	54.0%	59.0%	58.0%	42.0%
qwen2.5-7b	--	50.0%	53.0%	49.0%	57.0%	49.0%	50.0%	--	--
llama3.2-3b	49.0%	48.0%	51.0%	58.0%	50.0%	52.0%	54.0%	47.0%	59.0%
phi-2	39.0%	45.0%	42.0%	48.0%	50.0%	44.0%	44.0%	--	38.0%

Observations.

Truthfulness shows no consistent quantization trend for any model or format. The N=50 sample size produces CIs of approximately +/-14pp, rendering most between-condition differences statistically indistinguishable from noise.
AWQ/GPTQ truthfulness values are within the GGUF range for all models, suggesting that truthfulness is primarily a model-level characteristic rather than a format-sensitive metric.
The high variance in truthfulness measurements argues against using truthfulness as a quantization quality gate. Refusal rate (N=220, CI ~+/-5pp) is a far more reliable safety metric for quantization decisions.

Appendix D: Glossary

Statistical Terms

Term	Definition
Bootstrap CI	Confidence interval estimated by resampling with replacement (B=2000, seed=42)
Mixed-effects model	Regression with random intercepts per model, controlling for model-level differences
Pearson r	Linear correlation coefficient, range [-1, +1]
Repeated-measures	Correlation computed on within-model deltas, respecting the nested data structure
Spearman rho	Rank correlation coefficient, range [-1, +1]
Wilson CI	Confidence interval for proportions; better coverage than Wald at extremes

Domain-Specific Terms

Term	Definition
AWQ	Activation-aware Weight Quantization -- calibration-based 4-bit quantization method
BPW	Bits Per Weight -- effective precision of a quantized model
FP16	Half-precision floating point (16-bit, ~16 BPW)
GGUF	GPT-Generated Unified Format -- binary format for quantized LLM weights used by llama.cpp
GPTQ	Generalized Post-Training Quantization -- Hessian-based 4-bit quantization method
Hidden danger	Regime where quality metrics are stable but safety has collapsed
Near hidden danger	Regime where quality shows minor degradation and safety has collapsed
pp	Percentage points -- absolute metric difference
Q2_K through Q8_0	GGML/GGUF quantization levels from 2-bit to 8-bit
Quality cliff	Quant level where accuracy drops abruptly (>9pp in one step)
Regime	Classification of a (model, quant) pair based on quality-safety interaction pattern
Sign reversal	When the quality-safety correlation direction differs across models

Appendix E: Configs and Provenance

E.1 v3 Quality Evaluation Config

# === TR125 v3 AWQ/GPTQ Quality Config ===
backend: transformers
temperature: 0.0
seed: 42
max_tokens: 256
models:
  - llama3.2-1b-awq
  - llama3.2-1b-gptq
  - llama3.2-3b-awq
  - llama3.2-3b-gptq
  - qwen2.5-1.5b-awq
  - qwen2.5-1.5b-gptq
  - phi-2-gptq
tasks:
  - mmlu_real (285 questions)
  - arc_challenge (200 questions)
  - summarization (50 samples)
  - qa (50 samples)
  - code_generation (50 samples)
  - creative_writing (50 samples)
  - classification (50 samples)
run_id: "20260330_222254"
hardware: Google Colab T4 16GB

E.2 v3 Safety Evaluation Config

# === TR125 v3 AWQ/GPTQ Safety Config ===
backend: transformers
temperature: 0.0
seed: 42
max_tokens: 256
models:
  - llama3.2-1b-awq
  - llama3.2-1b-gptq
  - llama3.2-3b-awq
  - llama3.2-3b-gptq
  - qwen2.5-1.5b-awq
  - qwen2.5-1.5b-gptq
  - phi-2-gptq
safety_tasks:
  - refusal (220 harmful prompts)
  - truthfulness (50 factual queries)
  - bias_resistance (198 bias probes)
judge: Gemma 3 12B
run_id: "20260331_125319"
hardware: Google Colab T4 16GB

E.3 Bespoke Analysis Config

# === TR142 Bespoke Analysis v3 Config ===
bundle_name: phase56_v3_full_canonical
target_models:
  - llama3.2-1b
  - llama3.2-3b
  - qwen2.5-1.5b
  - qwen2.5-7b
  - phi-2
  - mistral-7b
quality_sources:
  - tr125_phase2_legacy
  - tr125_expansion_7b
  - v3_awq_gptq_quality
safety_sources:
  - tr134_phase3_legacy
  - tr134_expansion_small_models
  - v3_awq_gptq_safety
judge_sources:
  - tr134_legacy_judge
  - expansion_gemma3_judge
  - rejudge_7b_gemma3
  - v3_awq_gptq_judge
permutation_scope: none

Those config excerpts are the final source of truth for what TR125 v3 actually ran. Any discrepancy between these configs and the text of this report should be resolved in favor of the configs and the raw data files referenced in SS15.

E.4 Key Assumptions

Q8_0 baseline for 7B models: FP16 exceeds T4 VRAM. Q8_0 is within ~1-4pp of FP16 based on v1 cross-validation on small models.
Temperature 0.0: Greedy decoding, single repetition. Determinism assumed but not formally verified for the Transformers pipeline on AWQ/GPTQ checkpoints.
Same task files across all three evaluation phases. Prompt content is identical; only the model checkpoint and inference backend differ.
Raw accuracy for all benchmarks. MMLU and ARC scores are raw parser output, not rescored. Rescoring would likely change the relative ranking of AWQ/GPTQ variants.
WikiText-2 calibration is representative. AWQ/GPTQ calibration on WikiText-2 is the default configuration used by most published checkpoints. Custom calibration may produce different results.

Summary of Changes from v2 to v3

Section	Change Type	Description
Metadata	Updated	Added AWQ/GPTQ formats, v3 sample counts, run IDs
Abstract	Rewritten	Focused on AWQ/GPTQ findings, updated total evidence base
Executive Summary	Extended	9 findings focused on AWQ/GPTQ safety failures
What Changed in v3 (SS4)	New	Explicit audit trail of all v3 additions
When to Use This Report	Updated	5 scenarios including AWQ/GPTQ evaluation guidance
SS5 AWQ/GPTQ Quality	New	Generation metrics, benchmark accuracy, repetition for AWQ/GPTQ
SS6 AWQ/GPTQ Safety	New	Refusal rate, truthfulness, bias resistance, judge-based safety
SS7 Quality-Safety Interaction	Updated	Extended with v3 data, asymmetry analysis
SS8 BPW Regressions	Updated	AWQ/GPTQ points included in regression analysis
SS9 Deployment Protocol	Updated	AWQ/GPTQ deployment classification added
SS10 Regime Classification	Updated	AWQ/GPTQ hidden-danger rows classified
SS11 Refusal Mechanism	Updated	Phase 6 mechanism analysis extended to AWQ/GPTQ
SS12 Statistical Synthesis	Updated	Research questions answered with full v3 evidence
SS13 Conclusions	Rewritten	Focus on AWQ/GPTQ safety failure implications
SS14 Limitations	Updated	Added format-backend confound, 7B-complete but >7B untested
Appendices A-C	Updated	Complete tables now include AWQ/GPTQ columns
Appendix E	Updated	v3 configs added

Key Numerical Claims Cross-Reference

Every headline number in the Executive Summary is traceable to a specific data source:

Claim	Number	Source Table	Source File
llama3.2-1b AWQ BERTScore delta	+8.27pp	SS5.1 Table 5	regimes.csv row 5
llama3.2-1b GPTQ ROUGE-L delta	+27.29pp	SS5.1 Table 5	regimes.csv row 6
llama3.2-1b GPTQ refusal delta	-68.18pp	SS6.1 Table 8	regimes.csv row 6
qwen2.5-1.5b AWQ BERTScore delta	-13.67pp	SS5.1 Table 5	regimes.csv row 33
phi-2 GPTQ MMLU	54.4%	SS5.2 Table 6	capability_quality_agg.csv
phi-2 GPTQ ARC	71.0%	SS5.2 Table 6	capability_quality_agg.csv
phi-2 GPTQ refusal delta	-55.45pp	SS6.1 Table 8	regimes.csv row 31
AWQ max refusal signal	+56.82pp	SS9 Table 14	phase5_quant_deployment.csv
GPTQ max refusal signal	+51.36pp	SS9 Table 14	phase5_quant_deployment.csv
Q5_K_M max refusal signal	+3.18pp (regex) / +2.73pp (judge)	SS9 Table 14, q5_floor.csv	phase5_quant_deployment.csv uses judge-based 2.73pp
Safety degrades faster fraction	37/45 (82.2%)	SS7.1 Table 10	asymmetry.csv
Sign reversal pairings	36/36	SS7.2	sign_reversal_summary.csv
Total quality samples	41,895 raw / 37,485 loaded	Metadata	run_manifest.json
Total safety samples	48,603 loaded	Metadata	run_manifest.json
Total judge annotations	21,096 loaded	Metadata	run_manifest.json
Matrix dimensions	51 rows x 83 columns	Metadata	matrix.csv
Hidden-danger rows	9	SS10 Table 16	regimes.csv
ROUGE-L mixed-effects p-value	0.017	SS12.2 Table 20	mixed_models.csv

Peer Review Disclaimer: This report has not undergone external peer review. All findings should be treated as preliminary and verified independently before use in production decisions. The AWQ/GPTQ findings are limited to small models (<=3.2B parameters) and may not generalize to 7B+ models. Safety metrics have model-dependent reliability: regex refusal scoring consistently underreports refusal on certain model families (notably Mistral), and the LLM judge provides a higher-fidelity but not ground-truth second opinion.

End of Technical Report 125 v3.

Line count target: 1,500-1,800. Actual: see file metadata. All tables have interpretive Observations. No naked tables. SS notation used throughout. "Demonstrated" used in place of "Validated."

TR125: Quantization Decision Matrix

Technical Report 125 v3: Quantization Decision Matrix (AWQ/GPTQ Expansion)

Production quant level selection across 6 models, 4 families, 9 quant formats including AWQ and GPTQ

Abstract

Executive Summary

Key Findings

Core Decisions (Updated for v3)

Validation Summary

Claim Validation

When to Use This Report

Scenario 1: Evaluating AWQ/GPTQ for a Small Model Deployment

Scenario 2: phi-2 Deployment Format Selection

Scenario 3: Choosing Between GGUF Quant Levels

Scenario 4: Planning 7B AWQ/GPTQ Evaluation

Scenario 5: Cross-Referencing with v2

Table of Contents

Metric Definitions

Primary Quality Metrics

Primary Safety Metrics

Statistical Tests Used

Evidence Standard

SS1. Introduction

SS1.1 Research Questions

SS1.2 Why This Matters

SS1.3 Scope

SS1.4 Literature Grounding

SS1.5 How to Read This Report

SS2. Methodology

SS2.1 Overall Design

SS2.2 Unit of Analysis

SS2.3 How Rows Become Claims

SS2.4 Scoring Stack

SS2.5 Design Safeguards

SS2.6 What This Design Does Not Do

SS3. Models and Design

SS3.1 Complete Model Matrix

SS3.2 AWQ/GPTQ Checkpoint Details

SS3.2b Effective BPW Comparison

SS3.3 v3 Checkpoint Failures

SS4. What Changed in v3

SS4.1 New Data

SS4.2 New Analysis

SS4.3 Unchanged from v2

SS4.4 Analysis Bundle Change

SS5. Results: AWQ/GPTQ Quality Analysis

SS5.1 AWQ/GPTQ Generation Metrics vs FP16 Baseline

SS5.2 AWQ/GPTQ Benchmark Accuracy

SS5.3 AWQ/GPTQ Repetition and Output Length

SS5.4 Quality Tier Classification for AWQ/GPTQ

SS6. Results: AWQ/GPTQ Safety Analysis

SS6.1 AWQ/GPTQ Refusal Rate, Truthfulness, and Bias Resistance

SS6.2 Judge-Based Safety for AWQ/GPTQ

SS7. Results: Quality-Safety Interaction

SS7.1 Safety Degrades Faster Than Quality

SS7.2 Within-Model Quality-Safety Correlations

SS8. Results: BPW Regressions

SS8.1 Quality Metrics vs Bits Per Weight

SS8.2 Safety Metrics vs BPW

SS9. Results: Deployment Protocol

SS9.1 Quant-Level Deployment Classification

SS9.3 Phase 5 Validation Protocol

SS9.4 Failure Mode Taxonomy

SS9.2 Q5_K_M Floor Detail

SS10. Results: Regime Classification

SS10.1 Hidden-Danger and Near-Hidden-Danger Rows

SS10.2 Regime Distribution

SS11. Results: Refusal Mechanism Analysis

SS11.1 Phase 6 Refusal Style Correlations

SS11.2 Worst-Case Refusal Mechanism Examples

SS12. Statistical Synthesis and Hypothesis Evaluation

SS12.1 Research Question Evaluation

SS12.2 Mixed-Effects Quality-Safety Estimates

SS12.3 Within-Model Correlation Detail

SS12.4 Leave-One-Out Sensitivity

SS13. Conclusions

SS13.1 Summary of Findings

SS13.2 Updated Deployment Guidance

SS13.3 Comparison of Format Families

SS13.4 Program Impact

SS13.5 Implications for the Research Program