TR134-TR137 Decision Whitepaper

Executive guidance for safety-critical LLM deployment

Field	Value
Project	Banterhearts LLM Performance Research
Date	2026-03-08
Version	1.0
Report Type	Decision whitepaper
Audience	Decision makers, product leaders, ML ops leaders
Scope	TR134 (Alignment Under Quantization), TR135 (Concurrency x Safety), TR136 (Cross-Backend Safety), TR137 (Safety Tax Synthesis)
Primary Source	`PublishReady/reports/Technical_Report_Conclusive_134-137.md`

Abstract

This whitepaper distills four technical reports (TR134-TR137) and 74,254 evaluated samples into deployment policy for safety-critical LLM inference. The central question: when optimizing a small LLM for production (quantization, concurrency, backend), which optimization degrades safety alignment, by how much, and for which models? Outcome: six shippable decisions that cover the full safety-optimization space, backed by bootstrap confidence intervals, effect sizes, and per-model profiling across 5 models, 7 quantization levels, 4 concurrency levels, and 4 serving backends.

Boundary conditions (do not skip)

This guidance is valid only under the measured boundary:

Models at or below 7.6B parameters (Llama 3.2 1B/3B, Mistral 7B Instruct v0.3, Qwen 2.5 7B Instruct, Qwen 2.5 3B/1.5B)
Fixed hardware class: NVIDIA RTX consumer GPU, single-GPU inference
Quantization via Ollama GGUF (Q2_K through FP16, 7 levels)
Serving backends: Ollama, vLLM, TGI, HuggingFace Transformers
Safety scored by automated classifiers (RefusalDetector, BiasDetector, TruthfulnessScorer) -- not human annotation
Temperature 0 (deterministic sampling) throughout
Safety tasks: AdvBench refusal (100 samples), TruthfulQA (50), BBQ bias (200), jailbreak amplification (120)
Capability tasks: MMLU (285 samples), ARC-Challenge (200)

If any of these change, re-run the safety evaluation matrix and re-validate before applying these decisions.

Six decisions you can ship now

Quantization floor for safety-critical applications: Q4_K_M. At Q4_K_M, all tested models retain >= 93% of baseline safety. The marginal safety cost at this level is 1.3-4.6pp -- well within acceptable bounds. Q8_0 or FP16 are safer but offer marginal improvement (98-101% retention). Below Q4_K_M, safety degrades model-specifically and unpredictably.
Concurrency scaling is safety-neutral: scale freely. Maximum observed safety effect across all models, tasks, and concurrency levels (N=1 to N=8) is 0.4pp. All jailbreak compliance slopes under concurrency are exactly zero. No safety testing is required for concurrency changes.
Backend migrations are safety-critical changes: re-validate. Switching from Ollama GGUF to vLLM/TGI FP16 can cost 4-25pp of safety depending on the model. No backend pair achieves TOST equivalence at +/-3pp (0/18 tests pass). Treat every backend change as requiring full safety re-evaluation.
Per-model safety profiling is mandatory. I-squared heterogeneity = 99.9% on the quantization axis. Models do not agree on the direction of safety degradation, let alone the magnitude. Llama 1B loses 35pp at Q2_K; Llama 3B gains 6pp. No generic "safe quantization level" guideline is reliable across models.
Ban Q2_K for Llama-class 1B models. All Q2_K configurations for Llama 3.2 1B are rated CRITICAL risk (57.5% safety retention). This is the only model-quant combination that falls below the 80% safety threshold. The ban is model-specific -- other models do not show this catastrophic pattern at Q2_K.
Monitor jailbreak susceptibility under quantization. All four jailbreak techniques (prefix injection, direct, DAN-style, roleplay) become more effective at lower quantization levels (all BPW slopes negative). Prefix injection is the most dangerous (-0.036 compliance/BPW). Jailbreak vulnerability is invariant to concurrency.

Decision matrix (one-glance policy)

Condition	Quantization	Concurrency	Backend	Safety Action
Safety-critical, any model	Q4_K_M or higher	Any (N=1-8)	Current (no change)	Standard monitoring
Safety-critical, Llama 1B	Q4_K_M or higher	Any	Current	Enhanced monitoring
Safety-critical, new model	Profile first	Any	Current	Full safety evaluation before deployment
Cost-optimized, non-critical	Q3_K_S (model-dependent)	Any	Any	Spot-check safety
Backend migration	Keep current quant	Any	Re-validate safety	Full safety evaluation
Scaling concurrency	Keep current quant	Scale freely	Keep current backend	No action needed
Maximum risk tolerance	Q2_K (not for Llama 1B)	Any	Any	Accept CRITICAL risk

Key findings (decision-grade)

Quantization accounts for 57% of total safety cost. Mean absolute delta: 20.6pp across 2 anchor models. Cohen's d = 1.93 for Llama 1B (large effect). However, the 95% bootstrap CI spans [-6.0, 35.2]pp because one model improves under quantization while the other degrades severely. This extreme variance is itself the finding.
Backend choice accounts for 41% of total safety cost. Mean delta: 14.8pp (CI: [4.4, 25.1]pp). The mechanism is chat template divergence between GGUF-embedded and HuggingFace tokenizer templates, not the serving framework itself (vLLM and TGI are functionally identical: d < 0.03). This is a weight format issue, not a software issue.
Concurrency accounts for 2% of total safety cost. Mean delta: 0.1pp (CI: [-0.3, 0.4]pp). This is the only axis where all models agree (I-squared = 0.0%). Deterministic sampling produces identical outputs regardless of concurrent load.
The safety veneer hypothesis is not universally supported. Only 3 of 10 model-axis combinations show safety degrading faster than capability. RLHF alignment is not a uniquely fragile "thin layer" for most models -- safety and capability degrade comparably.
Nationality is the most vulnerable bias category under quantization. BBQ per-category slopes show Nationality at -0.010/BPW (worsening) and Race/Ethnicity at +0.015/BPW (improving). Race-related bias mitigation is more robust to quantization, likely due to heavier emphasis in RLHF training data.
Automated classifiers disagree more at low quantization. Cohen's kappa between regex and LLM judge drops from 0.000 (baseline, but 72% raw agreement) to 0.007 (Q2_K, 58% agreement) on AdvBench. Safety scores at extreme quant levels carry higher measurement uncertainty.

Operational recommendations (policy statements)

Quantization policy

Policy: Q4_K_M is the safety-validated default for all tested models at or below 7.6B parameters.
Policy gate: Use Q8_0 or FP16 when safety retention >= 98% is required.
Policy ban: Never deploy Q2_K for Llama 3.2 1B in any safety-sensitive context (57.5% retention = CRITICAL).
Policy: Always run per-model safety profiling before deploying a new model at any quantization level. Do not extrapolate from other models.

Concurrency policy

Policy: Scale concurrency (N=1 to N=8+) without safety constraints.
Rationale: Zero jailbreak slope, zero bias slope, max 0.4pp overall safety delta across all tested models and tasks.

Backend policy

Policy: Treat any backend migration as a safety-critical change requiring full re-evaluation.
Policy gate: If migrating from Ollama to vLLM/TGI, expect 4-25pp safety impact (model-dependent). Budget for safety testing before production cutover.
Policy: Do not assume vLLM and TGI are safety-equivalent to Ollama merely because they serve the same weights. Chat template handling differs by weight format (GGUF vs HuggingFace FP16).

Jailbreak monitoring

Policy: Test all deployed quantized models against prefix injection and direct prompt jailbreaks.
Policy: Models at Q3_K_S or below require jailbreak testing at deployment and after any weight format change.
Policy: Concurrency scaling does not require jailbreak re-testing.

Bias monitoring

Policy: Monitor Nationality-related bias under quantization -- most vulnerable category.
Policy: Race/Ethnicity bias is the most robust category, but this should not be assumed for untested models.

Risk impact

Safety cost by optimization axis (ranked)

Quantization: 57% of total safety cost. Range: -6pp to +35pp depending on model. The dominant lever and the most dangerous because of extreme model disagreement (I-squared = 99.9%).
Backend: 41% of total safety cost. Range: 4pp to 25pp. Driven by weight format and chat template, not serving framework.
Concurrency: 2% of total safety cost. Range: 0pp to 0.4pp. Negligible and consistent across models.

Worst-case deployment

Model: Llama 3.2 1B at Q2_K, N=1, Ollama
Safety retention: 57.5% (CRITICAL)
Additional backend variance: 25.1pp (if backend changes)
Theoretical floor (Q2_K + TGI): below 30% retention (not directly measured)

Risk distribution across 24 assessed configurations

Risk Level	Count	Percentage
Critical (< 80% retention)	3	12.5%
High (80-90%)	0	0.0%
Moderate (90-95%)	3	12.5%
Low (>= 95%)	18	75.0%

Avoiding Q2_K for Llama 1B eliminates all critical risk. Using Q4_K_M or higher reduces risk to moderate or low for all configurations.

Implementation plan (30-day view)

Days 1-7: establish safety baseline

Run per-model safety profiling at your deployed quantization level using the 4-task safety battery (advbench, jailbreak, bbq, truthfulqa).
Compute safety retention against FP16/Q8_0 baseline for each model.
Flag any model-quant combination below 90% retention for immediate review.

Days 8-14: enforce quantization policy

Set Q4_K_M as default quantization for all safety-sensitive deployments.
Ban Q2_K for Llama-class 1B models. Require explicit risk acceptance for Q2_K on other models.
Document model-specific safety profiles and store alongside deployment configs.

Days 15-21: backend migration protocol

If backend migration is planned (Ollama <-> vLLM/TGI), run full safety evaluation on the target backend before cutover.
Compare safety scores across backends. If delta exceeds 3pp, investigate chat template handling before proceeding.
Create a backend migration safety checklist.

Days 22-30: ongoing monitoring

Add jailbreak resistance (prefix injection + direct) to CI/CD safety gates for quantized models.
Monitor Nationality-bias scores under quantization in production if applicable.
Schedule quarterly re-evaluation when models, quant levels, or backends change.

Risks, limitations, invalidation triggers

Limitations

Two-model anchor set. All cross-axis conclusions rest on Llama 3.2 1B and 3B. Adding models to the cross-axis analysis would improve confidence.
No factorial design. Each TR varied one axis. Interaction effects (e.g., quantization x concurrency synergy) are not measured. The deployment matrix uses an additive model.
Automated scoring only. Safety scores come from regex classifiers with poor LLM judge agreement (kappa = 0.147). Human annotation would improve confidence.
Temperature 0 only. Stochastic sampling may interact differently with optimization axes.
Consumer hardware only. Datacenter GPUs (A100, H_100) may behave differently, especially for vLLM/TGI.
Small models only (<= 7.6B). Larger models may show different quantization resilience.

What invalidates this guidance

Model family not in the tested set (Llama 3.2, Mistral 7B, Qwen 2.5) -- re-profile
Quantization method other than GGUF (e.g., GPTQ, AWQ) -- different degradation profile
Temperature > 0 -- unknown interaction with safety under quantization
Multi-GPU tensor parallelism -- untested serving configuration
Model size > 7.6B parameters -- different redundancy characteristics
Safety classifier change -- scores are not absolute, they depend on the classifier

Evidence anchors (audit-ready)

Decision	Artifact	Validation
Q4_K_M preserves >= 93% safety	`research/tr134/results/phase3/20260305_144827/phase3_analysis.json`	Per-model retention at Q4_K_M: 98.4% (Llama 1B), 93.8% (Llama 3B)
Concurrency is safety-neutral	`research/tr135/results/20260307_162151/tr135_analysis.json`	Max delta 0.4pp, all jailbreak slopes = 0.000
No backend pair equivalent at +/-3pp	`research/tr136/results/20260308_015147/tr136_analysis.json`	0/18 TOST tests pass
Q2_K is CRITICAL for Llama 1B	`research/tr137/results/20260308_180727/tr137_deployment_matrix.csv`	57.5% retention, 3/3 Q2_K configs critical
I-squared = 99.9% on quant axis	`research/tr137/results/20260308_180727/tr137_analysis.json`	Heterogeneity pass, N=2 models
Prefix injection most effective jailbreak	`research/tr134/results/phase3/20260305_144827/phase3_analysis.json`	Slope = -0.036/BPW, steepest of 4 techniques
Nationality most vulnerable bias category	`research/tr134/results/phase3/20260305_144827/phase3_analysis.json`	Avg slope = -0.010/BPW across 4 models
Backend effect is chat template, not framework	`research/tr136/results/20260308_015147/tr136_analysis.json`	vLLM vs TGI d < 0.03; Ollama vs vLLM d = 0.60

References

Conclusive report: PublishReady/reports/Technical_Report_Conclusive_134-137.md
TR134: PublishReady/reports/Technical_Report_134.md (Alignment Under Quantization)
TR135: PublishReady/reports/Technical_Report_135.md (Concurrency x Safety)
TR136: PublishReady/reports/Technical_Report_136.md (Cross-Backend Safety)
TR137: PublishReady/reports/Technical_Report_137.md (Safety Tax Synthesis)
Prior phase whitepaper: PublishReady/reports/Technical_Report_Conclusive_123-133_Whitepaper.md

Optional upgrades (board-ready polish)

Add a 1-page Safety Decision Card: the six-row matrix + boundary conditions + three invalidation triggers.
Add a safety degradation heatmap (model x quant x task) as a visual summary.
Add a Cost-Safety Pareto chart: plot $/token savings vs safety retention for each quant level.
Add a Change Control clause: any model, backend, or quantization change triggers re-run of the safety profiling battery.
Commission human annotation study (500 samples at FP16, Q4_K_M, Q2_K) to calibrate automated classifiers.

Phase 3 Decision Whitepaper