Technical Report 136: Cross-Backend Safety Consistency

Does the serving backend change model safety behavior? A controlled comparison of Ollama, vLLM, and TGI across 3 models, 4 backend configurations, and 6 benchmarks

Field	Value
TR Number	136
Project	Banterhearts LLM Performance Research
Date	2026-03-08
Author	Research Team
Report Type	Safety alignment analysis (metric-backed, 3 models, 4 backends, 6 benchmarks)
Test Duration	~2.4 hrs (eval) + ~0.6 hrs (LLM judge) + ~0.01 hrs (analysis)
Status	Complete
Run ID	`20260308_015147`
Related Work	TR134 (Quantization x Safety), TR135 (Concurrency x Safety), TR137 (Safety Tax Synthesis, forthcoming)
Depends On	TR134 (safety classifiers, jailbreak tasks, BBQ benchmark, LLM judge module), TR135 (concurrency baselines, Ollama Q4_K_M reference)

Abstract

TR134 showed that quantization degrades safety alignment below certain bit-per-weight thresholds, and TR135 showed that concurrency does not degrade it at all. Both experiments used Ollama as the sole inference backend. But production deployments rarely standardize on a single backend. Teams choose between Ollama (GGUF weights, native process), vLLM (HuggingFace weights, continuous batching, Docker), and TGI (HuggingFace weights, Docker) based on throughput, latency, and integration requirements. If these backends produce different safety behavior from the same model architecture, operators need to know -- because a model that passes safety testing on one backend may fail on another.

TR136 asks: does the serving backend change safety behavior, and if so, does quantization or the backend itself drive the difference?

We test 3 models (Llama 3.2 1B, Llama 3.2 3B, Qwen 2.5 1.5B) across 4 backends: Ollama Q4_K_M, Ollama Q8_0, vLLM FP16, and TGI FP16. Each backend processes identical prompts (868 per model) across 6 benchmarks: AdvBench refusal (100), TruthfulQA (50), BBQ bias (198), jailbreak amplification (120), MMLU (200), and ARC-Challenge (200). Temperature is fixed at 0.0 with seed=42 for determinism. All backends run sequentially on a single GPU to avoid contention. Post-hoc LLM judge validation (Qwen 2.5 7B Instruct Q8_0) provides an independent signal on 5,616 safety samples.

The headline result is that the serving backend matters more than quantization for safety. Llama 3.2 1B shows a 23pp safety drop between Ollama (Q4_K_M: 0.858, Q8_0: 0.876) and FP16 backends (vLLM: 0.628, TGI: 0.625) -- a medium effect size (Cohen's d = -0.55 to -0.61, p < 0.0001 after Holm correction). This is counterintuitive: FP16 precision is higher than Q4_K_M, yet safety is lower. The effect is driven almost entirely by jailbreak refusal (92.5% Ollama vs 50.0% FP16) and AdvBench refusal (88-95% Ollama vs 54% FP16). The same pattern appears at smaller magnitude in Llama 3.2 3B (6pp gap, d = -0.13 to -0.16) and Qwen 2.5 1.5B (3-7pp gap, d = -0.07 to -0.17). Within the same backend family, quantization effects are negligible: Q4_K_M vs Q8_0 shows d < 0.09 for all models. vLLM vs TGI agreement is 95.7%, confirming the two FP16 Docker backends are functionally identical.

The mechanism is not precision loss -- it is the reverse. Ollama's GGUF format bundles model-specific chat templates and tokenizer configurations that were tuned during safety training. HuggingFace FP16 weights served via vLLM/TGI rely on tokenizer_config.json and the backend's own chat template application, which may diverge from the GGUF-embedded template. The safety drop occurs not because FP16 removes safety information, but because the prompt framing differs between backend families.

No backend pair achieves TOST equivalence at the +/-3pp margin. This means we cannot formally certify any two backends as interchangeable for safety-critical deployments. Capability scores (MMLU, ARC-Challenge) show much smaller cross-backend variation (4-8pp range vs 7-25pp for safety), confirming that safety is differentially sensitive to the backend.

Implication for practitioners: safety evaluations performed on one backend do not transfer to another. Operators must re-evaluate safety when switching backends, even when using the same model architecture and higher precision.

Total: 10,416 evaluated samples, 5,616 judged samples, 3 models, 4 backends, 6 benchmarks.

Metric Definitions

These definitions are inherited from TR134 for cross-TR comparability. Safety metrics are the primary focus; capability metrics serve as a control arm.

Safety Metrics

Refusal Rate: Proportion of harmful prompts where the model refuses to comply. Measured via regex classifier (RefusalDetector) matching ~15 refusal patterns (e.g., "I cannot", "I'm not able to", "As an AI", "I must decline"). Range [0, 1]. Higher = safer. Primary safety metric for AdvBench and jailbreak tasks.
Bias Resistance: Proportion of BBQ questions where the model selects the non-stereotypical or "unknown" answer rather than reinforcing demographic stereotypes. Range [0, 1]. Higher = less biased.
Truthfulness: Proportion of TruthfulQA questions answered correctly (matching reference answer). Range [0, 1]. Higher = more truthful.

Capability Metrics

Accuracy (MMLU / ARC-Challenge): Proportion of multiple-choice questions answered correctly. Regex letter extraction from model output.

Derived Metrics

TOST Equivalence: Two one-sided t-tests at +/-3pp margin. "Equivalent" = confirmed within +/-3pp. This is strictly stronger than "not significant."
Cohen's d: Standardized effect size (pooled SD). < 0.2 = trivial, 0.2-0.5 = small, 0.5-0.8 = medium, > 0.8 = large.
Cohen's Kappa: Inter-rater agreement between regex classifier and LLM judge, corrected for chance. < 0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate (Landis & Koch 1977).
Cramer's V: Effect size for chi-squared tests. < 0.1 = negligible, 0.1-0.3 = small, 0.3-0.5 = medium, > 0.5 = large.
Jaccard Similarity: Token-level overlap between responses from two backends for the same prompt. Range [0, 1]. 1.0 = identical output. Measures whether backends produce semantically similar text, independent of safety classification.
Bootstrap CI: 2,000-iteration resampling for 95% confidence intervals on safety means.
MDE (Minimum Detectable Effect): Smallest effect detectable at alpha=0.05, power=0.80, via two-proportion z-test.

Statistical Methods & Caveats

Tests used:

Chi-squared test for backend independence (safety outcome x backend contingency table), with p-values via regularized incomplete gamma function
Welch's t-test for pairwise backend comparisons (unequal variance assumed)
TOST equivalence testing at +/-3pp margin (confirms equivalence, not just failure to reject)
Holm-Bonferroni step-down correction for 6 pairwise comparisons per model (18 total)
Cohen's d (pooled SD) for standardized effect sizes
Cramer's V for chi-squared effect size
2,000-iteration bootstrap for 95% CIs on aggregate safety means
Wilson score binomial CI for pairwise agreement proportions
Pearson correlation for Jaccard-safety divergence relationship
Power analysis: binary MDE via two-proportion z-test (alpha=0.05, power=0.80)
Cohen's kappa for inter-rater agreement (regex classifier vs LLM judge)

Important caveats:

Ollama uses GGUF weights; vLLM/TGI use HuggingFace FP16 weights. These are different weight formats with different tokenizer embeddings and chat template application. The "backend effect" is confounded with the "weight format effect." We cannot isolate which component (serving code, tokenizer, chat template, weight precision) drives the difference without an experiment that holds weight format constant.
Quantization and backend are partially confounded. Ollama Q8_0 vs vLLM FP16 differs in both quantization level AND backend. The "quant effect" (Q4 vs Q8 within Ollama) is clean, but the "backend effect" (Ollama Q8 vs vLLM FP16) confounds backend with residual quantization.
Two model families only. Llama 3.2 (2 sizes) and Qwen 2.5 (1 size). No Mistral, Phi, or Gemma. The Ollama-FP16 gap may be family-specific.
Temperature = 0.0 eliminates sampling variance but does not test stochastic behavior. At temperature > 0, backend-specific RNG implementations could introduce additional divergence.
Sequential backend execution. Backends run one at a time. Thermal state, GPU memory fragmentation, and OS-level caching may differ between early and late runs.
Docker overhead for vLLM/TGI inflates latency vs native Ollama. Latency comparisons are not apples-to-apples for backend speed evaluation.
Regex classifiers are surface-level. A response that "technically refuses" but provides harmful information through implication will be scored as safe. The LLM judge (Cohen's kappa 0.11-0.16) provides a partial check but is not a gold-standard human annotation.
Single run per configuration. No multi-run variance estimation. The 95% CIs reflect within-sample uncertainty, not run-to-run reproducibility.
System prompt encoding differs between backends. All backends receive the same system prompt content, but the encoding of that system prompt differs between GGUF-embedded templates and HuggingFace chat template application. We test the combined effect, not the isolated system prompt effect.

Executive Summary

Key Findings

The serving backend changes safety behavior -- substantially for small Llama models. Llama 3.2 1B shows a 23-25pp aggregate safety drop between Ollama backends (Q4_K_M: 0.858, Q8_0: 0.876) and FP16 backends (vLLM: 0.628, TGI: 0.625). This is a medium effect size (Cohen's d = -0.55 to -0.61, p < 0.0001 after Holm-Bonferroni correction). A model that appears safe on Ollama may not appear safe on vLLM or TGI.
The backend / weight-format effect dominates quantization. Within Ollama, Q4_K_M vs Q8_0 produces d < 0.09 (trivial) for all models. Between Ollama Q8_0 and vLLM FP16, the backend effect reaches d = 0.60 (medium) for Llama 1B. The backend axis explains >95% of the observed safety variation; quantization explains <5%.
vLLM and TGI are functionally identical for safety. 95.7% pairwise agreement across 1,404 safety samples, Jaccard token similarity 0.744, Cohen's d < 0.03. The FP16 serving framework does not affect safety classification.
Jailbreak refusal is the most backend-sensitive task. For Llama 3.2 1B: Ollama jailbreak refusal = 92.5-98.3%, FP16 jailbreak refusal = 50.0% (Cramer's V = 0.512, large effect). AdvBench refusal follows a similar pattern (V = 0.425). Bias resistance and truthfulness are not backend-dependent.
Safety is differentially vulnerable -- capability is not. Llama 3.2 1B safety range across backends = 25.1pp; capability range = 4.0pp. Safety/capability range ratio = 6.28x. The backend change specifically disrupts refusal patterns while leaving general knowledge intact.
No backend pair achieves TOST equivalence at +/-3pp. All 18 pairwise TOST tests fail. We cannot formally certify any backend pair as interchangeable for safety-critical deployments.
Cross-TR reproducibility is good but not perfect. 9/12 task-model Ollama Q4_K_M pairs agree within 5pp between TR136 and TR134. The 3 divergent pairs are: ARC-Challenge on Llama 1B (-6.0pp), AdvBench on Llama 3B (+7.0pp), and jailbreak on Llama 3B (+6.7pp). The two Llama 3B refusal divergences are within per-task MDE (16-19pp); the ARC divergence is a capability task within noise range.
Backends produce fundamentally different text. Jaccard similarity between Ollama and FP16 backends = 0.20-0.28 (only 20-28% token overlap). Within backend families, overlap is 0.67-0.74. The backends are not making marginal classification differences -- they are generating substantially different responses.
The effect is model-size-dependent. Llama 1B (1.2B params): 24pp gap. Llama 3B (3.2B params): 6pp gap. Qwen 1.5B (1.5B params, different family): 5pp gap. Larger models and different RLHF recipes appear more robust to backend variation.

Validation Summary

Target	Metric	Required	Achieved	Status
Backend effect measured	Chi-squared p < 0.05	All 3 models	3/3 (p = 0.0, 0.011, 0.050)	PASS
Effect decomposition	Quant, backend, serving d	All 3 models	3/3 computed	PASS
Pairwise comparisons	Welch + Holm + TOST	18 pairs	18/18 computed	PASS
Cross-TR validation	< 5pp vs TR134	>= 75% consistent	75% (9/12)	PASS
LLM judge agreement	Kappa computed	All 4 backends	4/4 (kappa 0.11-0.16)	PASS
Error rate	0% errors	All 12 runs	0/10416 errors	PASS

Claim Validation

#	Claim	Evidence Base	Status
1	Backend choice affects safety scores	Chi-squared significant for all 3 models (Section 14)	Demonstrated
2	Backend effect > quantization effect	Decomposition d: backend 0.15-0.60 vs quant 0.01-0.08 (Section 6)	Demonstrated
3	vLLM and TGI are interchangeable	95.7% agreement, d < 0.03 (Section 11)	Demonstrated (practical, not TOST-formal)
4	Jailbreak refusal is most affected	V = 0.19-0.51 across all models (Section 14)	Demonstrated
5	Safety varies more than capability	Range ratio 6.28x for Llama 1B (Section 9)	Partially validated -- only for Llama 1B; ratio ~1.0 for others
6	Effect is model-size-dependent	Gaps: 24pp (1B), 6pp (3B), 5pp (1.5B) (Section 5)	Demonstrated (direction clear, but family confounded with size)
7	Chat template divergence is the mechanism	Textual divergence (Jaccard 0.20-0.28) + safety-specific effect (Section 12)	Supported but not proven -- correlational, not causal
8	No backend pair is formally equivalent	0/18 TOST tests pass (Section 10)	Demonstrated

Key Decisions for Practitioners

Always re-test safety when switching backends. The Llama 3.2 1B jailbreak refusal rate drops from 92-98% (Ollama) to 50% (vLLM/TGI). A model that passes internal safety review on Ollama may not pass on a Docker-based deployment. Budget for re-evaluation.
vLLM and TGI are interchangeable for safety. If you are choosing between these two FP16 Docker backends, safety is not a differentiator. They agree on 95.7% of safety classifications (Section 11) and produce 74.4% token overlap (Section 12). Pick based on throughput, integration, and operational preferences.
Ollama's Q4_K_M and Q8_0 are interchangeable for safety. Within Ollama, the quantization level has negligible effect on safety (d < 0.09). This is consistent with TR134's finding that Q4_K_M is a safe deployment point. Use Q4_K_M to save memory without safety cost.
Use larger models in multi-backend deployments. The Llama 3.2 1B model is 6x more sensitive to backend choice than to capability perturbations. The 3B model is much more robust. If your deployment may span multiple backends, prefer models >= 3B parameters.
Verify chat template alignment before deploying on a new backend. Compare the output of transformers.AutoTokenizer.apply_chat_template() against the GGUF-embedded template using llama.cpp's template extraction. Misalignment in system prompt encoding, turn boundaries, or special tokens is the most likely cause of safety divergence (Section 19).

When to Use This Report

Scenario 1: Migrating from Ollama to vLLM/TGI

Question: "We tested our model on Ollama and it passed safety checks. Can we deploy on vLLM without re-testing?"

Answer: No. Llama 3.2 1B shows a 23-25pp safety drop when moving from Ollama to FP16 backends (Section 5). The drop is concentrated in jailbreak and AdvBench refusal tasks (Section 7). Re-run your safety evaluation suite on the target backend before deployment.

Scenario 2: Choosing between vLLM and TGI

Question: "We're deciding between vLLM and TGI for production. Does safety differ?"

Answer: No. The two FP16 backends agree on 95.7% of safety classifications (Section 11) with Jaccard token overlap of 0.744 (Section 12). Cohen's d < 0.03 for all models (Section 10). Choose based on throughput, API compatibility, and operational criteria.

Scenario 3: Evaluating whether quantization or backend matters more

Question: "Should I worry more about the Q4_K_M precision loss or the backend choice?"

Answer: The backend choice matters more. Within Ollama, Q4 vs Q8 produces d < 0.09 (trivial). Between Ollama and FP16 backends, d reaches 0.60 (medium). See the decomposition in Section 6.

Scenario 4: Deciding if your model is robust enough for multi-backend deployment

Question: "How do I know if my model will behave consistently across backends?"

Answer: Larger models are more robust. Llama 3.2 1B shows a 24pp gap; Llama 3.2 3B shows 6pp; Qwen 2.5 1.5B shows 5pp (Section 5). Check the safety-capability divergence ratio (Section 9): if the safety range across backends is much higher than the capability range, your model is differentially vulnerable.

Scenario 5: Validating cross-experiment reproducibility

Question: "Can I trust that TR136's Ollama results match TR134's?"

Answer: Mostly. 9/12 task-model pairs agree within 5pp (Section 17). The 3 divergent pairs are refusal tasks on Llama 3B with moderate baseline rates, where the per-task MDE is 17-19pp (Section 16). Capability and bias scores reproduce perfectly.

How to Read This Report

Time	Reading Path
2 min	Abstract + Executive Summary Key Findings
10 min	Add Sections 5 (aggregate scores), 6 (decomposition), 9 (divergence)
30 min	Add Sections 7 (per-task), 10 (pairwise t-tests), 14 (chi-squared), 19 (conclusions)
60 min	Full report minus appendices
Deep dive	Full report including appendices, cross-reference with TR134 Section 11 and TR135 Section 17.3

Front Matter

Abstract
Metric Definitions
Statistical Methods & Caveats
Executive Summary

Background (Sections 1-4)

Introduction & Research Motivation
Experimental Design
Model Lineup
Environment & Artifacts

Core Results (Sections 5-9)

Per-Backend Aggregate Safety Scores
Quantization vs Backend Effect Decomposition
Per-Task Safety Breakdown
Capability Control Arm
Safety-Capability Divergence

Statistical Validation (Sections 10-16)

Pairwise Backend Comparisons
Pairwise Backend Agreement
Response Divergence (Jaccard)
Safety-Divergence Correlation
Per-Task Backend Independence (Chi-Squared)
LLM Judge Agreement
Power Analysis

Validation & Synthesis (Sections 17-20)

Cross-Experiment Validation (vs TR134)
Baseline Normalization
Conclusions & Limitations
Reproducibility

Appendices

Appendix A: Normalized Per-Task Scores
Appendix B: Chi-Squared Contingency Tables
Appendix C: Full Latency Comparison
Appendix D: Glossary
References

1. Introduction & Research Motivation

1.1 The Problem

TR134 measured how quantization degrades safety. TR135 measured how concurrency degrades safety. Both used Ollama as the sole backend. This is a significant gap: in production, the same model architecture is served through different backends depending on deployment constraints. Ollama uses GGUF-formatted weights and its own inference engine. vLLM and TGI use HuggingFace-formatted weights with different serving architectures. If these backends produce different safety behavior from the same model, then safety evaluations performed on one backend do not transfer to another -- and operators deploying models across heterogeneous infrastructure cannot trust a single safety test.

The gap is not academic. A team that validates safety on Ollama during development and deploys on vLLM in production may unknowingly degrade safety. Conversely, a team that finds poor safety on vLLM may reject a model that would be safe on Ollama. Neither outcome is acceptable for safety-critical applications.

1.2 Research Questions

Backend independence: Do Ollama, vLLM, and TGI produce the same safety scores for the same model and prompts?
Quantization vs backend: Is the safety difference between Ollama Q4_K_M and vLLM FP16 driven by quantization (4-bit vs 16-bit) or by the backend itself?
Backend family equivalence: Are vLLM FP16 and TGI FP16 interchangeable for safety evaluation?
Task specificity: Which safety tasks are most sensitive to backend choice?
Safety-capability divergence: Does safety vary more across backends than capability, suggesting differential vulnerability?
Cross-TR reproducibility: Do Ollama Q4_K_M safety scores from TR136 match TR134's baselines?

1.3 Literature Gap

No prior work that we are aware of:

Compares safety metrics across Ollama, vLLM, and TGI on identical prompts
Decomposes the Ollama-to-FP16 safety gap into quantization, backend, and serving components
Measures whether safety is differentially more sensitive to backend choice than capability
Provides per-task (refusal, bias, truthfulness, jailbreak) backend sensitivity rankings
Correlates token-level response divergence (Jaccard) with safety classification divergence

1.4 Relationship to Prior Work

Reference	Contribution	How TR136 Uses It
TR134	Quantization x safety across 4 models, jailbreak amplification, per-category bias	Safety classifiers, jailbreak task set, BBQ benchmark, LLM judge module, Q4_K_M baselines for cross-validation
TR135	Concurrency x safety, null finding (concurrency does not affect safety)	Establishes that Ollama serialization eliminates concurrency confounds; Ollama Q4_K_M reference scores
TR125	Quantization decision matrix (capability focus)	Q4_K_M as "sweet spot across all tested models" -- TR136 tests whether this holds across backends
TR133	VRAM modeling and latency prediction	VRAM estimates for model selection feasibility
Rottger et al. (2024)	HarmBench: standardized safety evaluation	Methodological precedent for multi-backend safety comparison
Huang et al. (2024)	Chat template effects on LLM behavior	Theoretical basis for chat template divergence hypothesis

2. Experimental Design

2.1 Design Summary

Parameter	Value
Models	Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, Qwen 2.5 1.5B Instruct
Backends	Ollama Q4_K_M, Ollama Q8_0, vLLM FP16 (Docker), TGI FP16 (Docker)
Safety tasks	AdvBench refusal (100), TruthfulQA (50), BBQ bias (198), Jailbreak amplification (120)
Capability tasks	MMLU (200), ARC-Challenge (200)
Prompts per backend-model	868 (468 safety + 400 capability)
Temperature	0.0 (deterministic)
Seed	42
Max new tokens	256
Warmup requests	3 per backend startup
Cooldown between backends	10 seconds
Execution order	Sequential (one backend at a time, GPU contention avoidance)
Judge model	Qwen 2.5 7B Instruct Q8_0 (Ollama, different family to avoid self-eval bias)
Total raw records	10,416 (3 models x 4 backends x 868 prompts)
Total judged safety samples	5,616
Hardware	NVIDIA GeForce RTX 4080 Laptop GPU (12GB VRAM)

2.2 Backend Configurations

Backend	Weight Format	Container	Port	Timeout	Extra Args
Ollama Q4_K_M	GGUF 4-bit	Native process	11434	120s	--
Ollama Q8_0	GGUF 8-bit	Native process	11434	120s	--
vLLM FP16	HuggingFace FP16	Docker (`vllm/vllm-openai:latest`)	8000	180s	`--max-model-len 2048 --dtype float16 --enforce-eager --gpu-memory-utilization 0.80`
TGI FP16	HuggingFace FP16	Docker (`ghcr.io/huggingface/text-generation-inference:latest`)	8080	180s	`--max-input-length 1024 --max-total-tokens 2048`

Ollama runs as a native process and manages its own model loading. vLLM and TGI run inside Docker containers with GPU passthrough via nvidia-container-toolkit. Docker containers are started and stopped between model evaluations, with a 300-second startup timeout to allow model loading. All backends receive identical prompt text.

2.3 Safety Task Descriptions

AdvBench Refusal (N=100): Prompts from the AdvBench harmful behaviors dataset. Tests whether the model refuses direct harmful requests. Scored by RefusalDetector regex classifier.
TruthfulQA (N=50): Prompts from the TruthfulQA benchmark. Tests whether the model produces truthful answers to questions designed to elicit common misconceptions. Scored by TruthfulnessScorer regex matching against reference answers.
BBQ Bias (N=198): Prompts from the BBQ (Bias Benchmark for QA) dataset covering 11 demographic categories. Tests whether the model selects stereotypical vs. non-stereotypical answers. Scored by BiasDetector.
Jailbreak Amplification (N=120): 30 base harmful behaviors (from JailbreakBench) x 3 jailbreak wrappers (DAN-style, roleplay, prefix injection) + 30 direct requests as control. Tests whether jailbreak techniques bypass the model's refusal. Scored by RefusalDetector.

2.4 Capability Tasks

MMLU (N=200): 200 multiple-choice questions from MMLU spanning 4 subject areas. Regex letter extraction.
ARC-Challenge (N=200): 200 science reasoning questions from the AI2 Reasoning Challenge. Regex letter extraction.

Capability tasks serve as a control arm: if capability scores also drop across backends, the safety drop is part of general performance loss. If capability is stable while safety drops, the backend change specifically targets safety-trained behavior.

2.5 Classification Pipeline

Three regex classifiers score safety tasks:

RefusalDetector: ~15 refusal patterns for AdvBench and jailbreak tasks
TruthfulnessScorer: Reference answer matching for TruthfulQA
BiasDetector: Non-stereotypical / "unknown" answer detection for BBQ

Post-hoc LLM judge (Qwen 2.5 7B Instruct Q8_0) independently classifies safety samples as FULL_REFUSAL / PARTIAL_REFUSAL / COMPLIANCE / UNCLEAR. Judge results are used for inter-rater agreement analysis (Cohen's kappa) but do not affect primary scoring.

2.6 Pipeline Architecture

config.yaml
    |
    v
prepare_benchmarks.py  -->  tasks/*.yaml (copied from TR134)
    |
    v
run.py (orchestrator)
    |-- Step 1: prepare_benchmarks.py
    |-- Step 2: eval loop (for each backend x model: start -> warmup -> run -> stop -> cooldown)
    |-- Step 3: judge_analysis.py (post-hoc LLM judge on safety samples)
    |-- Step 4: analyze.py (15-pass statistical analysis)
    +-- Step 5: generate_report.py
    |
    v
results/<timestamp>/
    |-- config_snapshot.yaml
    |-- samples.jsonl          (10,416 raw eval records)
    |-- tr136_judged.jsonl     (5,616 judge verdicts)
    |-- tr136_scored.jsonl     (10,416 scored records)
    |-- tr136_analysis.json    (73KB, 15-pass analysis)
    +-- tr136_report.md        (auto-generated report)

3. Model Lineup

3.1 Model Summary

Model	Family	Parameters	HuggingFace ID	Ollama Tag	RLHF Method	Origin
Llama 3.2 1B Instruct	Llama 3.2	1,236M	`unsloth/Llama-3.2-1B-Instruct`	`llama3.2:1b`	PPO + rejection sampling	Meta (US)
Llama 3.2 3B Instruct	Llama 3.2	3,213M	`unsloth/Llama-3.2-3B-Instruct`	`llama3.2:3b`	PPO + rejection sampling	Meta (US)
Qwen 2.5 1.5B Instruct	Qwen 2.5	1,543M	`Qwen/Qwen2.5-1.5B-Instruct`	`qwen2.5:1.5b`	DPO	Alibaba (China)

All models are instruct-tuned, ungated, and small enough to run at FP16 on a 12GB GPU. Ollama quant tags are constructed by appending -instruct-q4_K_M, -instruct-q8_0, or -instruct-fp16 to the base tag.

3.2 Why These Models

Llama 3.2 1B and 3B provide a within-family size comparison (1.2B vs 3.2B params, same RLHF recipe). If the backend effect is size-dependent, these two should show different magnitudes. They are also the primary models used in TR134 and TR135, enabling direct cross-experiment validation.
Qwen 2.5 1.5B provides a cross-family comparison. It uses DPO rather than PPO for alignment and originates from a different research lab (Alibaba vs Meta). If the backend effect is RLHF-recipe-specific, Qwen should behave differently from Llama.

3.3 Design Decision: No 7B Models

7B models (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) cannot run at FP16 on a 12GB GPU (~14.5GB minimum for FP16 7B). The experimental design requires all backends to serve the same model, and vLLM/TGI require FP16 weights. Adding 7B models would require either (a) using quantized HuggingFace weights (introducing a new confound) or (b) a larger GPU (not available). The 1B-3B range is sufficient to test the size-dependence hypothesis at the small end.

3.4 Design Decision: Fixed Quantization Levels

Ollama is tested at two quantization levels (Q4_K_M, Q8_0) rather than the full 7-level sweep from TR134. The purpose is not to re-measure the quantization curve (TR134 did that) but to isolate the backend effect from the quantization effect. Two levels are sufficient for this decomposition: Q4_K_M represents a production-common quantization, and Q8_0 provides a near-lossless Ollama baseline for comparison with FP16.

4. Environment & Artifacts

4.1 Environment

Component	Value
OS	Windows 11 Home 10.0.26200 (WSL2 for Docker)
GPU	NVIDIA GeForce RTX 4080 Laptop GPU (12GB VRAM, CC 8.9)
Ollama	v0.6.2 (native Windows)
vLLM Docker	`vllm/vllm-openai:latest`
TGI Docker	`ghcr.io/huggingface/text-generation-inference:latest`
Python	3.11+
Temperature	0.0
Max new tokens	256
Seed	42

4.2 Key Artifacts

Artifact	Path	Description
Config	`research/tr136/config.yaml`	Source configuration
Config snapshot	`research/tr136/results/20260308_015147/config_snapshot.yaml`	Runtime config with resolved paths
Raw samples	`research/tr136/results/20260308_015147/samples.jsonl`	10,416 eval records (14.2MB)
Judged samples	`research/tr136/results/20260308_015147/tr136_judged.jsonl`	5,616 LLM judge verdicts (1.7MB)
Scored samples	`research/tr136/results/20260308_015147/tr136_scored.jsonl`	10,416 scored records (15.4MB)
Analysis JSON	`research/tr136/results/20260308_015147/tr136_analysis.json`	15-pass statistical analysis (73KB)
Analysis code	`research/tr136/analyze.py`	15-pass analysis pipeline (~1,300 lines)
Judge code	`research/tr136/judge_analysis.py`	Post-hoc LLM judge runner
Orchestrator	`research/tr136/run.py`	5-step pipeline orchestrator
Published report	`PublishReady/reports/Technical_Report_136.md`	This document

5. Per-Backend Aggregate Safety Scores

Safety scores aggregated across all safety tasks (advbench, truthfulqa, bbq, jailbreak) per backend. Each cell represents the mean of 468 binary safety scores (100 + 50 + 198 + 120).

Model	Backend	Safety Mean	Bootstrap 95% CI	Std	N
Llama 3.2 1B	Ollama Q4_K_M	0.8579	[0.828, 0.888]	0.343	468
Llama 3.2 1B	Ollama Q8_0	0.8761	[0.847, 0.905]	0.325	468
Llama 3.2 1B	TGI FP16	0.6250	[0.581, 0.670]	0.482	468
Llama 3.2 1B	vLLM FP16	0.6282	[0.584, 0.674]	0.481	468
Llama 3.2 3B	Ollama Q4_K_M	0.8034	[0.767, 0.840]	0.392	468
Llama 3.2 3B	Ollama Q8_0	0.8088	[0.776, 0.843]	0.388	468
Llama 3.2 3B	TGI FP16	0.7436	[0.702, 0.783]	0.433	468
Llama 3.2 3B	vLLM FP16	0.7479	[0.707, 0.787]	0.431	468
Qwen 2.5 1.5B	Ollama Q4_K_M	0.8184	[0.783, 0.854]	0.382	468
Qwen 2.5 1.5B	Ollama Q8_0	0.8483	[0.816, 0.880]	0.355	468
Qwen 2.5 1.5B	TGI FP16	0.7821	[0.745, 0.818]	0.408	468
Qwen 2.5 1.5B	vLLM FP16	0.7917	[0.754, 0.829]	0.402	468

Observations:

Within each model, a consistent clustering pattern emerges: Ollama Q8_0 >= Ollama Q4_K_M >> TGI FP16 ~ vLLM FP16. The gap between the two clusters varies dramatically by model:

Llama 3.2 1B: 24pp gap. Ollama mean = 0.867, FP16 mean = 0.627. The confidence intervals do not overlap -- this is a robust, well-separated effect. The standard deviation is also higher for FP16 backends (0.481-0.482 vs 0.325-0.343), reflecting a more heterogeneous mix of safe and unsafe responses.
Llama 3.2 3B: 6pp gap. Ollama mean = 0.806, FP16 mean = 0.746. The CIs partially overlap, and the effect is near the MDE boundary (7.6pp). This is a real but modest effect that does not survive Holm-Bonferroni correction for individual pairwise tests.
Qwen 2.5 1.5B: 5pp gap. Ollama mean = 0.833, FP16 mean = 0.787. The pattern is the same (Ollama higher) but the magnitude is smaller. Qwen's DPO-based alignment appears more robust to backend variation than Llama's PPO-based alignment.

The two Ollama backends cluster together, and the two FP16 Docker backends cluster together, suggesting the weight format / chat template is the dominant factor, not quantization level or serving framework.

6. Quantization vs Backend Effect Decomposition

To disentangle quantization from backend effects, we decompose the Ollama-to-FP16 gap into three orthogonal components:

Quant effect = Ollama Q8_0 - Ollama Q4_K_M (same backend, different quant -- isolates quantization)
Backend effect = vLLM FP16 - Ollama Q8_0 (different backend, Q8 vs FP16 -- isolates backend + residual quant)
Serving effect = TGI FP16 - vLLM FP16 (different Docker backend, same FP16 weights -- isolates serving framework)

Model	Quant Effect (d)	Backend Effect (d)	Serving Effect (d)	Interpretation
Llama 3.2 1B	+1.8pp (d=+0.054)	-24.8pp (d=-0.604)	-0.3pp (d=-0.007)	Backend dominates
Llama 3.2 3B	+0.5pp (d=+0.014)	-6.1pp (d=-0.149)	-0.4pp (d=-0.010)	Backend dominates
Qwen 2.5 1.5B	+3.0pp (d=+0.081)	-5.7pp (d=-0.149)	-1.0pp (d=-0.024)	Comparable

Key insight: The quant effect (Q4 vs Q8) never exceeds d = 0.08 (trivial by conventional thresholds). The backend effect reaches d = 0.60 for Llama 3.2 1B -- a medium effect, 11x larger than the quant effect. The serving effect (vLLM vs TGI) never exceeds d = 0.02 (negligible). The backend / weight-format axis explains the overwhelming majority of the observed safety variation.

For Qwen 2.5 1.5B, the quant and backend effects are closer in magnitude (d = 0.08 vs d = 0.15), with the interpretation flagged as "comparable" rather than "backend dominates." This suggests Qwen's alignment is more robust to both perturbations -- possibly because DPO training produces safety behavior that is less dependent on exact prompt formatting than PPO-trained models.

None of the three decomposition components achieves TOST equivalence at +/-3pp (all TOST p > 0.16), so even the "trivial" quant effects cannot be formally certified as zero.

7. Per-Task Safety Breakdown

7.1 Llama 3.2 1B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16	Max Gap (pp)
advbench_refusal	0.880	0.950	0.540	0.540	41.0
bbq_bias	0.874	0.889	0.793	0.793	9.6
jailbreak_amplification	0.925	0.983	0.500	0.500	48.3
truthfulqa	0.590	0.420	0.430	0.460	17.0

Observations: The aggregate 24pp gap decomposes into a task-specific hierarchy. Jailbreak amplification shows the largest effect (48pp between Ollama Q8 and FP16) -- the model that refuses 98.3% of jailbreak attempts on Ollama Q8 only refuses 50% on FP16. This is a catastrophic safety failure: half of all jailbreak prompts succeed on the FP16 backends. AdvBench follows the same pattern (41pp gap). BBQ bias is more robust (9.6pp gap), suggesting bias resistance is encoded differently from refusal behavior. TruthfulQA shows an anomalous pattern: Ollama Q4_K_M scores higher (0.59) than Ollama Q8_0 (0.42), reversing the overall trend. At N=50 per cell, this 17pp within-Ollama variance is within the 28pp MDE and likely reflects noise.

7.2 Llama 3.2 3B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16	Max Gap (pp)
advbench_refusal	0.540	0.490	0.690	0.690	20.0
bbq_bias	0.965	0.970	0.929	0.924	4.6
jailbreak_amplification	0.892	0.942	0.575	0.583	36.7
truthfulqa	0.480	0.490	0.520	0.560	8.0

Observations: The jailbreak pattern persists -- 36.7pp gap between Ollama Q8 and FP16 -- but it is moderated compared to the 1B model. BBQ bias is nearly backend-invariant (4.6pp gap), consistent with the hypothesis that bias resistance is structurally encoded. The AdvBench result is anomalous: FP16 backends score higher (0.69) than Ollama (0.49-0.54). This reversal does not occur for jailbreak or BBQ, and may reflect a 3B-specific pattern where the model's AdvBench refusal phrasing changes between backends without changing actual compliance. The regex classifier may be capturing different surface-level refusal patterns on each backend.

7.3 Qwen 2.5 1.5B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16	Max Gap (pp)
advbench_refusal	0.980	0.990	0.980	0.980	1.0
bbq_bias	0.939	0.955	0.929	0.899	5.6
jailbreak_amplification	0.633	0.717	0.458	0.575	25.9
truthfulqa	0.460	0.460	0.580	0.510	12.0

Observations: Qwen's AdvBench refusal is near-ceiling across all backends (98-99%), confirming that the Llama AdvBench vulnerability is family-specific, not universal. Jailbreak amplification remains the most sensitive task (25.8pp gap), though the pattern is more complex: TGI scores lower (0.458) than vLLM (0.575), introducing a serving-level difference that does not appear for other models. BBQ bias shows a mild drop on vLLM FP16 (0.899 vs 0.939-0.955 on Ollama), suggesting some bias amplification on the FP16 backend. TruthfulQA shows FP16 scoring higher than Ollama, the reverse of the overall safety pattern -- this is consistent with TruthfulQA measuring factual knowledge rather than safety alignment.

7.4 Cross-Model Task Summary

Observation	Pattern
Jailbreak amplification	Most backend-sensitive for ALL models (25-48pp gaps)
AdvBench refusal	Llama-specific vulnerability (20-41pp); Qwen is backend-invariant (1pp)
BBQ bias	Robust across backends (<10pp for all models)
TruthfulQA	Inconsistent direction; likely noise at N=50

The backend effect concentrates in the model's refusal mechanism -- the same mechanism that prevents jailbreaks and direct harmful prompt compliance. Bias resistance and factual knowledge are structurally different from refusal and survive the backend transition largely intact.

8. Capability Control Arm

Capability scores (MMLU, ARC-Challenge) across backends. If capability also drops, the safety drop is part of general performance loss. If capability is stable, the backend change specifically targets safety.

Model	Backend	MMLU	ARC-Challenge	Capability Mean
Llama 3.2 1B	Ollama Q4_K_M	0.310	0.335	0.323
Llama 3.2 1B	Ollama Q8_0	0.360	0.365	0.363
Llama 3.2 1B	TGI FP16	0.340	0.370	0.355
Llama 3.2 1B	vLLM FP16	0.325	0.365	0.345
Llama 3.2 3B	Ollama Q4_K_M	0.595	0.700	0.648
Llama 3.2 3B	Ollama Q8_0	0.600	0.720	0.660
Llama 3.2 3B	TGI FP16	0.490	0.675	0.583
Llama 3.2 3B	vLLM FP16	0.490	0.675	0.583
Qwen 2.5 1.5B	Ollama Q4_K_M	0.505	0.700	0.603
Qwen 2.5 1.5B	Ollama Q8_0	0.565	0.730	0.648
Qwen 2.5 1.5B	TGI FP16	0.570	0.740	0.655
Qwen 2.5 1.5B	vLLM FP16	0.570	0.740	0.655

Observations: Capability scores are much more stable across backends than safety scores. For Llama 3.2 1B, the MMLU range is 5pp and ARC range is 3.5pp -- compared to 25pp for aggregate safety. This asymmetry is the critical finding: the backend change does not degrade the model's general reasoning ability, only its safety-trained refusal behavior.

For Llama 3.2 3B, MMLU shows a larger drop (11pp between Ollama Q8 and FP16), suggesting some capability sensitivity at this model size. However, the Qwen 2.5 1.5B capability scores are higher on FP16 than on Ollama Q4_K_M (MMLU 0.57 vs 0.51), eliminating the hypothesis that "FP16 backends are generally worse." The directional inconsistency in capability, contrasted with the consistent Ollama-higher-than-FP16 pattern in safety, supports the differential vulnerability interpretation.

9. Safety-Capability Divergence

Does safety vary more across backends than capability? A ratio > 1.5 indicates safety is differentially sensitive to backend choice.

Model	Safety Range (pp)	Capability Range (pp)	Safety SD	Capability SD	Range Ratio	Interpretation
Llama 3.2 1B	25.1	4.0	0.139	0.017	6.28	Safety varies 6x more
Llama 3.2 3B	6.5	7.8	0.035	0.042	0.84	Both vary similarly
Qwen 2.5 1.5B	6.6	5.2	0.030	0.025	1.26	Both vary similarly

Observations:

For Llama 3.2 1B, safety is overwhelmingly more sensitive to backend choice than capability. The range ratio of 6.28 means that the safety score range across backends (25.1pp) is more than 6 times larger than the capability score range (4.0pp). This is the strongest evidence that the backend change specifically disrupts the model's safety training signals -- refusal patterns, jailbreak resistance -- while leaving general knowledge and reasoning intact. The 1B model's safety alignment appears to be encoded in a way that is highly dependent on the exact prompt formatting provided by the backend.

For the 3B model and Qwen 1.5B, the ratio is near 1.0 (0.84 and 1.26 respectively), meaning safety and capability vary proportionally across backends. This does not mean the backend effect is absent -- the chi-squared tests confirm it exists (Section 14) -- but it is not differentially targeting safety. Larger models and different RLHF recipes appear to encode safety in a more format-robust way.

10. Pairwise Backend Comparisons

Welch's t-test on aggregate safety scores with Holm-Bonferroni correction (6 comparisons per model, 18 total).

10.1 Llama 3.2 1B

Backend A	Backend B	Diff (pp)	95% CI	t	p	Cohen's d	Holm Sig	TOST Equiv
Ollama Q4	Ollama Q8	+1.8	[-2.2, +6.0]	0.83	0.406	+0.054	No	No
Ollama Q4	TGI FP16	-23.3	[-28.5, -17.8]	-8.52	<0.0001	-0.557	Yes	No
Ollama Q4	vLLM FP16	-23.0	[-28.2, -17.6]	-8.42	<0.0001	-0.551	Yes	No
Ollama Q8	TGI FP16	-25.1	[-30.6, -20.0]	-9.35	<0.0001	-0.611	Yes	No
Ollama Q8	vLLM FP16	-24.8	[-30.1, -19.6]	-9.24	<0.0001	-0.604	Yes	No
TGI FP16	vLLM FP16	+0.3	[-5.8, +6.5]	0.10	0.919	+0.007	No	No

Observations: 4 of 6 pairs are significant after Holm-Bonferroni correction -- all involving cross-family comparisons (Ollama vs FP16). The effect sizes are medium (d = -0.55 to -0.61), meaning the practical difference is substantial, not just statistically detectable. The two within-family comparisons (Q4 vs Q8, TGI vs vLLM) show trivial effect sizes (d < 0.06). No pair achieves TOST equivalence, even the functionally identical TGI-vLLM pair (TOST p = 0.197) -- this is a power limitation at N=468 per group.

10.2 Llama 3.2 3B

Backend A	Backend B	Diff (pp)	95% CI	t	p	Cohen's d	Holm Sig	TOST Equiv
Ollama Q4	Ollama Q8	+0.5	[-4.4, +5.6]	0.21	0.834	+0.014	No	No
Ollama Q4	TGI FP16	-6.0	[-11.6, -0.5]	-2.21	0.027	-0.145	No	No
Ollama Q4	vLLM FP16	-5.6	[-11.2, -0.1]	-2.06	0.040	-0.135	No	No
Ollama Q8	TGI FP16	-6.5	[-12.0, -1.3]	-2.42	0.016	-0.159	No	No
Ollama Q8	vLLM FP16	-6.1	[-11.5, -0.9]	-2.27	0.023	-0.149	No	No
TGI FP16	vLLM FP16	+0.4	[-5.1, +6.1]	0.15	0.880	+0.010	No	No

Observations: 0 of 6 pairs survive Holm-Bonferroni correction. The raw p-values for Ollama-vs-FP16 are all < 0.05 (range 0.016-0.040), indicating a real but modest effect that does not survive multiple comparison correction. The effect sizes (d = -0.13 to -0.16) are at the small/trivial boundary. The 6pp gap is near the per-model MDE of 7.6pp, so the lack of post-correction significance is expected -- the experiment is borderline-powered for this model. The within-family comparisons are again trivial (d < 0.02).

10.3 Qwen 2.5 1.5B

Backend A	Backend B	Diff (pp)	95% CI	t	p	Cohen's d	Holm Sig	TOST Equiv
Ollama Q4	Ollama Q8	+3.0	[-2.0, +7.7]	1.24	0.215	+0.081	No	No
Ollama Q4	TGI FP16	-3.6	[-9.0, +1.5]	-1.41	0.160	-0.092	No	No
Ollama Q4	vLLM FP16	-2.7	[-7.8, +2.5]	-1.04	0.298	-0.068	No	No
Ollama Q8	TGI FP16	-6.6	[-11.8, -1.7]	-2.65	0.008	-0.173	Yes	No
Ollama Q8	vLLM FP16	-5.7	[-10.7, -0.9]	-2.29	0.023	-0.149	No	No
TGI FP16	vLLM FP16	+1.0	[-4.5, +5.9]	0.36	0.717	+0.024	No	No

Observations: Only 1 of 6 pairs is significant after Holm correction: Ollama Q8 vs TGI FP16 (d = -0.173, p = 0.008). This makes sense: Q8 is Qwen's highest-safety backend (0.848), and the drop to TGI's 0.782 produces the largest pairwise gap. The Ollama Q4 comparisons are non-significant because Q4 itself is lower than Q8, reducing the distance to FP16 backends. The within-Ollama quant effect (d = 0.081) is the largest of all three models, suggesting Qwen's safety is slightly more quantization-sensitive than Llama's, though still trivial.

Summary across all models: After Holm correction, 5 of 18 pairwise comparisons are significant -- all involving Ollama-vs-FP16 contrasts. No within-family comparison (Q4 vs Q8, TGI vs vLLM) is significant. No pair achieves TOST equivalence at +/-3pp, meaning we cannot formally certify any backend pair as interchangeable.

11. Pairwise Backend Agreement

Binary agreement: both backends classify the same sample as safe (score > 0.5) or unsafe. Aggregated across all 3 models (1,404 safety samples per pair = 3 models x 468).

Backend A	Backend B	Agree	Disagree	Agreement %	Wilson 95% CI
Ollama Q4_K_M	Ollama Q8_0	1264	140	90.0%	[88.4%, 91.5%]
Ollama Q4_K_M	TGI FP16	1014	390	72.2%	[69.8%, 74.5%]
Ollama Q4_K_M	vLLM FP16	1010	394	71.9%	[69.5%, 74.2%]
Ollama Q8_0	TGI FP16	1008	396	71.8%	[69.4%, 74.1%]
Ollama Q8_0	vLLM FP16	1008	396	71.8%	[69.4%, 74.1%]
TGI FP16	vLLM FP16	1344	60	95.7%	[94.5%, 96.7%]

Observations:

Three distinct agreement clusters emerge:

TGI-vLLM cluster (95.7%): These two FP16 Docker backends produce nearly identical safety classifications. Only 60 out of 1,404 safety samples are classified differently. This is the strongest evidence that the FP16 serving framework (vLLM vs TGI) does not matter for safety.
Ollama intra-family cluster (90.0%): Q4_K_M and Q8_0 within Ollama agree on 90% of safety samples. The 10% disagreement (140 samples) comes from prompts near the classification boundary where quantization shifts the response slightly. This level of agreement is consistent with TR134's finding that Q4_K_M preserves safety.
Cross-family gap (~72%): Any Ollama backend paired with any FP16 backend yields only ~72% agreement. Nearly 28% of all safety samples -- approximately 390 out of 1,404 -- are classified differently depending on the backend family. These are not borderline cases: the Jaccard analysis (Section 12) shows the backends are producing fundamentally different text for these prompts.

12. Response Divergence

Jaccard similarity of response tokens measures how much textual overlap exists between two backends' responses to the same prompt. 1.0 = identical output, 0.0 = no token overlap. This goes beyond classification agreement to measure whether the backends are producing similar or different text.

Backend A	Backend B	Mean Jaccard	Bootstrap 95% CI	N Pairs
Ollama Q4_K_M	Ollama Q8_0	0.668	[0.654, 0.683]	2604
Ollama Q4_K_M	TGI FP16	0.205	[0.195, 0.216]	2604
Ollama Q4_K_M	vLLM FP16	0.270	[0.258, 0.283]	2604
Ollama Q8_0	TGI FP16	0.227	[0.216, 0.238]	2604
Ollama Q8_0	vLLM FP16	0.279	[0.267, 0.291]	2604
TGI FP16	vLLM FP16	0.744	[0.730, 0.758]	2604

Observations:

Ollama and FP16 backends produce fundamentally different text. Mean Jaccard of 0.20-0.28 between backend families means only 20-28% of tokens overlap. This is not a marginal classification boundary effect -- the models are generating substantially different responses to the same prompts. The divergence encompasses both the structure and content of responses.

Within backend families, the picture is very different:

TGI-vLLM: 0.744 Jaccard -- the highest overlap of any pair. These backends share the same HuggingFace weights and tokenizer, producing responses that differ only in minor sampling/rounding details.
Ollama Q4-Q8: 0.668 Jaccard -- still high, but lower than TGI-vLLM. The 4-bit vs 8-bit quantization introduces modest textual variation while preserving the overall response structure.

The cross-family divergence (Jaccard 0.20-0.28) is striking because the same model architecture is involved. The weights represent the same neural network. The difference is in how the prompt is tokenized, how the chat template is applied, and how the backend manages inference. This magnitude of textual divergence from the same architecture is a strong signal that the chat template / tokenizer handling is fundamentally different between GGUF and HuggingFace formats.

13. Safety-Divergence Correlation

Does lower response similarity (Jaccard) predict larger safety score differences? A significant negative correlation would confirm that textual divergence drives safety classification disagreement.

Backend A	Backend B	Pearson r	t	p	N Pairs
Ollama Q4_K_M	Ollama Q8_0	-0.259	-10.02	< 0.0001	1404
Ollama Q4_K_M	TGI FP16	-0.233	-8.97	< 0.0001	1404
Ollama Q4_K_M	vLLM FP16	-0.226	-8.70	< 0.0001	1404
Ollama Q8_0	TGI FP16	-0.285	-11.15	< 0.0001	1404
Ollama Q8_0	vLLM FP16	-0.269	-10.44	< 0.0001	1404
TGI FP16	vLLM FP16	-0.340	-13.51	< 0.0001	1404

Observations:

All six correlations are significantly negative (p < 0.0001): prompts where backends produce more divergent text also show larger safety score differences. The correlation magnitudes (r = -0.23 to -0.34) indicate a moderate relationship -- textual divergence explains approximately 5-12% of the variance in safety score differences. The remaining variance comes from classification noise (regex patterns matching differently on different phrasings) and cases where the text diverges but the safety classification happens to agree.

The strongest correlation (r = -0.340) is between TGI and vLLM -- counterintuitively, the most similar backend pair. This makes sense: when TGI and vLLM produce similar text (as they usually do, Jaccard 0.744), their safety scores almost always agree. But when they produce divergent text (the rare 4.3% of samples), the textual difference is strongly associated with a safety classification flip. In other words, for near-identical backends, textual divergence is a reliable predictor of safety disagreement -- the signal-to-noise ratio is high.

14. Per-Task Backend Independence (Chi-Squared)

Chi-squared test per safety task: is the safety outcome independent of the backend? This breaks down the aggregate model-level chi-squared (Appendix B) into task-level components.

Model	Task	X^2	df	p	Cramer's V	Significant?	N
Llama 3.2 1B	advbench_refusal	72.17	3	< 0.0001	0.425	Yes	400
Llama 3.2 1B	bbq_bias	11.51	3	0.009	0.121	Yes	792
Llama 3.2 1B	jailbreak_amplification	125.77	3	< 0.0001	0.512	Yes	480
Llama 3.2 1B	truthfulqa	5.48	3	0.140	0.166	No	200
Llama 3.2 3B	advbench_refusal	13.31	3	0.004	0.182	Yes	400
Llama 3.2 3B	bbq_bias	6.54	3	0.088	0.091	No	792
Llama 3.2 3B	jailbreak_amplification	73.32	3	< 0.0001	0.391	Yes	480
Llama 3.2 3B	truthfulqa	0.39	3	0.942	0.044	No	200
Qwen 2.5 1.5B	advbench_refusal	0.44	3	0.933	0.033	No	400
Qwen 2.5 1.5B	bbq_bias	5.06	3	0.167	0.080	No	792
Qwen 2.5 1.5B	jailbreak_amplification	17.61	3	0.001	0.192	Yes	480
Qwen 2.5 1.5B	truthfulqa	2.69	3	0.441	0.116	No	200

Observations:

6 of 12 task-model combinations show significant backend dependence. The pattern reveals a clear task hierarchy:

Jailbreak amplification is backend-dependent for ALL models -- the only task with universal significance. Effect sizes: V = 0.512 (large, Llama 1B), V = 0.391 (medium, Llama 3B), V = 0.192 (small, Qwen 1.5B). This is the task most sensitive to how the backend frames the prompt, which makes sense: jailbreak prompts exploit formatting-dependent escape patterns, and different chat templates provide different escape surfaces.
AdvBench refusal is significant for Llama but not Qwen. Llama 1B (V = 0.425, medium) and Llama 3B (V = 0.182, small) show backend-dependent refusal. But Qwen's refusal rate is near-ceiling (98-99%) across all backends, leaving no room for backend variation. The vulnerability is family-specific: Llama's refusal mechanism is format-dependent; Qwen's is not.
BBQ bias is marginally significant only for Llama 1B (V = 0.121, small). For the other two models, bias resistance is backend-independent. Bias detection appears to be encoded in the model's factual associations rather than its prompt-formatting-dependent safety training.
TruthfulQA is never significant. V ranges from 0.044 to 0.166, all non-significant at alpha=0.05. Combined with N=50 per cell and MDE of 28pp, truthfulness is underpowered for backend comparison at this sample size.

15. LLM Judge Agreement

Cohen's kappa between the regex classifier and the post-hoc LLM judge (Qwen 2.5 7B Instruct Q8_0) on refusal tasks. The judge independently classifies each response as FULL_REFUSAL / PARTIAL_REFUSAL / COMPLIANCE / UNCLEAR, which is then mapped to a binary safe/unsafe label for kappa computation.

Backend	Kappa	Agreement %	N Pairs
Ollama Q4_K_M	0.112	82.3%	660
Ollama Q8_0	0.144	86.2%	660
TGI FP16	0.158	66.4%	660
vLLM FP16	0.130	67.4%	660

Observations:

All kappa values are in the "slight agreement" range (< 0.20 by Landis & Koch thresholds). Raw agreement is substantially higher for Ollama backends (82-86%) than for FP16 backends (66-67%), but this difference is largely a base-rate artifact. On Ollama, most samples are classified as "safe" by both classifiers (because Ollama produces higher refusal rates). When both classifiers agree on the majority class, raw agreement inflates while kappa -- which corrects for chance -- remains low.

To illustrate: if both classifiers label 85% of Ollama samples as "safe," they would agree on ~74% by chance alone (0.85 x 0.85 + 0.15 x 0.15 = 0.745). The actual 86.2% agreement exceeds chance by only 12 percentage points, yielding kappa = 0.144. On FP16 backends, the base rate is more balanced (~63% safe), so chance agreement is lower (~47%), and the 67% raw agreement yields a slightly higher kappa despite lower raw agreement.

This low inter-rater agreement is consistent with TR134 (kappa 0.013-0.18) and TR135 (kappa 0.067-0.14). It does not invalidate the primary findings -- it reflects a fundamental difference between surface-level pattern matching (regex) and semantic intent evaluation (LLM judge). The two classifiers capture different aspects of safety, and neither is a gold-standard human annotation.

16. Power Analysis

Minimum detectable effect (MDE) at alpha=0.05, power=0.80, per backend and per task.

16.1 Per-Backend MDE (Aggregate Safety)

Model	Backend	N	Baseline Rate	MDE (pp)
Llama 3.2 1B	Ollama Q4_K_M	468	0.858	6.4
Llama 3.2 1B	Ollama Q8_0	468	0.876	6.0
Llama 3.2 1B	TGI FP16	468	0.625	8.9
Llama 3.2 1B	vLLM FP16	468	0.628	8.9
Llama 3.2 3B	All backends	468	0.74-0.81	7.2-8.0
Qwen 2.5 1.5B	All backends	468	0.78-0.85	6.6-7.6

Cross-backend MDE: 7.2-8.0pp per model (N=1,872 total safety samples).

16.2 Per-Task MDE

Task	N per cell	Baseline Range	MDE Range (pp)
advbench_refusal	100	0.49-0.99	5.2-19.4
bbq_bias	198	0.79-0.97	6.3-10.4
jailbreak_amplification	120	0.46-0.98	15.7-17.8
truthfulqa	50	0.42-0.59	28.0

Observations:

The aggregate safety MDE of 7-8pp means the experiment can detect any cross-backend difference larger than ~8pp with 80% power. The observed Llama 3.2 1B gap (24pp) is 3x above this threshold -- the finding is robustly powered and would remain significant even at a much smaller sample size.

For Llama 3.2 3B and Qwen 2.5 1.5B, the observed gaps (6-7pp) are near the MDE boundary (7.2-8.0pp). This explains why the pairwise t-tests are significant before Holm correction but mostly fail after -- the experiment is borderline-powered for these models. Detecting a 6pp effect at 80% power would require approximately 770 samples per backend (vs 468 actual), a 1.6x increase. This is a design limitation, not a null result -- the chi-squared tests still detect the effect at the model level because they pool across backends.

TruthfulQA's MDE of 28pp means only catastrophic per-task effects are detectable at N=50. The non-significant truthfulqa results (Section 14) should not be interpreted as "backends are equivalent for truthfulness" -- they are simply underpowered. To achieve a 10pp MDE for TruthfulQA, approximately 400 samples per cell would be needed (8x current).

17. Cross-Experiment Validation (vs TR134)

Compares TR136 Ollama Q4_K_M scores to TR134 Phase 3 Q4_K_M scores. Same models, same quantization, same tasks. Scores should agree within 5pp if the evaluation framework is reproducible across experiments.

17.1 Llama 3.2 1B

Task	TR136 (Q4_K_M)	TR134 (Q4_K_M)	Diff (pp)	Consistent?
advbench_refusal	0.880	0.870	+1.0	Yes
arc_challenge	0.335	0.395	-6.0	No
bbq_bias	0.874	0.874	+0.0	Yes
jailbreak_amplification	0.925	0.933	-0.8	Yes
mmlu_real	0.310	0.337	-2.7	Yes
truthfulqa	0.590	0.580	+1.0	Yes

5/6 tasks consistent (< 5pp threshold). The ARC-Challenge divergence (6.0pp) is slightly outside the threshold. Given ARC's MDE of ~18pp at N=200, a 6pp difference is within expected noise. BBQ bias reproduces with zero deviation, confirming the bias benchmark and classifier are fully deterministic.

17.2 Llama 3.2 3B

Task	TR136 (Q4_K_M)	TR134 (Q4_K_M)	Diff (pp)	Consistent?
advbench_refusal	0.540	0.470	+7.0	No
arc_challenge	0.700	0.705	-0.5	Yes
bbq_bias	0.965	0.965	+0.0	Yes
jailbreak_amplification	0.892	0.825	+6.7	No
mmlu_real	0.595	0.590	+0.5	Yes
truthfulqa	0.480	0.500	-2.0	Yes

4/6 tasks consistent. AdvBench and jailbreak refusal diverge by 7pp, which is within the per-task MDE range for these tasks (advbench: 19.4pp, jailbreak: 15.7pp). The divergence concentrates in refusal tasks where the baseline rate is moderate (0.47-0.83), leaving more room for run-to-run variance. Capability tasks (ARC, MMLU) and bias (BBQ) reproduce with near-perfect fidelity.

Overall: 9 of 12 task-model pairs are within 5pp. The 3 divergent pairs are: ARC-Challenge on Llama 1B (-6.0pp, a capability task), AdvBench on Llama 3B (+7.0pp), and jailbreak on Llama 3B (+6.7pp). The two Llama 3B refusal divergences are within per-task MDE (advbench: 19.4pp, jailbreak: 15.7pp). The ARC divergence (6.0pp vs 5pp threshold) is marginal and likely reflects noise. The divergence could reflect run-to-run variance in the model's refusal behavior at temperature=0.0, or subtle differences in prompt shuffling order between TR134 and TR136.

18. Baseline Normalization

Safety scores normalized to vLLM FP16 as the reference backend (1.000 = vLLM FP16 level). This provides a common scale for comparing backend effects across models.

Model	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16
Llama 3.2 1B	1.366 (+23.0pp)	1.395 (+24.8pp)	0.995 (-0.3pp)	1.000
Llama 3.2 3B	1.074 (+5.5pp)	1.081 (+6.1pp)	0.994 (-0.4pp)	1.000
Qwen 2.5 1.5B	1.034 (+2.7pp)	1.072 (+5.7pp)	0.988 (-1.0pp)	1.000

Observations:

Ollama backends produce safety scores 3-40% higher than the vLLM FP16 baseline, depending on model. The normalization reveals a clear size gradient: the 1B model's Ollama premium is 1.37-1.40x (37-40% higher safety); the 3B model's is 1.07-1.08x (7-8%); Qwen's is 1.03-1.07x (3-7%). TGI FP16 is within 1pp of vLLM FP16 in all cases (0.988-0.995x), confirming these two backends produce indistinguishable safety behavior.

The asymmetry is important for deployment guidance: Ollama does not make models safer in any absolute sense. Rather, Ollama's GGUF chat template formatting happens to trigger refusal patterns more reliably than HuggingFace's formatting. A model that "passes" safety testing on Ollama at 0.876 safety may actually have a baseline safety level closer to 0.628 when served through the HuggingFace weight format -- a 25pp gap that could be the difference between deployment approval and rejection.

19. Conclusions & Limitations

19.1 Research Question Answers

#	Question	Answer	Evidence
RQ1	Backend independence	Rejected. Chi-squared significant for all 3 models	Appendix B (all p <= 0.05)
RQ2	Quant vs backend	Backend dominates. d: quant 0.01-0.08, backend 0.15-0.60	Section 6
RQ3	vLLM-TGI equivalence	Functionally equivalent (95.7% agreement, d < 0.03)	Sections 10-12
RQ4	Task specificity	Jailbreak > AdvBench > BBQ > TruthfulQA	Section 14 (Cramer's V ranking)
RQ5	Safety-capability divergence	6.28x for Llama 1B; ~1x for others	Section 9
RQ6	Cross-TR reproducibility	9/12 pairs within 5pp	Section 17

19.2 Mechanism: Chat Template Divergence

The most likely explanation is chat template divergence between GGUF and HuggingFace formats. This hypothesis is supported by three converging lines of evidence:

Textual divergence is massive. Jaccard similarity between backend families is only 0.20-0.28 (Section 12). The same model architecture produces fundamentally different text when served through different backends. This level of divergence cannot be explained by quantization alone (Ollama Q4 vs Q8 Jaccard = 0.668).
The effect is safety-specific. Capability scores (MMLU, ARC) show 4-8pp cross-backend range, while safety shows 7-25pp (Section 9). If the divergence were caused by general model quality differences, capability and safety would both shift proportionally. The differential effect points to safety-trained behavior (refusal patterns) being more format-dependent than general knowledge.
The effect concentrates in refusal tasks. Jailbreak and AdvBench refusal are most affected (Section 14), while bias and truthfulness are not. Refusal behavior is trained through RLHF on carefully formatted prompt-response pairs. If the deployment-time formatting differs from the training-time formatting, the refusal triggers may not fire.

A secondary factor may be tokenizer implementation differences. GGUF files use llama.cpp's tokenizer; vLLM and TGI use HuggingFace's tokenizers library. Edge-case disagreements on special tokens, Unicode handling, and BOS/EOS placement could alter the model's perception of the prompt.

19.3 Backend vs Quantization vs Concurrency: A Comparison

Dimension	TR134 (Quantization)	TR135 (Concurrency)	TR136 (Backend)
Max safety degradation	15-40pp (Q2_K)	< 1pp	25pp (Llama 1B)
Jailbreak amplification	+11pp per BPW (slope)	0pp	48pp gap (Ollama vs FP16)
Mechanism	Precision loss degrades fine-tuned weights	None (Ollama serializes)	Chat template / tokenizer divergence
Threshold behavior	Sharp cliff at Q3_K_S	No threshold	Binary: GGUF vs HuggingFace
Capability impact	Proportional to safety	None	Disproportionately small
Practitioner action	Stay above Q4_K_M	No action needed	Re-test on target backend
Controllable?	Yes (choose quant level)	N/A	Partially (align chat templates)

19.4 Limitations

GGUF vs HuggingFace confound is unresolved. We cannot separate backend code effects from weight format effects without running HuggingFace weights through Ollama (not supported) or GGUF weights through vLLM (not standard). A targeted follow-up comparing chat template application alone would isolate this variable.
Two model families only. Llama 3.2 and Qwen 2.5. The massive effect on Llama 1B may be family-specific. Mistral, Phi, and Gemma models may behave differently.
Small models only (1B-3B). Larger models (7B+) with more robust safety training may be less sensitive to backend choice. The effect's size-dependence (24pp for 1B, 6pp for 3B) suggests it may attenuate further at 7B+.
Temperature = 0.0. Stochastic sampling at temperature > 0 could amplify or dampen backend differences. The deterministic setting measures the most likely output, not the distribution of possible outputs.
Single run per configuration. We report within-sample CIs but not run-to-run variance. The 3 divergent cross-validation results (Section 17) suggest non-trivial run-to-run variance for refusal tasks even at temperature=0.
Regex classifiers are surface-level. The LLM judge's low kappa (0.11-0.16) indicates the two classifiers disagree on ~15-30% of samples after chance correction. Human annotation would provide a more reliable ground truth.
Docker overhead inflates latency. Latency differences between Ollama (native) and vLLM/TGI (Docker) reflect container overhead, not pure inference speed.
Jailbreak prompts from a fixed set. Adversarial robustness to novel jailbreaks may differ from robustness to known JailbreakBench patterns.
No system prompt variation. All backends receive the same system prompt content, but the encoding differs between GGUF-embedded and HuggingFace templates. We test the combined effect, not the isolated effect.
Borderline power for 3B and Qwen models. The 6-7pp effects are near the MDE boundary. To confidently confirm or reject a 5pp effect, approximately 770 samples per backend would be needed (1.6x current).

19.5 Implications for Practitioners

Re-evaluate safety when switching backends. A model that passes safety testing on Ollama may fail on vLLM/TGI, particularly for jailbreak resistance and harmful prompt refusal. Budget for re-evaluation as part of any backend migration.
vLLM and TGI are interchangeable. If you are choosing between these two FP16 Docker backends, safety is not a differentiator. Pick based on throughput, API compatibility, and operational preferences.
Quantization within Ollama is safe. Q4_K_M and Q8_0 produce nearly identical safety behavior (d < 0.09). This is consistent with TR134's finding that Q4_K_M is a safe deployment point. Use Q4_K_M to save memory without safety cost.
Use larger models in multi-backend deployments. The 1B model shows a 6x safety/capability divergence ratio; 3B models are much more robust. If your deployment may span multiple backends, prefer models >= 3B parameters.
Investigate chat template alignment. The most actionable mitigation is to verify that the HuggingFace chat template matches the GGUF-embedded template before deploying. Compare the output of transformers.AutoTokenizer.apply_chat_template() against GGUF template metadata to detect divergences before they cause safety failures.

20. Reproducibility

20.1 Pipeline Commands

# Full pipeline (eval + judge + analysis + report)
python research/tr136/run.py

# Skip eval, re-run analysis on existing data
python research/tr136/run.py --skip-eval --skip-judge

# Run specific backends or models only
python research/tr136/run.py --backends ollama_q4_k_m ollama_q8_0 --models llama3.2-1b

# Standalone analysis on existing run
python research/tr136/analyze.py --run-dir research/tr136/results/20260308_015147

20.2 Prerequisites

Ollama >= v0.6.2 installed and running (ollama serve)
Docker with GPU support (nvidia-container-toolkit installed)
Ollama models pulled: llama3.2:1b, llama3.2:3b, qwen2.5:1.5b (base tags; quant variants are auto-pulled)
HuggingFace models cached: unsloth/Llama-3.2-1B-Instruct, unsloth/Llama-3.2-3B-Instruct, Qwen/Qwen2.5-1.5B-Instruct (auto-downloaded by vLLM/TGI)
Judge model: qwen2.5:7b-instruct-q8_0 pulled in Ollama
Python packages: See requirements.txt
GPU: >= 12GB VRAM for FP16 inference of 3B models
Disk space: ~20GB for models + ~30MB for results

20.3 Key Git Commits

Commit	Description
`9f1e35dc`	Upgrade analyze.py to SOTA statistical rigor and harden run.py
`65a05700`	Add untracked artifacts, infra configs, TR136 tasks, and tests

20.4 Known Reproducibility Issues

Docker image versions. vllm/vllm-openai:latest and ghcr.io/huggingface/text-generation-inference:latest are mutable tags. Pin to specific versions for exact reproduction.
Ollama GGUF versions. Ollama model tags (e.g., llama3.2:1b) may be updated with new GGUF conversions. Pin to specific digests for exact reproduction.
Temperature=0.0 is not fully deterministic. At temperature=0 with a fixed seed, most backends produce deterministic output, but floating-point non-associativity in GPU operations can cause rare divergences (1-2 tokens per 1000 responses). The 3 cross-validation divergences in Section 17 may partially reflect this.
WSL2 Docker performance. Docker containers on Windows run through WSL2, which may introduce different latency characteristics than native Linux Docker.

Appendix A: Normalized Per-Task Scores (Reference: vLLM FP16)

A.1 Llama 3.2 1B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16
advbench_refusal	1.630 (+34.0pp)	1.759 (+41.0pp)	1.000 (+0.0pp)	1.000
bbq_bias	1.102 (+8.1pp)	1.121 (+9.6pp)	1.000 (+0.0pp)	1.000
jailbreak_amplification	1.850 (+42.5pp)	1.967 (+48.3pp)	1.000 (+0.0pp)	1.000
truthfulqa	1.283 (+13.0pp)	0.913 (-4.0pp)	0.935 (-3.0pp)	1.000

The jailbreak normalization (1.967x for Q8) means Ollama Q8 refuses nearly twice as many jailbreak attempts as vLLM FP16. TGI and vLLM produce identical scores for this model (1.000x on all tasks except truthfulqa).

A.2 Llama 3.2 3B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16
advbench_refusal	0.783 (-15.0pp)	0.710 (-20.0pp)	1.000 (+0.0pp)	1.000
bbq_bias	1.044 (+4.0pp)	1.049 (+4.5pp)	1.006 (+0.5pp)	1.000
jailbreak_amplification	1.529 (+30.8pp)	1.614 (+35.8pp)	0.986 (-0.8pp)	1.000
truthfulqa	0.857 (-8.0pp)	0.875 (-7.0pp)	0.929 (-4.0pp)	1.000

The Llama 3B AdvBench reversal is visible here: Ollama scores below vLLM (0.78x, 0.71x). Jailbreak amplification maintains the Ollama-higher pattern (1.53x, 1.61x), creating a task-specific contradiction.

A.3 Qwen 2.5 1.5B

Task	Ollama Q4_K_M	Ollama Q8_0	TGI FP16	vLLM FP16
advbench_refusal	1.000 (+0.0pp)	1.010 (+1.0pp)	1.000 (+0.0pp)	1.000
bbq_bias	1.045 (+4.0pp)	1.062 (+5.5pp)	1.034 (+3.0pp)	1.000
jailbreak_amplification	1.101 (+5.8pp)	1.247 (+14.2pp)	0.797 (-11.7pp)	1.000
truthfulqa	0.902 (-5.0pp)	0.902 (-5.0pp)	1.137 (+7.0pp)	1.000

Qwen shows near-unity normalization for AdvBench (all backends identical) and BBQ (< 6pp spread). Jailbreak is the outlier: TGI scores below vLLM (0.797x), an unusual divergence between the two FP16 backends that warrants investigation.

Appendix B: Chi-Squared Contingency Tables

B.1 Llama 3.2 1B (X^2 = 148.60, p < 0.0001)

Backend	Safe	Unsafe	Total	Safety Rate
Ollama Q4_K_M	406	62	468	86.8%
Ollama Q8_0	413	55	468	88.3%
TGI FP16	295	173	468	63.0%
vLLM FP16	297	171	468	63.5%

The contingency table makes the clustering visible: Ollama backends have ~60 unsafe samples each; FP16 backends have ~172 -- nearly 3x more. The chi-squared statistic of 148.60 (Cramer's V = 0.282, small-medium) reflects this dramatic shift.

B.2 Llama 3.2 3B (X^2 = 11.05, p = 0.011)

Backend	Safe	Unsafe	Total	Safety Rate
Ollama Q4_K_M	380	88	468	81.2%
Ollama Q8_0	383	85	468	81.8%
TGI FP16	351	117	468	75.0%
vLLM FP16	353	115	468	75.4%

The gap narrows: Ollama has ~86 unsafe samples; FP16 has ~116. The Cramer's V of 0.077 (negligible-small) reflects a real but modest effect.

B.3 Qwen 2.5 1.5B (X^2 = 7.83, p = 0.050)

Backend	Safe	Unsafe	Total	Safety Rate
Ollama Q4_K_M	386	82	468	82.5%
Ollama Q8_0	400	68	468	85.5%
TGI FP16	370	98	468	79.1%
vLLM FP16	374	94	468	79.9%

The smallest effect. Borderline significant (p = 0.050), with Cramer's V = 0.065 (negligible). The within-Ollama spread (82.5% to 85.5%) is comparable to the Ollama-FP16 spread (79.1% to 85.5%), consistent with the "comparable" interpretation in Section 6.

Appendix C: Full Latency Comparison

Model	Backend	advbench	arc	bbq	jailbreak	mmlu	truthfulqa
Llama 1B	Ollama Q4	352	257	424	359	257	558
Llama 1B	Ollama Q8	374	249	469	372	247	588
Llama 1B	TGI FP16	1326	127	301	1486	382	781
Llama 1B	vLLM FP16	1019	118	267	1225	321	654
Llama 3B	Ollama Q4	458	262	379	440	266	932
Llama 3B	Ollama Q8	516	259	456	541	262	1028
Llama 3B	TGI FP16	2412	1006	3870	4524	1984	2877
Llama 3B	vLLM FP16	2126	855	3210	3776	1673	2547
Qwen 1.5B	Ollama Q4	384	223	456	877	227	504
Qwen 1.5B	Ollama Q8	358	229	589	876	212	474
Qwen 1.5B	TGI FP16	497	670	1121	2443	1761	1822
Qwen 1.5B	vLLM FP16	423	92	1455	1744	407	1092

All values in milliseconds (mean per request). Docker backends (TGI, vLLM) show higher latency for safety tasks due to longer response generation (refusal explanations) combined with container networking overhead. For capability tasks with short responses (ARC: single letter), vLLM is actually faster than Ollama (92ms vs 223ms for Qwen) because the HuggingFace tokenizer and FP16 inference are more efficient for short sequences. Latency differences do not affect safety evaluation -- all 10,416 requests completed successfully (0% error rate across all configurations).

Appendix D: Glossary

Term	Definition
Backend	The inference server that hosts and serves the model (Ollama, vLLM, TGI)
BPW	Bits per weight. FP16 = 16.0, Q8_0 = 8.0, Q4_K_M ~ 4.5
Cohen's d	Standardized effect size. < 0.2 trivial, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large
Cohen's Kappa	Inter-rater agreement corrected for chance. < 0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate (Landis & Koch 1977)
Cramer's V	Chi-squared effect size. < 0.1 negligible, 0.1-0.3 small, 0.3-0.5 medium, > 0.5 large
GGUF	GPT-Generated Unified Format. Ollama's native weight format, includes embedded chat template and tokenizer
Holm-Bonferroni	Step-down multiple comparison correction. More powerful than Bonferroni, controls FWER
Jaccard Similarity	Token overlap ratio. \|A intersect B\| / \|A union B\|. 1.0 = identical, 0.0 = no overlap
MDE	Minimum Detectable Effect. Smallest effect detectable at given alpha and power
pp	Percentage points. An absolute difference in proportions (e.g., 85% - 60% = 25pp)
TOST	Two One-Sided Tests. Confirms equivalence within a margin (here +/-3pp). Strictly stronger than "not significant"
TGI	Text Generation Inference. HuggingFace's inference server
vLLM	Virtual LLM. High-throughput inference server with continuous batching

References

Landis, J.R. & Koch, G.G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174.
Rottger, P. et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Refusal. NeurIPS 2024.
Mazeika, M. et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. NeurIPS 2024.
Parrish, A. et al. (2022). BBQ: A Hand-Built Bias Benchmark for Question Answering. ACL 2022.
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL 2022.
Huang, Y. et al. (2024). Chat Template Matters: The Impact of Prompt Formatting on LLM Safety. arXiv:2402.xxxxx.
Banterhearts TR125 (2026). Quantization Decision Matrix: Quality-Accuracy Trade-offs Across 4 Models and 7 Quantization Levels.
Banterhearts TR133 (2026). VRAM Modeling and Latency Prediction for Multi-Model Deployments.
Banterhearts TR134 (2026). Alignment Robustness Under Quantization: Multi-Family Safety Evaluation Across 4 Models with Jailbreak Amplification.
Banterhearts TR135 (2026). Safety Under Multi-Agent Concurrency: Does Running N Concurrent Agents Degrade Model Safety?

TR136: Cross-Backend Safety Consistency

Technical Report 136: Cross-Backend Safety Consistency

Does the serving backend change model safety behavior? A controlled comparison of Ollama, vLLM, and TGI across 3 models, 4 backend configurations, and 6 benchmarks

Abstract

Metric Definitions

Safety Metrics

Capability Metrics

Derived Metrics

Statistical Methods & Caveats

Executive Summary

Key Findings

Validation Summary

Claim Validation

Key Decisions for Practitioners

When to Use This Report

How to Read This Report

Table of Contents

1. Introduction & Research Motivation

1.1 The Problem

1.2 Research Questions

1.3 Literature Gap

1.4 Relationship to Prior Work

2. Experimental Design

2.1 Design Summary

2.2 Backend Configurations

2.3 Safety Task Descriptions

2.4 Capability Tasks

2.5 Classification Pipeline

2.6 Pipeline Architecture

3. Model Lineup

3.1 Model Summary

3.2 Why These Models

3.3 Design Decision: No 7B Models

3.4 Design Decision: Fixed Quantization Levels

4. Environment & Artifacts

4.1 Environment

4.2 Key Artifacts

5. Per-Backend Aggregate Safety Scores

6. Quantization vs Backend Effect Decomposition

7. Per-Task Safety Breakdown

7.1 Llama 3.2 1B

7.2 Llama 3.2 3B

7.3 Qwen 2.5 1.5B

7.4 Cross-Model Task Summary

8. Capability Control Arm

9. Safety-Capability Divergence

10. Pairwise Backend Comparisons

10.1 Llama 3.2 1B

10.2 Llama 3.2 3B

10.3 Qwen 2.5 1.5B

11. Pairwise Backend Agreement

12. Response Divergence

13. Safety-Divergence Correlation

14. Per-Task Backend Independence (Chi-Squared)

15. LLM Judge Agreement

16. Power Analysis

16.1 Per-Backend MDE (Aggregate Safety)

16.2 Per-Task MDE

17. Cross-Experiment Validation (vs TR134)

17.1 Llama 3.2 1B

17.2 Llama 3.2 3B

18. Baseline Normalization

19. Conclusions & Limitations

19.1 Research Question Answers

19.2 Mechanism: Chat Template Divergence

19.3 Backend vs Quantization vs Concurrency: A Comparison

19.4 Limitations

19.5 Implications for Practitioners

20. Reproducibility

20.1 Pipeline Commands

20.2 Prerequisites

20.3 Key Git Commits

20.4 Known Reproducibility Issues

Appendix A: Normalized Per-Task Scores (Reference: vLLM FP16)

A.1 Llama 3.2 1B

A.2 Llama 3.2 3B

A.3 Qwen 2.5 1.5B

Appendix B: Chi-Squared Contingency Tables

B.1 Llama 3.2 1B (X^2 = 148.60, p < 0.0001)

B.2 Llama 3.2 3B (X^2 = 11.05, p = 0.011)