TR123-TR133 Decision Whitepaper

Executive guidance for deployment leaders

Field	Value
Project	Banterhearts LLM Performance Research
Date	2026-02-28
Version	1.0
Report Type	Decision whitepaper
Audience	Decision makers, product leaders, ops leaders
Scope	TR123, TR124, TR125, TR126, TR127, TR128, TR129, TR130, TR131, TR132, TR133
Primary Source	`PublishReady/reports/Technical_Report_Conclusive_123-133.md`

Abstract

This whitepaper distills eleven technical reports (TR123-TR133) and approximately 70,000 measurements into deployment policy for local-first LLM inference on consumer hardware. The central question: given 70,000+ measurements on a consumer GPU, what model, quantization level, backend, and serving stack should we run -- and when does each choice break? Outcome: six shippable decisions that cover the full deployment space from single-agent chat to eight-agent orchestration, backed by artifact-level provenance and tested against their own falsification criteria.

Boundary conditions (do not skip)

This guidance is valid only under the measured boundary:

Fixed hardware class: NVIDIA RTX 4080 Laptop GPU, 12 GB VRAM, 432 GB/s bandwidth, AD104
Same inference stack versions: PyTorch 2.x, CUDA 12.x, Ollama 0.6.x, vLLM 0.7.x, TGI
Same measurement definitions: phase-split (prefill vs decode), end-to-end, kernel-level
Models at or below 8B parameters (GPT-2 124M through LLaMA-3.1-8B)

If any of these change, re-run the core measurement matrix and re-validate artifacts before applying these decisions.

Six decisions you can ship now

Default backend (single-agent): Ollama with Q4_K_M quantization. Highest throughput per dollar at negligible quality loss (within -4.1pp of FP16 on benchmark accuracy). Compile overhead, ONNX export complexity, and raw transformers startup costs are not justified for N=1.
Default backend (multi-agent, N >= 4): vLLM with FP16 weights. Continuous batching amortizes GPU memory bandwidth by 4.7-5.8x, yielding a 2.25x throughput advantage over Ollama at N=8. TGI provides equivalent amortization but lower absolute throughput.
Compile policy: torch.compile prefill only, on Linux, with Inductor+Triton backend. Never compile decode -- it crashes in all tested modes and provides no speedup even when it works (+2.2%, not significant). On Windows, torch.compile is a no-op (aot_eager fallback). Eager decode always.
Quantization policy: Q4_K_M is the universal default across all tested models (llama3.2-1b, qwen2.5-1.5b, phi-2, llama3.2-3b, llama3.1-8b). Use Q8_0 for quality-critical workloads requiring accuracy within 2pp of baseline. Never deploy Q2_K -- it produces near-random accuracy across all models. Q3_K_S is acceptable only for phi-2 and llama3.1-8b.
Context budget: Ollama for contexts exceeding 4K tokens on 12 GB VRAM. HuggingFace FP16 is limited to 4-8K tokens before VRAM spillover causes 25-105x latency cliffs. Monitor VRAM utilization; alert at 85%, hard-stop at 90%.
Capacity planning tool: Use chimeraforge plan for configuration search across the (model, quantization, backend, N-agents) space. Empirical lookup tables validated at R-squared >= 0.859 for throughput, 0.968 for VRAM. Do not use M/D/1 queueing theory -- it deviates up to 20.4x from reality.

Decision matrix (one-glance policy)

Condition	Backend	Quantization	Compile	Streaming
N=1, decode-heavy	Ollama	Q4_K_M	N/A	Always on
N=1, prefill-heavy (Linux)	Compiled HF	FP16	Prefill only	N/A
N=2-3, any workload	Ollama or vLLM (benchmark your mix)	Q4_K_M or FP16	Per backend default	Always on
N >= 4, any workload	vLLM FP16	FP16	N/A (vLLM manages internally)	Always on
Quality-critical	phi-2 or llama3.1-8b	Q8_0	No	Always on
VRAM-constrained (12 GB)	Ollama	Q4_K_M; Q3_K_S for phi-2 only	No	Always on

Key findings (decision-grade)

Quantization is the dominant lever. Q4_K_M saves 30-67% cost versus FP16 while losing at most -4.1pp accuracy. The effect compounds across cost, VRAM budget, and concurrency ceiling simultaneously. Q2_K is universally unacceptable (all models >11pp loss; qwen2.5-1.5b collapses -40.6pp).
Backend choice does not affect quality. TR124 tested 7 quality metrics across 5 models and 2 backends at temp=0; none showed statistically significant differences. The cheapest backend is the best backend for a given quality level.
The GPU, not the serving stack, is the scaling bottleneck. TR131 overturned TR130: PyTorch Direct (no serving stack) degrades 86.4% at N=8, worse than Ollama's 82.1%. Memory bandwidth stress increases +74% at N=8, statistically confirmed with very high confidence (sole Holm-surviving test).
Continuous batching amortizes the bandwidth bottleneck. TR132 proved the mechanism: 77-80% kernel count reduction and 79-83% bandwidth-per-token reduction at N=8. Amortization ratio is 4.7-5.8x (59-72% of the theoretical 8:1 maximum). vLLM and TGI use the identical mechanism.
VRAM spillover, not quadratic attention, is the practical bottleneck for long context. TR127 observed 25-105x latency cliffs when KV-cache pushes VRAM past capacity. Below spillover, Ollama prefill scaling is clean and sub-linear (b = 0.083-0.158). GQA models sustain 3-11x longer contexts than MHA models.
NUM_PARALLEL is a no-op. TR128 tested 30 pairwise comparisons; 0/30 significant (mean absolute change 4.0%). M/D/1 queueing theory deviates up to 20.4x from observed latency. Streaming adds zero overhead (0/9 tests significant).
Multi-agent throughput plateaus at N=2. TR129 fits Amdahl's Law with serial fractions s = 0.39-0.54 (R-squared > 0.97). Per-agent throughput at N=8 is 17-20% of solo throughput. Fairness remains excellent (Jain's index >= 0.997).
Consumer hardware is 95.4% cheaper than cloud. TR123 TCO at 1B tokens/month: $153/yr consumer versus $2,880/yr AWS on-demand (GPT-2/compile). Break-even for an RTX 4080 occurs at 0.3-2.7 months at 10M requests/month.

Operational recommendations (policy statements)

Backend selection

Policy: Default to Ollama Q4_K_M for single-agent serving (N=1).
Policy: Switch to vLLM FP16 at N >= 4 agents. N=2-3 is the crossover zone; benchmark your workload mix.
Policy: Use TTFT as the tiebreaker in the crossover zone -- vLLM/TGI deliver 6-8x faster time-to-first-token (22-35ms versus 163-194ms).

Quantization

Policy: Q4_K_M is the universal default for all models tested.
Policy gate: Use Q8_0 only when composite quality >= 0.60 is required and Q4_K_M falls below the threshold.
Policy ban: Never deploy Q2_K in any quality-sensitive context. Q3_K_S is banned for llama3.2-1b, llama3.2-3b, and qwen2.5-1.5b (9.5-12.2pp loss).

Compile

Policy: Compile prefill only, on Linux, with Inductor+Triton. Speedup range: 1.3x (qwen2.5-3b) to 2.5x (gpt2-25m).
Policy gate: Never compile decode -- 100% crash rate in all tested modes. Eager decode always.
Policy: All Windows torch.compile results are invalid (aot_eager fallback). Do not trust them.

Context budget

Policy: VRAM budget = model_weights + KV_cost_per_token x context_length. GQA models get 3-11x more context per GB.
Policy: Use Ollama for contexts exceeding 4K tokens on 12 GB VRAM. HF FP16 collapses (95% throughput loss) at the spillover boundary.
Policy gate: Alert at 85% VRAM utilization; hard-stop at 90% to avoid 25-105x latency cliffs.

Capacity planning

Policy: Use chimeraforge plan for all configuration search and sizing decisions. Runtime < 1 second, zero GPU required.
Policy ban: Do not use M/D/1 queueing theory for latency prediction (20.4x deviation from reality).
Policy: Cap utilization at 70% of measured peak throughput to absorb burst traffic (TTFT amplifies 29.9x at saturation).

Economic impact

Cost levers (ranked by magnitude)

Infrastructure choice: Consumer GPU versus cloud on-demand saves 95.4%. This is the single largest cost lever and dwarfs all other optimizations combined.
Quantization: Q4_K_M saves 30-67% versus FP16. Cost range spans 10x across the quantization-model matrix ($0.020 to $0.198 per 1M tokens).
Serving stack at scale: vLLM delivers 2.25x throughput at N=8, halving the per-request cost for multi-agent workloads.
Compile (prefill-heavy, Linux only): 1.3-2.5x prefill speedup for small models. Diminishes at larger sizes.

Dollar figures

Best cost: $0.013/1M tokens (GPT-2/compile, chat blend)
Best cost above 1B parameters: $0.047/1M tokens (LLaMA-3.2-1B/compile)
Consumer TCO at 1B tokens/month: $153/yr (GPT-2/compile) to $561/yr (LLaMA-3.2-1B/compile)
AWS TCO at 1B tokens/month: $2,880/yr (GPT-2) to $8,584/yr (LLaMA-3.2-1B); consumer hardware saves 95.4%
Break-even: RTX 4080 ($1,200) pays for itself in 0.3 months at 10M requests/month (llama3.2-1b/Ollama), 2.7 months at 1M requests/month

Production throughput ceiling

Single-agent peak: 1.17 req/s (Ollama, llama3.2-1b)
Multi-agent peak: 559 tok/s total (vLLM, llama3.2-1b, N=8)
Per-agent at N=8: 17-20% of solo throughput (Ollama); 56% efficiency retained (vLLM)

Implementation plan (30-day view)

Days 1-7: reproduce and validate

Re-run the core scenario matrix on your hardware and workload mix.
Validate artifacts against the chain-of-custody (Appendix B of the conclusive report).
Capture full manifests: GPU model, driver, CUDA, PyTorch, Ollama/vLLM versions, OS.

Days 8-14: translate into your planning numbers

Recompute $/token and capacity with your electricity rate and hardware amortization schedule.
Run chimeraforge plan with your specific (model, quantization, N-agents, SLA, quality floor, budget ceiling) inputs.
If compiling on Linux: verify Triton kernel generation (check torch._inductor logs for generated kernels, not aot_eager fallback).

Days 15-30: enforce policies

Ship backend routing rules: Ollama for N <= 2, vLLM for N >= 4. Benchmark your workload at N=3.
Deploy Q4_K_M as the default quantization; require explicit approval for Q8_0 or Q3_K_S exceptions.
Set VRAM monitoring with 85%/90% alert/hard-stop thresholds.
Cap utilization at 70% of measured peak; set TTFT alarms at your SLA boundary.
Add re-run-required triggers to change management for any hardware, driver, or stack version change.

Risks, limitations, invalidation triggers

Limitations

Single hardware baseline (RTX 4080 Laptop, 12 GB VRAM). Not portable across GPU classes without re-run.
Models tested at or below 8B parameters. Larger models may exhibit different scaling, quantization tolerance, and VRAM behavior.
$/token values are planning proxies (shadow prices), not exact TCO. Apply your own rates.
Quality metrics are automated (BERTScore, ROUGE-L, MMLU, ARC). No human evaluation performed.
Formal equivalence testing could not confirm with full statistical power at a tight +/-3pp margin, so quality tiers are based on observed accuracy differences rather than formal equivalence proof.
Consumer WDDM driver blocks direct kernel-level metrics (ncu null); bandwidth attribution relies on back-of-envelope calculation.
Scaling model is the weakest predictor (R-squared = 0.647); Amdahl captures the trend but misses backend interaction effects.

What invalidates this guidance

GPU, driver, or CUDA version change without re-run
PyTorch major version upgrade (kernel paths, compile behavior, CUDA graph handling)
Ollama or vLLM version upgrade (serving stack behavior, scheduler, quantization engine)
Workload mix shift (prompt/decode ratio, model families, context lengths) without revalidation
Moving to multi-GPU without extending the measurement matrix

Evidence anchors (audit-ready)

Decision	Artifact	Validation
KV-cached decode cheaper than uncached	`results/eval/tr123/processed/cost_summary.json`	Phase-split $/token tables, 30/30 KV formula matches
Backend quality equivalence	`results/eval/tr124_phase1/anova_results.json`	7-metric ANOVA + Holm-Bonferroni
Q4_K_M preserves quality	`results/eval/tr125/phase2_analysis.json`	Wilson CIs + 4-tier classification
Compile prefill speedup on Linux	`results/eval/tr126/phase2_compile_analysis.json`	Welch's t + Cohen's d = -0.59
VRAM spillover dominates context scaling	`results/eval/tr127/context_scaling_analysis.json`	Two-regime fit + cliff detection
NUM_PARALLEL is a no-op	`results/eval/tr128/phase2_concurrency.json`	30 pairwise tests + Holm correction
Amdahl serial fractions s = 0.39-0.54	`results/eval/tr129/scaling_analysis.json`	Amdahl fit R-squared > 0.97
GPU bandwidth is the bottleneck	`results/eval/tr131/profiling_analysis.json`	PyTorch Direct control + Mann-Whitney
Continuous batching amortizes bandwidth	`results/eval/tr132/kernel_analysis.json`	Kernel count + Holm 8/8 significant
Lookup tables meet all validation targets	`results/eval/tr133/validation_results.json`	4/4 targets + 10/10 spot checks

References

Conclusive report: PublishReady/reports/Technical_Report_Conclusive_123-133.md
TR123: PublishReady/reports/Technical_Report_123.md (KV-Cache Production Economics)
TR124: PublishReady/reports/Technical_Report_124.md (Quality & Accuracy Baseline)
TR125: PublishReady/reports/Technical_Report_125.md (Quantization Decision Matrix)
TR126: PublishReady/reports/Technical_Report_126.md (Docker/Linux + Triton Validation)
TR127: PublishReady/reports/Technical_Report_127.md (Long-Context Performance Characterization)
TR128: PublishReady/reports/Technical_Report_128.md (Production Workload Characterization)
TR129: PublishReady/reports/Technical_Report_129.md (N-Agent Scaling Laws)
TR130: PublishReady/reports/Technical_Report_130.md (Serving Stack Benchmarking)
TR131: PublishReady/reports/Technical_Report_131.md (GPU Kernel Profiling -- Root-Cause Analysis)
TR132: PublishReady/reports/Technical_Report_132.md (In-Container GPU Kernel Profiling)
TR133: PublishReady/reports/Technical_Report_133.md (Predictive Capacity Planner)

Optional upgrades (board-ready polish)

Add a 1-page Decision Card at the front: the six-row matrix + boundary conditions + three invalidation triggers.
Add a single architecture diagram: request arrives, classify (N, workload shape), route to backend, apply quant/compile policy, monitor VRAM/TTFT, audit.
Add a Change Control clause: any hardware/stack upgrade triggers re-run of the core scenario matrix and re-validation of the chimeraforge lookup tables.

Phase 2 Decision Whitepaper