TR123-TR133 Decision Whitepaper
Executive guidance for deployment leaders
| Field | Value |
|---|---|
| Project | Banterhearts LLM Performance Research |
| Date | 2026-02-28 |
| Version | 1.0 |
| Report Type | Decision whitepaper |
| Audience | Decision makers, product leaders, ops leaders |
| Scope | TR123, TR124, TR125, TR126, TR127, TR128, TR129, TR130, TR131, TR132, TR133 |
| Primary Source | PublishReady/reports/Technical_Report_Conclusive_123-133.md |
Abstract
This whitepaper distills eleven technical reports (TR123-TR133) and approximately 70,000 measurements into deployment policy for local-first LLM inference on consumer hardware. The central question: given 70,000+ measurements on a consumer GPU, what model, quantization level, backend, and serving stack should we run -- and when does each choice break? Outcome: six shippable decisions that cover the full deployment space from single-agent chat to eight-agent orchestration, backed by artifact-level provenance and tested against their own falsification criteria.
Boundary conditions (do not skip)
This guidance is valid only under the measured boundary:
- Fixed hardware class: NVIDIA RTX 4080 Laptop GPU, 12 GB VRAM, 432 GB/s bandwidth, AD104
- Same inference stack versions: PyTorch 2.x, CUDA 12.x, Ollama 0.6.x, vLLM 0.7.x, TGI
- Same measurement definitions: phase-split (prefill vs decode), end-to-end, kernel-level
- Models at or below 8B parameters (GPT-2 124M through LLaMA-3.1-8B)
If any of these change, re-run the core measurement matrix and re-validate artifacts before applying these decisions.
Six decisions you can ship now
-
Default backend (single-agent): Ollama with Q4_K_M quantization. Highest throughput per dollar at negligible quality loss (within -4.1pp of FP16 on benchmark accuracy). Compile overhead, ONNX export complexity, and raw transformers startup costs are not justified for N=1.
-
Default backend (multi-agent, N >= 4): vLLM with FP16 weights. Continuous batching amortizes GPU memory bandwidth by 4.7-5.8x, yielding a 2.25x throughput advantage over Ollama at N=8. TGI provides equivalent amortization but lower absolute throughput.
-
Compile policy: torch.compile prefill only, on Linux, with Inductor+Triton backend. Never compile decode -- it crashes in all tested modes and provides no speedup even when it works (+2.2%, not significant). On Windows, torch.compile is a no-op (aot_eager fallback). Eager decode always.
-
Quantization policy: Q4_K_M is the universal default across all tested models (llama3.2-1b, qwen2.5-1.5b, phi-2, llama3.2-3b, llama3.1-8b). Use Q8_0 for quality-critical workloads requiring accuracy within 2pp of baseline. Never deploy Q2_K -- it produces near-random accuracy across all models. Q3_K_S is acceptable only for phi-2 and llama3.1-8b.
-
Context budget: Ollama for contexts exceeding 4K tokens on 12 GB VRAM. HuggingFace FP16 is limited to 4-8K tokens before VRAM spillover causes 25-105x latency cliffs. Monitor VRAM utilization; alert at 85%, hard-stop at 90%.
-
Capacity planning tool: Use
chimeraforge planfor configuration search across the (model, quantization, backend, N-agents) space. Empirical lookup tables validated at R-squared >= 0.859 for throughput, 0.968 for VRAM. Do not use M/D/1 queueing theory -- it deviates up to 20.4x from reality.
Decision matrix (one-glance policy)
| Condition | Backend | Quantization | Compile | Streaming |
|---|---|---|---|---|
| N=1, decode-heavy | Ollama | Q4_K_M | N/A | Always on |
| N=1, prefill-heavy (Linux) | Compiled HF | FP16 | Prefill only | N/A |
| N=2-3, any workload | Ollama or vLLM (benchmark your mix) | Q4_K_M or FP16 | Per backend default | Always on |
| N >= 4, any workload | vLLM FP16 | FP16 | N/A (vLLM manages internally) | Always on |
| Quality-critical | phi-2 or llama3.1-8b | Q8_0 | No | Always on |
| VRAM-constrained (12 GB) | Ollama | Q4_K_M; Q3_K_S for phi-2 only | No | Always on |
Key findings (decision-grade)
-
Quantization is the dominant lever. Q4_K_M saves 30-67% cost versus FP16 while losing at most -4.1pp accuracy. The effect compounds across cost, VRAM budget, and concurrency ceiling simultaneously. Q2_K is universally unacceptable (all models >11pp loss; qwen2.5-1.5b collapses -40.6pp).
-
Backend choice does not affect quality. TR124 tested 7 quality metrics across 5 models and 2 backends at temp=0; none showed statistically significant differences. The cheapest backend is the best backend for a given quality level.
-
The GPU, not the serving stack, is the scaling bottleneck. TR131 overturned TR130: PyTorch Direct (no serving stack) degrades 86.4% at N=8, worse than Ollama's 82.1%. Memory bandwidth stress increases +74% at N=8, statistically confirmed with very high confidence (sole Holm-surviving test).
-
Continuous batching amortizes the bandwidth bottleneck. TR132 proved the mechanism: 77-80% kernel count reduction and 79-83% bandwidth-per-token reduction at N=8. Amortization ratio is 4.7-5.8x (59-72% of the theoretical 8:1 maximum). vLLM and TGI use the identical mechanism.
-
VRAM spillover, not quadratic attention, is the practical bottleneck for long context. TR127 observed 25-105x latency cliffs when KV-cache pushes VRAM past capacity. Below spillover, Ollama prefill scaling is clean and sub-linear (b = 0.083-0.158). GQA models sustain 3-11x longer contexts than MHA models.
-
NUM_PARALLEL is a no-op. TR128 tested 30 pairwise comparisons; 0/30 significant (mean absolute change 4.0%). M/D/1 queueing theory deviates up to 20.4x from observed latency. Streaming adds zero overhead (0/9 tests significant).
-
Multi-agent throughput plateaus at N=2. TR129 fits Amdahl's Law with serial fractions s = 0.39-0.54 (R-squared > 0.97). Per-agent throughput at N=8 is 17-20% of solo throughput. Fairness remains excellent (Jain's index >= 0.997).
-
Consumer hardware is 95.4% cheaper than cloud. TR123 TCO at 1B tokens/month: $153/yr consumer versus $2,880/yr AWS on-demand (GPT-2/compile). Break-even for an RTX 4080 occurs at 0.3-2.7 months at 10M requests/month.
Operational recommendations (policy statements)
Backend selection
- Policy: Default to Ollama Q4_K_M for single-agent serving (N=1).
- Policy: Switch to vLLM FP16 at N >= 4 agents. N=2-3 is the crossover zone; benchmark your workload mix.
- Policy: Use TTFT as the tiebreaker in the crossover zone -- vLLM/TGI deliver 6-8x faster time-to-first-token (22-35ms versus 163-194ms).
Quantization
- Policy: Q4_K_M is the universal default for all models tested.
- Policy gate: Use Q8_0 only when composite quality >= 0.60 is required and Q4_K_M falls below the threshold.
- Policy ban: Never deploy Q2_K in any quality-sensitive context. Q3_K_S is banned for llama3.2-1b, llama3.2-3b, and qwen2.5-1.5b (9.5-12.2pp loss).
Compile
- Policy: Compile prefill only, on Linux, with Inductor+Triton. Speedup range: 1.3x (qwen2.5-3b) to 2.5x (gpt2-25m).
- Policy gate: Never compile decode -- 100% crash rate in all tested modes. Eager decode always.
- Policy: All Windows torch.compile results are invalid (aot_eager fallback). Do not trust them.
Context budget
- Policy: VRAM budget = model_weights + KV_cost_per_token x context_length. GQA models get 3-11x more context per GB.
- Policy: Use Ollama for contexts exceeding 4K tokens on 12 GB VRAM. HF FP16 collapses (95% throughput loss) at the spillover boundary.
- Policy gate: Alert at 85% VRAM utilization; hard-stop at 90% to avoid 25-105x latency cliffs.
Capacity planning
- Policy: Use
chimeraforge planfor all configuration search and sizing decisions. Runtime < 1 second, zero GPU required. - Policy ban: Do not use M/D/1 queueing theory for latency prediction (20.4x deviation from reality).
- Policy: Cap utilization at 70% of measured peak throughput to absorb burst traffic (TTFT amplifies 29.9x at saturation).
Economic impact
Cost levers (ranked by magnitude)
- Infrastructure choice: Consumer GPU versus cloud on-demand saves 95.4%. This is the single largest cost lever and dwarfs all other optimizations combined.
- Quantization: Q4_K_M saves 30-67% versus FP16. Cost range spans 10x across the quantization-model matrix ($0.020 to $0.198 per 1M tokens).
- Serving stack at scale: vLLM delivers 2.25x throughput at N=8, halving the per-request cost for multi-agent workloads.
- Compile (prefill-heavy, Linux only): 1.3-2.5x prefill speedup for small models. Diminishes at larger sizes.
Dollar figures
- Best cost: $0.013/1M tokens (GPT-2/compile, chat blend)
- Best cost above 1B parameters: $0.047/1M tokens (LLaMA-3.2-1B/compile)
- Consumer TCO at 1B tokens/month: $153/yr (GPT-2/compile) to $561/yr (LLaMA-3.2-1B/compile)
- AWS TCO at 1B tokens/month: $2,880/yr (GPT-2) to $8,584/yr (LLaMA-3.2-1B); consumer hardware saves 95.4%
- Break-even: RTX 4080 ($1,200) pays for itself in 0.3 months at 10M requests/month (llama3.2-1b/Ollama), 2.7 months at 1M requests/month
Production throughput ceiling
- Single-agent peak: 1.17 req/s (Ollama, llama3.2-1b)
- Multi-agent peak: 559 tok/s total (vLLM, llama3.2-1b, N=8)
- Per-agent at N=8: 17-20% of solo throughput (Ollama); 56% efficiency retained (vLLM)
Implementation plan (30-day view)
Days 1-7: reproduce and validate
- Re-run the core scenario matrix on your hardware and workload mix.
- Validate artifacts against the chain-of-custody (Appendix B of the conclusive report).
- Capture full manifests: GPU model, driver, CUDA, PyTorch, Ollama/vLLM versions, OS.
Days 8-14: translate into your planning numbers
- Recompute $/token and capacity with your electricity rate and hardware amortization schedule.
- Run
chimeraforge planwith your specific (model, quantization, N-agents, SLA, quality floor, budget ceiling) inputs. - If compiling on Linux: verify Triton kernel generation (check
torch._inductorlogs for generated kernels, not aot_eager fallback).
Days 15-30: enforce policies
- Ship backend routing rules: Ollama for N <= 2, vLLM for N >= 4. Benchmark your workload at N=3.
- Deploy Q4_K_M as the default quantization; require explicit approval for Q8_0 or Q3_K_S exceptions.
- Set VRAM monitoring with 85%/90% alert/hard-stop thresholds.
- Cap utilization at 70% of measured peak; set TTFT alarms at your SLA boundary.
- Add re-run-required triggers to change management for any hardware, driver, or stack version change.
Risks, limitations, invalidation triggers
Limitations
- Single hardware baseline (RTX 4080 Laptop, 12 GB VRAM). Not portable across GPU classes without re-run.
- Models tested at or below 8B parameters. Larger models may exhibit different scaling, quantization tolerance, and VRAM behavior.
- $/token values are planning proxies (shadow prices), not exact TCO. Apply your own rates.
- Quality metrics are automated (BERTScore, ROUGE-L, MMLU, ARC). No human evaluation performed.
- Formal equivalence testing could not confirm with full statistical power at a tight +/-3pp margin, so quality tiers are based on observed accuracy differences rather than formal equivalence proof.
- Consumer WDDM driver blocks direct kernel-level metrics (ncu null); bandwidth attribution relies on back-of-envelope calculation.
- Scaling model is the weakest predictor (R-squared = 0.647); Amdahl captures the trend but misses backend interaction effects.
What invalidates this guidance
- GPU, driver, or CUDA version change without re-run
- PyTorch major version upgrade (kernel paths, compile behavior, CUDA graph handling)
- Ollama or vLLM version upgrade (serving stack behavior, scheduler, quantization engine)
- Workload mix shift (prompt/decode ratio, model families, context lengths) without revalidation
- Moving to multi-GPU without extending the measurement matrix
Evidence anchors (audit-ready)
| Decision | Artifact | Validation |
|---|---|---|
| KV-cached decode cheaper than uncached | results/eval/tr123/processed/cost_summary.json |
Phase-split $/token tables, 30/30 KV formula matches |
| Backend quality equivalence | results/eval/tr124_phase1/anova_results.json |
7-metric ANOVA + Holm-Bonferroni |
| Q4_K_M preserves quality | results/eval/tr125/phase2_analysis.json |
Wilson CIs + 4-tier classification |
| Compile prefill speedup on Linux | results/eval/tr126/phase2_compile_analysis.json |
Welch's t + Cohen's d = -0.59 |
| VRAM spillover dominates context scaling | results/eval/tr127/context_scaling_analysis.json |
Two-regime fit + cliff detection |
| NUM_PARALLEL is a no-op | results/eval/tr128/phase2_concurrency.json |
30 pairwise tests + Holm correction |
| Amdahl serial fractions s = 0.39-0.54 | results/eval/tr129/scaling_analysis.json |
Amdahl fit R-squared > 0.97 |
| GPU bandwidth is the bottleneck | results/eval/tr131/profiling_analysis.json |
PyTorch Direct control + Mann-Whitney |
| Continuous batching amortizes bandwidth | results/eval/tr132/kernel_analysis.json |
Kernel count + Holm 8/8 significant |
| Lookup tables meet all validation targets | results/eval/tr133/validation_results.json |
4/4 targets + 10/10 spot checks |
References
- Conclusive report:
PublishReady/reports/Technical_Report_Conclusive_123-133.md - TR123:
PublishReady/reports/Technical_Report_123.md(KV-Cache Production Economics) - TR124:
PublishReady/reports/Technical_Report_124.md(Quality & Accuracy Baseline) - TR125:
PublishReady/reports/Technical_Report_125.md(Quantization Decision Matrix) - TR126:
PublishReady/reports/Technical_Report_126.md(Docker/Linux + Triton Validation) - TR127:
PublishReady/reports/Technical_Report_127.md(Long-Context Performance Characterization) - TR128:
PublishReady/reports/Technical_Report_128.md(Production Workload Characterization) - TR129:
PublishReady/reports/Technical_Report_129.md(N-Agent Scaling Laws) - TR130:
PublishReady/reports/Technical_Report_130.md(Serving Stack Benchmarking) - TR131:
PublishReady/reports/Technical_Report_131.md(GPU Kernel Profiling -- Root-Cause Analysis) - TR132:
PublishReady/reports/Technical_Report_132.md(In-Container GPU Kernel Profiling) - TR133:
PublishReady/reports/Technical_Report_133.md(Predictive Capacity Planner)
Optional upgrades (board-ready polish)
- Add a 1-page Decision Card at the front: the six-row matrix + boundary conditions + three invalidation triggers.
- Add a single architecture diagram: request arrives, classify (N, workload shape), route to backend, apply quant/compile policy, monitor VRAM/TTFT, audit.
- Add a Change Control clause: any hardware/stack upgrade triggers re-run of the core scenario matrix and re-validation of the chimeraforge lookup tables.