TR108-TR116 Decision Whitepaper

Executive guidance for language, architecture, runtime, and model selection

Field	Value
Project	Banterhearts LLM Performance Research
Date	2025-12-28
Version	1.0
Report Type	Decision whitepaper
Audience	Decision makers, engineering leads, ops leaders
Scope	TR108, TR109, TR110, TR111_v2, TR112_v2, TR113, TR114_v2, TR115_v2, TR116
Primary Source	`PublishReady/reports/` (individual TR reports)

Abstract

This whitepaper distills TR108-TR116 into deployment policy for multi-agent LLM inference on consumer-grade hardware. The program answers one central question:

Should we use Rust or Python for multi-agent LLM workloads on consumer hardware, and what architecture, runtime, and model choices maximize efficiency?

Outcome: Rust with dual Ollama instances, Tokio-default runtime, and Gemma 3 achieves 99.4% multi-agent parallel efficiency versus Python's structural ceiling of ~86%. The single highest-impact change is architectural (dual Ollama), not linguistic (Rust vs Python).

Boundary conditions (do not skip)

This guidance is valid only under the measured boundary:

NVIDIA RTX 4080 Laptop GPU (12 GB VRAM), Intel i9-13980HX, 32 GB DDR5-4800
Windows 11 Pro
Ollama inference backend (single and dual instance configurations)
Gemma 3 4.3B Q4_K_M, Llama 3.1 8B instruct Q4_0, Qwen 2.5 7B
2-agent concurrent workloads (baseline-vs-chimera, chimera-homo, chimera-hetero)
Measurement definitions: tok/s, TTFT, parallel efficiency, contention rate

If any of these change, re-run the core benchmark matrix and re-validate all findings.

Executive summary: six decisions you can ship now

Language: Rust for production. 15.2% faster throughput, 67% less memory, 83% faster startup than Python.
Architecture: Dual Ollama mandatory for multi-agent. Contention drops from 63% to 0.74%; efficiency jumps from 82.2% to 99.4%.
Runtime: Tokio-default (#[tokio::main]). 98.72% mean efficiency, 1.21pp standard deviation. No custom configuration needed.
Model: Gemma 3 for scaling efficiency (99.2%). Llama 3.1 for reasoning-heavy tasks (96.5%).
Python ceiling: Structural event loop limit at ~86% multi-agent efficiency. Rust is mandatory for high-throughput concurrent workloads.
Configuration transfer failure is systematic: Single-inference optimal settings do not transfer to agent workflows, and agent-optimal settings do not transfer to multi-agent. Validate per deployment mode.

If you follow these rules, you avoid the four common deployment failures: wrong language at scale, single Ollama bottleneck, wrong runtime, transferred configs.

Definitions (one-time)

TTFT: Time to first token (ms). Measures cold-start and prompt processing latency.
tok/s: Tokens per second. Primary throughput metric for decode-phase generation.
Parallel efficiency: (concurrent throughput / sum of individual throughputs) x 100. 100% = zero overhead.
Contention rate: Percentage of runs where concurrent agents degrade each other's throughput below baseline.
Dual Ollama: Two independent Ollama server instances on separate ports, each serving one agent. Eliminates server-level request serialization.

Decision matrix (one-glance policy)

Condition	Language	Architecture	Runtime	Model
Multi-agent production (>95% eff required)	Rust	Dual Ollama	Tokio-default	Gemma 3
Multi-agent reasoning-heavy	Rust	Dual Ollama	Tokio-default	Llama 3.1
Single-agent production	Rust	Single Ollama	Tokio-default	Model-dependent
Prototyping / research	Python	Dual Ollama	asyncio	Any
Binary size constrained (<5 MB)	Rust	Dual Ollama	Smol-1KB	Gemma 3

Key findings (decision-grade)

Rust single-agent advantage is real: +15.2% throughput (114.54 vs 99.34 tok/s), +46% consistency (2.6% vs 4.8% CV), 58% faster TTFT cold start. Evidence: TR112_v2, 111 runs.
Dual Ollama is the single highest-impact change: Transforms multi-agent from 82.2% efficiency / 63% contention (TR113, single Ollama) to 99.4% efficiency / 0.74% contention (TR114_v2, dual Ollama). This is an architectural fix, not a code fix.
Tokio-default wins on consistency, not peak: All four working runtimes hit ~100% peak (99.87-99.99%), but Tokio-default has the tightest distribution (1.21pp sigma vs 4.03pp for localset, 4.87pp for smol). Async-std is unusable (50% serialization). Evidence: TR115_v2, 150 runs.
Gemma 3 is the efficiency champion: 99.2% chimera-homo efficiency in Rust, 97.3% overall multi-agent mean. Llama 3.1 close behind at 96.5%. Qwen 2.5 trails at 90.0% with heavier coordination overhead. Evidence: TR116, 60 runs.
Python cannot exceed ~86% multi-agent efficiency: Best Python result across all models is 83.8% (Llama 3.1). This is a structural event loop ceiling, not a tuning gap. Rust achieves +12-17pp higher efficiency for every model tested.
Config transfer failure is systematic: TR108 optimal (GPU=999, CTX=4096) fails for agent workflows (TR109 optimal: GPU=60-80, CTX=512-1024). Agent-optimal fails for multi-agent (TR110/TR114_v2 optimal: GPU=80, CTX=512-2048). Each deployment mode requires independent validation.

Operational recommendations (policy statements)

Language selection

Policy: Default to Rust for all production workloads.
Policy: Permit Python only for prototyping, research, and development-velocity-critical paths.
Gate: If multi-agent efficiency >90% is required, Rust is mandatory.

Architecture

Policy: Always deploy dual Ollama instances for multi-agent workloads.
Policy: Monitor contention rate; alert if contention exceeds 1% of runs.
Policy: Single Ollama is acceptable only for single-agent deployments.

Runtime

Policy: Use #[tokio::main] with no custom thread pool configuration.
Policy: Never use async-std (50% serialization failure).
Policy: Smol-1KB is acceptable only when binary size is the binding constraint (<5 MB).

Model selection

Policy: Default to Gemma 3 (4.3B Q4_K_M) for multi-agent efficiency.
Policy: Use Llama 3.1 8B for reasoning-heavy tasks where quality outweighs throughput.
Policy: Avoid Qwen 2.5 for multi-agent workloads (90% Rust, 77.6% Python; 7-13pp below alternatives).

Configuration

Policy: Multi-agent baseline: GPU=80, CTX=512-2048, TEMP=0.6-0.8.
Policy: Never transfer single-inference configs to agent or multi-agent workloads.
Policy: Validate every configuration change with a minimum 5-run benchmark.

Economic impact (what changes your spend)

Memory and resource efficiency:

Memory: Rust uses 65-90 MB vs Python's 300-350 MB (67% reduction). Enables 3x concurrent capacity per host.
Throughput: 15.2% higher single-agent tok/s reduces cost per token proportionally.
Startup: 0.2s vs 1.5s (83% faster). Enables rapid scaling and serverless-compatible cold starts.

Infrastructure savings:

Dual Ollama at 99.4% efficiency: 2 instances serve the work of 2 agents with <1% overhead.
Python at 86% ceiling: Requires ~15% more wall time, translating to 15% more compute cost.
Instance reduction: At scale, Rust + dual Ollama requires 2 instances where Python requires 4 for equivalent throughput. 50% infrastructure cost reduction.

Annual savings estimate:

Small scale (10K req/month): ~$1,440/year (memory + throughput savings)
Medium scale (100K req/month): ~$3,040/year (50% instance reduction)
Large scale (1M req/month): ~$7,680/year (compounding throughput + instance savings)
Break-even on Rust migration: 12.7-20 months depending on development overhead ($3-5K estimated)
5-year TCO: 26% lower with Rust + dual Ollama stack

Implementation plan (30-day view)

Days 1-7: reproduce and validate

Re-run TR112_v2 single-agent benchmarks on your hardware (Rust vs Python, 18 configs each).
Confirm Rust throughput advantage and consistency characteristics match within 10% of reported values.

Days 8-14: deploy dual Ollama and benchmark multi-agent

Stand up dual Ollama instances on separate ports.
Run TR114_v2 multi-agent matrix (baseline-vs-chimera, chimera-homo, chimera-hetero).
Validate >95% parallel efficiency and <1% contention rate.

Days 15-21: select runtime and model

Run TR115_v2 runtime comparison (Tokio-default vs smol-1KB minimum) with 30 runs each.
Run TR116 cross-model comparison (Gemma 3, Llama 3.1, Qwen 2.5) with 10 runs each.
Confirm Tokio-default consistency and Gemma 3 efficiency leadership.

Days 22-30: enforce policies and monitor

Ship language, architecture, runtime, and model policies from the operational recommendations above.
Deploy contention monitoring with <1% alert threshold.
Document invalidation triggers in change management.
Set re-run-required gates on hardware, OS, Ollama, model, and toolchain changes.

Risks, limitations, invalidation triggers

Limitations:

Single hardware baseline (RTX 4080 Laptop). Not portable across GPU classes without re-run.
Windows 11 only. Linux/macOS may exhibit different async runtime characteristics.
2-agent concurrency limit. Scaling to 3+ agents is unmeasured.
Shadow pricing (estimated $/token), not actual cloud TCO.
Quality not measured in multi-agent mode. Pair efficiency decisions with quality evaluation.
All models tested at Q4 quantization. Different quantization levels may shift rankings.

Invalidates guidance:

Hardware change (different GPU, CPU, or memory tier)
OS change (Linux, macOS, or major Windows update)
Ollama version upgrade (inference scheduling may change)
Model update (new quantization, architecture revision, or fine-tune)
Rust toolchain major version change (async runtime behavior may shift)
Agent count change (3+ agents introduces new contention dynamics)

Evidence anchors (audit-ready)

Rust vs Python single-agent throughput: TR112_v2, 111 runs, 37 configs (PublishReady/reports/Technical_Report_112_v2.md)
Single Ollama multi-agent contention: TR113, 19 configs, 63% contention rate (PublishReady/reports/Technical_Report_113.md)
Dual Ollama multi-agent efficiency: TR114_v2, 135 runs, 27 configs, 99.4% peak (PublishReady/reports/Technical_Report_114_v2.md)
Runtime ranking and consistency: TR115_v2, 150 runs, 5 runtimes (PublishReady/reports/Technical_Report_115_v2.md)
Cross-model multi-agent validation: TR116, 60 runs, 3 models x 2 runtimes (PublishReady/reports/Technical_Report_116.md)
Python efficiency ceiling: TR116, Python never exceeds 86% across all models (PublishReady/reports/Technical_Report_116.md)
Config transfer failure: TR108 vs TR109 optimization inversion (PublishReady/reports/Technical_Report_108.md, PublishReady/reports/Technical_Report_109.md)

References

TR108: PublishReady/reports/Technical_Report_108.md (Ollama model benchmarking, 158 configs)
TR109: PublishReady/reports/Technical_Report_109.md (Python agent workflow optimization)
TR110: PublishReady/reports/Technical_Report_110.md (Python multi-agent, 150 runs)
TR111_v2: PublishReady/reports/Technical_Report_111_v2.md (Rust single-agent, 57 runs)
TR112_v2: PublishReady/reports/Technical_Report_112_v2.md (Rust vs Python comparison, 111 runs)
TR113: PublishReady/reports/Technical_Report_113.md (Rust multi-agent, single Ollama)
TR114_v2: PublishReady/reports/Technical_Report_114_v2.md (Rust multi-agent, dual Ollama, 135 runs)
TR115_v2: PublishReady/reports/Technical_Report_115_v2.md (Rust runtime deep dive, 150 runs)
TR116: PublishReady/reports/Technical_Report_116.md (Cross-model multi-agent, 60 runs)

Phase 1 Decision Whitepaper