Conclusive Report 108-116: From Python Baseline to Rust Production --- Language, Architecture, Runtime, and Model Selection for Multi-Agent LLM Systems
Synthesis of single-agent optimization, cross-language migration, multi-agent coordination, runtime analysis, and cross-model validation
| Field | Value |
|---|---|
| Project | Banterhearts LLM Performance Research |
| Date | 2025-12-28 |
| Author | Research Team |
| Report Type | Conclusive synthesis across TR108-TR116 (artifact-backed) |
| Scope | TR108, TR109, TR110, TR111_v2, TR112_v2, TR113, TR114_v2, TR115_v2, TR116 |
| Primary Sources | PublishReady/reports/Technical_Report_108.mdPublishReady/reports/Technical_Report_109.mdPublishReady/reports/Technical_Report_110.mdPublishReady/reports/Technical_Report_111_v2.mdPublishReady/reports/Technical_Report_112_v2.mdPublishReady/reports/Technical_Report_113.mdPublishReady/reports/Technical_Report_114_v2.mdPublishReady/reports/Technical_Report_115_v2.mdPublishReady/reports/Technical_Report_116.md |
Abstract
This report synthesizes nine technical reports (TR108 through TR116) into a single decision-grade narrative for multi-agent LLM inference on consumer hardware. The research arc begins with single-agent parameter optimization on Ollama-served models (TR108), discovers that agent workflow optimization diverges fundamentally from single-inference tuning (TR109), and establishes a Python multi-agent gold standard at 99.25% parallel efficiency with dual Ollama instances (TR110). The program then migrates to Rust with full workflow parity, demonstrating 15.2% higher throughput, 58% faster time-to-first-token, and 67% lower memory consumption at the single-agent level (TR111_v2, TR112_v2). An initial multi-agent deployment in Rust reveals a single-Ollama bottleneck producing 63% resource contention (TR113), which is resolved by adopting dual Ollama architecture to achieve 99.4% peak efficiency and surpass the Python baseline (TR114_v2). A systematic evaluation of five async runtimes identifies tokio-default as the production winner on consistency grounds (98.72% mean, 1.21pp standard deviation), not peak performance (TR115_v2). Cross-model validation across Gemma 3, Llama 3.1 8B, and Qwen 2.5 7B confirms that Rust's 12-17 percentage-point efficiency advantage over Python is structural and model-independent, with Gemma 3 achieving 99.2% chimera-homo efficiency as the scaling champion and Llama 3.1 offering 96.5% for reasoning-intensive workloads (TR116). Across 903+ benchmark runs, the synthesis establishes six decisions that can be shipped: Rust for production language, dual Ollama for multi-agent architecture, tokio-default for async runtime, Gemma 3 or Llama 3.1 for model selection, and agent-specific configuration parameters (GPU=60-80, CTX=256-512) that diverge from single-inference optima. A structural Python efficiency ceiling at approximately 86% is identified and attributed to asyncio event-loop contention under concurrent LLM workloads, making Rust mandatory for deployments targeting sustained throughput above 100 tokens per second. This work is decision-grade, not universal: all conclusions are bounded to the measured hardware baseline (RTX 4080 Laptop GPU, i9-13980HX, 32 GB DDR5) and the Ollama inference stack.
Executive Summary
This report closes the loop started in TR108 and makes the TR108-TR116 research program decision-grade. We moved from a single-model parameter sweep through a complete language migration, an architecture discovery, a runtime selection, and a cross-model validation. The outcome is a set of stable, artifact-backed conclusions about what language, architecture, runtime, and model to deploy for multi-agent LLM workloads on consumer hardware.
The synthesis in one line
The dominant driver of multi-agent efficiency is runtime architecture (Rust vs Python) and server topology (dual vs single Ollama), not model choice or parameter tuning --- and the configuration transfer problem means single-inference benchmarks cannot predict agent workflow behavior.
Claim status (cross-report, reviewer-proof)
| Claim ID | Claim | Evidence Base | Status |
|---|---|---|---|
| C1 | Rust is the correct production language for multi-agent LLM workloads (15.2% faster, 67% less memory, 83% faster startup) | TR111_v2, TR112_v2, TR114_v2, TR116 | Supported (consistent across 4 reports, 363+ runs, 3 models) |
| C2 | Dual Ollama architecture is mandatory for multi-agent deployments (reduces contention from 63% to 0.74%) | TR110, TR113, TR114_v2 | Supported (single Ollama produces 63% contention in TR113; dual Ollama achieves 99.4% in TR114_v2; Python gold standard in TR110 also uses dual Ollama) |
| C3 | Tokio-default (#[tokio::main]) is the optimal async runtime for multi-agent LLM workloads | TR115_v2 | Supported (best consistency at 1.21pp sigma across 150 runs; all 4 working runtimes achieve ~100% peak, so consistency is the differentiator) |
| C4 | Gemma 3 is the best model for multi-agent scaling (99.2% efficiency); Llama 3.1 is best for reasoning-heavy agents (96.5%) | TR116 | Supported (60 runs across 3 models x 2 runtimes; Gemma 3 dominates in chimera-homo; Llama 3.1 trades throughput for reasoning quality) |
| C5 | Python has a structural efficiency ceiling at ~86% for multi-agent LLM workloads; Rust is mandatory for >100 tok/s sustained concurrent throughput | TR110, TR116 | Supported (Python never exceeds 86% efficiency in TR116 across 3 models; Python peak in TR110 is 99.25% but under ideal single-model conditions only; cross-model average ceiling is ~84%) |
| C6 | Agent workflow configurations (GPU=60-80, CTX=256-512) differ from single-inference optima (GPU=999, CTX=4096); the configuration transfer problem is real | TR108, TR109, TR111_v2 | Supported (TR108 finds GPU=999 optimal for single inference; TR109 finds inverse relationship with smaller contexts outperforming; TR111_v2 confirms in Rust) |
Program trajectory (TR108 -> TR116)
The research program is intentionally sequential. Each report closes a failure mode or opens a new question from the previous one:
-
TR108 establishes the single-agent performance baseline across 158+ configurations, identifying Gemma 3 as the throughput leader (102.85 tok/s vs Llama 3.1 at 76.59 tok/s) and GPU layer allocation as the critical parameter. This is the starting benchmark.
-
TR109 falsifies the assumption that single-inference optima transfer to agent workflows. Context size shows an inverse relationship (512-1024 beats 4096), GPU layers have a sweet spot at 60-80 (not 999), and multi-step workflows expose fundamentally different optimization surfaces. This is the configuration transfer discovery.
-
TR110 constructs the Python multi-agent gold standard: 99.25% parallel efficiency at 1.985x speedup using dual Ollama instances, homogeneous Chimera agents, GPU=80, CTX=2048, TEMP=1.0. This establishes the target that any alternative language must match or exceed. This is the Python ceiling.
-
TR111_v2 achieves full Rust-Python workflow parity (file I/O, multi-stage LLM calls, metric collection) and reveals Rust is 15.2% faster in single-agent throughput (114.54 vs 99.34 tok/s). This reverses the earlier incorrect assumption of Python superiority. This is the Rust parity proof.
-
TR112_v2 provides the comprehensive cross-language comparison: Rust wins on throughput (+15.2%), consistency (CV 2.6% vs 4.8%), cold-start TTFT (-58%), memory (-67%), and startup time (-83%). Python retains advantages in development velocity and peak optimization variance. This is the language decision.
-
TR113 deploys Rust multi-agent with a single Ollama instance and discovers catastrophic performance: 82.2% peak efficiency, 63% resource contention rate, and a -17.0 percentage-point gap versus Python. The root cause is server-level request serialization, not a Rust runtime deficiency. This is the architecture failure.
-
TR114_v2 fixes the architecture with dual Ollama instances, achieving 99.4% peak efficiency (99.396% config average vs Python's 99.25%), 98.28% overall mean across 135 runs, and confirming that Rust matches or exceeds Python when infrastructure parity is maintained. This is the architecture fix.
-
TR115_v2 tests five async runtimes (tokio-default, tokio-localset, async-std, smol, smol-1KB) across 150 runs and discovers that all four working runtimes achieve ~100% peak efficiency (99.87-99.99%), making consistency the only differentiator. Tokio-default wins with 1.21pp standard deviation; async-std fails catastrophically at 50% due to ecosystem lock-in. This is the runtime decision.
-
TR116 validates all prior findings across three model architectures (Gemma 3, Llama 3.1 8B, Qwen 2.5 7B) and confirms that Rust's 12-17pp efficiency advantage over Python is structural and model-independent. Gemma 3 achieves 99.2% chimera-homo efficiency; Python never exceeds 86%. This is the cross-model confirmation.
This ordering is deliberate: you cannot migrate languages before establishing a baseline; you cannot compare multi-agent before achieving workflow parity; you cannot select a runtime before understanding architecture constraints; you cannot generalize before cross-model validation.
Bottom-line decisions you can ship
-
Language decision: Deploy Rust for production multi-agent LLM workloads. The 15.2% throughput advantage, 67% memory reduction, and 83% faster startup are consistent across all tested configurations. Python remains appropriate for prototyping and single-agent exploratory work.
-
Architecture decision: Use dual Ollama instances (one per agent) for all multi-agent deployments. A single Ollama instance produces 63% resource contention and caps efficiency at 82%. Dual instances reduce contention to 0.74% and enable 99%+ efficiency.
-
Runtime decision: Use tokio-default (#[tokio::main]) with no custom configuration. All four working Rust async runtimes achieve near-identical peak performance (99.87-99.99%); the differentiator is consistency. Tokio-default provides the best consistency (1.21pp sigma) and requires zero configuration.
-
Model decision (multi-agent): Use Gemma 3 (gemma3:latest) for high-concurrency agent swarms requiring maximum scaling efficiency (99.2% chimera-homo). Use Llama 3.1 8B (llama3.1:8b-instruct-q4_0) for reasoning-heavy agent tasks where throughput is secondary (96.5% efficiency, lower raw tok/s but near-equivalent scaling). Avoid Qwen 2.5 7B for multi-agent workloads (90.0% efficiency, 13-19pp below Gemma/Llama).
-
Python limitation: Do not deploy Python for production multi-agent systems targeting sustained throughput above 100 tok/s or requiring more than two concurrent agents. Python's structural efficiency ceiling at ~86% imposes a 15-20% wall-time penalty that compounds with agent count.
-
Configuration policy: Do not transfer single-inference optima to agent workflows. Use GPU=60-80 (not 999), CTX=256-512 (not 4096), and TEMP=0.6-1.0 for agent deployments. The configuration transfer problem is real and produces suboptimal results when ignored.
Operational Defaults (Decision Card)
Valid under stated boundary conditions (RTX 4080 Laptop GPU, i9-13980HX, 32 GB DDR5, Ollama inference stack, Windows 11).
- Production language: Rust (tokio async runtime, reqwest HTTP client)
- Async runtime: tokio-default via
#[tokio::main]--- no LocalSet, no smol, no async-std - Server architecture: Dual Ollama instances (ports 11434/11435), one per agent
- Model (scaling): gemma3:latest (4.3B, Q4_K_M, ~100 tok/s single-agent)
- Model (reasoning): llama3.1:8b-instruct-q4_0 (8B, Q4_0, ~68 tok/s single-agent)
- Agent configuration: GPU layers = 60-80, context size = 256-512 tokens, temperature = 0.6-1.0
- Warmup policy: Run at least 1 warmup request per Ollama instance before timing; exclude warmup from metrics
- Process isolation: Use forced model unloads (
ollama stop) between benchmark runs; not required for production steady-state - Statistical minimum: 3 runs per configuration for directional results; 5 runs per configuration for publishable claims
- Contention threshold: Flag runs where TTFT exceeds 3x baseline as resource contention events
Manifest requirements (minimum):
- GPU model + driver version
- CUDA version
- Rust toolchain version
- Ollama version (both instances)
- Model identifier and quantization level
- OS version and build number
- Git commit SHAs for benchmark harness and analysis scripts
Invalidates this report:
- GPU hardware change (different VRAM, different architecture)
- Ollama major version upgrade (scheduling behavior may change)
- Model weight updates (Gemma 3, Llama 3.1 new releases)
- OS-level scheduler changes (Windows power plan, CPU governor)
- Any change to dual Ollama port assignments or binding configuration
Table of Contents
Executive Summary Operational Defaults (Decision Card)
- Introduction and Research Questions
- Background and Related Work
- Methods and Measurement Framework
- Decision Impact Matrix (TR108-TR116)
- Results by Report (TR108-TR116) 5.1 TR108: Single-Agent Baseline and the Throughput Landscape 5.2 TR109: Configuration Transfer Failure and Agent Workflow Discovery 5.3 TR110: Python Multi-Agent Gold Standard 5.4 TR111_v2: Rust Workflow Parity and Single-Agent Advantage 5.5 TR112_v2: Cross-Language Decision Framework 5.6 TR113: Single-Ollama Bottleneck and Architecture Failure 5.7 TR114_v2: Dual Ollama Fix and Rust Multi-Agent Parity 5.8 TR115_v2: Runtime Selection and Consistency as the Differentiator 5.9 TR116: Cross-Model Validation and Python Ceiling Confirmation
- Cross-Report Synthesis by Decision Axis
- Economics and Capacity (Agent-First and Scaling-First)
- Operational Doctrine and Risk Controls
- Threats to Validity and Scope Limits
- Limitations by Report and Mitigations
- Future Work
- Conclusive Statement
- References
- Appendix A: Key Formulas and Metric Definitions
- Appendix B: Claim-to-Artifact Chain-of-Custody
- Appendix C: Per-TR Key Numbers (Extracted)
- Appendix D: Glossary and Definitions
- Appendix E: Operational Checklists
- Appendix F: Configuration Transfer Problem --- Worked Examples
- Appendix G: Runtime Comparison Data Tables
- Appendix H: Cross-Model Efficiency Matrices
- Appendix I: Statistical Notes and Validation
- Appendix J: Traceability Map (TR108-TR116 to Decisions)
- Appendix K: Extended Literature Review
- Appendix L: Measurement Boundary Catalog
- Appendix M: Detailed Methods by Report
- Appendix N: Expanded Discussion and Implications
- Appendix O: Extended Results Narratives
- Appendix P: Decision-Grade Reporting Rubric
- Appendix Q: Architecture Evolution Narrative (Single Ollama to Dual)
- Appendix R: Async Runtime Failure Mode Catalog
- Appendix S: Production Deployment Playbook
- Appendix T: Extended Risk Register
- Appendix U: Program Evolution Narrative
- Appendix V: Cost Modeling and Break-Even Analysis
Supplemental material remains mirrored in PublishReady/reports/Technical_Report_Conclusive_108-116_Extended_Appendices.md.
1. Introduction and Research Questions
1.1 Motivation
As open-weight LLMs become deployable on consumer hardware, the engineering decisions that determine deployment quality — language runtime, serving architecture, async framework, model selection — need empirical grounding rather than community folklore. This phase establishes that grounding for multi-agent LLM workloads on consumer GPUs.
The migration from Python to Rust for LLM agent workloads is a question that recurs across the industry, but it is rarely answered with controlled, artifact-backed evidence under fixed hardware conditions. Most comparisons rely on micro-benchmarks (single HTTP calls, trivial async tasks) that do not capture the interaction between language runtime, inference server topology, model architecture, and multi-step agent workflow complexity. The result is that teams make language and architecture decisions based on intuition, community sentiment, or synthetic benchmarks that do not transfer to production agent workloads.
The TR108-TR116 research program was constructed to close this gap. The central challenge is not "which language is faster" in isolation, but whether the performance characteristics observed in single-agent inference benchmarks survive the transition to concurrent multi-agent workflows --- and if not, what architectural interventions restore or exceed the expected performance.
The research begins from a practical engineering need: the Banterhearts project requires real-time multi-agent LLM coordination for gaming applications, where latency, throughput, memory efficiency, and scaling behavior directly affect user experience. The hardware is fixed (a consumer-grade RTX 4080 Laptop GPU with 12 GB VRAM), the inference stack is fixed (Ollama), and the question is how to maximize concurrent agent efficiency under these constraints.
Three surprises drive the narrative. First, optimal configurations for single-inference tasks do not transfer to agent workflows (TR109), which means the entire parameter optimization literature for LLM inference must be re-evaluated in the agent context. Second, Rust's theoretical concurrency advantages do not materialize when the inference server architecture creates a serialization bottleneck (TR113), which means language-level benchmarks are misleading without infrastructure-level parity. Third, Python has a structural efficiency ceiling at approximately 86% under concurrent multi-agent workloads that is model-independent (TR116), which means the language decision is not about marginal throughput but about a hard architectural limit.
These three surprises --- the configuration transfer problem, the architecture bottleneck, and the Python ceiling --- form the empirical backbone of this synthesis. Each required a dedicated investigation to discover, quantify, and resolve. The resulting decision framework is not a performance ranking; it is a deployment policy grounded in measured failure modes.
1.2 Research questions (program level)
This conclusive report answers the following cross-cutting questions:
-
Language selection: Does Rust provide a measurable, consistent advantage over Python for multi-agent LLM workloads on consumer hardware, and if so, what is the magnitude and under what conditions does it hold?
-
Architecture design: What inference server topology (single vs dual Ollama) is required for multi-agent deployments, and what is the quantitative cost of the wrong architecture?
-
Runtime selection: Among available Rust async runtimes (tokio-default, tokio-localset, async-std, smol, smol-1KB), which provides the best production characteristics for LLM agent workloads, and what is the correct selection criterion?
-
Model selection: How does model architecture (Gemma 3, Llama 3.1, Qwen 2.5) interact with runtime and concurrency, and is model choice orthogonal to the language decision?
-
Configuration transfer: Do single-inference performance optima transfer to multi-step agent workflows, and if not, what agent-specific configuration policy should be adopted?
-
Scaling limits: What is the theoretical and measured efficiency ceiling for Python multi-agent workloads, and at what throughput threshold does Rust become mandatory?
1.3 Contributions of this synthesis
This synthesis contributes six decision-grade deliverables:
-
Language decision framework: A controlled, artifact-backed comparison of Rust and Python across single-agent, multi-agent, and cross-model scenarios, with explicit measurement of throughput, latency, memory, startup time, consistency, and parallel efficiency. The framework identifies when each language is appropriate and when it is not.
-
Architecture discovery and resolution: The identification and quantification of the single-Ollama serialization bottleneck (63% contention, 82.2% peak efficiency) and its resolution through dual Ollama instances (0.74% contention, 99.4% peak efficiency). This is a concrete infrastructure design decision, not a theoretical concern.
-
Runtime selection methodology: A consistency-first selection framework for async runtimes that rejects peak-performance comparisons in favor of standard deviation, minimum efficiency, and failure mode analysis. The methodology is transferable to any runtime selection problem where all candidates achieve similar peaks.
-
Cross-model validation: A systematic measurement of how model architecture interacts with multi-agent coordination within the tested configurations, demonstrating that the Rust-Python efficiency gap is structural (12-17pp) and model-independent, while model choice affects absolute efficiency within a given runtime.
-
Configuration transfer analysis: Empirical evidence that single-inference optima (GPU=999, CTX=4096) produce suboptimal results in agent workflows, with specific agent-optimized parameters (GPU=60-80, CTX=256-512) that improve performance. This is a direct challenge to the common practice of applying inference benchmarks to agent deployment.
-
Python ceiling quantification: The measurement and attribution of Python's structural efficiency ceiling at approximately 86% for concurrent multi-agent LLM workloads, providing a concrete threshold for when Rust migration is necessary rather than optional.
1.4 Document structure and reading guide
Reading guide.
If you need deployment decisions: read the Executive Summary, the Operational Defaults (Decision Card), Section 4 (Decision Impact Matrix), Section 6 (Cross-Report Synthesis by Decision Axis), and Section 8 (Operational Doctrine and Risk Controls).
If you need method defensibility: read Section 3 (Methods and Measurement Framework) plus the limitation gates in Section 10, then Appendix B (Claim-to-Artifact Chain-of-Custody).
If you need to understand the architecture discovery: read Sections 5.6 (TR113) and 5.7 (TR114_v2) for the single-Ollama failure and dual-Ollama resolution, then Appendix Q for the full architecture evolution narrative.
If you need runtime selection rationale: read Section 5.8 (TR115_v2) and Appendix R (Async Runtime Failure Mode Catalog) for failure mode analysis across all five runtimes.
If you need cross-model data: read Section 5.9 (TR116) and Appendix H (Cross-Model Efficiency Matrices) for per-model, per-runtime, per-scenario breakdowns.
If you are auditing the Python ceiling claim: read Sections 5.3 (TR110), 5.9 (TR116), and 6.5 (Scaling Limits axis) for the evidence chain, then Section 9 (Threats to Validity) for the scope boundaries on this claim.
If you need the configuration transfer problem explained: read Sections 5.1 (TR108), 5.2 (TR109), and Appendix F (Configuration Transfer Problem --- Worked Examples) for before/after parameter comparisons.
Notation conventions. Throughout this report, "pp" denotes percentage points (absolute difference between percentages), "CV" denotes coefficient of variation (standard deviation divided by mean, expressed as a percentage), and "sigma" denotes standard deviation. Efficiency is defined as (speedup / N) x 100% where N is the number of concurrent agents. Contention rate is the fraction of runs where TTFT exceeded 3x the baseline value.
2. Background and Related Work
2.1 LLM inference on consumer GPUs
Local LLM inference on consumer-grade GPUs has become viable with quantized model formats (GGUF, GPTQ) and optimized serving stacks (Ollama, llama.cpp, vLLM). The RTX 4080 Laptop GPU used in this study represents the upper tier of consumer hardware: 12 GB GDDR6X VRAM, 9,728 CUDA cores, 232 fourth-generation Tensor Cores, and 432 GB/s memory bandwidth at a 175W thermal design power. This hardware can serve 4-8B parameter models at quantization levels Q4_0 through Q4_K_M with full GPU offload, achieving 68-115 tok/s depending on model architecture and quantization.
The critical constraint for multi-agent workloads is not raw throughput but concurrent throughput: when two or more agents issue simultaneous inference requests, the GPU must time-share or batch across requests. Ollama's architecture exposes this constraint through its request scheduling behavior, which becomes a central variable in this study. Unlike cloud-hosted inference with dedicated hardware per request, consumer-grade deployment forces all agents through a shared GPU, making server topology and request scheduling first-order concerns.
The broader literature on consumer GPU inference focuses on single-request optimization: quantization selection, KV-cache management, batch size tuning, and context length scaling. This study extends that work to the multi-agent regime, where concurrent request scheduling, runtime-level async coordination, and inter-agent resource contention become the dominant performance factors.
2.2 Async runtimes in systems programming
Asynchronous programming models for I/O-bound workloads differ fundamentally between Rust and Python. Python's asyncio provides a single-threaded event loop with cooperative multitasking; coroutines yield control at await points, and the Global Interpreter Lock (GIL) is released during I/O operations, allowing OS threads to proceed in parallel for I/O-bound work. Rust's async ecosystem offers multiple runtime implementations with different scheduling strategies.
Tokio is the dominant Rust async runtime, providing a multi-threaded work-stealing scheduler by default. The work-stealing model distributes tasks across a thread pool and migrates tasks between threads to balance load. This introduces thread-migration overhead that is negligible for CPU-bound work but can become significant for I/O-wait-dominated workloads (such as waiting for Ollama responses), where the scheduling overhead exceeds the actual computation. Tokio also offers a LocalSet variant that pins tasks to a single thread, eliminating migration overhead at the cost of losing parallelism.
Alternative Rust runtimes include async-std (which provides a Tokio-like API with a different scheduler) and smol (a minimal runtime with a simple thread-per-core model). Each runtime makes different tradeoffs between scheduling sophistication and overhead. TR115_v2 evaluates all five configurations and discovers that the tradeoffs are irrelevant at the peak level (all achieve ~100%) but decisive at the consistency level, where smol's simplicity produces pathological failures (72.80% minimum) and async-std's Tokio bridge conflict produces catastrophic serialization (50% efficiency).
The systems programming literature on async runtimes emphasizes throughput under load, but the LLM agent use case inverts the usual assumptions: the bottleneck is not CPU computation or network I/O latency but GPU inference time (tens to hundreds of milliseconds per request), making the runtime's scheduling overhead a measurable fraction of the total request time.
2.3 Multi-agent coordination and parallel efficiency
Multi-agent systems that coordinate through shared infrastructure face a fundamental tension between parallelism and contention. The theoretical maximum speedup for N independent agents on N independent resources is Nx (linear scaling). In practice, shared resources (GPU, memory bus, inference server) introduce contention that reduces the effective speedup.
Parallel efficiency is defined in this study as:
efficiency = (2 * slower_agent_time) / (slower_agent_time + faster_agent_time) * 100%
for two-agent concurrent execution, which normalizes to speedup/2 * 100%. An efficiency of 100% means perfect parallelism (2x speedup); 50% means complete serialization (1x speedup, equivalent to sequential execution).
The multi-agent coordination literature typically assumes homogeneous tasks and focuses on communication overhead. LLM agent workloads introduce an additional complication: the inference server is a shared, stateful resource with its own scheduling policy. Ollama, for example, can serve multiple concurrent requests but may serialize them internally depending on model loading state, KV-cache availability, and VRAM pressure. This server-level behavior is opaque to the agent runtime and can dominate the efficiency equation regardless of how well the agents themselves are coordinated.
TR110 and TR114_v2 demonstrate this concretely: the Python multi-agent system achieves 99.25% efficiency because it uses dual Ollama instances (eliminating server-level contention), while the Rust multi-agent system initially achieves only 82.2% efficiency (TR113) because it uses a single Ollama instance. The architecture fix (TR114_v2) is not a Rust improvement but an infrastructure improvement, and it raises Rust's efficiency above Python's. This finding underscores that multi-agent performance research must control for infrastructure topology, not just language and runtime.
2.4 Rust vs Python for systems-level workloads
The Rust-versus-Python debate in systems programming is well-documented but rarely quantified under controlled conditions for LLM inference workloads. Rust offers zero-cost abstractions, compile-time memory safety, deterministic resource management, and single-binary deployment. Python offers rapid development, a rich ML ecosystem, dynamic typing, and extensive library support for LLM tooling.
For CPU-bound workloads, Rust's performance advantage is well-established (typically 10-50x over Python for pure computation). For I/O-bound workloads, the gap narrows because both languages spend most of their time waiting for external resources. LLM agent workloads occupy an intermediate regime: the agent runtime performs I/O (HTTP requests to Ollama), CPU work (parsing responses, managing state, file I/O), and coordination (scheduling concurrent requests, collecting metrics). The relative weight of these components determines whether Rust's advantages are realized.
TR112_v2 provides the first controlled measurement in this regime with full workflow parity: identical multi-step agent workflows (file system scanning, data ingestion, multi-stage LLM calls, metric collection) implemented in both languages against the same hardware and model. The result is not the 10-50x advantage typical of CPU-bound comparisons, but a consistent 15.2% throughput advantage, 58% TTFT reduction, 67% memory reduction, and 83% startup time reduction. These are operationally significant but not transformative for single-agent workloads; they become decisive at the multi-agent level, where the compounding effect of runtime overhead and the Python efficiency ceiling make Rust mandatory for scaling.
2.5 Ollama architecture and local inference
Ollama is a local LLM inference server that wraps llama.cpp with an HTTP API, model management, and automatic GPU offloading. Its architecture is relevant to this study because Ollama's request scheduling behavior is the root cause of the single-Ollama bottleneck discovered in TR113.
When Ollama receives concurrent requests for the same model, it must manage shared GPU resources: VRAM allocation, KV-cache state, and compute scheduling. In the single-instance configuration, concurrent requests are internally serialized or time-shared, producing contention that manifests as TTFT inflation and throughput degradation. The dual-instance configuration (ports 11434 and 11435) runs two independent Ollama processes, each managing its own model copy and KV-cache. This eliminates server-level contention at the cost of doubling model memory consumption (from ~3.3 GB to ~6.6 GB for Gemma 3).
On the RTX 4080 with 12 GB VRAM, dual Gemma 3 instances consume approximately 55% of available VRAM, leaving headroom for KV-cache growth and other GPU workloads. For larger models (Llama 3.1 8B at ~5.5 GB, Qwen 2.5 7B at ~5 GB), dual instances consume 85-92% of VRAM, approaching the hardware limit. This VRAM constraint is a hard boundary on the dual-Ollama architecture: it works for models up to approximately 5.5 GB per instance on this hardware.
Ollama exposes timing metrics (prompt evaluation time, evaluation time, load time) through its API response, which the benchmark harness captures and transforms into throughput (tok/s), TTFT (ms), and parallel efficiency metrics. These server-reported timings are the primary measurement substrate for single-agent benchmarks, while wall-clock concurrent execution time is the primary measurement for multi-agent parallel efficiency.
2.6 Model architectures: Gemma, Llama, Qwen
The three model families tested in TR116 represent different design philosophies and architectural tradeoffs.
Gemma 3 (gemma3:latest, 4.3B parameters, Q4_K_M quantization) is Google's compact model optimized for efficiency. At 3.3 GB disk footprint, it achieves approximately 100 tok/s on the RTX 4080, the highest single-agent throughput of the three models tested. Its small size and efficient attention implementation make it the best candidate for multi-agent scaling, as confirmed by TR116's 99.2% chimera-homo efficiency in Rust.
Llama 3.1 8B (llama3.1:8b-instruct-q4_0, 8B parameters, Q4_0 quantization) is Meta's general-purpose instruction-tuned model. At approximately 5.5 GB disk footprint, it achieves approximately 68 tok/s, 32% slower than Gemma 3. Despite lower raw throughput, Llama 3.1 achieves 96.5% multi-agent efficiency in Rust (TR116), only 0.8pp below Gemma 3 in the baseline-vs-chimera scenario. This near-equivalent scaling suggests that model architecture matters less than model size for multi-agent coordination overhead.
Qwen 2.5 7B (qwen2.5:7b, 7B parameters, Q4_K_M quantization) is Alibaba's multilingual model. At approximately 5 GB disk footprint, it achieves approximately 76 tok/s. Qwen shows consistently lower multi-agent efficiency (90.0% in Rust, 77.6% in Python for baseline-vs-chimera) despite a parameter count between Gemma and Llama. TR116 attributes this to heavier KV-cache pressure or different attention patterns that increase contention under concurrent access.
2.7 Benchmarking methodology for agent workloads
Agent workload benchmarking differs from single-inference benchmarking in three critical ways. First, the measurement boundary must include the full agent lifecycle (initialization, data ingestion, multi-step LLM interaction, output generation), not just a single inference call. Second, the concurrent execution timing must capture wall-clock elapsed time for the entire multi-agent run, not aggregated per-request metrics. Third, statistical confidence requires multiple runs per configuration (at minimum 3 for directional results, 5 for publishable claims) because agent workflows introduce variance from file I/O, model loading state, and inter-agent timing jitter.
The benchmarking literature emphasizes that measurement boundaries are inseparable from interpretation. A benchmark that measures only inference latency will miss agent-level overhead; a benchmark that measures only wall-clock time will miss the decomposition into inference, I/O, and coordination components. The TR108-TR116 program addresses this by measuring at multiple levels: per-inference metrics (throughput, TTFT) from Ollama, per-agent metrics (total generation time, token counts), and per-run metrics (wall-clock time, speedup, efficiency, contention).
2.8 The configuration transfer problem: single-inference vs agent workflows
A recurring theme in applied ML is the assumption that optimal hyperparameters transfer across deployment contexts. In the LLM inference domain, this manifests as the belief that parameters optimized for single-request throughput (e.g., maximum GPU layer offload, large context windows) will also optimize multi-step agent workflows.
TR108 and TR109 provide a controlled test of this assumption. TR108 finds that GPU=999 (maximum offload) and CTX=4096 (large context) maximize single-inference throughput. TR109, using the same hardware and model but an agent workflow (multi-step report generation), finds the opposite: GPU=60-80 and CTX=512-1024 outperform the single-inference optima. The mechanism is that agent workflows issue many short requests rather than one long request, and the overhead of maintaining a large context window and managing full GPU offload across rapid sequential calls exceeds the throughput benefit.
This configuration transfer failure is not a minor numerical discrepancy; it inverts the parameter rankings. A team that deploys agent workloads using single-inference benchmarks will select suboptimal parameters and may not realize the error because agent throughput is harder to measure than single-request throughput. The TR108-TR116 program treats this as a first-class finding and provides agent-specific configuration recommendations that differ from single-inference optima.
3. Methods and Measurement Framework
3.1 Artifact-first methodology
Every claim in the TR108-TR116 program is backed by reproducible artifacts. This includes:
- Raw benchmark logs: CSV files containing per-run metrics (throughput, TTFT, eval count, prompt eval count, total duration, concurrency speedup, parallel efficiency, resource contention flags) for every configuration and run.
- Configuration records: YAML or JSON files specifying exact parameters (model identifier, GPU layers, context size, temperature, Ollama port assignments) for every test configuration.
- Analysis scripts: Python and Rust code that transforms raw logs into summary statistics, comparison tables, and visualizations with deterministic transformations.
- Report generation pipeline: Automated scripts that produce the published technical reports from raw data, ensuring that no manual data manipulation occurs between measurement and publication.
The artifact-first methodology serves two purposes. First, it enables reproducibility: any claim can be traced back to specific raw data files, specific analysis code, and specific configuration parameters. Second, it enables falsification: if a later report contradicts an earlier finding (as TR113 contradicts the assumption that Rust would automatically outperform Python in multi-agent scenarios), the raw data from both reports can be compared directly to identify the root cause.
3.2 Efficiency metric definitions
The primary metrics used across the program are defined as follows:
Parallel efficiency (multi-agent):
efficiency = (speedup / N) * 100%
where speedup = sequential_estimated_time / concurrent_wall_time and N is the number of concurrent agents (N=2 throughout this study). An efficiency of 100% indicates perfect parallelism; 50% indicates complete serialization.
Concurrency speedup (multi-agent):
speedup = (agent_1_sequential_time + agent_2_sequential_time) / max(agent_1_concurrent_time, agent_2_concurrent_time)
The denominator uses the slower agent's wall-clock time because the concurrent run completes when the last agent finishes. Theoretical maximum is 2.0x for two agents.
Contention rate (multi-agent):
contention_rate = count(runs where TTFT > 3 * baseline_TTFT) / total_runs
A TTFT anomaly exceeding 3x the baseline value is classified as a resource contention event. This threshold was empirically determined from TR110 and TR113 data distributions.
Throughput (single-agent):
throughput = eval_count / eval_duration (tok/s)
where eval_count is the number of tokens generated and eval_duration is the time spent generating them, as reported by Ollama.
Time-to-first-token (TTFT) (single-agent):
TTFT = load_time + prompt_eval_time (ms)
This includes model loading overhead (if not already loaded) and prompt evaluation. Cold-start TTFT can be orders of magnitude larger than warm TTFT.
Coefficient of variation (CV):
CV = (standard_deviation / mean) * 100%
Used to compare consistency across configurations, runtimes, and languages. Lower CV indicates more predictable performance.
3.3 Statistical framework
The statistical framework across TR108-TR116 prioritizes robustness over parametric elegance.
Central tendency: Means are used for throughput comparisons (where the distribution is approximately normal) and medians are used for TTFT comparisons (where heavy-tailed cold-start events can distort means). When both are reported, the mean-median gap is noted as a distribution shape indicator.
Dispersion: Standard deviation and CV are the primary dispersion measures. For multi-agent efficiency, standard deviation in percentage points (pp sigma) is the key consistency metric, as demonstrated by TR115_v2's runtime selection based on 1.21pp sigma for tokio-default versus 4.87pp sigma for smol.
Multi-run validation: All publishable claims require a minimum of n=3 runs per configuration. TR110 and TR115_v2 use n=5 for all configurations. TR113 uses n=1 (exploratory) and is explicitly flagged as directional rather than conclusive. When n is small, rank-order comparisons are preferred over parametric tests.
Percentile reporting: P50 (median), P95, and P99 latencies are reported where tail behavior is relevant (TTFT, cold-start analysis). This prevents mean-dominated summaries from hiding operationally significant tail events.
Cross-report comparisons: When comparing results across reports (e.g., TR110 Python efficiency versus TR114_v2 Rust efficiency), the synthesis uses only configurations that share identical hardware, model, and measurement methodology. Configuration-sweep comparisons are explicitly labeled and separated from baseline-to-baseline comparisons.
3.4 Hardware baseline
All measurements in the TR108-TR116 program are conducted on a single, fixed hardware configuration:
| Component | Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 4080 Laptop GPU |
| VRAM | 12 GB GDDR6X |
| CUDA Cores | 9,728 |
| Tensor Cores | 232 (4th Generation) |
| Memory Bus | 256-bit |
| Memory Bandwidth | 432 GB/s |
| TDP | 175W (laptop configuration) |
| CPU | Intel Core i9-13980HX |
| CPU Cores | 24 (8 Performance + 16 Efficiency) |
| CPU Threads | 32 |
| CPU Base Clock | 2.2 GHz |
| CPU Boost Clock | 5.6 GHz |
| System Memory | 32 GB DDR5-4800 |
| Storage | NVMe SSD |
| OS | Windows 11 Pro |
This hardware configuration is not representative of cloud or datacenter deployments. All conclusions are bounded to this specific hardware class (consumer-grade laptop GPU with shared thermal envelope). The thermal constraint is particularly relevant: the RTX 4080 Laptop GPU operates under a 175W power limit with shared cooling, which can introduce thermal throttling during sustained multi-agent workloads. Benchmark runs are short enough (minutes, not hours) that thermal steady-state is typically maintained, but this is an explicit boundary condition.
3.5 Scenario definitions
The program defines three primary benchmark scenarios, each designed to isolate a specific aspect of performance:
Scenario 1: Single-agent (TR108, TR109, TR111_v2, TR112_v2) A single agent issues sequential LLM requests to a single Ollama instance. This scenario measures raw inference throughput, TTFT, and configuration sensitivity without concurrent access complications. It is the baseline for all cross-language comparisons.
Scenario 2: Agent workflow (TR109, TR111_v2) A single agent performs a multi-step workflow (data ingestion, analysis, report generation) involving multiple sequential LLM calls, file I/O, and state management. This scenario captures the configuration transfer problem: parameters optimal for Scenario 1 may not be optimal here.
Scenario 3: Multi-agent concurrent (TR110, TR113, TR114_v2, TR115_v2, TR116) Two agents execute simultaneously, each communicating with its own Ollama instance (in the dual-Ollama configuration) or sharing a single instance (in the single-Ollama configuration). This scenario measures parallel efficiency, contention rate, and the interaction between runtime, architecture, and model under concurrent load. Three sub-scenarios are defined:
- baseline-vs-chimera: One agent uses Ollama defaults, one uses Chimera-optimized parameters. Measures heterogeneous deployment overhead.
- chimera-hetero: Both agents use Chimera parameters, but with different values. Measures asymmetric optimization impact.
- chimera-homo: Both agents use identical Chimera-optimized parameters. Measures peak achievable efficiency.
3.6 Warmup policy and process isolation protocol
Warmup policy: Each benchmark configuration begins with at least one warmup request that is excluded from timing measurements. This policy was established in response to TR108 and TR109 observations that the first request to a cold Ollama instance incurs model loading overhead (often 1-5 seconds) that dominates subsequent request times (tens of milliseconds TTFT). Without warmup exclusion, mean metrics are distorted by cold-start events.
Process isolation: Between benchmark runs (not between requests within a run), the protocol calls ollama stop to force model unloading. This ensures that each run begins from a known cold state, preventing warm-cache effects from confounding cross-configuration comparisons. Within a run, the model remains loaded to reflect production steady-state behavior.
Inter-run cooling: A minimum 5-second pause between runs is maintained to allow GPU thermal and memory state to stabilize. This prevents cumulative thermal effects from biasing later runs within a configuration sweep.
3.7 Decision-grade reporting protocol
Decision-grade reporting requires three layers, each building on the previous:
-
Measurement validity: The measurement captures what it claims to capture. The timing boundary is explicit, the metric definition is unambiguous, and the raw data is available for audit. This layer is established by the artifact-first methodology (Section 3.1) and the per-TR measurement boundaries (Section 3.10).
-
Attribution correctness: The measured effect is attributed to the correct cause. A throughput difference between Rust and Python is attributed to the language runtime only when the workflow, model, hardware, and server configuration are held constant. This layer is established by the workflow parity validation (TR111_v2, TR112_v2) and the architecture correction (TR113 to TR114_v2).
-
Decision translation: The measured and attributed effect is translated into an actionable deployment decision with explicit scope boundaries. A 15.2% throughput advantage becomes "deploy Rust for production" only when the advantage is consistent across configurations (TR112_v2), multi-agent scenarios (TR114_v2), and model architectures (TR116). This layer is established by the cross-report synthesis (Section 6) and the operational doctrine (Section 8).
A report that is valid but incorrectly attributed (e.g., claiming Rust is slower in multi-agent because of language runtime when the real cause is single-Ollama contention) produces invalid recommendations. A report that is correctly attributed but not translated produces numbers without decisions. The TR108-TR116 program constructs these three layers sequentially, and this synthesis makes the layer boundaries explicit.
3.8 Cross-language comparison methodology
Cross-language comparisons in the program follow strict parity requirements:
Workflow parity: Both Rust and Python implementations perform identical operations: file system scanning, CSV/JSON/Markdown parsing, multi-stage LLM calls (analysis and report generation), metric collection and output. TR111_v2 documents the upgrade that brought the Rust implementation to full parity with the Python implementation described in TR109.
Identical prompts and files: Both implementations process the same input data (benchmark CSVs from the same directory) and issue the same prompts to Ollama. The prompts are not language-specific; they are read from shared prompt files.
Identical model and hardware: All cross-language comparisons use the same model (gemma3:latest unless otherwise specified), the same hardware (Section 3.4), and the same Ollama version and configuration.
Identical measurement methodology: Both implementations measure the same metrics (throughput, TTFT, eval count, wall-clock time) using the same timing boundaries (Ollama-reported timings for single-agent, wall-clock for multi-agent).
Baseline-to-baseline comparison: The primary cross-language comparison uses Ollama default configurations (no Chimera optimization) to ensure that optimization strategy differences do not confound the language comparison. Configuration-sweep comparisons are reported separately and explicitly labeled.
3.9 Multi-agent measurement: dual Ollama, concurrent execution timing
Multi-agent measurements use the following protocol:
Dual Ollama deployment: Two independent Ollama instances run on ports 11434 and 11435. Each agent is assigned to one instance. This eliminates server-level request serialization, which TR113 demonstrated is the dominant contention source for single-instance deployments.
Concurrent execution: Agents are launched simultaneously using asyncio.gather() (Python) or tokio::join!() (Rust). The wall-clock start time is recorded before launch; the wall-clock end time is recorded when the last agent completes. This captures the true concurrent elapsed time including any scheduling overhead.
Sequential baseline estimation: The sequential baseline is computed as the sum of individual agent execution times if they were run sequentially. This is measured by running each agent independently and summing the durations. In some configurations, the sequential time is estimated from the concurrent run by summing agent-reported durations (which does not include concurrent scheduling overhead).
Speedup and efficiency computation: Speedup = sequential_time / concurrent_time. Efficiency = speedup / 2 * 100%. These are computed per-run and then averaged across runs for a given configuration.
3.10 Per-TR measurement boundaries
Each report defines a specific measurement boundary matched to its research question. The boundaries are not identical; they are chosen to isolate the relevant variables.
| Report | Measurement Boundary | What Is Timed | What Is Excluded |
|---|---|---|---|
| TR108 | Single-inference, Ollama API | Per-request throughput, TTFT, eval metrics | Agent workflow overhead, file I/O, concurrent access |
| TR109 | Agent workflow, single-agent | Full workflow wall-clock time, per-LLM-call metrics | Concurrent agent interaction, server contention |
| TR110 | Multi-agent concurrent, Python | Wall-clock concurrent time, per-agent metrics, contention | Rust comparison, runtime variation |
| TR111_v2 | Agent workflow, Rust single-agent | Full workflow wall-clock time, per-LLM-call metrics | Cross-language comparison (deferred to TR112_v2) |
| TR112_v2 | Cross-language single-agent | Throughput, TTFT, memory, startup, consistency (Rust vs Python) | Multi-agent scenarios (deferred to TR114_v2) |
| TR113 | Multi-agent concurrent, Rust, single Ollama | Wall-clock concurrent time, contention rate, efficiency | Dual Ollama (not yet deployed), runtime variation |
| TR114_v2 | Multi-agent concurrent, Rust, dual Ollama | Wall-clock concurrent time, efficiency, cross-language comparison | Runtime variation (deferred to TR115_v2) |
| TR115_v2 | Multi-agent concurrent, Rust, 5 runtimes | Efficiency, consistency (sigma), min/max, failure modes per runtime | Cross-model variation (deferred to TR116) |
| TR116 | Multi-agent concurrent, 3 models, 2 runtimes | Cross-model efficiency, runtime gap per model, model-specific scaling | Runtime variation beyond tokio-default, configuration sweep |
This boundary catalog ensures that each report's claims are scoped to its measurement capability and that cross-report synthesis respects the boundary differences.
4. Decision Impact Matrix (TR108-TR116)
This table anchors each technical report to its primary decision impact. The objective is not merely to summarize findings but to demonstrate how each report changes what a practitioner should do when deploying local-first LLM inference in production. Every row carries a risk column that makes explicit the cost of ignoring the report's evidence -- not as rhetoric, but as a quantified or structurally grounded consequence derived from the measured data.
| Report | Primary question answered | Key artifact(s) | Decision impact | Risk if ignored |
|---|---|---|---|---|
| TR108 | Which model and configuration is fastest for single-agent inference on consumer GPU? | 158+ configuration benchmark runs across 6 model variants (Llama3.1 q4_0/q5_K_M/q8_0, Gemma3 latest/270m/1b-qat) | Baseline performance matrix establishing model selection and GPU offload strategy; Gemma3:latest identified as throughput champion at 102.85 tok/s | Suboptimal model choice wastes 2-4x throughput; blind full-offload assumption misses partial-offload advantage for Llama architecture |
| TR109 | Do single-inference optimal configurations transfer to agent workflows? | 20+ agent workflow configurations across 3 experimental phases with statistical validation | Agent-specific configuration required (GPU=60-80, CTX=256-512); inference-optimal configs are systematically wrong for agent context | Using TR108-optimal configs (GPU=999, CTX=4096) regresses TTFT by 323.7% in agent workflows; configuration transfer failure is structural, not noise |
| TR110 | Can Python multi-agent execution achieve near-perfect parallel efficiency? | 150 concurrent benchmark runs across 30 configurations and 3 deployment scenarios | Python gold standard established: 99.25% parallel efficiency at GPU=80, CTX=2048; 0.75% irreducible overhead defines physical limit | Over-engineering concurrency solutions when near-perfect efficiency is already achievable; deploying heterogeneous configs that drop efficiency to 89% |
| TR111_v2 | Can Rust match Python's agent workflow complexity with full feature parity? | 57 Rust workflow benchmarks (19 configurations x 3 runs) with complete file I/O and multi-stage LLM pipeline | Full workflow parity achieved; 114.54 tok/s baseline with exceptional cross-config consistency (CV=0.24%); near hardware saturation confirmed | Assuming Rust cannot handle complex agent workflows; missing the deployment advantages of single-binary distribution and 60% memory reduction |
| TR112_v2 | How does Rust compare to Python head-to-head under identical workflow conditions? | 111 cross-language benchmarks (57 Rust + 54 Python) with workflow parity validation | Rust delivers +15.2% throughput, -58% TTFT, -67% memory; 5-year TCO advantage of $14.9k; Rust worst-case still 14.9% faster than Python best-case | Missing 15.2% throughput improvement and 67% memory savings; deploying Python at 4x8GB instance cost when Rust requires only 2x4GB |
| TR113 | Does Rust's single-agent throughput advantage transfer to multi-agent concurrent execution? | 19 Rust multi-agent configurations with parallel efficiency measurement | Single Ollama instance bottleneck caps Rust multi-agent at 82.2% efficiency; serialization overhead negates Rust's single-agent advantage | Deploying Rust multi-agent on single Ollama without architecture fix; expecting linear scaling that the serving layer cannot deliver |
| TR114_v2 | Can dual Ollama instances fix the Rust multi-agent bottleneck identified in TR113? | 135 dual-Ollama benchmarks with contention measurement and efficiency tracking | Contention drops from 63% to 0.74%; Rust multi-agent reaches 99.4% parallel efficiency; dual-instance architecture validated | Running multi-agent workloads on single Ollama instance and accepting 17.8% efficiency loss; infrastructure under-provisioning for concurrent agents |
| TR115_v2 | Which async runtime is best for Rust multi-agent LLM orchestration? | 150 runtime benchmarks across Tokio, async-std, and Smol with 5-run statistical validation | Tokio-default wins with 98.72% mean efficiency and lowest variance (1.21pp sigma); async-std catastrophically fails at 50% efficiency | Choosing the wrong async runtime and losing 50% parallel efficiency; deploying async-std in production where it exhibits pathological scheduling |
| TR116 | Does Rust's performance advantage hold across different model architectures and sizes? | 60 cross-model benchmarks across Gemma3:latest, Llama3.1:8b, and smaller variants | Rust maintains +12-17pp throughput advantage across all tested models; Gemma3:latest confirmed as cross-language champion | Making model-specific performance assumptions that do not generalize; selecting models without cross-language validation data |
4.1 How to read the decision impact matrix
The matrix is a bridge between measurement and deployment policy. Each row corresponds to a technical report, but more importantly, each row answers a single deployment-critical question. The "Decision impact" column states what changes in practice; the "Risk if ignored" column states the quantified or structurally grounded consequence of proceeding without the evidence.
This is not a summary table. It is a dependency contract. If a deployment decision requires evidence that a report does not provide, the policy must wait for that evidence or acknowledge the gap explicitly. For example, TR108 establishes model selection but cannot inform agent workflow configuration -- that evidence resides in TR109. A practitioner who reads TR108 alone and deploys GPU=999 with CTX=4096 into an agent pipeline will experience a 323.7% TTFT regression that TR109 would have prevented. The matrix makes such dependency chains visible.
The "Key artifact(s)" column serves a chain-of-custody function: every decision impact is traceable to a specific corpus of benchmark runs. This is the discipline that separates artifact-backed recommendation from anecdotal optimization advice. When the matrix states "99.25% parallel efficiency," that number traces to Test 108 in TR110's 150-run corpus, not to a single exploratory measurement.
4.2 Decision dependencies and ordering
The matrix implies a strict ordering that reflects the research program's logical progression from single-agent to multi-agent, from Python to Rust, and from single-instance to multi-instance architectures.
Phase 1: Single-Agent Foundation (TR108 -> TR109). TR108 establishes the performance ceiling for single-inference workloads. TR109 then tests whether that ceiling transfers to agent workflows -- and discovers it does not. This pair establishes a critical methodological lesson: benchmark results from isolated inference do not predict agent-level behavior. Any deployment decision must be validated at the workflow level, not merely at the inference level.
Phase 2: Python Multi-Agent Baseline (TR110). TR110 depends on TR109's agent-optimal configuration (GPU=60-80, CTX=256-512) and extends it to concurrent execution. The 99.25% parallel efficiency at GPU=80, CTX=2048 becomes the gold standard that all subsequent multi-agent implementations must match or explain why they cannot. Without TR110's baseline, there is no objective target for Rust multi-agent performance.
Phase 3: Rust Single-Agent Validation (TR111_v2 -> TR112_v2). TR111_v2 depends on TR109's workflow definition to ensure apples-to-apples comparison. TR112_v2 then synthesizes TR109 (Python) and TR111_v2 (Rust) into a direct cross-language comparison. The dependency is strict: TR112_v2's +15.2% throughput claim is valid only because TR111_v2 and TR109 share identical workflow complexity, validated through artifact inspection.
Phase 4: Rust Multi-Agent and Architecture (TR113 -> TR114_v2 -> TR115_v2). TR113 depends on TR110's Python gold standard to contextualize Rust's 82.2% multi-agent efficiency. TR114_v2 depends on TR113's bottleneck diagnosis to motivate the dual-Ollama architecture. TR115_v2 depends on TR114_v2's validated architecture to isolate the async runtime as the remaining variable. This chain is strictly sequential: each report addresses the failure mode identified by its predecessor.
Phase 5: Cross-Model Generalization (TR116). TR116 depends on the full chain (TR108 through TR115_v2) to validate that measured advantages are not artifacts of a single model. It closes the generalizability gap that would otherwise limit every preceding finding to Gemma3:latest.
The ordering also reveals a meta-pattern: the research program alternates between establishing a capability (TR108, TR110, TR111_v2, TR114_v2) and testing whether that capability survives a context shift (TR109, TR112_v2, TR113, TR115_v2, TR116). This alternation is the program's primary defense against premature generalization.
5. Results by Report (TR108-TR116)
This section presents each technical report as a self-contained result, following a consistent subsection structure: experimental design and artifact boundary, key findings, implications for the program, limitations and boundary conditions, and transferability. Each subsection concludes with an evidence snapshot table that anchors claims to measured data. The synthesis that follows in later sections draws exclusively from findings established here.
5.1 TR108: Single-Agent LLM Performance Analysis
Research question. Which model and configuration yields the highest throughput for single-agent inference on consumer GPU hardware, and how do quantization, GPU offload, and context size interact to determine performance?
TR108 is the foundational benchmark of the research program. It establishes a systematic performance matrix for local-first LLM inference on the RTX 4080 Laptop GPU, testing 158+ configurations across six model variants with gaming dialogue prompts representative of the Banterhearts application domain. The report's significance extends beyond its numerical findings: it defines the measurement methodology, the configuration parameter space, and the model selection criteria that every subsequent report inherits or explicitly modifies.
5.1.1 Experimental design and artifact boundary
The experiment tests six model variants spanning two architecture families and multiple quantization levels: Llama3.1:8b-instruct-q4_0 (8B parameters, Q4_0 quantization), Llama3.1:8b-instruct-q5_K_M (8B, Q5_K_M), Llama3.1:8b-instruct-q8_0 (8B, Q8_0), Gemma3:latest (4.3B, Q4_K_M), Gemma3:270m (270M, minimal quantization), and Gemma3:1b-qat (1B, QAT). All inference is served through Ollama on the RTX 4080 Laptop GPU (12 GB GDDR6X VRAM, 9,728 CUDA cores) with an Intel i9-13980HX CPU (24 cores, 32 threads) and 32 GB DDR5-4800 system memory under Windows 11.
The configuration parameter space covers three axes: GPU layer allocation (num_gpu: 0 to 999, where 999 indicates full offload), context window size (num_ctx: 512 to 4096 tokens), and sampling temperature (0.2 to 1.0). Each configuration is evaluated using gaming dialogue prompts -- multi-turn banter generation tasks representative of the Banterhearts application domain. The measurement boundary is the Ollama API response, which includes model loading, prompt processing, and token generation within a single HTTP call. This boundary choice captures the full user-facing latency but also includes initialization overhead that can distort aggregate statistics when cold starts are present.
The artifact structure stores per-run throughput (tokens per second), TTFT (time-to-first-token in milliseconds), total generation time, token counts, and configuration metadata. The 158+ benchmark runs are sufficient for identifying dominant performance factors but are deliberately not powered for fine-grained statistical inference across all configuration cells -- a limitation that TR109 and subsequent reports address through focused experimental designs.
5.1.2 Key findings
The throughput hierarchy across model variants is consistent and architecturally grounded:
Model throughput ranking (best configurations):
| Model | Best throughput (tok/s) | Architecture | Parameters | GPU offload preference |
|---|---|---|---|---|
| Gemma3:270m | 303.9 | Gemma (small) | 270M | Full offload (gpu=999) |
| Gemma3:latest | 102.85 | Gemma (Q4_K_M) | 4.3B | Full offload (gpu=999) |
| Gemma3:1b-qat | 187.2 | Gemma (QAT) | 1B | Full offload (gpu=999) |
| Llama3.1:q4_0 | 78.42 | Llama (Q4_0) | 8B | Partial offload (gpu=40) |
| Llama3.1:q5_K_M | 65.18 | Llama (Q5_K_M) | 8B | Partial offload (gpu=40) |
| Llama3.1:q8_0 | 47.8 | Llama (Q8_0) | 8B | Partial offload (gpu=40) |
Three findings carry direct decision weight:
First, GPU offload strategy is architecture-dependent. Llama3.1:8b (8B parameters) exceeds the 12 GB VRAM budget at full offload, forcing partial offload. The optimal GPU layer count for Llama is approximately 40 layers -- a partial offload regime where the model splits computation between GPU and CPU. Gemma3 variants, being smaller, fit entirely in VRAM and prefer gpu=999 (full offload). This finding means that GPU offload is not a single knob to maximize; it is an architecture-conditioned decision that depends on model size relative to available VRAM.
Second, quantization dominates throughput within an architecture. Within the Llama3.1 family, Q4_0 quantization delivers +17% throughput over Q5_K_M and +64% over Q8_0. This is a direct consequence of reduced memory bandwidth requirements: smaller quantized weights transfer faster through the GPU memory subsystem, and the RTX 4080's 432 GB/s memory bandwidth is the binding constraint for autoregressive decode. The throughput ordering (Q4_0 > Q5_K_M > Q8_0) is strict and monotonic, with no crossover at any tested configuration.
Third, context window size has minimal impact on throughput. Across all models, context size variation from 512 to 4096 tokens produces less than 1% throughput variation. This is expected for short gaming dialogue prompts where the KV cache remains well within VRAM: the decode phase is dominated by weight reads, not KV-cache reads, at these context lengths. However, this finding is bounded by the short prompt lengths used in TR108; TR127 later demonstrates that context scaling becomes a dominant factor at longer sequences where VRAM spillover occurs.
The quality ranking introduces a critical counterpoint to the throughput hierarchy: Llama3.1:q4_0 > Gemma3:latest > Llama3.1:q5_K_M for gaming dialogue coherence. Gemma3:latest achieves 34% higher throughput than Llama3.1:q4_0, but Llama3.1:q4_0 produces more coherent multi-turn dialogue. This throughput-quality tension is the first instance of a trade-off pattern that recurs throughout the research program and is formally resolved only in TR124's quality evaluation framework.
5.1.3 Implications for program
TR108 makes three program-level contributions that persist through TR116:
First, it establishes Gemma3:latest as the default benchmark model. The 102.85 tok/s throughput, reasonable quality, and full VRAM fit make it the natural choice for controlled experiments where model variation is not the independent variable. TR109 through TR115 all use Gemma3:latest as their primary model, enabling cross-report comparison without model-induced confounds.
Second, it identifies GPU layer allocation as the single most impactful configuration parameter -- a finding that TR109 subsequently qualifies by showing that the optimal GPU allocation differs between inference and agent contexts. The program learns, through TR108 and TR109 in sequence, that "most impactful" is context-dependent.
Third, it demonstrates that 158+ configurations are sufficient for factor identification but insufficient for statistical power on individual cells. This methodological lesson directly shapes TR109's 3-phase design (discovery, sweep, validation) and TR110's 5-run-per-configuration protocol.
5.1.4 Limitations and boundary conditions
TR108 is bounded by five explicit constraints:
-
Single-inference measurement boundary. The Ollama API boundary captures full user-facing latency but cannot separate prefill from decode or attribute overhead to specific pipeline stages. Later reports (TR123) introduce phase-split measurement.
-
Gaming dialogue prompts only. The prompt suite is representative of Banterhearts banter generation but does not include summarization, question answering, or code generation workloads. Findings may not transfer to workloads with fundamentally different input/output ratios.
-
No statistical validation protocol. TR108 does not enforce minimum run counts or report confidence intervals. The 158+ runs provide broad coverage but not statistical rigor on any individual comparison. TR109 Phase 3 introduces the 3-run minimum protocol that becomes standard.
-
Single hardware baseline. All measurements are on the RTX 4080 Laptop GPU. The throughput hierarchy (especially the GPU offload preference) may differ on GPUs with different VRAM capacities or memory bandwidth characteristics.
-
No quality evaluation framework. Quality is assessed informally through manual inspection of generated dialogue. TR124 later establishes formal quality metrics (ROUGE-L, BERTScore, coherence) that retroactively validate TR108's qualitative quality rankings.
5.1.5 Transferability
TR108's findings transfer under two conditions: the target hardware has a similar VRAM-to-model-size ratio (determining GPU offload strategy), and the workload involves short-prompt, single-inference generation (preserving the context-insensitivity finding). The throughput hierarchy is likely stable across consumer GPUs with 8-16 GB VRAM, but absolute numbers will scale with memory bandwidth and compute density.
The correct way to use TR108 in a new deployment is as a methodology template: replicate the configuration sweep on the target hardware, identify the architecture-conditioned GPU offload point, and validate that context sensitivity remains low for the target prompt distribution. The absolute numbers (102.85 tok/s, 78.42 tok/s) are not transferable; the relative ordering and factor structure are.
Evidence snapshot (artifact-backed):
| Metric | Value | Model | Config | Evidence source |
|---|---|---|---|---|
| Peak throughput (production model) | 102.85 tok/s | Gemma3:latest | gpu=999, ctx=4096 | TR108 benchmark corpus |
| Peak throughput (small model) | 303.9 tok/s | Gemma3:270m | gpu=999, ctx=512 | TR108 benchmark corpus |
| Llama best throughput | 78.42 tok/s | Llama3.1:q4_0 | gpu=40, ctx=1024 | TR108 benchmark corpus |
| Quantization throughput gain (q4 vs q8) | +64% | Llama3.1 family | Matched configs | TR108 cross-quant comparison |
| Context impact on throughput | <1% variation | All models | ctx=512 to ctx=4096 | TR108 context sweep |
| GPU offload (Gemma optimal) | gpu=999 (full) | Gemma3:latest | All contexts | TR108 GPU layer sweep |
| GPU offload (Llama optimal) | gpu=40 (partial) | Llama3.1:8b | All contexts | TR108 GPU layer sweep |
| Total configurations tested | 158+ | All 6 models | Full parameter space | TR108 artifact manifest |
5.2 TR109: Agent Workflow Optimization
Research question. Do single-inference optimal configurations (TR108) transfer to multi-step agent workflows, and if not, what agent-specific configuration is required?
TR109 is the first methodological correction in the research program. It tests the natural assumption that a configuration optimized for single-inference throughput will also be optimal when that inference call is embedded in a multi-step agent pipeline -- and discovers that this assumption is systematically false. The finding is not a minor calibration; it is a category-level failure of transfer that invalidates any deployment strategy based solely on inference-level benchmarks.
5.2.1 Experimental design and artifact boundary
The experiment uses a three-phase design that progressively narrows from discovery to validation:
Phase 1 (Configuration Transfer Test). The TR108-optimal configuration (GPU=999, CTX=4096) is deployed directly into an agent workflow consisting of file system scanning, data ingestion from CSV and JSON sources, multi-stage LLM analysis (analysis call followed by report generation call), and structured output production. The measurement boundary expands from TR108's single API call to the full agent pipeline execution time, including file I/O, data parsing, and both LLM invocations. The workflow processes approximately 101 files per run with 2 LLM calls, matching the Banterhearts agent use case.
Phase 2 (Parameter Sweep). Eighteen configurations are tested across GPU layer allocation (40, 60, 80, 100, 120), context window (256, 512, 1024, 2048), and temperature (0.6, 0.8, 1.0). Each configuration is evaluated on throughput, TTFT, and output quality. The sweep is designed to identify the agent-optimal region of the parameter space, not to achieve statistical significance on individual comparisons -- that is Phase 3's role.
Phase 3 (Statistical Validation). The top candidate configurations from Phase 2 are re-evaluated with n=3 runs per configuration to test replicability. This phase directly addresses TR108's methodological gap: single-run measurements are insufficient for agent workflows where inter-run variance is higher due to file I/O jitter, Ollama state variation, and multi-call orchestration overhead.
The model is Gemma3:latest (4.3B, Q4_K_M) on the same hardware baseline as TR108: RTX 4080 Laptop GPU (12 GB GDDR6X), Intel i9-13980HX, 32 GB DDR5-4800, Windows 11. The use of the same model and hardware ensures that any performance difference is attributable to the workflow context, not to hardware or model variation.
5.2.2 Key findings
The central finding is stark and structurally significant:
Phase 1: Configuration transfer failure. The TR108-optimal configuration (GPU=999, CTX=4096) produces a 323.7% TTFT regression when deployed in the agent workflow context. The workflow baseline (process-isolated, Ollama defaults) achieves 99.34 tok/s throughput with 1,437 +/- 75 ms TTFT. The TR108 "optimal" configuration, which maximizes single-inference throughput, dramatically increases agent-level latency because the larger context window and full GPU offload create memory pressure that compounds across the two sequential LLM calls within each agent run.
| Metric | Workflow baseline | TR108-optimal in agent context | Delta |
|---|---|---|---|
| Throughput (tok/s) | 99.34 | ~98.5 | -0.8% |
| TTFT (ms) | 1,437 | ~6,090 | +323.7% |
| Process isolation | Yes | No (shared Ollama) | Context-dependent |
Phase 2: Agent-optimal configuration region. The parameter sweep identifies a distinct optimal region for agent workflows: GPU=60-80, CTX=256-512, Temperature=0.8. The best single-run configuration (GPU=60, CTX=512, TEMP=0.8) achieves +2.2% throughput improvement over baseline with a 68% TTFT reduction. The agent-optimal region is characterized by lower GPU offload and smaller context windows than the inference-optimal configuration -- the opposite of what TR108 would predict.
The mechanism is clear: in a multi-step agent workflow, the Ollama server maintains model state between calls. Smaller context windows reduce per-call KV-cache allocation, and moderate GPU offload (60-80 layers rather than 999) leaves headroom for the system to manage memory across sequential calls without triggering VRAM pressure. Full offload with large context windows creates a memory fragmentation pattern that manifests as TTFT inflation rather than throughput degradation -- the model generates tokens at similar speed once started, but the time to produce the first token balloons.
Phase 3: Statistical validation and the replication gap. The +2.2% throughput improvement identified in Phase 2 does not survive statistical validation. Under 3-run replication, the improvement becomes -0.2% -- within noise. This finding is itself valuable: it establishes that small throughput gains (<3%) in agent workflows are unreliable without n>=3 validation, and that the primary optimization target for agent workflows is TTFT, not throughput. The 68% TTFT reduction does replicate, confirming that latency optimization is the actionable lever.
Quality trade-off. The agent-optimal configuration (GPU=60, CTX=512, TEMP=0.8) produces a 6.9% quality score reduction compared to the baseline default. This introduces a three-way tension between throughput, latency, and quality that the research program does not fully resolve until TR124's formal quality evaluation framework. For TR109's scope, the finding is reported as a boundary condition: optimizing for TTFT comes at a measurable quality cost.
5.2.3 Implications for program
TR109's configuration transfer failure has three program-level consequences:
First, it establishes the principle that every workflow context requires its own configuration validation. This principle is not a generalization from one failed transfer; it is a structural observation about the interaction between Ollama's memory management and multi-call orchestration. The program adopts this as a methodological policy: no configuration is "optimal" without specifying the workflow context in which it was measured.
Second, it shifts the primary optimization target from throughput to TTFT for agent workflows. Throughput differences across configurations are within noise (less than 3%), but TTFT differences are 2-5x. This reframing directly influences TR111_v2's focus on TTFT variance as the dominant Rust optimization axis.
Third, it introduces the 3-run minimum validation protocol that becomes standard for all subsequent reports. TR108's single-run methodology is sufficient for factor identification but insufficient for claiming improvements smaller than the inter-run noise floor. The program learns this through Phase 3's replication failure, not through theoretical argument.
5.2.4 Limitations and boundary conditions
TR109's scope is bounded by several constraints that subsequent reports must address:
-
Single model (Gemma3:latest). The configuration transfer failure may differ in magnitude for models with different VRAM footprints. TR116 later validates that performance patterns generalize across models, but TR109 alone cannot claim cross-model validity.
-
Two-call workflow only. The agent performs exactly two LLM calls per run (analysis + report generation). Workflows with more calls may exhibit different memory pressure patterns. TR110 extends to concurrent multi-agent but not to deeper call chains.
-
Informal quality assessment. The -6.9% quality trade-off is measured by a scoring heuristic, not by TR124-grade evaluation metrics. The magnitude and direction are indicative but not precise.
-
Windows 11 only. Ollama's memory management behavior may differ on Linux, where TR126 later demonstrates different compilation and serving characteristics.
-
Small sample size in Phase 3. Three runs per configuration provide a replication check but not statistical power for detecting small effects. The finding that +2.2% does not replicate is conservative: it may replicate at n=10 with a smaller true effect size.
5.2.5 Transferability
TR109's configuration transfer failure is likely to generalize to any system where a serving layer (Ollama, vLLM, TGI) maintains state between sequential calls within a workflow. The specific optimal values (GPU=60-80, CTX=256-512) are hardware-dependent, but the structural finding -- that agent-optimal configurations differ from inference-optimal configurations -- is architecturally grounded and expected to transfer.
The transferable output is a policy rule: avoid deploying an inference-benchmarked configuration into an agent workflow without agent-level validation. The specific configuration values are a starting point for the RTX 4080 + Gemma3:latest combination, not a universal prescription.
Evidence snapshot (artifact-backed):
| Metric | Value | Phase | Configuration | Evidence source |
|---|---|---|---|---|
| Workflow baseline throughput | 99.34 tok/s | Phase 1 | Ollama defaults | TR109 Phase 1 baseline |
| Workflow baseline TTFT | 1,437 +/- 75 ms | Phase 1 | Process-isolated | TR109 Phase 1 baseline |
| TTFT regression (TR108 config) | +323.7% | Phase 1 | GPU=999, CTX=4096 | TR109 Phase 1 transfer test |
| Best single-run throughput gain | +2.2% | Phase 2 | GPU=60, CTX=512, TEMP=0.8 | TR109 Phase 2 sweep |
| TTFT reduction (agent-optimal) | -68% | Phase 2 | GPU=60, CTX=512, TEMP=0.8 | TR109 Phase 2 sweep |
| Throughput gain after validation | -0.2% (not replicated) | Phase 3 | GPU=60, CTX=512, TEMP=0.8, n=3 | TR109 Phase 3 validation |
| Quality trade-off | -6.9% | Phase 2 | Agent-optimal vs baseline | TR109 quality scoring |
| Configs tested (sweep) | 18 | Phase 2 | Full parameter grid | TR109 Phase 2 artifact set |
| Agent-optimal GPU range | 60-80 layers | Phase 2 | Across all contexts | TR109 parameter sensitivity |
| Agent-optimal context range | 256-512 tokens | Phase 2 | Across all GPU settings | TR109 parameter sensitivity |
5.3 TR110: Python Multi-Agent Baseline
Research question. What is the maximum achievable parallel efficiency for concurrent Python multi-agent LLM execution, and what configuration parameters determine the contention boundary?
TR110 extends the research program from single-agent to concurrent multi-agent execution. Where TR108 and TR109 measure one agent at a time, TR110 deploys two agents simultaneously on separate Ollama instances and measures whether the per-agent performance degrades, remains constant, or -- in principle -- could improve through resource sharing effects. The report's primary deliverable is a gold-standard parallel efficiency number that establishes the ceiling for any language or architecture to match.
5.3.1 Experimental design and artifact boundary
The experiment deploys a two-agent system where Agent 1 (DataCollector) and Agent 2 (Insight) each communicate with a dedicated Ollama instance (ports 11434 and 11435 respectively). Both agents run Gemma3:latest (4.3B, Q4_K_M). The architecture uses process isolation: each agent runs in its own Python process with its own HTTP connection to its Ollama instance, eliminating GIL contention and shared-memory interference.
The configuration matrix covers 30 configurations with 5 runs each, producing 150 total benchmark runs. Three deployment scenarios are tested:
- Homogeneous Chimera (chimera_homo): Both agents use identical optimized configurations. This tests the ceiling of parallel efficiency when resource allocation is symmetric.
- Baseline vs Chimera (baseline_chimera): One agent uses Ollama defaults, the other uses Chimera-optimized settings. This tests whether asymmetric configuration degrades the optimized agent.
- Heterogeneous (hetero): Agents use different GPU layer, context, and temperature settings. This tests the worst-case scenario of resource competition between mismatched workloads.
Configuration parameters span GPU layers (60, 80, 100, 120), context size (512, 1024, 2048), and temperature (0.6, 0.8, 1.0). The measurement boundary is the wall-clock time from concurrent launch to both agents completing, with per-agent throughput and TTFT captured independently.
The hardware baseline is the RTX 4080 Laptop GPU (12 GB GDDR6X, 9,728 CUDA cores) with Intel i9-13980HX (24 cores, 32 threads), 32 GB DDR5-4800, and Windows 11 Pro. Dual Ollama instances (v0.1.17) share the GPU but are served on separate ports.
5.3.2 Key findings
The headline result is unambiguous: 99.25% parallel efficiency with 1.985x speedup at the optimal configuration (Test 108: GPU=80, CTX=2048, TEMP=1.0). This is within 0.75 percentage points of the theoretical maximum (2.0x speedup, 100% efficiency), establishing that the remaining overhead is an irreducible physical cost of process-isolated HTTP multi-agent execution on shared GPU hardware.
Efficiency by deployment scenario:
| Scenario | Best efficiency | Mean efficiency | Config | Contention rate |
|---|---|---|---|---|
| Homogeneous Chimera | 99.25% | 98.8% | GPU=80, CTX=2048, TEMP=1.0 | 0% at GPU>=80 |
| Baseline vs Chimera | 97.93% | ~95% | GPU=80, CTX=2048 | Low |
| Heterogeneous | 90.0% | 89-90% | Mixed configs | Elevated (resource starvation) |
Four structural findings emerge from the 150-run corpus:
First, GPU=80 is the minimum threshold for contention-free execution. Below GPU=80, the homogeneous Chimera scenario exhibits resource contention in 60% of baseline_vs_chimera runs at GPU=60. At GPU=80 and above, contention drops to zero in the chimera_homo scenario. The threshold is sharp, not gradual: there is no intermediate contention regime between 60 and 80. This suggests that the contention mechanism is a binary resource exhaustion event (likely VRAM allocation failure for the second model instance) rather than a continuous degradation.
Second, context scaling improves efficiency monotonically. Efficiency increases from 96.7% at CTX=512 to 98.9% at CTX=1024 to 99.2% at CTX=2048 in the homogeneous scenario. This is counterintuitive: larger contexts consume more VRAM per agent, which should increase resource pressure. The likely explanation is that larger contexts produce longer generation runs, which amortize the fixed overhead of HTTP round-trips and Ollama scheduling over more tokens. The efficiency gain from context scaling is a utilization effect, not a resource effect.
Third, temperature has a hidden concurrency cost. TEMP=1.0 produces 3.6x more efficiency variance than TEMP=0.6 in concurrent execution. While the mean efficiency at TEMP=1.0 (99.25%) is slightly higher than at TEMP=0.6 (98.93%), the variance is substantially larger. In production, this means TEMP=1.0 will occasionally produce efficiency below 95% even when the mean is near-perfect. The recommendation for production concurrent deployment is TEMP=0.6 for predictable efficiency, not TEMP=1.0 for maximum mean efficiency.
Fourth, the 0.75% irreducible overhead defines the physical limit. The gap between 99.25% and 100% efficiency represents the combined cost of HTTP serialization, Ollama request queuing, GPU context switching between two model instances, and Python process management overhead. This overhead cannot be eliminated within the process-isolated HTTP architecture; it can only be reduced by moving to shared-memory or native GPU scheduling approaches -- which is precisely what TR111_v2 (Rust) attempts.
5.3.3 Implications for program
TR110's 99.25% efficiency number becomes the Python gold standard against which all subsequent multi-agent implementations are measured. Its program-level impact is threefold:
First, it establishes that near-perfect parallel efficiency is achievable with process-isolated HTTP multi-agent on consumer GPU hardware. This finding prevents the program from investing in complex concurrency optimization frameworks when simple dual-instance Ollama already delivers 99.25% of the theoretical maximum.
Second, it identifies GPU=80 as the universal minimum for contention-free multi-agent execution. This threshold propagates through TR113 (Rust multi-agent), TR114_v2 (dual Ollama), and TR115_v2 (async runtime comparison) as a prerequisite configuration.
Third, it demonstrates that heterogeneous deployment degrades efficiency by 9-10 percentage points relative to homogeneous. This finding directly informs the program's production recommendation: deploy identical configurations across all concurrent agents, or accept a measurable efficiency penalty.
5.3.4 Limitations and boundary conditions
-
Two agents only. The 12 GB VRAM budget supports two concurrent Gemma3:latest instances at full offload. Three or more agents would require smaller models or reduced offload, introducing a qualitatively different contention regime that TR110 does not measure. TR129 later addresses N-agent scaling with different models and serving architectures.
-
Single model (Gemma3:latest). Parallel efficiency may differ for models with different VRAM footprints. A larger model (8B) might hit contention at lower GPU layer counts or fail to support dual instances entirely.
-
Process isolation assumption. The 99.25% efficiency depends on process isolation (separate Python processes, separate Ollama instances). Thread-based or coroutine-based concurrency within a single process would encounter different bottlenecks (GIL, connection pooling, event loop scheduling).
-
Short-duration benchmarks. The 150 runs capture steady-state behavior but may miss long-running thermal or memory fragmentation effects that emerge over hours of continuous operation.
-
Dual Ollama instances pre-configured. The experiment assumes dual Ollama is already running. The operational cost of managing dual instances (monitoring, restart handling, port configuration) is not measured.
5.3.5 Transferability
TR110's parallel efficiency findings transfer to any deployment where: (a) the GPU has sufficient VRAM for two concurrent model instances at the target offload level, (b) the serving layer supports multi-instance deployment with port-level isolation, and (c) the client uses process-level parallelism rather than thread-level. The 0.75% irreducible overhead is an architecture-level constant for process-isolated HTTP multi-agent and is expected to be similar across hardware with comparable PCIe and memory bandwidth.
The non-transferable elements are the specific GPU layer threshold (80), which depends on VRAM capacity, and the absolute throughput numbers, which depend on GPU compute and memory bandwidth. The transferable output is the methodology (dual-instance isolation, 5-run validation, scenario taxonomy) and the structural finding that near-perfect efficiency is achievable without exotic concurrency frameworks.
Evidence snapshot (artifact-backed):
| Metric | Value | Scenario | Configuration | Evidence source |
|---|---|---|---|---|
| Peak parallel efficiency | 99.25% | Chimera homogeneous | GPU=80, CTX=2048, TEMP=1.0 | TR110 Test 108 |
| Peak speedup | 1.985x | Chimera homogeneous | GPU=80, CTX=2048, TEMP=1.0 | TR110 Test 108 |
| Homogeneous mean efficiency | 98.8% | Chimera homogeneous | All configs, GPU>=80 | TR110 chimera_homo corpus |
| Baseline vs Chimera best | 97.93% | Baseline vs Chimera | GPU=80, CTX=2048 | TR110 Test 202 |
| Heterogeneous efficiency | 89-90% | Heterogeneous | Mixed configs | TR110 hetero corpus |
| Contention threshold | GPU=80 | Chimera homogeneous | Contention=0% at GPU>=80 | TR110 contention analysis |
| Context scaling (512 -> 2048) | 96.7% -> 99.2% | Chimera homogeneous | Fixed GPU=80 | TR110 context sweep |
| Temperature variance ratio | 3.6x | Chimera homogeneous | TEMP=1.0 vs TEMP=0.6 | TR110 variance analysis |
| Irreducible overhead | 0.75% | Chimera homogeneous | Best config | TR110 ceiling analysis |
| Total benchmark runs | 150 | All scenarios | 30 configs x 5 runs | TR110 artifact manifest |
5.4 TR111_v2: Rust Single-Agent Performance
Research question. Can a Rust-based agent implementation achieve full workflow parity with the Python agent (TR109) and what are the performance characteristics of Rust inference at the agent level?
TR111_v2 is the research program's first cross-language experiment. It replaces the superseded TR111 (which tested a simplified Rust implementation performing only single LLM calls without file I/O or multi-step workflows) with a production-grade Rust agent that matches the Python agent's full complexity: file system scanning, data ingestion from CSV, JSON, and Markdown sources, multi-stage LLM analysis (analysis + report generation), and comprehensive metric tracking. The upgrade, documented in TR115, ensures that any performance difference between Rust and Python is attributable to the language runtime, not to workflow simplification.
5.4.1 Experimental design and artifact boundary
The experiment tests 19 configurations (1 baseline + 18 parameter variations) with 3 runs per configuration, producing 57 total benchmark runs. The configuration space mirrors TR109's parameter ranges: GPU layers (40, 60, 80, 100), context size (256, 512, 1024), and temperature (0.6, 0.8, 1.0). Each run processes approximately 101 files and makes 2 LLM calls (analysis + report generation), matching the Python agent's workflow exactly.
The Rust agent is compiled with Rust 1.90.0 (x86_64-pc-windows-msvc) using release optimizations. The Ollama API communication uses the reqwest HTTP client with tokio async runtime. The measurement boundary matches TR109: full workflow execution time from file system scan to report output, with per-call throughput and TTFT captured at the Ollama API boundary.
Hardware and model are identical to TR109: Gemma3:latest (4.3B, Q4_K_M) on RTX 4080 Laptop GPU (12 GB GDDR6X, 9,728 CUDA cores), Intel i9-13980HX (24 cores, 32 threads), 32 GB DDR5-4800, Windows 11. This control eliminates hardware, model, and serving-layer variation, isolating the language runtime as the sole independent variable.
5.4.2 Key findings
TR111_v2 produces five findings that collectively establish Rust's performance profile for single-agent LLM workflows:
Baseline performance:
| Metric | Value | CV | Interpretation |
|---|---|---|---|
| Throughput (baseline, mean) | 114.54 tok/s | 2.6% (sigma=2.97) | Stable, near hardware saturation |
| TTFT (baseline, mean) | 603.53 ms | High (10-180%) | Highly variable, context-dependent |
| Cross-config throughput CV | 0.24% | -- | Exceptional consistency across configs |
| Best config throughput | 114.97 tok/s | -- | GPU=60, CTX=256, TEMP=0.6 |
| Worst config throughput | 113.99 tok/s | -- | Negligible spread (0.86%) |
First, throughput is configuration-insensitive. The cross-configuration coefficient of variation is 0.24%, meaning that all 19 tested configurations produce effectively identical throughput. The individual parameter impacts are negligible: GPU layers account for 0.14% throughput variation, context size for 0.07%, and temperature for 0.06%. This is a qualitatively different regime from Python (TR109), where configuration choice produces measurable throughput differences. The Rust runtime is so efficient that the Ollama serving layer -- not the client -- is the binding constraint on throughput.
Second, the throughput ceiling is near hardware saturation. The best configuration achieves 114.97 tok/s, only 0.4% above the baseline of 114.54 tok/s. This 0.4% gap represents the maximum improvement available through configuration tuning. The Rust client has effectively eliminated all software-side overhead, and the remaining throughput is determined by Ollama's model execution speed, which in turn is determined by GPU memory bandwidth and compute. This near-saturation finding means that Rust agent performance cannot be meaningfully improved by configuration tuning; improvements require changes to the serving layer or hardware.
Third, TTFT is 150x more variable than throughput. While throughput CV is 2.6% for the baseline configuration, TTFT CV ranges from 10% to 180% across configurations. This asymmetry is structurally significant: it means that TTFT is sensitive to configuration parameters and to Ollama internal state in ways that throughput is not. The TTFT baseline of 603.53 ms is already 58% faster than Python's 1,437 ms (comparison formalized in TR112_v2), but TTFT optimization remains the primary axis for Rust agent tuning.
Fourth, full workflow parity is confirmed. The Rust agent processes 101 files, executes 2 LLM calls per run, generates structured reports, and collects comprehensive metrics. The workflow parity validation compares output structure, file counts, and metric completeness between Rust and Python agents, confirming that performance differences are not artifacts of reduced functionality.
Fifth, production deployment advantages are quantifiable. The Rust agent compiles to a single binary (~15 MB), requires no runtime dependencies (no Python interpreter, no pip packages, no virtual environment), uses approximately 65-90 MB of process memory (vs. 300-350 MB for Python), and starts in approximately 0.2 seconds (vs. 1.5 seconds for Python). These are not benchmark metrics; they are operational characteristics that affect deployment complexity, container sizing, and cold-start behavior.
5.4.3 Implications for program
TR111_v2 establishes Rust as a production-viable alternative to Python for LLM agent workflows, shifting the program from a Python-only paradigm to a cross-language comparison. Three program-level implications follow:
First, the configuration insensitivity finding (CV=0.24%) means that Rust agents do not require configuration tuning -- the default configuration is effectively optimal. This simplifies deployment: a Rust agent can be deployed with Ollama defaults and achieve within 0.4% of the maximum possible throughput.
Second, the near-saturation finding (0.4% improvement ceiling) means that the Ollama serving layer is now the bottleneck, not the client. This shifts the optimization frontier from client-side code to serving-layer architecture -- a shift that motivates TR113 (Rust multi-agent), TR114_v2 (dual Ollama), and the later serving-stack investigations (TR130).
Third, the TTFT advantage (603 ms vs 1,437 ms for Python) provides the first quantitative evidence that Rust's lower runtime overhead translates to measurable latency improvements for LLM workflows. This advantage is distinct from throughput: it reflects the difference in process startup, HTTP connection establishment, and first-request handling between the two language runtimes.
5.4.4 Limitations and boundary conditions
-
Single model and serving layer. All measurements use Gemma3:latest on Ollama. Rust's advantages may differ with different models (especially larger models that stress VRAM) or different serving layers (vLLM, TGI).
-
Windows only. The Rust binary is compiled for Windows (x86_64-pc-windows-msvc). Performance characteristics may differ on Linux, where different system call paths and memory allocators are used. TR126 later confirms that Linux serving behavior differs materially.
-
19 configurations, 3 runs each. The 57-run corpus is sufficient for identifying the configuration-insensitivity pattern but provides limited statistical power for detecting small configuration effects (if any exist below the 0.24% CV noise floor).
-
TTFT variance not fully explained. The 150x TTFT-to-throughput variance ratio suggests that TTFT is driven by factors outside the Rust client's control (Ollama model loading, CUDA context initialization, OS scheduling). TR111_v2 identifies the pattern but does not attribute it to specific causes.
-
Two LLM calls per workflow. The workflow complexity matches TR109 but does not test deeper call chains (5+), longer contexts, or streaming generation patterns.
5.4.5 Transferability
TR111_v2's configuration insensitivity finding is expected to transfer to any deployment where the LLM serving layer (not the client) is the throughput bottleneck. This is likely true for most Rust HTTP clients communicating with local Ollama instances, regardless of hardware. The specific throughput (114.54 tok/s) and TTFT (603.53 ms) are hardware-dependent and will scale with GPU capability and model size.
The production advantages (single binary, low memory, fast startup) are inherent to Rust's compilation model and transfer unconditionally. The 65-90 MB memory footprint and 0.2-second startup are architectural properties, not configuration-dependent measurements.
Evidence snapshot (artifact-backed):
| Metric | Value | Sigma/CV | Configuration | Evidence source |
|---|---|---|---|---|
| Baseline throughput | 114.54 tok/s | sigma=2.97, CV=2.6% | Ollama defaults, n=3 | TR111_v2 baseline runs |
| Baseline TTFT | 603.53 ms | CV=10-180% (config-dependent) | Ollama defaults | TR111_v2 baseline runs |
| Cross-config throughput CV | 0.24% | -- | All 19 configs | TR111_v2 aggregate analysis |
| Best config throughput | 114.97 tok/s | -- | GPU=60, CTX=256, TEMP=0.6 | TR111_v2 config sweep |
| Worst config throughput | 113.99 tok/s | -- | -- | TR111_v2 config sweep |
| Improvement ceiling | 0.4% | -- | Best vs baseline | TR111_v2 optimization analysis |
| GPU layer impact | 0.14% | -- | Across GPU layer sweep | TR111_v2 parameter sensitivity |
| Context impact | 0.07% | -- | Across context sweep | TR111_v2 parameter sensitivity |
| Temperature impact | 0.06% | -- | Across temperature sweep | TR111_v2 parameter sensitivity |
| Total benchmark runs | 57 | -- | 19 configs x 3 runs | TR111_v2 artifact manifest |
| Process memory | 65-90 MB | -- | All configs | TR111_v2 resource monitoring |
| Startup time | ~0.2 s | -- | Single binary | TR111_v2 deployment metrics |
| Files processed per run | 101 | -- | Full workflow | TR111_v2 workflow validation |
5.5 TR112_v2: Rust vs Python Cross-Language Comparison
Research question. When Rust and Python agent implementations have verified workflow parity, what are the measurable performance, resource, and cost differences, and which language should be chosen for production deployment?
TR112_v2 is the decision-grade report of the first five technical reports. It synthesizes the Rust data from TR111_v2 (57 benchmarks, 19 configurations, 3 runs each) and the Python data from TR109 (54 benchmarks, 18 configurations, 3 runs each) into a head-to-head comparison under identical conditions. The report supersedes TR112 (v1), which compared an outdated Rust micro-benchmark (single LLM call, no file I/O) against Python's full workflow implementation -- a comparison that was structurally unfair and has been withdrawn. TR112_v2 is the first report in the program that produces a deployment recommendation with both statistical backing and economic analysis.
5.5.1 Experimental design and artifact boundary
The comparison draws on 111 total benchmarks: 57 Rust runs (TR111_v2) and 54 Python runs (TR109). Both corpora use: Gemma3:latest (4.3B, Q4_K_M) on the RTX 4080 Laptop GPU, identical file I/O workloads (~101 files), identical LLM call patterns (2 calls per run: analysis + report generation), identical Ollama API interface, and overlapping configuration parameter spaces (GPU layers, context size, temperature).
The comparison methodology enforces several fairness guarantees:
-
Baseline-default comparison. The primary throughput and TTFT comparisons use Ollama default configurations in both languages, eliminating configuration-induced variance. Configuration sweep comparisons are labeled explicitly as cross-configuration analysis.
-
Matched workflow complexity. Both agents perform file system scanning, multi-format data ingestion (CSV, JSON, Markdown), multi-stage LLM inference, and structured report output. The Rust agent upgrade (documented in TR115) specifically targeted workflow parity.
-
Same hardware, same model, same serving layer. The only independent variable is the language runtime: Python 3.13 with aiohttp/requests vs. Rust 1.90 with reqwest/tokio.
-
Statistical apparatus. Welch's t-test for throughput comparison (unequal variance assumed), Cohen's d for effect size, coefficient of variation for consistency comparison. The sample sizes (n=57 and n=54) provide sufficient power for detecting the observed effect sizes.
5.5.2 Key findings
The comparison produces a comprehensive performance profile across five axes: throughput, latency, consistency, resource efficiency, and economics.
Primary performance comparison (baseline defaults):
| Metric | Rust | Python | Delta | Statistical significance |
|---|---|---|---|---|
| Throughput (tok/s) | 114.54 | 99.34 | +15.2% | p < 0.001, Cohen's d = 6.82 |
| TTFT (ms) | 603 | 1,437 | -58.0% | p < 0.001 |
| Process memory (MB) | 65-90 | 300-350 | -67% | Direct measurement |
| Startup time (s) | 0.2 | 1.5 | -83% | Direct measurement |
| Cross-config CV | 0.24% | 1.8% | 7.5x better (Rust) | F-test on variances |
The Cohen's d of 6.82 for the throughput comparison is extraordinary by social science standards but expected in engineering benchmarks where the measurement noise is small relative to the effect. It means that the distributions of Rust and Python throughput measurements do not overlap: every Rust run is faster than every Python run, even accounting for configuration variation.
Cross-configuration dominance. The most striking finding is that Rust's worst configuration (113.99 tok/s) is still 14.9% faster than Python's best configuration (101.08 tok/s, achieved at GPU=80, CTX=512, TEMP=0.8 with n=1, noting that this peak did not replicate at n=3). This establishes absolute dominance: there is no configuration of the Python agent that matches any configuration of the Rust agent in throughput. The dominance holds even when comparing Rust's default with Python's most aggressively optimized single-run result.
Optimization pattern asymmetry:
| Optimization metric | Rust | Python |
|---|---|---|
| Optimization success rate | 72.2% | 38.9% |
| Mean improvement when successful | +0.4% | +2.2% |
| Consistency of improvement | High (narrow distribution) | Low (wide distribution) |
| Peak single-run improvement | +1.2% | +2.2% |
| Does peak replicate? | Yes (n=3) | No (-0.2% at n=3) |
Rust's 72.2% optimization success rate (percentage of configurations that improve over baseline) versus Python's 38.9% reveals a fundamental difference in the optimization landscape. Rust's landscape is flat near the optimum (0.4% mean improvement), so most directions are slightly uphill -- a signature of hardware saturation. Python's landscape is noisy, with occasional peaks (+2.2%) that do not replicate, indicating that Python's higher variance creates occasional favorable conditions that are not reproducible. For production deployment, Rust's reliable small gains are preferable to Python's unreliable occasional peaks.
Resource efficiency and deployment characteristics:
| Characteristic | Rust | Python | Implication |
|---|---|---|---|
| Binary/runtime size | ~15 MB single binary | ~500 MB (interpreter + packages) | Rust: simpler containers, faster deployment |
| Memory footprint | 65-90 MB | 300-350 MB | Rust: 2x4GB instances; Python: 4x8GB instances |
| Startup time | 0.2 s | 1.5 s | Rust: viable for serverless cold starts |
| Dependencies | 0 runtime deps | pip packages, venv | Rust: no dependency conflicts |
| Concurrent capacity (12 GB VRAM) | ~3 instances | ~1 instance (with headroom) | Rust: higher per-node density |
5-year economic analysis:
| Economic metric | Rust | Python | Delta |
|---|---|---|---|
| 5-year TCO | $42,600 | $57,500 | -$14,900 (-26%) |
| Annual infrastructure cost | $7,200 | $9,600 | -$2,400/yr |
| Break-even on Rust migration | 12.7 months | -- | Accounting for ~$5k Rust dev overhead |
| Instance sizing | 2 x 4 GB | 4 x 8 GB | 50% fewer, 50% smaller instances |
| Requests/month at parity | 1M+ | 1M+ | Cost advantage scales with volume |
The TCO model assumes 1M+ requests per month, amortized hardware costs, and the instance sizing differences derived from the memory footprint measurements. The $5,000 Rust development overhead (estimated from the TR115 upgrade effort) is recovered in 12.7 months through infrastructure savings. At higher request volumes, the break-even point shortens proportionally because Rust's higher throughput serves more requests per unit of compute time.
5.5.3 Implications for program
TR112_v2 establishes Rust as the recommended production language for LLM agent workflows on the Banterhearts hardware baseline. This recommendation is grounded in five mutually reinforcing advantages:
First, throughput dominance (+15.2%) directly translates to serving capacity. A Rust deployment serves 15.2% more requests per GPU-hour than an equivalent Python deployment, compounding into thousands of dollars of infrastructure savings annually.
Second, latency advantage (-58% TTFT) directly affects user experience. The 603 ms first-token latency in Rust is below the 1-second threshold commonly used for interactive application responsiveness, while Python's 1,437 ms exceeds it. For the Banterhearts gaming dialogue use case, this difference is perceptible and consequential.
Third, resource efficiency (-67% memory) enables higher per-node agent density. On a 12 GB VRAM system, Rust's 65-90 MB process footprint allows approximately 3 concurrent agent instances where Python's 300-350 MB allows approximately 1. This multiplier directly reduces the number of nodes required for a given concurrency target.
Fourth, consistency advantage (7.5x better CV) simplifies capacity planning. Rust's near-deterministic throughput (CV=0.24%) means that provisioning can use the mean as the planning number without safety margins. Python's higher variance (CV=1.8%) requires provisioning to the p5 or p10 level, effectively over-provisioning by 5-10% to maintain SLA guarantees.
Fifth, deployment simplicity (single binary, zero dependencies) reduces operational overhead. Container images are smaller, deployment pipelines are simpler, and there are no dependency conflicts or virtual environment management costs.
The program-level consequence is clear: TR113 through TR116 focus exclusively on Rust multi-agent performance, treating the single-agent language decision as settled.
5.5.4 Limitations and boundary conditions
-
Single model, single serving layer. The +15.2% throughput advantage is measured with Gemma3:latest on Ollama. Models with different sizes, architectures, or serving requirements (e.g., models requiring custom tokenizers or preprocessing) may exhibit different language-dependent overhead profiles.
-
GPU-bound regime. In the GPU-bound regime tested here, the Ollama serving layer is the primary throughput constraint, and Rust's advantage comes from lower client-side overhead (HTTP management, memory allocation, process scheduling). In CPU-bound or network-bound regimes, the language advantage may shrink or disappear because the bottleneck shifts away from client-side code.
-
Development velocity not measured. The economic analysis includes a $5,000 Rust development overhead estimate but does not measure ongoing development velocity differences. For teams without Rust expertise, the maintenance and iteration costs may shift the break-even point beyond 12.7 months.
-
Windows-specific compilation. The Rust binary is compiled for Windows (MSVC toolchain). Cross-compilation to Linux may reveal different performance characteristics, particularly for async I/O and memory allocation patterns.
-
Two-call workflow depth. The comparison uses a 2-call workflow (analysis + report). Deeper call chains may amplify or attenuate the per-call overhead differences between Rust and Python.
-
Ollama version sensitivity. The measurements use a specific Ollama version. Ollama updates that change HTTP handling, request queuing, or model loading behavior could alter the absolute numbers. The relative ordering (Rust faster than Python) is expected to be robust to Ollama updates because it reflects language-level overhead, not serving-layer behavior.
5.5.5 Transferability
TR112_v2's Rust advantage is expected to transfer to any deployment where: (a) the LLM serving layer communicates via HTTP API, (b) the client performs non-trivial work between API calls (file I/O, parsing, orchestration), and (c) the deployment is on hardware where client-side overhead is measurable relative to model inference time. The +15.2% throughput advantage may shrink for very large models where inference time dominates client overhead, or grow for very small models where client overhead is a larger fraction of total time.
The absolute numbers ($42.6k vs $57.5k TCO, 603 ms vs 1,437 ms TTFT) are hardware- and pricing-specific. The relative advantages (+15.2% throughput, -58% TTFT, -67% memory, -83% startup) are architecturally grounded in language-level differences and are expected to be directionally stable across hardware platforms, though the magnitudes will vary.
The transferable output is the decision framework: compare baseline defaults (not optimized peaks), validate workflow parity before claiming language advantages, use Cohen's d to assess practical significance, and compute break-even periods that account for migration costs. This framework can be applied to any language comparison (e.g., Go vs Python, C++ vs Rust) on any hardware baseline.
Evidence snapshot (artifact-backed):
| Metric | Value | Comparison | Statistical test | Evidence source |
|---|---|---|---|---|
| Rust throughput (baseline) | 114.54 tok/s | vs Python 99.34 tok/s | p < 0.001, d = 6.82 | TR111_v2 + TR109 baselines |
| Throughput advantage | +15.2% | Rust over Python | Welch's t-test | TR112_v2 primary comparison |
| TTFT advantage | -58% | Rust 603 ms vs Python 1,437 ms | p < 0.001 | TR112_v2 latency comparison |
| Memory advantage | -67% | Rust 65-90 MB vs Python 300-350 MB | Direct measurement | TR112_v2 resource profiling |
| Startup advantage | -83% | Rust 0.2 s vs Python 1.5 s | Direct measurement | TR112_v2 deployment metrics |
| Cross-config CV ratio | 7.5x | Rust 0.24% vs Python 1.8% | F-test | TR112_v2 consistency analysis |
| Rust worst vs Python best | +14.9% | 113.99 vs 101.08 tok/s | Configuration-level comparison | TR112_v2 dominance analysis |
| Optimization success rate | 72.2% vs 38.9% | Rust vs Python | Proportion test | TR112_v2 optimization analysis |
| 5-year TCO delta | -$14,900 (-26%) | Rust $42.6k vs Python $57.5k | Cost model | TR112_v2 economic analysis |
| Break-even period | 12.7 months | $5k dev overhead vs $2.4k/yr savings | Financial analysis | TR112_v2 economic analysis |
| Total benchmarks | 111 | 57 Rust + 54 Python | -- | TR111_v2 + TR109 corpora |
| Instance sizing (Rust) | 2 x 4 GB | vs Python 4 x 8 GB | Deployment model | TR112_v2 infrastructure analysis |
5.6 TR113: Single Ollama Bottleneck Discovery
5.6.1 Experimental design and artifact boundary
TR113 is an exploratory diagnostic report, not a production optimization study. Its purpose is to answer a pointed question: why does the Rust multi-agent harness underperform the Python multi-agent harness (TR110) by 17 percentage points at the efficiency ceiling, despite Rust's demonstrated single-agent throughput advantage (TR112)? The experimental design deliberately constrains one variable -- the Ollama instance topology -- while sweeping configuration space across 19 parameter combinations in a single-run exploratory sweep. Total wall-clock time was approximately 45 minutes.
The artifact boundary is the dual-agent Rust orchestrator operating against a single Ollama instance on the default port (11434). Both agents issue concurrent HTTP requests to the same endpoint. The measurement boundary therefore includes not only inference latency but also request queuing, model loading serialization, and any contention at the Ollama HTTP server layer. This is a deliberate choice: the objective is to measure the system-level bottleneck, not the model-level throughput.
The 19 configurations span three scenario types (baseline-vs-chimera, chimera-homogeneous, chimera-heterogeneous) with GPU layer allocations from 60 to 140, context sizes from 512 to 2048, and temperatures of 0.6 and 0.8. Each configuration was executed once, which is sufficient for bottleneck identification but insufficient for variance estimation. TR113 is therefore a hypothesis-generating study, not a hypothesis-confirming one. The confirmation comes in TR114_v2.
The hardware baseline is fixed: RTX 4080 Laptop GPU (12 GB GDDR6X, 9728 CUDA cores), Intel i9-13980HX, 32 GB DDR5-4800, Windows 11 Pro. All Ollama instances use the same model weights and quantization level. The only independent variable under test is the concurrency topology: one Ollama instance serving two agents simultaneously.
5.6.2 Key findings
The headline result is that no configuration achieved 95% multi-agent efficiency under single-Ollama topology. The best observed efficiency was 82.2% (homogeneous scenario, GPU=80, context=1024, temperature=0.6), and the mean across all 19 configurations was 72.4% with a standard deviation of 5.5 percentage points. This is not a marginal shortfall; it represents a systematic ceiling that no amount of parameter tuning within the existing topology can overcome.
Contention was pervasive. Of the 19 configurations tested, 14 (74%) exhibited measurable contention signatures, defined as periods where one agent's inference was delayed by the other agent's active request. The baseline-vs-chimera scenario was the most severely affected, with a mean efficiency of 71.2% and a contention rate of 92% (contention detected in nearly every run within that scenario class).
The root cause is architectural, not parametric. A single Ollama instance serializes model loading and inference for concurrent requests. When Agent A is in mid-inference, Agent B's request enters an HTTP-level queue and waits. This serialization converts what should be overlapped GPU utilization into sequential execution with idle gaps. The estimated efficiency loss from serialization alone is 15-20 percentage points, which accounts for the gap between the observed 82.2% peak and the theoretical 100% utilization.
The comparison with TR110 is instructive. Python's multi-agent harness (TR110) achieved 99.25% peak efficiency using a dual-Ollama topology (ports 11434 and 11435), where each agent has a dedicated inference server. The 17 percentage-point gap between Python's peak (99.25%) and Rust's peak (82.2%) is therefore not attributable to the language runtime; it is attributable to the Ollama instance count. This is the central discovery: the bottleneck is the serving topology, not the client-side language.
A secondary finding concerns temperature sensitivity. Within the Rust single-Ollama configuration space, temperature 0.6 consistently outperformed temperature 0.8 by 8-14 percentage points. This is a Rust-specific interaction effect: lower temperature produces shorter, more deterministic completions, which reduces the time each agent holds the Ollama lock. In the Python dual-Ollama topology, temperature sensitivity is much weaker because the lock contention is eliminated.
Cross-language throughput comparison across the full configuration sweep shows Python achieving 18-25% higher speedup and a 17 percentage-point higher efficiency ceiling. However, this comparison is confounded by the topology difference. TR113's value is precisely in decomposing this confound: the language contributes a small fraction; the topology contributes the majority.
5.6.3 Implications for program
TR113 reframes the entire multi-agent optimization problem. Prior to this report, the working hypothesis was that Rust's lower per-request overhead would translate directly into higher multi-agent efficiency. TR113 refutes this hypothesis and replaces it with a structural one: multi-agent efficiency is dominated by the serving topology, and the client-side runtime is a second-order effect once the topology bottleneck is removed.
This has three program-level consequences. First, it motivates TR114_v2 as the direct fix: deploy dual Ollama instances for Rust multi-agent workloads and measure whether the topology change recovers the expected efficiency. Second, it establishes a diagnostic methodology: when multi-agent performance disappoints, the first investigation should target the serving topology, not the client code. Third, it provides a quantitative budget for the topology tax: 15-20 percentage points of efficiency are lost to single-instance serialization, which translates directly into wasted GPU-seconds and inflated cost per token.
5.6.4 Limitations and boundary conditions
TR113 is a single-run exploratory sweep. It has no variance estimates and cannot support statistical inference. The contention rates are point estimates, not confidence intervals. The 45-minute total execution time means that each configuration received limited runtime, and transient system effects (thermal throttling, background processes) could influence individual results.
The study is limited to two concurrent agents. The serialization penalty may scale differently with three or more agents, and the single-Ollama topology may exhibit different failure modes under higher concurrency. Additionally, the study uses a single model; models with different inference profiles (e.g., longer generation, heavier KV-cache usage) may exhibit different contention patterns.
The temperature sensitivity finding (0.6 advantage of 8-14pp) is observed but not mechanistically explained at the inference-engine level. The proposed mechanism (shorter completions reduce lock hold time) is plausible but not confirmed by internal Ollama instrumentation.
5.6.5 Transferability
The architectural finding -- that single-instance serving creates a serialization bottleneck for concurrent agents -- is highly transferable. Any inference server that serializes concurrent requests (whether Ollama, vLLM in single-model mode, or a custom HTTP wrapper around a single model instance) will exhibit similar behavior. The specific efficiency numbers (82.2% peak, 72.4% mean) are tied to the hardware baseline, model, and Ollama version, but the structural pattern is general.
The diagnostic methodology is also transferable: when multi-agent efficiency plateaus below expectations, the serving topology should be the first variable investigated, before client-side optimizations.
Evidence Snapshot (TR113)
| Metric | Value | Context |
|---|---|---|
| Configurations tested | 19 | Single-run exploratory sweep |
| Total wall-clock time | ~45 min | All 19 configs |
| Peak efficiency | 82.2% | homo_g80_c1024_t0.6 |
| Mean efficiency | 72.4% | Across all configs, sigma=5.5pp |
| Contention rate | 74% | 14/19 configs detected contention |
| Baseline-vs-chimera mean | 71.2% | 92% contention rate |
| Python (TR110) peak | 99.25% | Dual Ollama, for comparison |
| Efficiency gap (topology) | ~17pp | Single vs dual Ollama |
| Serialization penalty est. | 15-20pp | Single-instance serving tax |
| Temperature 0.6 advantage | +8-14pp | Over temperature 0.8, Rust-specific |
| Python speedup advantage | 18-25% | Confounded by topology difference |
5.7 TR114_v2: Dual Ollama Multi-Agent Validation
5.7.1 Experimental design and artifact boundary
TR114_v2 is the controlled intervention that follows TR113's diagnostic. Having identified single-Ollama serialization as the root cause of Rust's multi-agent efficiency shortfall, this report deploys the fix -- dual Ollama instances on ports 11434 and 11435, with each agent pinned to a dedicated instance -- and measures whether the topology change recovers the efficiency predicted by TR113's analysis.
The experimental design is rigorous where TR113 was exploratory. Twenty-seven configurations are tested, each with 5 independent runs, yielding 135 total benchmarks executed over 8+ hours of wall-clock time. This design supports both point estimation and variance analysis. The 27 configurations span three scenario categories: baseline-vs-chimera, chimera-heterogeneous, and chimera-homogeneous. GPU layer allocations range from 60 to 140, context sizes from 512 to 2048, and temperatures from 0.6 to 0.8.
The artifact boundary is the dual-agent Rust orchestrator operating against two independent Ollama instances. Each agent has exclusive access to its own inference server. The measurement boundary includes HTTP round-trip time, model inference, and any Ollama-internal overhead, but crucially excludes the cross-agent serialization that dominated TR113. This is the designed difference: the only change between TR113 and TR114_v2 is the Ollama topology.
The hardware baseline is identical to all prior reports in the program: RTX 4080 Laptop GPU, i9-13980HX, 32 GB DDR5-4800, Windows 11 Pro.
5.7.2 Key findings
The topology fix is decisive. Peak single-run efficiency reaches 99.992% (test108, run 1), and the best configuration average across 5 runs is 99.396% (test011: chimera-heterogeneous, GPU 120/140, context 512/1024). The overall mean efficiency across all 135 benchmarks is 98.281%, with a mean speedup of 1.969x (peak 2.000x, which represents perfect dual-agent scaling).
The contention rate collapses from 74% in TR113 to 0.74% in TR114_v2 -- a single run out of 135 exhibited measurable contention. This is the most dramatic single-variable improvement in the entire TR108-TR116 program. The efficiency gain of 17.2 percentage points at peak (82.2% to 99.4%) and 25.9 percentage points at mean (72.4% to 98.3%) confirms TR113's diagnosis: the bottleneck was entirely in the serving topology.
Scenario-level analysis reveals consistent performance across all three scenario types. Chimera-heterogeneous configurations achieve the highest mean efficiency at 98.79%, followed by chimera-homogeneous at 98.40%, and baseline-vs-chimera at 97.37%. The ordering is stable across runs, but the spread is narrow (1.42pp between best and worst scenario means), indicating that once the topology bottleneck is removed, scenario-level parameter choices become second-order.
The comparison with Python (TR110) is now favorable to Rust. Peak efficiency is +0.15 percentage points higher (99.396% vs 99.25%), mean efficiency is +2.48 percentage points higher (98.281% vs 95.8%), and the coefficient of variation is 2.5 percentage points lower (tighter distribution). Rust's advantage is not in peak performance -- both languages approach theoretical maximum -- but in consistency and floor behavior.
Two anti-patterns emerge. First, GPU=120 in the baseline-vs-chimera scenario produces 91.60% efficiency, which is 6.7 percentage points below the overall mean. This suggests that over-allocating GPU layers in a competitive scenario creates resource contention at the GPU scheduling level even when Ollama instances are separated. Second, low temperature (0.6) in the homogeneous scenario introduces instability, reversing the TR113 finding. This is a topology-dependent interaction: under single Ollama, lower temperature helped by reducing lock hold time; under dual Ollama, the lock is eliminated, and lower temperature's effect on completion length no longer confers an advantage.
5.7.3 Implications for program
TR114_v2 closes the architectural question opened by TR113. The combination of TR113 (diagnosis) and TR114_v2 (intervention) constitutes a complete causal argument: single-Ollama serialization causes the efficiency shortfall, and dual-Ollama deployment eliminates it. This is the highest-confidence causal claim in the TR108-TR116 program because it is backed by a controlled single-variable change with 135 measured outcomes.
For production deployment, the implication is unambiguous: any multi-agent Rust system must use per-agent Ollama instances. The cost of running two Ollama processes (minimal memory overhead, since model weights are shared via OS page cache) is negligible compared to the 17-26 percentage-point efficiency recovery.
TR114_v2 also establishes Rust as the preferred multi-agent runtime. With the topology bottleneck removed, Rust achieves higher mean efficiency, lower variance, and a higher floor than Python across all scenario types. This finding feeds directly into the language-choice synthesis in Section 6.1.
5.7.4 Limitations and boundary conditions
The study is limited to two concurrent agents. Scaling to three or more agents would require three or more Ollama instances, and the GPU memory and scheduling behavior under higher multiplicity is not characterized here. The RTX 4080's 12 GB VRAM may become a binding constraint as instance count increases, particularly with larger models or higher context sizes.
All 135 benchmarks use the same model. Cross-model validation of the dual-Ollama topology is deferred to TR116, which tests three models under the dual-Ollama Rust harness.
The anti-pattern findings (GPU=120 in baseline-vs-chimera, temperature=0.6 in homogeneous) are observed correlations, not mechanistically validated. The proposed explanations (GPU scheduling contention, completion-length interaction) are plausible but require targeted micro-benchmarks to confirm.
5.7.5 Transferability
The dual-Ollama topology pattern is directly transferable to any multi-agent system that uses HTTP-based inference servers. The principle -- dedicate one inference server per concurrent agent to eliminate request serialization -- generalizes beyond Ollama to vLLM, TGI, or any serving framework that serializes concurrent model invocations.
The specific efficiency numbers (99.4% peak, 98.3% mean) are bound to the hardware, model, and Ollama version. However, the magnitude of the topology effect (17-26pp recovery) is likely to be reproducible on other hardware configurations, since the serialization mechanism is in the HTTP server layer, not the GPU.
Evidence Snapshot (TR114_v2)
| Metric | Value | Context |
|---|---|---|
| Configurations tested | 27 | 5 runs each |
| Total benchmarks | 135 | 8+ hours wall-clock |
| Peak single-run efficiency | 99.992% | test108, run 1 |
| Best config average | 99.396% | test011, chimera-hetero |
| Overall mean efficiency | 98.281% | Across all 135 runs |
| Mean speedup | 1.969x | Peak: 2.000x |
| Contention rate | 0.74% | 1/135 runs |
| Baseline-vs-chimera mean | 97.37% | Lowest scenario |
| Chimera-heterogeneous mean | 98.79% | Highest scenario |
| Chimera-homogeneous mean | 98.40% | Middle scenario |
| vs Python peak | +0.15pp | 99.396% vs 99.25% |
| vs Python mean | +2.48pp | 98.281% vs 95.8% |
| vs Python CV | -2.5pp | Tighter distribution |
| GPU=120 anti-pattern | 91.60% | Baseline-vs-chimera only |
5.8 TR115_v2: Async Runtime Deep Dive
5.8.1 Experimental design and artifact boundary
TR115_v2 addresses a question that arises naturally from TR114_v2's success: given that dual-Ollama Rust multi-agent achieves near-perfect efficiency, does the choice of asynchronous runtime matter? The Rust ecosystem offers multiple async runtimes, each with different scheduling strategies, thread models, and I/O primitives. If the runtime choice can degrade multi-agent efficiency or introduce pathological failure modes, this must be characterized before a production recommendation can be made.
The experimental design is a full factorial cross of 6 configurations and 5 async runtimes, with 5 runs per cell, yielding 150 total benchmarks over 12+ hours of wall-clock time. The 5 runtimes tested are: tokio-default (work-stealing, multi-threaded), tokio-localset (single-threaded, pinned), async-std (community runtime with Tokio dependency chain), smol (minimal runtime, configurable buffer), and smol-1kb (smol with 1 KB HTTP buffer size). The 6 configurations span the same parameter space as TR114_v2.
The artifact boundary is the dual-agent Rust orchestrator with dual Ollama instances, identical to TR114_v2. The only independent variable is the async runtime used by the Rust orchestrator. This isolates the runtime effect from the topology effect (already resolved in TR114_v2) and from the model effect (held constant).
The hardware baseline remains unchanged: RTX 4080 Laptop GPU, i9-13980HX, 32 GB DDR5-4800, Windows 11 Pro.
5.8.2 Key findings
The central finding is that peak efficiency is not the differentiator; consistency is. All four working runtimes (excluding async-std) achieve peak efficiencies within a 0.12 percentage-point band: tokio-localset at 99.99%, smol-1kb at 99.94%, tokio-default at 99.89%, and smol at 99.87%. If peak performance were the only criterion, the runtime choice would be irrelevant. But it is not.
The runtime ranking by consistency (standard deviation of efficiency across runs) reveals a clear hierarchy:
| Runtime | Peak Eff. | Mean Eff. | Std. Dev. | Min Eff. | Verdict |
|---|---|---|---|---|---|
| tokio-default | 99.89% | 98.72% | 1.21pp | 94.80% | PRODUCTION |
| smol-1kb | 99.94% | 98.61% | 1.32pp | 94.98% | Alternative |
| tokio-localset | 99.99% | 97.95% | 4.03pp | 81.03% | Unstable |
| smol | 99.87% | 97.72% | 4.87pp | 72.80% | Avoid |
| async-std | 50.00% | 50.00% | 0.00pp | 49.99% | Unusable |
Tokio-default achieves the best combination of mean efficiency (98.72%) and consistency (1.21pp standard deviation). Its minimum observed efficiency of 94.80% is the highest floor among all runtimes. Smol-1kb is a viable alternative with marginally lower consistency (1.32pp), but tokio-default's work-stealing scheduler provides a structural advantage that smol's cooperative model cannot match.
The work-stealing mechanism is the key differentiator. When one agent completes its inference request and the other is still waiting, tokio-default's work-stealing scheduler immediately reassigns the idle thread to process I/O for the waiting agent. This prevents idle-thread accumulation and maintains consistent request pipelining. The effect is quantified on the ctx=2048 configuration, where tokio-default achieves 99.29% efficiency versus tokio-localset's 86.43% -- a 12.86 percentage-point advantage attributable entirely to work-stealing versus thread-pinning.
Async-std exhibits catastrophic failure. It achieves exactly 50.00% efficiency with zero variance across all runs. The root cause is a hard Tokio dependency in the reqwest HTTP client library. When async-std attempts to drive reqwest, the internal Tokio runtime serializes all HTTP I/O onto a single thread, producing perfect sequential execution of the two agents. The result is exactly half the theoretical throughput, with no stochastic variation. This is not a performance degradation; it is a complete architectural incompatibility that produces deterministic failure.
Smol exhibits a pathological failure mode. While its mean efficiency (97.72%) is acceptable, a single run dropped to 72.80% -- a 27 percentage-point deviation from peak. This is the signature of a cooperative runtime without work-stealing: when one task blocks longer than expected, no scheduler intervention redistributes work, and the entire pipeline stalls. This tail event makes smol unsuitable for production workloads with SLO requirements.
The HTTP buffer size hypothesis was tested via the smol-1kb variant (1 KB buffers versus smol's default 8 KB). Smol-1kb achieves +4 percentage points over smol's mean, supporting the hypothesis that smaller buffers reduce head-of-line blocking in the HTTP response pipeline. However, tokio-default still outperforms smol-1kb by 0.11 percentage points in mean and 0.11 percentage points in standard deviation, indicating that work-stealing is more impactful than buffer tuning.
5.8.3 Implications for program
TR115_v2 produces an unambiguous production recommendation: use #[tokio::main] with default configuration. No custom thread pool sizing, no manual LocalSet pinning, no alternative runtimes. The default Tokio configuration provides the best combination of mean efficiency, consistency, and failure resilience.
This recommendation is deliberately conservative. Tokio-localset achieves the highest single-run peak (99.99%), but its 4.03pp standard deviation and 81.03% floor make it unsuitable for any deployment with an SLO. The production recommendation prioritizes the worst case, not the best case, which is the correct framing for decision-grade inference systems.
The async-std finding has broader ecosystem implications. The reqwest library's hard Tokio dependency means that any Rust HTTP client built on reqwest is effectively locked into the Tokio ecosystem. This is not a theoretical concern; it manifests as a 50pp efficiency floor for any runtime that is not Tokio-compatible. The program therefore treats Tokio as a mandatory dependency, not a preference.
5.8.4 Limitations and boundary conditions
The study tests 5 runtimes across 6 configurations with 5 runs each. This is sufficient for ranking and failure-mode identification but may not capture rare tail events (e.g., the smol 72.80% drop might occur more or less frequently at higher sample sizes). The 150-benchmark design is a pragmatic compromise between coverage and wall-clock time.
The study is limited to dual-agent workloads. Higher agent counts may stress runtime schedulers differently, particularly for cooperative runtimes like smol. The work-stealing advantage observed for tokio-default may become even more pronounced under higher concurrency.
The async-std failure is specific to the reqwest HTTP client. If async-std were paired with a native async-std HTTP client, the result might differ. However, reqwest is the dominant HTTP client in the Rust ecosystem, and replacing it would require significant engineering effort with uncertain benefits.
5.8.5 Transferability
The work-stealing finding is transferable to any I/O-bound concurrent workload in Rust. The principle -- that work-stealing schedulers outperform cooperative schedulers in consistency, not just throughput -- applies to any system where task durations are variable and I/O latency is non-deterministic.
The async-std incompatibility is a concrete ecosystem constraint that any Rust project using reqwest must account for. This is transferable as a dependency-audit finding: before selecting an async runtime, verify that all transitive dependencies are runtime-agnostic or compatible with the chosen runtime.
The specific efficiency numbers are bound to the hardware, model, and Ollama version. However, the runtime ranking (tokio-default > smol-1kb > tokio-localset > smol >> async-std) is likely stable across hardware configurations, since the ranking is driven by scheduling strategy rather than hardware-specific effects.
Evidence Snapshot (TR115_v2)
| Metric | Value | Context |
|---|---|---|
| Total benchmarks | 150 | 6 configs x 5 runtimes x 5 runs |
| Total wall-clock time | 12+ hours | Full factorial |
| Tokio-default mean | 98.72% | sigma=1.21pp, min=94.80% |
| Smol-1kb mean | 98.61% | sigma=1.32pp, min=94.98% |
| Tokio-localset mean | 97.95% | sigma=4.03pp, min=81.03% |
| Smol mean | 97.72% | sigma=4.87pp, min=72.80% |
| Async-std mean | 50.00% | sigma=0.00pp, catastrophic |
| Peak spread (4 working) | 0.12pp | 99.87%-99.99% |
| Work-stealing advantage | +12.86pp | ctx=2048, tokio vs localset |
| Smol pathological drop | 72.80% | 27pp below peak, single run |
| HTTP buffer effect | +4pp | smol-1kb vs smol-8kb |
| Production recommendation | tokio-default | No custom configuration |
5.9 TR116: Cross-Model Multi-Agent Validation
5.9.1 Experimental design and artifact boundary
TR116 addresses the final open question in the multi-agent research arc: do the findings from TR113-TR115 generalize across models, or are they specific to the single model used in those studies? This is a validity question, not an optimization question. If the dual-Ollama topology, Rust runtime advantage, and tokio-default recommendation are model-specific, they cannot be promoted to production defaults.
The experimental design crosses 3 models (Gemma 3 at 4.3B parameters, Qwen 2.5 at 7B parameters, and Llama 3.1 at 8B parameters) with 2 runtimes (Rust with tokio-default, Python with asyncio) and 2 scenarios (baseline, homogeneous chimera), with 5 independent runs per cell. This yields 60 total benchmarks over 12+ hours of wall-clock time. Both runtimes use the dual-Ollama topology established in TR114_v2.
The model selection is deliberate. Gemma 3 (4.3B) is the smallest and most efficient model in the set, representing the fast-inference regime. Llama 3.1 (8B) is the largest, representing the memory-bound regime where KV-cache pressure is highest. Qwen 2.5 (7B) occupies an intermediate position but, as will be shown, exhibits a distinctive failure mode that makes it an informative outlier rather than a middle-of-the-pack entry.
The artifact boundary is the same dual-Ollama dual-agent setup used in TR114_v2 and TR115_v2. Both Rust and Python harnesses communicate with their respective Ollama instances over HTTP. The only new independent variables are the model identity and the client-side language runtime.
5.9.2 Key findings
The Rust efficiency advantage is consistent across all three models. The following tables present the core results.
Rust Multi-Agent Efficiency by Model:
| Model | Baseline Eff. | Homogeneous Eff. | Best Observed |
|---|---|---|---|
| Gemma 3 (4.3B) | 97.3% | 99.2% | 99.2% |
| Llama 3.1 (8B) | 96.5% | 98.5% | 98.5% |
| Qwen 2.5 (7B) | 90.0% | 89.4% | 90.0% |
Python Multi-Agent Efficiency by Model:
| Model | Baseline Eff. | Homogeneous Eff. | Best Observed |
|---|---|---|---|
| Llama 3.1 (8B) | 83.8% | 85.8% | 85.8% |
| Gemma 3 (4.3B) | 80.2% | 84.9% | 84.9% |
| Qwen 2.5 (7B) | 77.6% | 84.1% | 84.1% |
The Rust advantage ranges from +5.9 percentage points (Qwen baseline) to +14.3 percentage points (Gemma homogeneous). Across all model-scenario combinations, the mean Rust advantage is approximately 12-17 percentage points. This is a runtime characteristic, not a model-specific effect: Rust's lower per-request overhead, zero-cost abstractions, and work-stealing scheduler (tokio-default) produce consistently higher utilization of the dual-Ollama pipeline regardless of which model is being served.
The Python ceiling is the most striking cross-model finding. No Python configuration exceeds 85.8% efficiency, regardless of model. This 86% ceiling is consistent with the event-loop saturation hypothesis developed in the cross-report synthesis (Section 6.6): Python's single-threaded asyncio event loop becomes the bottleneck when total throughput exceeds approximately 100 tokens per second across both agents. The ceiling is not a parameter-tuning failure; it is a structural limitation of the CPython runtime.
Model rankings flip between runtimes. Under Rust, Gemma 3 achieves the highest efficiency (99.2%), followed by Llama 3.1 (98.5%), then Qwen 2.5 (90.0%). Under Python, Llama 3.1 leads (85.8%), followed by Gemma 3 (84.9%), then Qwen 2.5 (84.1%). The inversion of Gemma and Llama between Rust and Python is a consequence of their different throughput profiles interacting with the runtime overhead structure. Gemma's higher raw throughput benefits most from Rust's lower overhead; Llama's more moderate throughput is less sensitive to the overhead difference.
The Qwen 2.5 failure mode is distinctive and informative. Despite being intermediate in parameter count (7B, between Gemma's 4.3B and Llama's 8B), Qwen achieves the lowest efficiency in both runtimes. The root cause is a persistent throughput imbalance between agents: one agent consistently generates approximately 30 tokens per second faster than the other, creating an asymmetric pipeline where the faster agent finishes and idles while the slower agent is still generating. This throughput delta is attributable to Qwen's heavier KV-cache footprint, which creates memory-bandwidth contention when two instances share the GPU. The 90% ceiling in Rust is therefore not a runtime limitation but a model-specific resource contention pattern.
Variance analysis confirms Rust's consistency advantage. The coefficient of variation for Rust efficiency is below 2% for all three models, compared to 3-5% for Python. This means Rust not only achieves higher mean efficiency but also delivers more predictable performance, which is essential for SLO-governed production deployments.
A 5-year total cost of ownership projection, using the measured efficiency numbers and standard cloud GPU pricing, estimates Gemma 3 with Rust at $58,000 versus Gemma 3 with Python at $67,000. The $9,000 differential arises entirely from the efficiency gap: lower efficiency means more GPU-hours to serve the same request volume.
5.9.3 Implications for program
TR116 validates the generalizability of the TR113-TR115 findings. The dual-Ollama topology, Rust runtime advantage, and tokio-default recommendation are not model-specific; they hold across three architecturally distinct models spanning a 2x parameter range (4.3B to 8B). This is sufficient evidence to promote these findings to production defaults.
The Qwen failure mode introduces a new decision axis: model-specific resource contention. Not all models are equally suited to multi-agent deployment on fixed hardware. Models with heavy KV-cache requirements or asymmetric throughput characteristics should be tested under the target multi-agent topology before deployment, even if single-agent benchmarks are favorable.
The Python ceiling finding has strategic implications. For organizations evaluating whether to invest in a Rust multi-agent stack, TR116 provides the quantitative justification: a 12-17 percentage-point efficiency gain across all models, with a 5-year TCO saving of approximately $9,000 per deployment on a single GPU. For teams that cannot adopt Rust, the ceiling defines the maximum achievable performance and prevents over-investment in Python-side optimizations that cannot overcome the structural limitation.
5.9.4 Limitations and boundary conditions
The study tests three models, which provides cross-model validation but not exhaustive coverage. Models with significantly different architectures (e.g., mixture-of-experts, very large context windows, or multi-modal inputs) may exhibit different efficiency patterns. The 4.3B-8B parameter range is relevant for single-GPU deployment but does not characterize behavior for models that require multi-GPU serving.
The 5-year TCO projection uses simplified assumptions (fixed GPU pricing, constant request volume, no hardware refresh). It is intended as an order-of-magnitude estimate for decision support, not as a financial forecast.
The Qwen throughput imbalance finding is observed and attributed to KV-cache contention, but the mechanism is not confirmed by GPU memory profiling. A targeted memory-bandwidth study would be needed to validate this attribution.
Both runtimes use the dual-Ollama topology. The study does not characterize Python's behavior under alternative serving strategies (e.g., dedicated GPU partitioning, CUDA MPS) that might mitigate the event-loop bottleneck.
5.9.5 Transferability
The Rust advantage finding (+12-17pp) is transferable to any dual-agent system using HTTP-based inference servers, with the caveat that the magnitude may vary with hardware and model. The structural advantage of Rust's multi-threaded runtime over Python's single-threaded event loop is independent of the specific models tested.
The Python ceiling (~86%) is transferable to any Python asyncio-based multi-agent system operating at similar throughput levels. It is a property of the CPython runtime, not of the specific application code.
The Qwen failure mode (throughput imbalance from KV-cache contention) is transferable as a diagnostic pattern: any model with disproportionately heavy memory-bandwidth requirements may exhibit similar asymmetric efficiency degradation under multi-instance deployment on shared GPU hardware.
The model-ranking inversion (Gemma leads in Rust, Llama leads in Python) is a cautionary finding: model benchmarks conducted in one runtime may not predict performance in another. Production model selection must be validated under the target runtime.
Evidence Snapshot (TR116)
| Metric | Value | Context |
|---|---|---|
| Total benchmarks | 60 | 3 models x 2 runtimes x 2 scenarios x 5 runs |
| Total wall-clock time | 12+ hours | Full cross-model validation |
| Gemma 3 Rust best | 99.2% | Homogeneous scenario |
| Llama 3.1 Rust best | 98.5% | Homogeneous scenario |
| Qwen 2.5 Rust best | 90.0% | Baseline scenario |
| Llama 3.1 Python best | 85.8% | Homogeneous scenario |
| Gemma 3 Python best | 84.9% | Homogeneous scenario |
| Qwen 2.5 Python best | 84.1% | Homogeneous scenario |
| Rust advantage range | +12-17pp | Across all models |
| Python ceiling | ~86% | No config exceeds this |
| Rust CV (all models) | <2% | Consistency advantage |
| Qwen throughput delta | ~30 tok/s | Between agents, KV-cache contention |
| 5-year TCO (Gemma Rust) | $58,000 | vs $67,000 Python |
6. Cross-Report Synthesis by Decision Axis
The preceding sections present each technical report as a self-contained result. This section synthesizes findings across all reports in the TR108-TR116 program along six decision axes. Each axis represents a question that a production team must answer, and the synthesis provides the evidence-backed answer within the stated boundary conditions.
The six axes are: language choice (Section 6.1), serving architecture (Section 6.2), async runtime selection (Section 6.3), model selection (Section 6.4), configuration transfer (Section 6.5), and the Python performance ceiling (Section 6.6). These axes are not independent -- language choice interacts with architecture, runtime interacts with model selection -- but they are presented separately because each corresponds to a distinct deployment decision with different stakeholders and different change costs.
6.1 Language Choice: Rust vs Python Across All Workloads
The TR108-TR116 program provides three independent lines of evidence on language choice, spanning single-agent, multi-agent, and cross-model workloads. The cumulative weight of this evidence is unambiguous: Rust is the production-grade choice for LLM inference orchestration, and Python is appropriate only for prototyping, experimentation, or workloads where development velocity outweighs runtime efficiency.
Single-agent evidence (TR112). In isolated single-agent benchmarks, Rust achieves +15.2% higher throughput than Python for equivalent inference tasks, -58% lower time-to-first-token (TTFT), and -67% lower memory consumption. These advantages arise from Rust's zero-cost abstractions, lack of garbage collection pauses, and compiled-native execution. The throughput advantage is moderate in absolute terms but compounds under sustained load: at 1 million tokens per day, the 15.2% throughput difference translates to approximately 2.5 fewer GPU-hours, which is a material cost saving at scale.
Multi-agent evidence (TR114_v2 vs TR110). Under identical dual-Ollama topology, Rust multi-agent achieves 98.281% mean efficiency versus Python's 95.8%, a +2.48 percentage-point advantage. More importantly, Rust's coefficient of variation is 2.5 percentage points lower, meaning its worst-case performance is closer to its best case. The peak efficiency difference is small (+0.15pp), confirming that both languages can approach the theoretical maximum under ideal conditions. The separation is in the floor and the tail: Rust's minimum observed efficiency across 135 runs is consistently higher than Python's.
Cross-model evidence (TR116). When tested across three architecturally distinct models (Gemma 3, Llama 3.1, Qwen 2.5), Rust maintains a +12-17 percentage-point efficiency advantage over Python in every model-scenario combination. This advantage is not model-specific; it is a property of the runtime. Python never exceeds 85.8% efficiency on any model, while Rust achieves 98.5-99.2% on two of three models and 90.0% on the problematic Qwen (which is model-limited, not runtime-limited).
Synthesis. The language choice decision is not close. Rust's advantages are consistent, large, and structurally grounded. They are not artifacts of benchmarking methodology or configuration tuning; they reflect fundamental differences in runtime architecture (multi-threaded vs single-threaded, compiled vs interpreted, deterministic memory management vs garbage collection). The only scenario where Python is preferable is rapid prototyping, where development speed matters more than runtime efficiency and the inference load is low enough that the 12-17pp efficiency gap does not translate to material cost.
Decision rule. For any production deployment serving more than 10,000 requests per day, use Rust. For internal experimentation, ad-hoc analysis, or proof-of-concept systems with low concurrency, Python is acceptable. If the system will eventually be promoted to production, begin with Rust to avoid a costly rewrite.
6.2 Architecture: Single vs Dual Ollama
The serving topology decision is the single highest-impact architectural choice in the TR108-TR116 program. No other variable -- language, runtime, model, configuration -- produces a comparable effect size.
Single Ollama (TR113). Under single-Ollama topology, Rust multi-agent achieves a peak efficiency of 82.2% and a mean of 72.4%. Contention is detected in 74% of configurations (14 of 19). The serialization mechanism is well-understood: a single Ollama instance processes one inference request at a time, and concurrent requests are queued at the HTTP level. With two agents, this creates a strict alternation pattern where one agent is always waiting while the other is inferring, producing a theoretical efficiency ceiling of approximately 50% (if inference times are equal) to 85% (if inference times are highly asymmetric and overlap with I/O).
Dual Ollama (TR114_v2). Under dual-Ollama topology (one instance per agent), Rust multi-agent achieves a peak efficiency of 99.4% and a mean of 98.3%. Contention drops to 0.74% (1 of 135 runs). The improvement is 17.2 percentage points at peak and 25.9 percentage points at mean. The contention reduction is 62 percentage points (from 74% to 0.74%, expressed as a fraction of configurations exhibiting contention).
Causal confidence. This is the strongest causal finding in the program. TR113 and TR114_v2 share the same Rust orchestrator, the same model, the same hardware, and overlapping configuration spaces. The only change is the Ollama instance count. The 135-benchmark design of TR114_v2 (versus TR113's 19 single-run configurations) provides statistical power to confirm that the improvement is not a sampling artifact.
Cost of the fix. Running two Ollama instances on the same machine has minimal overhead. The model weights are memory-mapped and shared via the operating system's page cache, so the VRAM cost is approximately one additional KV-cache allocation (model-dependent, typically 100-500 MB). The CPU cost is one additional Ollama process (~50 MB RSS). Relative to the 17-26pp efficiency recovery, this cost is negligible.
Generalization. The serialization bottleneck is not Ollama-specific. Any inference server that processes requests sequentially (whether by design or by default configuration) will exhibit similar behavior under concurrent load. The dual-instance pattern is therefore a general architectural principle for multi-agent systems: dedicate one inference endpoint per concurrent agent unless the serving framework explicitly supports concurrent inference with documented parallelism guarantees.
Decision rule. For any multi-agent deployment with N concurrent agents, provision N Ollama instances (or N inference endpoints). Verify that the GPU can accommodate N concurrent KV-cache allocations within VRAM. If VRAM is insufficient, reduce model size or context length before reducing instance count; the efficiency penalty of instance reduction (15-20pp per missing instance) almost always exceeds the throughput penalty of smaller models.
6.3 Runtime Selection: Five Runtimes, Consistency Over Peak
TR115_v2 tests five Rust async runtimes and produces a clear production recommendation. The key insight is that peak efficiency is not the decision criterion; consistency is. All four working runtimes achieve peaks within 0.12 percentage points of each other (99.87-99.99%), but their standard deviations range from 1.21pp to 4.87pp, and their minimum observed efficiencies range from 72.80% to 94.98%.
The work-stealing principle. Tokio-default's work-stealing scheduler is the structural reason for its consistency advantage. In an I/O-bound multi-agent workload, task durations are variable: one HTTP request may complete in 50ms while another takes 200ms. A work-stealing scheduler detects idle threads and reassigns them to pending tasks, maintaining pipeline utilization even when task durations are asymmetric. A cooperative scheduler (smol) or a pinned scheduler (tokio-localset) cannot perform this rebalancing, leading to idle-thread accumulation and efficiency drops during asymmetric phases.
The work-stealing advantage is quantified at +12.86 percentage points on the ctx=2048 configuration (99.29% for tokio-default versus 86.43% for tokio-localset). This is the most demanding configuration in the test matrix, where longer context windows produce more variable inference times and thus more opportunities for work-stealing to compensate.
The async-std catastrophe. Async-std achieves exactly 50.00% efficiency with zero variance. This is not a performance degradation; it is a complete failure caused by reqwest's hard Tokio dependency. When async-std drives reqwest, the internal Tokio runtime serializes all HTTP I/O, converting dual-agent concurrent execution into strict sequential alternation. The result is mathematically exact: 50% of theoretical throughput, with no stochastic variation. This finding effectively excludes async-std from any Rust project that uses reqwest for HTTP communication, which includes the vast majority of Rust web clients.
The smol tail risk. Smol's pathological drop to 72.80% on a single run is the most concerning finding for production use. A 27 percentage-point deviation from peak, occurring unpredictably, would violate any reasonable SLO. The root cause is the absence of work-stealing: when one task blocks unexpectedly (e.g., due to a slow HTTP response or a GC pause in the Ollama process), no scheduler intervention recovers the pipeline. This makes smol unsuitable for latency-sensitive production workloads despite its otherwise acceptable mean performance.
Synthesis. The runtime decision reduces to a single recommendation: #[tokio::main] with default configuration. No thread pool customization, no LocalSet pinning, no alternative runtimes. This recommendation is robust across all configurations tested and is grounded in a structural advantage (work-stealing) rather than a tuning artifact.
Decision rule. Use tokio-default. If organizational constraints require an alternative runtime, smol-1kb is the only acceptable fallback, with the understanding that its tail behavior is worse (sigma 1.32pp vs 1.21pp, min 94.98% vs 94.80%). Never use async-std with reqwest. Never use tokio-localset for multi-agent workloads.
6.4 Model Selection: Gemma vs Llama vs Qwen
TR116 provides the first cross-model multi-agent comparison in the program. The results introduce a nuance that single-model benchmarks cannot capture: model ranking depends on the runtime, and the interaction between model characteristics and runtime architecture can invert performance orderings.
Rust rankings. Under Rust with dual Ollama, Gemma 3 (4.3B) achieves the highest efficiency at 99.2%, followed by Llama 3.1 (8B) at 98.5%, and Qwen 2.5 (7B) at 90.0%. Gemma's advantage reflects its smaller size and more efficient KV-cache utilization: with lower memory-bandwidth requirements, two Gemma instances coexist more comfortably on the RTX 4080's 12 GB VRAM. Llama's slightly lower efficiency is consistent with its larger parameter count and correspondingly higher KV-cache pressure, but the 1.7pp gap from Gemma is operationally negligible.
Python rankings. Under Python with dual Ollama, the ordering shifts: Llama 3.1 leads at 85.8%, followed by Gemma 3 at 84.9%, and Qwen 2.5 at 84.1%. The spread is much narrower (1.7pp total), reflecting the fact that Python's event-loop bottleneck equalizes model differences. When the runtime is the constraint, model-level throughput differences are compressed into the noise floor.
The Qwen anomaly. Qwen 2.5 underperforms in both runtimes despite having an intermediate parameter count. The root cause is a persistent throughput imbalance: one agent consistently generates approximately 30 tokens per second faster than the other. This asymmetry is attributable to Qwen's heavier KV-cache footprint, which creates memory-bandwidth contention when two instances share the GPU's memory bus. The result is that Qwen's multi-agent efficiency is model-limited, not runtime-limited: even Rust's advantages cannot compensate for the underlying resource contention.
The ranking inversion. The fact that Gemma leads in Rust but Llama leads in Python is not a contradiction; it is a predictable consequence of the interaction between model throughput and runtime overhead. Gemma's higher raw throughput benefits most from Rust's lower per-request overhead, because each millisecond of overhead saved translates to a larger fraction of Gemma's shorter inference time. Llama's longer inference time dilutes the overhead effect, making the Python-to-Rust improvement proportionally smaller. Under Python, where the event loop is the bottleneck, Llama's slower inference gives the event loop more time to process I/O between requests, producing slightly higher utilization.
Synthesis. Model selection for multi-agent deployment must account for the runtime. A model benchmarked under Python may not be the best choice for a Rust deployment. The correct procedure is: (1) select the target runtime first (Rust, per Section 6.1), (2) benchmark candidate models under that runtime with the target agent count and Ollama topology, and (3) select based on the efficiency and cost numbers from that specific configuration. Single-agent benchmarks are informative but not sufficient for multi-agent model selection.
Decision rule. For dual-agent Rust deployments on the RTX 4080, Gemma 3 is the default recommendation (99.2% efficiency, lowest 5-year TCO at $58,000). Llama 3.1 is the alternative when Gemma's reasoning capability is insufficient (98.5% efficiency, marginal cost premium). Avoid Qwen 2.5 for multi-agent deployment on this hardware unless the throughput imbalance is resolved at the model or serving level. Always validate model selection under the target runtime and topology.
6.5 Configuration Transfer: Single-Inference to Agent to Multi-Agent
One of the most persistent mistakes in LLM deployment is assuming that optimal single-inference parameters transfer to agentic or multi-agent workloads. The TR108-TR116 program provides direct evidence that this assumption is false, and that each deployment mode requires independent configuration validation.
Single-inference optimum (TR108). In single-inference benchmarks, the optimal configuration is GPU=999 (full offload), CTX=4096 (maximum context), with aggressive GPU utilization. This makes intuitive sense: with no competing workloads, the model should occupy all available GPU resources to minimize inference latency.
Agent workflow optimum (TR109). In single-agent agentic workflows, the optimal configuration shifts to GPU=60, CTX=512. The reduction in GPU allocation and context size reflects the overhead of agent orchestration: the agent framework consumes CPU and memory for state management, tool invocation, and response parsing, which competes with the inference engine for system resources. Over-allocating GPU layers in this regime does not improve end-to-end throughput because the agent framework, not the model, is the bottleneck during non-inference phases.
Multi-agent optimum (TR110, TR114_v2). In multi-agent deployments, the optimal configuration shifts again to GPU=80, CTX=2048. The increase from agent-workflow levels reflects the need to balance two competing inference instances: each needs sufficient GPU resources to maintain throughput, but over-allocation (GPU=120+) creates GPU scheduling contention that degrades both instances (the TR114_v2 anti-pattern at 91.60%).
The non-monotonicity. The sequence GPU=999 to GPU=60 to GPU=80 is non-monotonic: the multi-agent optimum is higher than the agent optimum but lower than the single-inference optimum. This non-monotonicity defeats any simple extrapolation from one deployment mode to another. It also defeats the common heuristic of "give the model as much GPU as possible": in multi-agent mode, this heuristic produces the GPU=120 anti-pattern and costs 6.7 percentage points of efficiency.
Cross-report evidence chain. The configuration transfer failure is documented across four reports:
- TR108: single-inference baseline (GPU=999, CTX=4096 optimal)
- TR109: single-agent workflow (GPU=60, CTX=512 optimal)
- TR110: Python multi-agent (GPU=80, CTX=1024 competitive)
- TR114_v2: Rust multi-agent (GPU=80, CTX=2048 competitive, GPU=120 anti-pattern)
Each transition involves a qualitative change in the system's bottleneck structure: from model inference alone, to model-plus-agent-framework, to multiple-models-plus-multiple-frameworks-plus-shared-GPU. These qualitative changes invalidate parameter extrapolation.
Synthesis. Configuration must be validated per deployment mode. There is no shortcut. A configuration that is optimal for single inference can be suboptimal or actively harmful in a multi-agent context. The validation cost is modest (a few hours of benchmarking per mode), and the cost of skipping validation is severe (up to 40 percentage points of efficiency loss, as demonstrated by the gap between TR108's single-inference GPU=999 and TR109's agent-workflow GPU=60).
Decision rule. When deploying a new mode (single-inference, single-agent, multi-agent), run a configuration sweep covering GPU layer allocation, context size, and temperature. Do not transfer parameters from a different deployment mode. Budget 2-4 hours for the sweep and treat it as a mandatory deployment gate.
6.6 The Python Ceiling: Event Loop Saturation
The most consequential cross-report finding for language strategy is the Python performance ceiling. Across all multi-agent experiments in the TR108-TR116 program, Python never exceeds approximately 86% multi-agent efficiency. This ceiling is not a configuration artifact, a model-specific effect, or a measurement error. It is a structural limitation of the CPython runtime's asyncio event loop.
Evidence chain. The ceiling is observed independently in three reports:
- TR110: Python multi-agent peak at 99.25%, but mean at 95.8% with high variance (this represents the dual-Ollama topology but with specific configurations that happened to avoid saturation)
- TR116 (Python, all models): no configuration exceeds 85.8% efficiency. Llama 3.1 achieves 85.8%, Gemma 3 achieves 84.9%, Qwen 2.5 achieves 84.1%. The clustering within a 1.7pp band across three architecturally distinct models confirms that the constraint is runtime-level, not model-level.
The ceiling becomes binding when aggregate throughput across both agents exceeds approximately 100 tokens per second. Below this threshold, the event loop has sufficient headroom to process HTTP I/O, JSON parsing, state management, and scheduling without introducing delays. Above this threshold, the single-threaded event loop cannot keep up, and tasks begin to queue.
Mechanism. Python's asyncio event loop is single-threaded. All I/O callbacks, coroutine scheduling, and application logic execute on a single OS thread. In a dual-agent multi-agent workload, the event loop must process:
- HTTP response parsing for Agent A's inference result
- JSON deserialization of the response payload
- State update and next-prompt construction for Agent A
- HTTP request dispatch for Agent A's next inference call
- The same four steps for Agent B
- Scheduling overhead for interleaving the above
At moderate throughput (50-80 tok/s aggregate), steps 1-6 complete within the inter-inference gap, and the event loop does not introduce visible delay. At high throughput (100+ tok/s aggregate), the cumulative processing time for steps 1-6 exceeds the inter-inference gap, and Agent B's next request is delayed while the event loop processes Agent A's callback. This delay creates GPU idle time on Agent B's Ollama instance, reducing measured efficiency.
Why code optimization cannot fix this. The bottleneck is not in any specific Python function; it is in the event loop's single-threaded execution model. Optimizing JSON parsing (e.g., switching to orjson) reduces per-callback time but does not change the fundamental constraint: all callbacks share one thread. Using uvloop (a C-based event loop implementation) can reduce scheduling overhead by approximately 20-30%, but this shifts the ceiling from ~86% to ~90%, not to 99%. The only way to approach Rust's 98-99% efficiency from Python would be to fundamentally restructure the concurrency model -- for example, using multiprocessing with dedicated event loops per agent -- which introduces IPC overhead, shared-state complexity, and effectively rebuilds the system architecture.
Cross-language comparison. Rust does not have this limitation because tokio-default runs a multi-threaded work-stealing executor. HTTP callbacks for Agent A and Agent B can execute on different OS threads simultaneously, eliminating the serialization that Python's event loop imposes. This is not a matter of implementation quality; it is a fundamental architectural difference between single-threaded cooperative multitasking (Python asyncio) and multi-threaded preemptive work-stealing (Tokio).
Quantified impact. The efficiency gap attributable to the Python ceiling is 12-17 percentage points across all models (TR116). At the Gemma 3 efficiency levels (99.2% Rust vs 84.9% Python), this represents a 14.3pp gap. Over a 5-year deployment horizon, this gap translates to approximately $9,000 in additional GPU costs per single-GPU deployment ($67,000 Python vs $58,000 Rust for Gemma 3), assuming continuous operation at measured throughput levels.
Scaling implications. The Python ceiling becomes more constraining as models become faster. As inference engines improve and models generate tokens more quickly, the event loop must process callbacks at a higher rate. A model that generates 150 tok/s per agent would push the ceiling below 80%. Conversely, very slow models (e.g., large models at 20 tok/s per agent) may not trigger the ceiling at all, making the language choice less consequential for those workloads.
Synthesis. The Python ceiling is the strongest argument for Rust adoption in the TR108-TR116 program. It is not a bug to be fixed but a structural property of the runtime. Organizations that plan to serve high-throughput multi-agent workloads (>100 tok/s aggregate) have no path to 95%+ efficiency in Python. Rust is not merely faster; it is architecturally necessary for this performance regime.
Decision rule. If aggregate multi-agent throughput will exceed 100 tok/s, Rust is mandatory. If throughput is below 50 tok/s and development velocity is the priority, Python is acceptable with the understanding that efficiency will not exceed ~86%. In the 50-100 tok/s range, the decision depends on cost sensitivity: at scale, the 12-17pp efficiency gap compounds into material budget differences; at small scale, it may be acceptable.
6.7 Summary of Decision Axes
The following table consolidates the six decision axes into a single reference for production teams.
| Decision Axis | Key Finding | Evidence Base | Effect Size | Recommendation |
|---|---|---|---|---|
| Language (Rust vs Python) | Rust superior across all modes | TR112, TR114_v2, TR116 | +12-17pp efficiency | Rust for production |
| Architecture (single vs dual Ollama) | Dual Ollama eliminates contention | TR113, TR114_v2 | +17-26pp efficiency | N instances for N agents |
| Runtime (async) | Work-stealing essential for consistency | TR115_v2 | sigma 1.21pp vs 4.87pp | tokio-default, no custom config |
| Model selection | Rankings depend on runtime | TR116 | Gemma leads Rust, Llama leads Python | Validate under target runtime |
| Configuration transfer | Optimal configs do not transfer across modes | TR108-TR110, TR114_v2 | Up to 40pp penalty | Sweep per deployment mode |
| Python ceiling | ~86% max multi-agent efficiency | TR110, TR116 | Structural, unfixable | Rust mandatory above 100 tok/s |
These six axes are ordered by descending effect size. The architecture decision (single vs dual Ollama) has the largest absolute impact (+17-26pp). The language decision has the largest cross-workload consistency (Rust advantage observed in every comparison). The runtime decision has the smallest absolute impact but the highest leverage on tail behavior and SLO compliance. Configuration transfer is a process discipline rather than a technology choice, but its neglect can produce the largest penalties.
The interaction structure is worth noting: architecture and language are largely independent (both matter, and fixing one does not eliminate the need to fix the other). Runtime and language are coupled (the runtime recommendation is Rust-specific). Model selection and language interact (rankings flip). Configuration transfer applies universally. The Python ceiling constrains the entire language axis and cannot be mitigated by any other decision.
Taken together, these six axes define the decision space for multi-agent LLM inference on the tested hardware baseline. A production team that adopts all six recommendations -- Rust, dual Ollama, tokio-default, model-validated, mode-specific configuration, and awareness of the Python ceiling -- will operate at 98-99% multi-agent efficiency for compatible models on this hardware. A team that violates any single axis risks a 5-27 percentage-point efficiency penalty, with the architecture axis carrying the highest individual risk.
7. Economics and Capacity
The economic analysis synthesizes measured performance deltas into infrastructure cost projections, capacity planning heuristics, and break-even timelines. All figures derive from the empirical measurements reported in TR108-TR116 and are bounded to the tested hardware baseline (RTX 4080 Laptop, 12GB VRAM, Intel i9-13980HX). Cost projections use shadow pricing anchored to comparable cloud instance rates; they are not drawn from actual billing data and should be treated as directional estimates within the stated boundary conditions.
7.1 Memory footprint and instance density
The most immediate economic lever is memory consumption. Rust agent processes operate at 65-90 MB resident memory, compared to 300-350 MB for equivalent Python agents (TR112_v2). The static binary footprint reinforces this advantage: a Rust release binary occupies approximately 15 MB, while the Python runtime with its dependency tree requires approximately 100 MB on disk.
| Metric | Python | Rust | Delta |
|---|---|---|---|
| Process RSS (typical) | 300-350 MB | 65-90 MB | -67% |
| Binary / runtime footprint | ~100 MB | ~15 MB | -85% |
| Per-instance savings | -- | ~250 MB | -- |
The operational consequence is instance density. On a fixed-memory host, a Rust deployment can run approximately twice as many agent processes as an equivalent Python deployment before exhausting available RAM. For a fleet provisioned at 4 x 8 GB RAM instances under Python (total: $800/month at $200/instance), the same workload fits on 2 x 4 GB RAM instances under Rust (total: $160/month at $80/instance). This is a 5x cost reduction on the memory-provisioning axis alone.
This advantage compounds under multi-agent deployment. TR110 and TR114_v2 demonstrate that multi-agent workloads require concurrent process memory for each agent. A dual-agent Python deployment consumes 600-700 MB; the equivalent Rust deployment consumes 130-180 MB. On hardware with 8 GB system RAM available to user processes, Python supports approximately 10-11 agents before memory pressure; Rust supports approximately 44-60. While the tested scope is limited to two concurrent agents, the memory headroom for scaling is structurally different.
7.2 Throughput advantage and effective capacity
Raw throughput is the second economic lever. TR112_v2 establishes a 15.2% single-agent throughput advantage for Rust (114.54 tok/s vs 99.34 tok/s). This translates directly into effective capacity: per unit time, a Rust agent serves 15.2% more tokens than a Python agent on identical hardware and model configuration.
| Deployment mode | Python | Rust | Delta |
|---|---|---|---|
| Single-agent throughput | 99.34 tok/s | 114.54 tok/s | +15.2% |
| Multi-agent peak efficiency | 99.25% | 99.396% | +0.15pp |
| Multi-agent mean efficiency | 95.8% | 98.281% | +2.48pp |
| Throughput CV (cross-config) | 1.8% | 0.24% | -87% |
The consistency advantage is economically significant even when absolute throughput differences are small. Rust's 0.24% cross-configuration coefficient of variation (TR112_v2) means that capacity planning can use tighter margins. Python's 1.8% CV requires provisioning additional headroom to absorb configuration-dependent variance. Under SLO-driven deployment, tighter variance translates to fewer excess instances, which translates to lower cost.
In multi-agent mode, the efficiency delta is smaller in absolute terms (98.281% vs 95.8%, a +2.48pp advantage for Rust) but compounds with throughput. The effective multi-agent throughput for Rust, accounting for both the single-agent speed advantage and the higher parallel efficiency, is approximately 17-18% greater than Python's per-instance output.
7.3 Cost projections at scale
Cost projections are computed under two scenarios: an optimistic deployment matching the measured performance directly, and a conservative estimate derived from TR112's 1M requests/month baseline.
Scenario A: Measured-performance deployment
| Cost component | Python | Rust |
|---|---|---|
| Instance count | 4 x 8 GB RAM | 2 x 4 GB RAM |
| Monthly instance cost | $800 | $160 |
| Monthly savings | -- | $640 |
| Annual savings | -- | $7,680 |
Scenario B: Conservative (1M requests/month, TR112 baseline)
At 1M requests per month with modest instance sizing, the savings range narrows to $120-640/month depending on whether instances are right-sized for memory or throughput. The conservative annual savings estimate is $1,440 at the lower bound and $3,040 at the midpoint.
5-year total cost of ownership:
| Deployment mode | Python TCO (5yr) | Rust TCO (5yr) | Delta |
|---|---|---|---|
| Single-agent | $57,500 | $42,600 | -26% |
| Multi-agent | $67,000 | $58,000 | -13% |
The multi-agent TCO gap is narrower because Rust's multi-agent advantage (+2.48pp efficiency) is smaller than its single-agent advantage (+15.2% throughput). The architectural cost of dual Ollama deployment is identical for both languages, so the infrastructure savings come primarily from memory density and throughput capacity.
7.4 Break-even analysis for Rust migration
The migration from Python to Rust carries a one-time development cost. Based on the complexity of the agent workflow (HTTP client, JSON parsing, file I/O, async coordination, metric instrumentation), the estimated additional development effort for a production-grade Rust agent is $11,000-$14,000, reflecting the higher engineering cost of Rust relative to Python for equivalent functionality.
| Metric | Optimistic | Conservative |
|---|---|---|
| Migration cost | $11,000 | $14,000 |
| Monthly savings | $640 | $120-253 |
| Break-even period | 12.7 months | 20+ months |
| 5-year net savings | $15,000+ | $9,000+ |
The break-even period is sensitive to request volume. At volumes below approximately 500,000 requests/month, the savings may not justify the migration cost within a 24-month window. At volumes exceeding 2M requests/month, break-even drops below 8 months. The decision to migrate should therefore be gated on projected request volume, not on raw performance numbers.
7.5 Contention as hidden cost
Resource contention imposes a cost that does not appear in instance pricing but manifests as degraded throughput and increased tail latency. TR113 demonstrated that Python multi-agent deployments on a single Ollama instance experience 10-15% throughput contention; TR114_v2 showed that Rust with dual Ollama reduces contention to 0.74%.
| Metric | Python (single Ollama) | Rust (dual Ollama) | Delta |
|---|---|---|---|
| Contention rate | 10-15% | 0.74% | -10-14pp |
| Efficiency floor | ~85% | ~95% | +10pp |
The hidden cost of contention is capacity waste: if 10% of throughput is lost to contention, the fleet must be 11% larger to serve the same load. Over the 5-year TCO horizon, this amounts to an additional $3,000-$6,000 in shadow cost for a Python deployment that does not adopt dual Ollama architecture.
7.6 Summary of economic findings
The economic case for Rust is strongest in high-volume, multi-agent, memory-constrained deployments. The case weakens for low-volume, single-agent, latency-insensitive workloads where Python's lower development cost and faster iteration speed dominate. The break-even analysis provides a quantitative gate: if monthly savings exceed $600, migration pays for itself within 18 months. Below that threshold, the decision should weigh operational benefits (consistency, memory safety, deployment simplicity) against the development cost premium.
8. Operational Doctrine
This section translates measured findings into prescriptive rules for production deployment. Each rule is stated as a conditional policy with explicit invalidation criteria. The doctrine is decision-grade within the measured stack and should not be applied outside the boundary conditions established in Sections 3 and 9.
8.1 Language selection rules
Rule 1 (Default). For all production multi-agent LLM workloads, default to Rust. The +2.48pp mean efficiency advantage (TR114_v2), -67% memory footprint (TR112_v2), and -87% throughput variance (TR112_v2) collectively make Rust the lower-risk production choice for concurrent agent systems.
Rule 2 (Python exception). Python is acceptable for prototyping, research pipelines, single-shot inference where the latency SLO is greater than 2 seconds, and development workflows where iteration speed outweighs runtime performance. Python should not be used in production multi-agent deployments unless the efficiency SLO is below 86%.
Rule 3 (Efficiency gate). If the target parallel efficiency exceeds 86%, Rust is mandatory. TR116 demonstrates that Python never exceeds 86% efficiency in multi-agent scenarios across all tested model-runtime combinations. Rust consistently achieves 90-99% efficiency under the same conditions. This is a hard ceiling, not a statistical artifact: it reflects structural differences in runtime scheduling and I/O coordination.
Invalidation gate. The language selection rules are invalidated if the Python runtime ecosystem acquires true multi-threading capabilities (e.g., removal of the GIL with full async I/O parity), if a new Python inference framework eliminates the measured coordination overhead, or if the workload shifts to a regime not tested in this program (e.g., batch inference, streaming with backpressure).
8.2 Architecture policy: dual Ollama
Rule. Always deploy dual Ollama instances for multi-agent workloads. The primary instance listens on port 11434; the secondary instance listens on port 11435. Each agent is assigned exclusively to one instance.
Evidence basis. TR113 demonstrated that single-Ollama multi-agent deployment produces 63% resource contention and 82.2% peak efficiency. TR114 showed that dual Ollama deployment reduces contention to 6% and raises efficiency to 95.7%. TR114_v2 further improved this to 98.281% mean efficiency with 0.74% contention. The architectural intervention (+17pp efficiency gain, 10x contention reduction) exceeds any language-level or runtime-level optimization measured in the program.
Exception. Single Ollama deployment is acceptable only for single-agent or strictly sequential workflows where no concurrent model inference occurs. If agents take turns rather than running in parallel, the serialization overhead of a single instance is irrelevant.
Invalidation gate. This rule is invalidated if Ollama implements native multi-model concurrency (i.e., true parallel inference on a single instance without server-level serialization), or if the deployment hardware supports sufficient VRAM to load multiple model instances within a single Ollama process.
8.3 Runtime policy: Tokio-default
Rule. Use #[tokio::main] with the default multi-threaded work-stealing scheduler for all Rust agent deployments. Do not override the default thread count or scheduler configuration unless profiling demonstrates a specific bottleneck.
Evidence basis. TR115_v2 tested five async runtime configurations across 150 benchmark runs. Tokio-default achieved 99.89% peak efficiency, 98.72% mean efficiency, and 1.21pp standard deviation, making it the most reliable configuration. Tokio-localset achieved a marginally higher peak (99.99%) but with 4.03pp standard deviation and an 81.03% minimum, making it unsuitable for production use where worst-case matters.
Prohibited configurations:
- async-std: Measured at 50% efficiency due to fundamental incompatibility with Tokio-based HTTP clients. This is not a tuning problem; it is a structural dependency conflict.
- smol without custom buffering: Performs equivalently to Tokio-default but offers no advantage. Unless binary size is constrained below 5 MB (where smol's smaller footprint is relevant), the additional dependency and testing burden is not justified.
Invalidation gate. Reconsider if Tokio introduces breaking changes to the work-stealing scheduler, if a new runtime demonstrates measurable advantages in LLM inference workloads with Ollama, or if binary size constraints below 5 MB become a deployment requirement.
8.4 Model routing policy
Model selection for multi-agent deployments should follow measured efficiency and throughput characteristics.
| Use case | Recommended model | Efficiency (Rust) | Rationale |
|---|---|---|---|
| Throughput / scaling | Gemma 3 (4.3B, Q4_K_M) | 99.2% | Highest efficiency, lowest TCO |
| Reasoning-heavy tasks | Llama 3.1 8B (Q4_0) | 98.5% | Near-Gemma efficiency with superior reasoning |
| Avoid for multi-agent | Qwen 2.5 7B | 90.0% | 7-9pp efficiency penalty, throughput imbalance |
Evidence basis. TR116 tested three model families across both Rust and Python runtimes. Gemma 3 achieved the highest multi-agent efficiency in both languages (97.3% Rust baseline-vs-chimera, 80.2% Python). Llama 3.1 8B scaled nearly as well as Gemma (96.5% Rust) despite lower absolute throughput (68 tok/s vs 100 tok/s). Qwen 2.5 7B exhibited 13-19pp lower efficiency than Gemma and Llama in multi-agent scenarios, likely due to heavier KV cache utilization or different attention patterns.
Routing rule. Default to Gemma 3 for all multi-agent workloads unless task-specific quality requirements favor Llama 3.1. Do not deploy Qwen 2.5 in multi-agent configurations unless the efficiency penalty is explicitly accepted and budgeted.
8.5 Configuration guidance by deployment mode
Optimal parameter configurations differ by deployment mode. TR108, TR109, and TR111_v2 demonstrate that configurations optimal for single-inference do not transfer to agent workflows, and TR110/TR114_v2 show that agent configurations do not transfer to multi-agent concurrent execution without revalidation.
| Deployment mode | GPU layers | Context (tokens) | Temperature | Notes |
|---|---|---|---|---|
| Single-inference | 40-999 | 1024-4096 | Model-dependent | GPU=999 for full offload on capable hardware |
| Agent workflow | 60 | 256-512 | 0.6-0.8 | Lower context reduces per-step latency |
| Multi-agent concurrent | 80 | 512-2048 | 0.8-1.0 | GPU>=80 required for contention-free execution |
Rule. Never transfer configurations between deployment modes without empirical validation. A configuration that achieves 99% efficiency in single-agent mode may produce 73% efficiency in multi-agent mode (TR110, Test 2: GPU=60, CTX=1024).
8.6 Invalidation conditions
The entire operational doctrine is conditional on the measured stack. The following changes invalidate guidance and require re-benchmarking:
-
Hardware change. Any change in GPU model, VRAM capacity, or CPU architecture. The measured results are specific to RTX 4080 Laptop (12GB VRAM) with Intel i9-13980HX. Desktop or server GPUs may exhibit different memory bandwidth, scheduling, and thermal characteristics.
-
Ollama version update. Ollama's inference engine, model loading, and API behavior can change between versions. Any version upgrade requires a validation benchmark before deploying under existing configuration guidance.
-
Rust toolchain major version change. While Rust's stability guarantees minimize runtime behavioral changes, a major toolchain version (e.g., edition change, async runtime ABI modifications) warrants validation.
-
New model family. The model routing policy (Section 8.4) is validated only for Gemma 3, Llama 3.1, and Qwen 2.5. A new model family (e.g., Mistral, Phi, DeepSeek) must be validated for multi-agent scaling before deployment.
-
Operating system change. All measurements were conducted on Windows 11. Linux and macOS may exhibit different process scheduling, memory management, and I/O behavior. The dual Ollama architecture may perform differently under Linux's process isolation model.
-
Agent count exceeding two. The program validates only dual-agent concurrent execution. Scaling to three or more agents introduces untested contention patterns, memory pressure, and scheduling dynamics.
9. Threats to Validity
This section enumerates the principal threats to the internal and external validity of the findings presented across TR108-TR116. Each threat is stated with its scope of impact and, where applicable, a mitigation or bounding argument.
9.1 Single hardware baseline
All 903+ benchmark runs were conducted on a single hardware configuration: NVIDIA GeForce RTX 4080 Laptop GPU (12GB VRAM) with Intel i9-13980HX CPU, running Windows 11. This is a laptop-class system, not a desktop workstation or server. Thermal throttling, power delivery limitations, and mobile GPU clock behavior may produce results that differ systematically from desktop or datacenter hardware. Absolute throughput numbers are not portable; relative rankings and architectural conclusions (e.g., dual Ollama advantage) are more likely to transfer, but this is not empirically verified.
9.2 Windows-only measurement environment
The entire benchmark corpus was collected on Windows 11. Process scheduling, memory management, I/O system calls, and GPU driver behavior differ between Windows and Linux. The asyncio and Tokio runtimes may exhibit different performance characteristics on Linux, where epoll-based I/O is structurally different from Windows IOCP. The dual Ollama architecture relies on process isolation, which is implemented differently across operating systems. Conclusions about Rust-vs-Python runtime efficiency may not hold on Linux, where Python's asyncio may benefit from lower system call overhead.
9.3 Limited concurrency scope
The multi-agent analysis is restricted to two concurrent agents. The scaling behavior from one to two agents does not necessarily predict behavior at three, four, or eight agents. Contention patterns, VRAM pressure, and scheduler behavior may exhibit phase transitions at higher agent counts. The efficiency numbers reported (98-99% for Rust, 95-99% for Python) are specific to the dual-agent regime and should not be extrapolated.
9.4 Single inference backend
Ollama is the sole inference backend tested for multi-agent workloads. Other backends (vLLM, TensorRT-LLM, llama.cpp server mode, HuggingFace Text Generation Inference) may produce different throughput, latency, and contention characteristics. The dual-instance architecture insight may not apply to backends that natively support concurrent model inference.
9.5 Limited model family coverage
Gemma 3 (4.3B) is the primary model for the majority of benchmarks. Llama 3.1 8B and Qwen 2.5 7B are tested in TR116 but with fewer configurations. Models with different architectures (mixture-of-experts, state-space models, linear attention) or different parameter counts (1B, 70B) may exhibit different scaling and contention behavior. The model routing policy is validated only for the three tested families within the 4-8B parameter range.
9.6 Quantization specificity
The primary quantization level is Q4_K_M for Gemma 3 and Q4_0 for Llama 3.1. Other quantization levels (Q2_K, Q5_K, Q8_0, FP16) were not systematically tested in the multi-agent context. Quantization affects memory footprint, compute intensity, and potentially contention behavior. The economic projections assume Q4-level quantization and may not hold for higher-precision deployments.
9.7 Quality not measured in multi-agent context
The program measures throughput, latency, efficiency, and contention. It does not measure output quality, accuracy, or task completion rate in multi-agent scenarios. It is possible that configurations which maximize throughput efficiency produce lower-quality outputs due to temperature, context truncation, or model saturation effects. The economic projections assume quality parity across configurations, which is not empirically validated.
9.8 Shadow pricing
Cost projections use shadow pricing derived from comparable cloud instance rates, not actual billing data. Shadow prices are approximations that may not reflect negotiated rates, spot pricing, reserved instance discounts, or provider-specific pricing structures. The 5-year TCO projections assume stable pricing, which is historically unrealistic for cloud compute. The projections should be treated as order-of-magnitude estimates for decision framing, not as financial forecasts.
9.9 Process isolation overhead
The dual Ollama architecture relies on operating system process isolation. The overhead of running two separate Ollama server processes (memory, context switching, port management) is measured on Windows but may differ on Linux or containerized environments. Docker-based deployments add an additional layer of isolation that could affect the measured contention and efficiency results.
10. Limitations by Report and Mitigations
This section consolidates the principal limitation of each report in the TR108-TR116 arc, paired with the mitigation strategy employed within the program and the residual risk that persists after mitigation.
| Report | Key Limitation | Mitigation | Residual Risk |
|---|---|---|---|
| TR108 | Single-inference only; no agent workflow or multi-step evaluation. Configuration recommendations assume isolated LLM calls. | TR109 extends to agent workflows; TR110 extends to multi-agent. Cross-validated in TR112_v2. | Optimal configs for single-inference (GPU=999, CTX=4096) are misleading if applied to agent or multi-agent modes without revalidation. |
| TR109 | Python-only agent workflows; Rust not yet implemented. 20 configurations with limited statistical replication. | TR111/TR111_v2 provides Rust parity. TR112_v2 performs cross-language comparison with matched methodology. | Python agent workflow baselines may not generalize to other languages or runtimes not tested. |
| TR110 | Python multi-agent tested on single hardware with dual Ollama. Peak efficiency (99.25%) may reflect favorable scheduling on tested OS/hardware. | TR114_v2 validates dual Ollama on Rust. TR116 cross-validates with multiple models. | Peak numbers not validated on Linux or server hardware. 150 runs provide statistical confidence but not cross-environment generality. |
| TR111 | Original version tested simplified Rust agent (single LLM call, no file I/O). Not comparable to Python's full workflow. | Superseded by TR111_v2 with full workflow parity (file scanning, multi-stage LLM, metric tracking). | v1 data should not be cited; v2 is the canonical reference. |
| TR112 | Original version compared mismatched implementations (Rust micro-benchmark vs Python full workflow). | Superseded by TR112_v2 with production-grade parity. 111 total runs (57 Rust + 54 Python). | v1 conclusions (Python faster) are incorrect and must not inform decisions. v2 is canonical. |
| TR113 | Single Ollama instance for Rust multi-agent. Identified serialization bottleneck but did not solve it. Efficiency capped at 82.2%. | TR114 and TR114_v2 adopt dual Ollama architecture, raising efficiency to 95.7% and 98.281% respectively. | TR113 data is diagnostic only. Production decisions must use TR114_v2 numbers. |
| TR114 | First dual Ollama attempt; 150 runs but predates corrected single-agent baselines (TR111_v2/TR112_v2). | Superseded by TR114_v2 with corrected baselines, full workflow parity, and 135 additional runs. | v1 efficiency numbers (89.3% mean) understate Rust's true capability. v2 (98.281%) is canonical. |
| TR115 | Runtime comparison with 150 runs but based on uncorrected baselines. Peak numbers (96.3%) reported from single favorable runs. | Superseded by TR115_v2 with corrected baselines and 150 runs. Tokio-default identified as most reliable. | Runtime differences are within noise (< 1pp). The key finding is negative: runtime choice is not a significant lever. |
| TR116 | Three model families tested but with fewer configurations per model (60 total runs). Qwen 2.5 underperformance may be model-specific, not architecture-general. | Cross-validated against TR114_v2 (Gemma 3) and TR110 (Python baselines). Multiple scenarios per model. | Model routing policy validated only for tested models. New model families require independent validation. |
10.1 Version supersession policy
Reports TR111, TR112, TR114, and TR115 each have v2 versions that supersede the originals. The v2 versions correct methodological issues (mismatched implementations, uncorrected baselines, simplified workflows) identified during the research arc. All operational decisions, economic projections, and doctrinal rules in this synthesis are derived from v2 data where available. Original versions are retained for provenance but should not be cited for decision-making.
10.2 Cumulative residual risk
The principal residual risk across the program is environmental specificity: all measurements are from one hardware configuration, one operating system, one inference backend, and one quantization level, tested with at most three model families. The mitigations reduce within-program risk (corrected baselines, full workflow parity, dual Ollama validation) but do not address cross-environment generalization. Any deployment on different hardware, a different operating system, or a different inference backend should treat the conclusions as hypotheses to be validated, not as established facts.
11. Future Directions
The TR108-TR116 arc establishes a foundation for production deployment decisions within its measured boundary conditions. Several research directions extend the scope, address identified limitations, and connect to the broader Banterhearts research program.
11.1 Continuation into TR117-TR122
The immediate continuation of this research is documented in TR117 through TR122, which shift focus from runtime and architecture optimization to measurement methodology, cost modeling, scaling laws, compiler attribution, and physical measurement:
- TR117: Baseline performance matrix with single-model scope, extending the benchmarking framework to additional models and scenarios.
- TR118_v2.2: Validation and artifact integrity, establishing reproducibility guarantees for the measurement pipeline.
- TR119v1: Cost and energy economics, translating throughput into monetary and environmental cost.
- TR120: Compiler attribution audit, separating framework overhead from model computation.
- TR121v1: Scaling regimes across model sizes, identifying where parameter count predicts throughput and where it does not.
- TR122: Physics-based measurement, defining measurement resolution limits and energy attribution boundaries.
These reports are synthesized in a separate conclusive document (Technical_Report_Conclusive_117-122.md).
11.2 Continuation into TR123-TR133
The second continuation extends into quantization economics, quality baselines, and capacity planning:
- TR123: KV-cache production economics across five model architectures (124M to 3.2B parameters), examining how attention mechanism (MHA vs GQA) affects cache memory costs.
- TR124: Quality and accuracy baselines across backends and quantization levels, providing the quality dimension absent from TR108-TR116.
- TR125: Quantization decision matrix with systematic evaluation of Q2_K through FP16 across four model families, addressing the quantization specificity limitation identified in Section 9.6.
- TR126-TR133: Extended investigations into backend equivalence, sampling variance, and production readiness criteria.
These reports are synthesized in a separate conclusive document (Technical_Report_Conclusive_123-133.md).
11.3 Suggested investigations not yet covered
The following investigations address specific gaps identified in this synthesis:
-
Three-or-more agent scaling. The program validates only dual-agent execution. A systematic study of 3, 4, and 8 concurrent agents would establish whether efficiency degrades linearly, exhibits phase transitions, or requires architectural changes beyond dual Ollama. This is the single largest gap in the current evidence base for production planning.
-
Linux validation. All measurements are Windows-only. A controlled replication on Linux (Ubuntu 22.04 or later) with identical hardware would establish whether the Rust-vs-Python efficiency rankings and the dual Ollama advantage transfer across operating systems. This is particularly important because production deployments overwhelmingly target Linux.
-
Server GPU validation. The RTX 4080 Laptop GPU has different memory bandwidth, thermal envelope, and clock behavior compared to datacenter GPUs (A100, H_100, L40S). A replication study on server-class hardware would determine whether the architectural conclusions (dual Ollama, Rust advantage) hold at higher VRAM and bandwidth tiers.
-
Quality-adjusted throughput. The program measures tokens per second but not quality per token. A combined metric that weights throughput by output quality (measured via task completion rate, factual accuracy, or human evaluation) would provide a more complete basis for model routing and configuration decisions. This directly connects to TR124's quality baseline work.
-
Alternative inference backends. Ollama is the sole backend for multi-agent testing. Evaluating vLLM, TensorRT-LLM, or llama.cpp in server mode would determine whether the dual-instance architecture insight is Ollama-specific or generalizes across inference engines.
-
Containerized deployment. Docker and Kubernetes introduce additional isolation layers that may affect the measured contention and efficiency results. A validation study under containerized deployment would bridge the gap between benchmark results and production infrastructure.
12. Conclusive Statement
This synthesis consolidates 903 or more individual benchmark runs across nine technical reports (TR108 through TR116) into a coherent, decision-grade framework for local-first LLM inference in multi-agent production systems. The research arc progresses from single-model, single-inference performance characterization (TR108) through agent workflow optimization (TR109), concurrent multi-agent analysis (TR110), cross-language comparison (TR111-TR112), architectural intervention (TR113-TR114), runtime optimization (TR115), and cross-model validation (TR116). Each report builds on its predecessors, and four reports (TR111, TR112, TR114, TR115) underwent v2 revisions to correct methodological issues discovered during the arc.
The program yields six shippable decisions: (1) default to Rust for production multi-agent workloads; (2) deploy dual Ollama instances for concurrent agent execution; (3) use Tokio-default as the async runtime; (4) route to Gemma 3 for throughput-critical workloads and Llama 3.1 for reasoning-heavy tasks; (5) set GPU layers to 80 or above for contention-free multi-agent operation; and (6) never transfer configurations between deployment modes without empirical revalidation.
The most consequential finding is architectural, not linguistic: the transition from single to dual Ollama instances produced a +17pp efficiency gain and a 10x contention reduction, exceeding the impact of the Python-to-Rust migration (+2.48pp multi-agent efficiency, +15.2% single-agent throughput). Language choice matters, but architecture matters more. This insight reframes the optimization priority for production systems: solve the serialization bottleneck first, then optimize the runtime.
This synthesis is decision-grade within the measured stack. It is not a universal performance claim. The conclusions are bounded to one hardware baseline, one operating system, one inference backend, three model families, and dual-agent concurrency. The portable output is the method, the gating rules, and the architectural insight. The absolute numbers are local to the measured conditions and will change under different deployment configurations.
13. References
13.1 Primary sources (TR108-TR116)
[1] Technical Report 108: Comprehensive LLM Performance Analysis -- Ollama Model Benchmarking and Optimization Study. October 2025. PublishReady/reports/Technical_Report_108.md
[2] Technical Report 109: Agent Workflow Performance Analysis -- Chimera Optimization for Multi-Step LLM Agent Tasks. October 2025. PublishReady/reports/Technical_Report_109.md
[3] Technical Report 110: Concurrent Multi-Agent Performance Analysis with Chimera Optimization -- Systematic Evaluation of Parallel Agent Execution. October 2025. PublishReady/reports/Technical_Report_110.md
[4] Technical Report 111 v2: Rust Agent Workflow Performance Analysis -- Comprehensive Parameter Optimization and Python Performance Parity Validation. November 2025. PublishReady/reports/Technical_Report_111_v2.md
[5] Technical Report 112 v2: Rust vs Python Agent Performance Comparison -- Cross-Language Comprehensive Analysis for Production LLM Deployments. November 2025. PublishReady/reports/Technical_Report_112_v2.md
[6] Technical Report 113: Rust Concurrent Multi-Agent Performance Analysis -- Cross-Language Comparison and Production Deployment Evaluation. November 2025. PublishReady/reports/Technical_Report_113.md
[7] Technical Report 114 v2: Rust Concurrent Multi-Agent Performance with Dual Ollama Architecture -- Comprehensive Cross-Language Analysis and Production Validation. November 2025. PublishReady/reports/Technical_Report_114_v2.md
[8] Technical Report 115 v2: Rust Async Runtime Performance Deep Dive -- Comprehensive Multi-Runtime Analysis for Multi-Agent LLM Workloads. November 2025. PublishReady/reports/Technical_Report_115_v2.md
[9] Technical Report 116: Cross-Model Benchmarks and Runtime Architecture Analysis -- Qwen 2.5 vs Gemma 3 vs Llama 3.1 8B: Comprehensive Multi-Agent Performance Study. November 2025. PublishReady/reports/Technical_Report_116.md
13.2 Superseded reports (retained for provenance)
[10] Technical Report 111 (v1): Rust Agent Performance Analysis. November 2025. PublishReady/reports/Technical_Report_111.md. Superseded by [4].
[11] Technical Report 112 (v1): Rust vs Python Agent Comparison. November 2025. PublishReady/reports/Technical_Report_112.md. Superseded by [5].
[12] Technical Report 114 (v1): Rust Concurrent Multi-Agent Performance with Dual Ollama Architecture. November 2025. PublishReady/reports/Technical_Report_114.md. Superseded by [7].
[13] Technical Report 115 (v1): Rust Async Runtime Optimization Analysis. November 2025. PublishReady/reports/Technical_Report_115.md. Superseded by [8].
13.3 Continuation reports
[14] Technical Report Conclusive 117-122: From Benchmarking to Decision-Grade Inference. December 2025. PublishReady/reports/Technical_Report_Conclusive_117-122.md
[15] Technical Report Conclusive 123-133: Quantization Economics, Quality Baselines, and Capacity Planning. 2026. PublishReady/reports/Technical_Report_Conclusive_123-133.md
13.4 Hardware and platform references
[16] NVIDIA Corporation. NVIDIA GeForce RTX 4080 Laptop GPU Specifications. 2023. [Online]. Available: https://www.nvidia.com/en-us/geforce/laptops/
[17] Intel Corporation. 13th Gen Intel Core i9-13980HX Processor Specifications. 2023. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/232171/intel-core-i913980hx-processor-36m-cache-up-to-5-60-ghz.html
13.5 Runtime and framework references
[18] Tokio Contributors. Tokio: An Asynchronous Rust Runtime. 2024. [Online]. Available: https://docs.rs/tokio/latest/tokio/
[19] Tokio Contributors. Tokio Runtime Configuration and Work-Stealing Scheduler. 2024. [Online]. Available: https://docs.rs/tokio/latest/tokio/runtime/index.html
[20] The smol Contributors. smol: A Small and Fast Async Runtime for Rust. 2024. [Online]. Available: https://docs.rs/smol/latest/smol/
[21] Python Software Foundation. asyncio -- Asynchronous I/O. 2024. [Online]. Available: https://docs.python.org/3/library/asyncio.html
13.6 Inference backend references
[22] Ollama. Ollama: Run Large Language Models Locally. 2024. [Online]. Available: https://ollama.ai/
[23] Ollama. Ollama API Documentation. 2024. [Online]. Available: https://github.com/ollama/ollama/blob/main/docs/api.md
13.7 Model references
[24] Google DeepMind. Gemma 3: Open Models Based on Gemini Research and Technology. 2025. [Online]. Available: https://ai.google.dev/gemma
[25] Meta AI. Llama 3.1: Open Foundation and Instruction-Tuned Large Language Models. 2024. [Online]. Available: https://llama.meta.com/
[26] Alibaba Cloud. Qwen 2.5: A Series of Large Language Models. 2024. [Online]. Available: https://qwenlm.github.io/
13.8 Methodological references
[27] A. Vaswani et al., "Attention Is All You Need," in Advances in Neural Information Processing Systems, 2017. arXiv:1706.03762.
[28] T. Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," 2022. arXiv:2205.14135.
[29] J. Kaplan et al., "Scaling Laws for Neural Language Models," 2020. arXiv:2001.08361.
[30] T. B. Brown et al., "Language Models are Few-Shot Learners," 2020. arXiv:2005.14165.
Appendix A: Key Formulas and Derivations
This appendix collects the principal quantitative formulas used throughout the TR108--TR116 program. Each formula is presented with its definition, units, and the context in which it was applied. All formulas assume steady-state measurement windows; warmup and cold-start periods are excluded unless explicitly noted.
A.1 Parallel Efficiency
Parallel efficiency quantifies how effectively a concurrent multi-agent system utilizes its theoretical capacity relative to sequential execution. A value of 100% indicates perfect parallelism with zero coordination overhead.
eta = (2 * T_slower) / (T_slower + T_faster) * 100%
Where:
T_sloweris the wall-clock time of the slower agent in a concurrent pair (seconds)T_fasteris the wall-clock time of the faster agent in a concurrent pair (seconds)etais expressed as a percentage; 100% denotes zero overhead
Application: TR110 (Python multi-agent, peak 99.25%), TR113 (Rust single-Ollama, peak 82.2%), TR114_v2 (Rust dual-Ollama, peak 99.992% single run, 99.396% config average), TR116 (cross-model, Gemma 3 at 99.2% chimera-homo in Rust).
Boundary Note: This formula assumes a two-agent system. For N agents, the generalized form replaces 2 with N and uses the slowest agent's time in the numerator: eta_N = (N * T_slowest) / sum(T_i) * 100%.
A.2 Concurrency Speedup
Concurrency speedup measures how much faster a concurrent deployment completes the same workload relative to sequential execution.
S = T_sequential / T_concurrent
Where:
T_sequentialis the total wall-clock time when agents execute sequentially (seconds)T_concurrentis the wall-clock time when agents execute in parallel (seconds)Sis dimensionless; theoretical maximum is N for N agents
Application: TR110 peak speedup 1.985x (theoretical max 2.0x), TR113 peak 1.64x, TR114_v2 peak 1.99x.
Relationship to Efficiency: For a two-agent system, S = 2 * eta / 100. A speedup of 1.985x corresponds to 99.25% efficiency.
A.3 Contention Rate
Contention rate quantifies the fraction of benchmark runs in which resource conflicts degrade performance beyond a configurable threshold. It serves as a binary health indicator for deployment stability.
C_rate = N_contended / N_total
Where:
N_contendedis the number of runs wherethroughput_delta > thresholdor resource conflict signals are detected (e.g., VRAM exhaustion, scheduling stalls)N_totalis the total number of benchmark runs in the measurement set- Default threshold: throughput degradation > 5% relative to single-agent baseline
Application: TR113 contention rate 63% (single Ollama), TR114_v2 contention rate 0.74% (dual Ollama). The transition from 63% to 0.74% is the primary evidence for the dual-Ollama mandatory decision (C2).
A.4 Coefficient of Variation
The coefficient of variation provides a dimensionless measure of run-to-run consistency, enabling comparison across metrics with different units and scales.
CV = (sigma / mu) * 100%
Where:
sigmais the sample standard deviationmuis the sample meanCVis expressed as a percentage; lower values indicate higher consistency
Application: TR111_v2 throughput CV 0.24% (extremely stable), TR112_v2 Rust CV 2.6% vs Python CV 4.8% (Rust 46% more consistent). TR115_v2 uses standard deviation in percentage-point units (pp) for efficiency metrics: tokio-default 1.21pp sigma, tokio-localset 4.03pp sigma.
Decision Threshold: CV < 5% is considered production-acceptable for throughput metrics. CV > 10% triggers investigation for root-cause instability.
A.5 Throughput
Throughput measures the rate at which the model generates output tokens during the decode phase, excluding prompt processing time.
throughput = tokens_generated / generation_time_seconds
Where:
tokens_generatedis the count of output tokens produced in the response (dimensionless)generation_time_secondsis the elapsed wall-clock time during the decode phase (seconds)- Result is expressed in tok/s (tokens per second)
Application: TR108 Gemma 3 baseline 102.85 tok/s, Gemma 270m peak 303.9 tok/s, Llama 3.1 q8_0 46.57 tok/s. TR111_v2 Rust agent 114.54 tok/s, TR112_v2 Python agent 99.34 tok/s.
Boundary Note: Throughput as measured in this program is decode-only. Prompt processing (prefill) time is captured separately in TTFT. Total end-to-end latency is TTFT + (tokens_generated / throughput).
A.6 Time to First Token (TTFT)
TTFT measures the latency from request submission to the arrival of the first output token, encompassing network round-trip, prompt processing (prefill), and scheduling overhead.
TTFT = t_first_token - t_request_sent
Where:
t_request_sentis the timestamp at which the HTTP request is dispatched to the Ollama APIt_first_tokenis the timestamp at which the first streaming token is received by the client- Result is expressed in milliseconds (ms)
Application: TR109 TTFT improvement -68% (optimized vs baseline), worst case +323.7% (wrong config transfer). TR112_v2 Rust 603ms vs Python 1437ms baseline (-58% advantage for Rust).
A.7 Cost per Million Tokens
This formula translates throughput into a monetary cost metric, enabling infrastructure budget comparisons across deployment configurations.
cost_per_1M = (1e6 / tok_per_s / 3600) * usd_per_hour
Where:
tok_per_sis the measured throughput (tokens per second)usd_per_houris the fully loaded hourly cost of the compute resource (USD/hour), including amortized hardware, electricity, and cooling- Result is expressed in USD per 1,000,000 tokens
Application: TR112_v2 estimates $3,040/year savings from Rust deployment at scale (1M requests/month). At $0.50/hour compute cost: Rust at 114.54 tok/s yields $1.21 per 1M tokens; Python at 99.34 tok/s yields $1.40 per 1M tokens.
A.8 Memory Efficiency
Memory efficiency normalizes throughput by VRAM consumption, enabling comparisons across models with different memory footprints.
mem_eff = tok_per_s / vram_gb
Where:
tok_per_sis measured throughput (tokens per second)vram_gbis GPU VRAM consumed during inference (gigabytes)- Result is expressed in tok/s/GB
Application: Gemma 3 (4.3B, Q4_K_M) at ~3.3 GB VRAM and 102.85 tok/s yields 31.2 tok/s/GB. Llama 3.1 8B at ~5.5 GB VRAM and 68 tok/s yields 12.4 tok/s/GB. Gemma 3 is 2.5x more memory-efficient.
A.9 Break-Even Months
Break-even months quantify the time required for operational savings to recoup the additional development cost of a higher-investment deployment path (e.g., Rust migration).
break_even = additional_dev_cost / monthly_savings
Where:
additional_dev_costis the one-time investment to migrate or re-implement (USD)monthly_savingsis the recurring cost reduction per month from the new deployment (USD/month)- Result is expressed in months
Application: TR112_v2 estimates $5,000 additional development cost for Rust migration with $253/month savings ($3,040/year / 12), yielding a break-even period of approximately 20 months.
Appendix B: Claim-to-Artifact Chain of Custody
This appendix establishes a formal evidence chain linking each of the six principal decisions to the specific technical reports, measurements, and artifacts that support them. The chain-of-custody methodology ensures that no decision rests on a single data point and that every claim is traceable to reproducible artifacts.
B.1 Chain-of-Custody Table
| Claim | Decision | Primary Evidence (TR) | Key Metric | Secondary Evidence | Artifact Type | Status |
|---|---|---|---|---|---|---|
| C1 | Rust for production inference | TR112_v2 (+15.2% throughput, -58% TTFT, -67% memory) | 114.54 vs 99.34 tok/s | TR111_v2 (0.24% CV), TR116 (Rust 90-99% efficiency) | Benchmark logs, CSV results, statistical analysis | Supported |
| C2 | Dual Ollama mandatory for multi-agent | TR113 (82.2% peak, 63% contention), TR114_v2 (99.4% peak, 0.74% contention) | Contention 63% to 0.74% | TR110 (Python dual-Ollama 99.25%) | Contention logs, per-run timing data | Supported |
| C3 | Tokio-default runtime for Rust async | TR115_v2 (98.72% mean, 1.21pp sigma) | Consistency rank #1 of 5 runtimes | TR114_v2 (uses tokio-default), TR116 (uses tokio-default) | Runtime benchmark CSV, per-run efficiency distributions | Supported |
| C4 | Gemma 3 for scaling and production | TR116 (99.2% chimera-homo Rust efficiency) | 97.3% baseline-vs-chimera efficiency | TR108 (102.85 tok/s baseline), TR110 (used Gemma 3) | Cross-model comparison CSV, efficiency rankings | Supported |
| C5 | Python ceiling approximately 86% | TR116 (Python never exceeds 86% multi-agent efficiency) | Max 83.8% (Llama), 80.2% (Gemma), 77.6% (Qwen) | TR113 vs TR110 (Python dual-Ollama 99.25% but only under ideal homogeneous config) | Per-model per-runtime efficiency distributions | Supported |
| C6 | Configuration does not transfer across workload modes | TR109 (single-inference config transfer failure, +323.7% TTFT regression) | Inverse relationship for context size | TR108 vs TR110 (different optima for single vs multi-agent) | Config sweep results, regression analysis | Supported |
B.2 Evidence Depth Assessment
Each claim is graded on the following dimensions:
| Claim | Replications | Distinct Configs | Statistical Power | Falsification Tested | Grade |
|---|---|---|---|---|---|
| C1 | 111 runs (57 Rust + 54 Python) | 37 configs | CV < 5% for primary metric | Python-favorable configs tested | A |
| C2 | 154 runs (19 TR113 + 135 TR114_v2) | 46 configs | Contention rate shift: 63% to 0.74% | Single-Ollama tested as control | A |
| C3 | 150 runs across 5 runtimes | 30 configs per runtime | sigma comparison across runtimes | 4 alternative runtimes tested | A |
| C4 | 60 runs across 3 models x 2 runtimes | 6 model-runtime combinations | Per-model efficiency distributions | 2 alternative models tested | B+ |
| C5 | 60 runs (30 Python) | 6 model-runtime combinations | Ceiling observed across all 3 models | Rust exceeds ceiling on same hardware | B+ |
| C6 | 20+ runs across transfer experiments | 3 experimental phases | Regression magnitude +323.7% | Optimized configs tested bidirectionally | B |
B.3 Artifact Locations
All benchmark artifacts are stored in the Banterhearts repository under the following paths:
- TR108:
research/tr108/data/(Gemma 3 and Llama 3.1 benchmark CSVs, analysis summaries) - TR109:
research/tr109/(agent workflow results, configuration sweep data) - TR110:
research/tr110/(Python multi-agent results, 150 runs, contention logs) - TR111_v2:
research/tr111/(Rust single-agent parameter sweep, 57 runs) - TR112_v2: Cross-reference of TR111_v2 and TR109 datasets
- TR113:
research/tr113/(Rust single-Ollama multi-agent, 19 configs) - TR114_v2:
research/tr114/(Rust dual-Ollama multi-agent, 135 runs) - TR115_v2:
research/tr115/(5-runtime comparison, 150 runs) - TR116:
research/tr116/(3-model cross-runtime, 60 runs)
Appendix C: Per-TR Key Numbers
This appendix provides a compact reference table summarizing the primary quantitative output of each technical report. The table is designed for rapid lookup and cross-reference during decision review.
C.1 Summary Table
| TR | Runs | Primary Metric | Best Value | Worst Value | CV / Spread | Key Finding |
|---|---|---|---|---|---|---|
| TR108 | 158+ | Throughput (tok/s) | 303.9 (Gemma 270m) | 46.57 (Llama q8_0) | <5% per config | Model size inversely correlated with throughput; GPU layers most critical parameter |
| TR109 | 20+ | TTFT improvement (%) | -68% (optimized) | +323.7% (wrong config) | varies by phase | Single-inference optima do not transfer to agent workflows |
| TR110 | 150 | Parallel efficiency (%) | 99.25% (homo chimera) | 72.7% (GPU=60 mixed) | 7.4pp spread | Homogeneous chimera with GPU>=80 achieves near-perfect parallelism |
| TR111_v2 | 57 | Throughput (tok/s) | 114.98 (best config) | 113.99 (worst config) | 0.24% CV | Rust throughput extremely stable; TTFT is primary optimization target |
| TR112_v2 | 111 | Throughput delta (%) | +15.2% (Rust over Python) | -- | Rust 2.6% CV, Python 4.8% CV | Rust faster, more consistent, lower memory; Python retains dev velocity |
| TR113 | 19 | Parallel efficiency (%) | 82.2% (homo chimera) | 62.6% (mixed baseline) | 5.5pp spread | Single Ollama serializes requests; 63% contention rate |
| TR114_v2 | 135 | Parallel efficiency (%) | 99.992% (single run) | 91.6% (worst config avg) | 5.0pp spread | Dual Ollama eliminates contention; Rust matches Python multi-agent |
| TR115_v2 | 150 | Runtime efficiency (%) | 99.99% (localset peak) | 50.0% (async-std) | varies by runtime | Tokio-default most reliable (1.21pp sigma); async-std catastrophic failure |
| TR116 | 60 | Cross-model efficiency (%) | 99.2% (Gemma Rust homo) | 77.6% (Qwen Python) | <2% within model | Gemma 3 scaling champion; Python ceiling ~86%; Rust dominates all models |
C.2 Aggregate Program Statistics
| Metric | Value |
|---|---|
| Total benchmark runs | 903+ |
| Distinct configurations tested | 200+ |
| Models evaluated | 6 (Gemma 3, Gemma 270m, Llama 3.1 8B, Llama 3.1 various quants, Qwen 2.5 7B, plus config variants) |
| Languages compared | 2 (Rust, Python) |
| Async runtimes evaluated | 5 (tokio-default, tokio-localset, async-std, smol, smol-1KB) |
| Deployment architectures | 4 (single-agent, multi-agent single-Ollama, multi-agent dual-Ollama, cross-model) |
| Calendar duration | October--November 2025 |
| Hardware platform | Single workstation (RTX 4080 Laptop 12 GB, i9-13980HX, 32 GB DDR5-4800) |
Appendix D: Glossary and Definitions
This appendix provides authoritative definitions for all technical terms used in the TR108--TR116 program. Terms are listed alphabetically. Where a term has a program-specific meaning that differs from general usage, the program-specific meaning is given and the distinction is noted.
Baseline. The default Ollama configuration (no parameter overrides) used as the control condition in parameter sweep experiments. In multi-agent tests, "baseline" may refer to the sequential execution time against which concurrency speedup is measured.
Chimera. The project-specific name for optimized LLM agent configurations that have been tuned beyond default parameters. A "Chimera agent" is an agent running with optimized GPU layer allocation, context size, and temperature settings. The name derives from the parent project, Chimera Heart.
Cold Start. The first inference request after model loading, during which KV cache allocation, CUDA kernel compilation, and memory mapping occur. Cold-start TTFT is typically 2--10x higher than warm-start TTFT. In the TR108--TR116 program, cold-start runs are excluded from steady-state measurements unless explicitly noted.
Coefficient of Variation (CV). A dimensionless measure of relative dispersion, defined as the ratio of the standard deviation to the mean, expressed as a percentage. See Appendix A.4.
Contention. A state in which two or more concurrent agents compete for a shared resource (GPU compute, VRAM, Ollama server queue) in a manner that degrades throughput or increases latency beyond the expected baseline. Contention is detected when throughput delta exceeds a configurable threshold (default 5%).
Decision-Grade. A report classification indicating that the data quality, statistical rigor, and evidence chain are sufficient to support a production deployment decision. A decision-grade report must satisfy: (1) minimum 3 replications per configuration, (2) CV < 10% for primary metrics, (3) documented measurement boundaries, (4) falsification testing of at least one alternative hypothesis.
Dual Ollama. A deployment architecture in which two independent Ollama server processes run on the same machine, each bound to a distinct port (default 11434 and 11435). Each concurrent agent is assigned to a dedicated Ollama instance, eliminating server-level request serialization. This architecture was introduced in TR114 and validated as mandatory for multi-agent Rust deployments.
Event Loop Ceiling. The maximum parallel efficiency achievable by a given async runtime's event loop implementation. In the TR108--TR116 program, Python's asyncio event loop exhibits a ceiling of approximately 86% multi-agent efficiency (TR116), while Rust's tokio runtime achieves 98--99%.
Heterogeneous Deployment. A multi-agent configuration in which agents use different parameter settings (e.g., Agent 1 at GPU=60, CTX=512 and Agent 2 at GPU=120, CTX=1024). Contrasted with homogeneous deployment.
Homogeneous Deployment. A multi-agent configuration in which all agents use identical parameter settings. Homogeneous chimera deployments consistently achieve the highest parallel efficiency in this program (TR110: 99.25%, TR114_v2: 99.4%).
KV Cache. The key-value cache maintained by transformer models during autoregressive generation. The KV cache stores the key and value tensors for all previously processed tokens, enabling O(1) per-token generation rather than O(n) recomputation. KV cache size scales linearly with sequence length, number of layers, and head dimension.
Parallel Efficiency. The ratio of achieved concurrency speedup to theoretical maximum speedup, expressed as a percentage. See Appendix A.1.
Process Isolation. The practice of running each Ollama server as an independent operating-system process with its own memory space and GPU resource allocation. Process isolation prevents cross-agent interference and is the mechanism by which dual Ollama achieves near-zero contention.
Q4_K_M Quantization. A 4-bit quantization scheme using the K-quant method with medium precision. The "K" refers to the k-quant family of quantization algorithms; "M" denotes medium quality (as opposed to "S" for small/low or "L" for large/high). Q4_K_M typically reduces model size by approximately 75% relative to FP16 while preserving 95--99% of output quality.
Speedup. The ratio of sequential execution time to concurrent execution time. See Appendix A.2.
Time to First Token (TTFT). The elapsed time from request submission to the arrival of the first generated output token. See Appendix A.6.
tok/s (Tokens per Second). The standard unit of LLM inference throughput in this program, measuring decode-phase output token generation rate. See Appendix A.5.
VRAM (Video Random Access Memory). The GPU-local memory used to store model weights, KV cache, and intermediate activations during inference. The test platform provides 12 GB of GDDR6X VRAM (RTX 4080 Laptop).
Warm Start. An inference request issued after the model is already loaded and the KV cache is initialized. Warm-start measurements are the primary basis for steady-state performance analysis in this program.
Work-Stealing Scheduler. A parallel task scheduling strategy in which idle worker threads "steal" tasks from the queues of busy threads. Tokio's default runtime uses a work-stealing scheduler. TR115_v2 found that work-stealing achieves the best consistency (1.21pp sigma) despite marginally lower peak efficiency than thread-pinned alternatives.
Appendix E: Operational Checklists
This appendix provides step-by-step operational checklists for deploying, migrating, and monitoring Rust multi-agent LLM systems based on the configurations validated in the TR108--TR116 program.
E.1 Pre-Deployment Checklist: Rust Multi-Agent System
Complete all items before enabling production traffic.
| # | Action | Validation Criterion | TR Source |
|---|---|---|---|
| 1 | Verify dual Ollama instances running | curl http://localhost:11434/api/version and curl http://localhost:11435/api/version both return 200 |
TR114_v2 |
| 2 | Verify Tokio-default runtime | Source contains #[tokio::main] (not LocalSet, not async-std) |
TR115_v2 |
| 3 | Set GPU layer allocation | num_gpu >= 80 per Ollama instance; total VRAM < 11 GB (leave 1 GB headroom) |
TR108, TR110 |
| 4 | Set context size | num_ctx in range 512--2048; use 1024 for agent workflows, 2048 for chat |
TR109, TR111_v2 |
| 5 | Run warmup sequence | Issue 3 sequential requests per Ollama instance; discard results | TR108, TR111_v2 |
| 6 | Validate parallel efficiency | Run 5-iteration benchmark; confirm efficiency > 95% | TR114_v2 |
| 7 | Validate contention rate | Confirm contention rate < 1% across warmup + benchmark runs | TR114_v2 |
| 8 | Set monitoring alerts | Efficiency < 90%, TTFT > 2000 ms, throughput < 80 tok/s, contention > 5% | TR112_v2, TR116 |
| 9 | Validate memory consumption | nvidia-smi confirms VRAM usage < 11 GB under dual-agent load |
TR110, TR114_v2 |
| 10 | Confirm model version | ollama list shows expected model hash and quantization level |
TR108 |
E.2 Migration Checklist: Python to Rust
Complete sequentially. Each step must pass its validation criterion before proceeding.
| # | Action | Validation Criterion | Expected Value | TR Source |
|---|---|---|---|---|
| 1 | Establish Python baseline | Record single-agent throughput, TTFT, memory | ~99 tok/s, ~1437 ms, ~250 MB | TR112_v2 |
| 2 | Ensure workflow parity | Rust agent performs same file I/O, LLM calls, output format | Identical output structure | TR111_v2 |
| 3 | Benchmark Rust single-agent | Run 3+ iterations with default Ollama config | ~114 tok/s (+15% over Python) | TR111_v2, TR112_v2 |
| 4 | Benchmark Rust multi-agent | Run 5+ iterations with dual Ollama, tokio-default | Efficiency > 98% | TR114_v2 |
| 5 | Validate quality equivalence | Compare output quality on 10 representative prompts | No degradation in coherence or accuracy | TR112_v2 |
| 6 | Deploy with monitoring | Enable all alerts from E.1, run 7-day burn-in period | No alert triggers during burn-in | All TRs |
| 7 | Compare cost metrics | Monthly cost (Rust) < Monthly cost (Python) | ~15% infrastructure savings | TR112_v2 |
| 8 | Decommission Python stack | Only after 30-day stable operation of Rust deployment | Zero incidents during 30-day window | -- |
E.3 Regression Response Checklist
Execute when monitoring alerts trigger during production operation.
| # | Condition | Immediate Action | Root Cause Investigation | TR Reference |
|---|---|---|---|---|
| 1 | Efficiency < 90% | Check Ollama process health on both ports | Verify dual-instance isolation; check VRAM pressure | TR113, TR114_v2 |
| 2 | TTFT > 2000 ms | Check for cold-start (model unloaded) | Verify keep-alive settings; check competing GPU processes | TR109, TR112_v2 |
| 3 | Throughput < 80 tok/s | Check GPU utilization and thermal throttling | Verify GPU clock speed; check for VRAM swap | TR108 |
| 4 | Contention > 5% | Verify agents are routing to separate Ollama instances | Check port assignment; verify process isolation | TR113, TR114_v2 |
| 5 | Memory > 11 GB VRAM | Reduce num_ctx or reduce num_gpu layers | Investigate KV cache size; check for memory leaks | TR108, TR110 |
Appendix F: Workload Taxonomy
This appendix classifies the workload types encountered in the TR108--TR116 program and identifies the optimization target for each class. Practitioners should use this taxonomy to select the appropriate configuration profile.
F.1 Workload Classification
| Workload Type | Phase Dominance | Optimization Target | Representative TR | Tested | Production Guidance |
|---|---|---|---|---|---|
| Single-agent chat | Decode-dominant | Maximize throughput (tok/s) | TR108 | Yes | GPU=80+, CTX=512-1024, TEMP=0.6-0.8 |
| Agent workflow (multi-step) | Mixed prefill + decode | Minimize TTFT and end-to-end latency | TR109, TR111_v2 | Yes | CTX=256-512 (inverse of single-inference), GPU=60-80 |
| Multi-agent concurrent (homogeneous) | Decode-dominant, parallel | Maximize parallel efficiency | TR110, TR114_v2 | Yes | Dual Ollama, GPU=80, CTX=1024-2048, TEMP=1.0 |
| Multi-agent concurrent (heterogeneous) | Mixed | Balance efficiency and specialization | TR110, TR114_v2 | Yes | Dual Ollama, per-agent config tuning required |
| Cross-model multi-agent | Model-dependent | Maximize efficiency at model-specific optimum | TR116 | Yes | Gemma 3 preferred; Llama 3.1 viable for reasoning |
| Batch inference (offline) | Prefill-dominant | Maximize throughput, latency irrelevant | -- | Not tested | Future work; likely benefits from larger CTX, higher batch sizes |
F.2 Configuration Non-Transferability
A critical finding of this program (TR109, claim C6) is that optimal configurations do not transfer across workload types:
- Single-inference optimal (TR108): GPU=999, CTX=4096, high temperature
- Agent workflow optimal (TR109): GPU=60-80, CTX=256-512, moderate temperature
- Multi-agent optimal (TR110): GPU=80, CTX=2048, TEMP=1.0
Applying single-inference configurations to agent workflows caused TTFT regressions of up to +323.7% (TR109 Phase 1). This non-transferability is the basis for decision C6.
Appendix G: Worked Examples
This appendix presents fully worked numerical examples that demonstrate how the formulas in Appendix A translate raw benchmark data into actionable capacity and cost projections.
G.1 Single-Agent Capacity Calculation
Scenario: Estimate requests per hour for a Rust single-agent deployment generating 500-token responses.
Given:
- Throughput: 114.54 tok/s (TR111_v2 baseline)
- Response length: 500 tokens
- TTFT: 603 ms (TR112_v2 Rust baseline)
Calculation:
Decode time = 500 / 114.54 = 4.365 seconds
Total latency = 0.603 + 4.365 = 4.968 seconds per request
Capacity = 3600 / 4.968 = 724.4 requests/hour
Python comparison:
Decode time = 500 / 99.34 = 5.033 seconds
Total latency = 1.437 + 5.033 = 6.470 seconds per request
Capacity = 3600 / 6.470 = 556.4 requests/hour
Rust advantage: 724.4 / 556.4 = +30.2% more requests per hour (includes both throughput and TTFT improvements).
G.2 Multi-Agent Throughput Calculation
Scenario: Estimate effective throughput for a 2-agent Rust deployment with dual Ollama.
Given:
- Single-agent throughput: 114.54 tok/s (TR111_v2)
- Parallel efficiency: 98.28% (TR114_v2 overall mean)
Calculation:
Effective throughput = 2 * 114.54 * 0.9828 = 225.1 tok/s
Python comparison:
- Single-agent throughput: 99.34 tok/s (TR112_v2)
- Parallel efficiency: 95.8% (estimated from TR110 adjusted for workflow complexity)
Effective throughput = 2 * 99.34 * 0.958 = 190.3 tok/s
Combined Rust advantage: 225.1 / 190.3 = +18.3% (compound of throughput + efficiency gains).
G.3 Break-Even Calculation
Scenario: Evaluate the economic case for Rust migration.
Given:
- Additional development cost: $5,000 (estimated Rust re-implementation)
- Infrastructure cost rate: $0.50/hour (amortized hardware + electricity)
- Rust capacity: 724.4 requests/hour (from G.1)
- Python capacity: 556.4 requests/hour (from G.1)
Calculation:
At 1M requests/month:
Python hours needed = 1,000,000 / 556.4 = 1,797.3 hours/month
Rust hours needed = 1,000,000 / 724.4 = 1,380.4 hours/month
Monthly savings = (1,797.3 - 1,380.4) * $0.50 = $208.45/month
Break-even = $5,000 / $208.45 = 24.0 months
Note: The TR112_v2 estimate of 20 months uses a slightly different cost model that includes memory savings and reduced instance count. The difference illustrates the sensitivity of break-even calculations to cost assumptions.
G.4 Contention Impact Calculation
Scenario: Quantify the throughput loss from single-Ollama contention.
Given:
- Dual-Ollama efficiency: 98.28% (TR114_v2)
- Single-Ollama efficiency: 82.2% (TR113 peak)
- Single-agent throughput: 114.54 tok/s
Calculation:
Dual-Ollama effective throughput: 2 * 114.54 * 0.9828 = 225.1 tok/s
Single-Ollama effective throughput: 2 * 114.54 * 0.822 = 188.3 tok/s
Throughput loss from contention: 225.1 - 188.3 = 36.8 tok/s (16.3% loss)
This 36.8 tok/s penalty is the quantitative basis for the dual-Ollama mandatory decision (C2).
Appendix H: Operational Playbooks
This appendix provides step-by-step procedures for common operational scenarios in production multi-agent LLM deployments.
H.1 Pre-Deployment Validation Playbook
Objective: Confirm system readiness before enabling production traffic.
Procedure:
- Start Ollama instance 1:
OLLAMA_HOST=0.0.0.0:11434 ollama serve - Start Ollama instance 2:
OLLAMA_HOST=0.0.0.0:11435 ollama serve - Verify model availability:
ollama liston both instances - Execute warmup: 3 sequential requests per instance (discard results)
- Execute validation benchmark: 5 concurrent dual-agent runs
- Compute parallel efficiency (target: > 95%)
- Compute contention rate (target: < 1%)
- Record baseline metrics for regression detection
- Enable production traffic with monitoring
Rollback trigger: Efficiency < 90% or contention > 5% during validation.
H.2 Monitoring and Alerting Playbook
Objective: Detect and respond to performance degradation in real time.
Metrics to Monitor:
| Metric | Collection Interval | Warning Threshold | Critical Threshold | Source |
|---|---|---|---|---|
| Throughput (tok/s) | Per request | < 90 tok/s | < 80 tok/s | Application logs |
| TTFT (ms) | Per request | > 1500 ms | > 2000 ms | Application logs |
| Parallel efficiency (%) | Per 5-minute window | < 93% | < 90% | Computed from agent pair timing |
| Contention rate (%) | Per 15-minute window | > 2% | > 5% | Computed from throughput deltas |
| VRAM usage (GB) | Every 30 seconds | > 10 GB | > 11 GB | nvidia-smi |
| GPU temperature (C) | Every 30 seconds | > 85 C | > 90 C | nvidia-smi |
H.3 Regression Response Playbook
Objective: Restore performance to baseline when degradation is detected.
Step 1: Triage (< 5 minutes)
- Identify which metric triggered the alert
- Check if both Ollama instances are running (
curlhealth check on both ports) - Check VRAM usage and GPU temperature
Step 2: Isolate (< 15 minutes)
- Run single-agent benchmark on each Ollama instance independently
- Compare to baseline (expect ~114 tok/s for Rust, > 95% efficiency for dual-agent)
- If single-agent performance is degraded, issue is hardware/model-level
- If single-agent is fine but multi-agent is degraded, issue is coordination/contention
Step 3: Remediate
- Hardware issue: Check thermal throttling, reduce load, verify VRAM headroom
- Contention issue: Verify port assignments, restart Ollama instances, confirm process isolation
- Model issue: Verify model hash (
ollama show <model> --modelfile), re-pull if corrupted
Appendix I: Statistical Notes
This appendix documents the statistical methodology and design decisions applied across the TR108--TR116 program.
I.1 Replication Protocol
All production-relevant measurements follow a minimum replication count:
| Context | Minimum Runs | Recommended Runs | Rationale |
|---|---|---|---|
| Exploratory sweep | 1 | 3 | Identify promising configurations |
| Configuration validation | 3 | 5 | Establish mean and variance |
| Production decision | 5 | 10 | Statistical power for CV < 5% confidence |
| Pathological investigation | 10+ | 20+ | Capture tail behavior and rare events |
Program adherence: TR111_v2 used 3 runs per config (57 total), TR114_v2 used 5 runs per config (135 total), TR115_v2 used 5 runs per config per runtime (150 total). TR108 used 2--5 runs per config across 158+ configurations.
I.2 Warmup Exclusion Policy
The first 1--3 requests after model loading are excluded from all steady-state metrics. This policy prevents cold-start artifacts (CUDA kernel compilation, memory mapping, KV cache initialization) from distorting throughput and latency distributions.
Exception: Cold-start behavior is explicitly characterized in TR112_v2 (Rust 603 ms vs Python 1437 ms TTFT) and TR109 (agent workflow cold-start analysis).
I.3 Process Isolation
To prevent inter-experiment contamination:
- Each benchmark run spawns a fresh agent process
- Ollama instances are not restarted between runs within a configuration (model stays warm)
- Ollama instances are restarted between configurations that change
num_gpuornum_ctx - GPU memory is verified via
nvidia-smibetween configuration changes
I.4 Significance Considerations
This program does not employ formal hypothesis testing (e.g., t-tests, ANOVA) because the research design is comparative rather than inferential. Instead, practical significance is established through:
- Effect size: Is the measured difference large enough to matter operationally? (threshold: > 5% throughput or > 3pp efficiency)
- Consistency: Is the effect reproducible across runs? (threshold: CV < 10%)
- Robustness: Does the effect persist across configurations? (threshold: observed in > 50% of tested configs)
Appendix J: Traceability Map (Decisions to TRs)
This appendix provides a compact bidirectional mapping from decisions to supporting evidence and from each TR to the decisions it informs.
J.1 Decision to TR Map
| Decision | Primary TR | Supporting TRs | Confirming TRs | Contradicting Evidence |
|---|---|---|---|---|
| D1: Rust for production | TR112_v2 | TR111_v2, TR116 | TR114_v2, TR115_v2 | TR113 (multi-agent gap, later resolved) |
| D2: Dual Ollama | TR114_v2 | TR113 (problem identification) | TR110 (Python dual-Ollama precedent) | None |
| D3: Tokio-default | TR115_v2 | TR114_v2 | TR116 (uses tokio-default) | Tokio-localset higher peak (unstable) |
| D4: Gemma 3 | TR116 | TR108 | TR110, TR114_v2 | Llama 3.1 viable for reasoning tasks |
| D5: Python ceiling | TR116 | TR110, TR113 | All Python multi-agent results | Python 99.25% in ideal TR110 config |
| D6: Config non-transfer | TR109 | TR108 | TR110 vs TR108 optima | None |
J.2 TR to Decision Map
| TR | Decisions Informed | Role |
|---|---|---|
| TR108 | D4, D6 | Baseline model performance; identified config sensitivity |
| TR109 | D6 | Proved config non-transferability |
| TR110 | D2, D5 | Established Python multi-agent baseline; demonstrated dual-Ollama viability |
| TR111_v2 | D1 | Characterized Rust single-agent performance |
| TR112_v2 | D1 | Quantified Rust vs Python performance gap |
| TR113 | D2 | Identified single-Ollama contention bottleneck |
| TR114_v2 | D1, D2 | Validated Rust multi-agent with dual Ollama |
| TR115_v2 | D3 | Evaluated 5 async runtimes; selected tokio-default |
| TR116 | D1, D4, D5 | Cross-model validation; confirmed Rust superiority and Python ceiling |
J.3 Contradictions and Resolutions
| Apparent Contradiction | TRs Involved | Resolution |
|---|---|---|
| TR113 showed Python better at multi-agent | TR113 vs TR114_v2 | TR113 used single Ollama (serialization bottleneck). Dual Ollama in TR114_v2 resolved the gap. |
| TR110 Python achieved 99.25% efficiency | TR110 vs TR116 | TR110 used ideal homogeneous chimera config. TR116's broader test matrix revealed Python ceiling of ~86% in realistic configurations. |
| TR111 v1 showed Rust slower than Python | TR111 v1 vs TR111_v2 | TR111 v1 used micro-benchmark without workflow parity. TR111_v2 corrected methodology; Rust is 15.2% faster. |
Appendix K: Extended Literature Context
This appendix situates the TR108--TR116 program within the broader landscape of LLM serving, async runtime design, and inference optimization research. It is intentionally concise; the goal is orientation, not survey.
K.1 Rust Async Ecosystem
The Rust async ecosystem is built on the Future trait and the async/await syntax stabilized in Rust 1.39 (2019). Unlike Go or Erlang, Rust does not bundle a runtime; the application selects a runtime at compile time. The major runtimes evaluated in TR115_v2 are:
- Tokio: The dominant production runtime, featuring a multi-threaded work-stealing scheduler. Used by Cloudflare, Discord, and AWS Lambda Rust runtime.
- async-std: An alternative modeled on the standard library's sync API. TR115_v2 found catastrophic 50% serialization due to ecosystem incompatibility with Tokio's HTTP stack.
- smol: A minimal runtime designed for small binaries. TR115_v2 found pathological failures at high context sizes.
K.2 Ollama Internals
Ollama serves LLM inference through a REST API backed by llama.cpp for GGUF model execution. Key architectural properties relevant to this program:
- Single-threaded request queue: A single Ollama instance serializes concurrent requests, explaining TR113's contention finding. Dual instances bypass this by running separate processes.
- GPU layer offloading: The
num_gpuparameter controls how many transformer layers are placed on the GPU. Remaining layers execute on CPU. Performance is highly sensitive to this parameter (TR108). - KV cache management: Ollama pre-allocates KV cache based on
num_ctx. Larger contexts consume proportionally more VRAM. The context size inversely affects agent workflow performance (TR109) because larger pre-allocated caches reduce available VRAM for dual-instance execution.
K.3 LLM Serving Patterns
The TR108--TR116 program addresses the single-machine, low-concurrency serving regime. This is distinct from:
- Disaggregated serving (e.g., vLLM, TensorRT-LLM): Optimized for high-throughput cloud deployments with continuous batching. Not applicable to the single-GPU, 2-agent scenario tested here.
- Speculative decoding: A technique for accelerating autoregressive generation using a draft model. Not tested in this program but relevant for future throughput improvements.
- Quantization-aware serving: The program uses pre-quantized models (Q4_K_M) but does not evaluate dynamic quantization or mixed-precision strategies.
K.4 Tokio Scheduler Design
Tokio's work-stealing scheduler distributes tasks across a thread pool (default size: number of CPU cores). When a worker thread's local queue is empty, it steals tasks from other workers' queues. This design optimizes throughput for heterogeneous task durations but introduces overhead from:
- Thread migration: Moving a task between threads invalidates CPU cache locality.
- Synchronization: The steal operation requires atomic compare-and-swap operations.
- Wake-up latency: Idle threads must be notified when new tasks arrive.
TR115_v2's finding that tokio-default achieves the best consistency (1.21pp sigma) despite these overheads suggests that work-stealing's load-balancing benefits outweigh its costs in I/O-bound LLM serving workloads.
Appendix L: Measurement Boundary Catalog
This appendix specifies exactly what is included and excluded from each TR's primary timing measurements. Precise boundary definitions prevent invalid cross-TR comparisons.
| TR | Measurement | Boundary Start | Boundary End | Included | Excluded |
|---|---|---|---|---|---|
| TR108 | Throughput | First decode token | Last decode token | Decode time, token counting | Prompt processing, HTTP overhead, model loading |
| TR108 | TTFT | HTTP request sent | First streaming token received | Network round-trip, prefill, scheduling | Decode, response parsing |
| TR109 | End-to-end | Workflow start | Workflow complete | All LLM calls, file I/O, processing | Agent startup, Python import time |
| TR109 | TTFT | Per-call request sent | Per-call first token | Same as TR108 per-call | Inter-call processing time |
| TR110 | Parallel efficiency | Both agents start | Both agents complete | Full concurrent execution | Sequential baseline (measured separately) |
| TR111_v2 | Throughput | Decode start | Decode end | Same as TR108 | Rust binary startup (excluded, ~0.2s) |
| TR112_v2 | Cross-language | Identical boundary per language | Identical boundary per language | Matched workflow scope | Language-specific startup (reported separately) |
| TR113 | Contention | Concurrent run start | Concurrent run end | Both agents' full execution | Warmup runs |
| TR114_v2 | Parallel efficiency | Concurrent run start | Concurrent run end | Full dual-Ollama execution | Ollama startup, model loading |
| TR115_v2 | Runtime efficiency | Concurrent run start | Concurrent run end | Full runtime execution | Runtime initialization |
| TR116 | Cross-model | Concurrent run start | Concurrent run end | Full execution per model | Model pull/loading time |
Appendix M: Detailed Methods by Report
This appendix supplements the main methodology sections with implementation details that enable reproduction.
M.1 TR108: Single-Inference Benchmarking
- Tool: Custom Python benchmark harness using
requestslibrary with streaming responses - Prompt set: Gaming banter prompts (10 unique prompts, rotated across runs)
- Token counting: Ollama API response
eval_countfield - Timing: Python
time.perf_counter()for wall-clock, Ollama APIeval_durationfor server-side - Parameter sweep: num_gpu in {33, 60, 80, 99, 120, 999}, num_ctx in {256, 512, 1024, 2048, 4096}, temperature in {0.4, 0.6, 0.8, 1.0}
M.2 TR109: Agent Workflow Testing
- Tool: Banterhearts agent framework (Python, multi-step workflow)
- Workflow: File scan -> data ingestion -> LLM analysis call -> LLM report generation call -> output
- Phases: Phase 1 (config transfer), Phase 2 (parameter sweep), Phase 3 (quality analysis)
- Replication: 3--5 runs per configuration
M.3 TR110: Python Multi-Agent
- Tool: Banterhearts Multi-Agent Orchestrator v2.0 (Python asyncio)
- Architecture: 2 agents (DataCollector + Insight) on dual Ollama instances (ports 11434/11435)
- Test matrix: 30 configurations x 5 runs = 150 benchmarks
- Concurrency: Python
asyncio.gather()for concurrent agent execution
M.4 TR111_v2 / TR112_v2: Rust Single-Agent and Cross-Language
- Tool: Custom Rust agent using
reqwestasync HTTP client +tokioruntime - Workflow parity: Identical to Python (file I/O, multi-stage LLM calls, metric tracking)
- Compilation: Release mode (
cargo build --release) - Replication: TR111_v2: 3 runs x 19 configs = 57; TR112_v2: combined 111 runs
M.5 TR113 / TR114_v2: Rust Multi-Agent
- Tool: Demo_rust_multiagent (Rust async/tokio)
- TR113: Single Ollama instance, 19 configurations
- TR114_v2: Dual Ollama instances, 27 configurations x 5 runs = 135
- Key change: TR114_v2 assigns each agent to a dedicated Ollama port
M.6 TR115_v2: Runtime Comparison
- Tool: Modified Demo_rust_multiagent with configurable runtime backend
- Runtimes tested: tokio-default, tokio-localset, async-std, smol, smol-1KB
- Replication: 30 configurations x 5 runtimes = 150 runs
- Compilation: Each runtime compiled as a separate binary with appropriate feature flags
M.7 TR116: Cross-Model
- Tool: Demo_rust_multiagent (tokio-default) + Python multi-agent orchestrator
- Models: gemma3:latest, llama3.1:8b-instruct-q4_0, qwen2.5:7b
- Matrix: 3 models x 2 runtimes x 2 scenarios x 5 runs = 60
Appendix N: Expanded Discussion
This appendix elaborates on the interpretive themes that emerge from the TR108--TR116 program as a whole.
N.1 The Contention Problem and Its Resolution
The most dramatic finding of the program is the contention story arc spanning TR110--TR114_v2. Python multi-agent testing (TR110) established 99.25% efficiency using dual Ollama, making it appear that concurrent LLM serving was a solved problem. TR113 then revealed that Rust multi-agent on a single Ollama instance achieved only 82.2% -- a 17-percentage-point gap that initially suggested Rust was unsuitable for concurrent workloads.
The resolution came in TR114_v2, which demonstrated that the bottleneck was Ollama's single-threaded request queue, not Rust's async runtime. With dual Ollama instances providing process-level isolation, Rust achieved 99.4% efficiency -- marginally exceeding Python's peak. This arc illustrates the danger of premature conclusions and the importance of systematic architectural debugging.
N.2 The Python Ceiling Phenomenon
TR116 established that Python never exceeds approximately 86% multi-agent efficiency across all models tested. This ceiling appears to be structural, rooted in Python's asyncio event loop overhead and GIL contention during non-I/O coordination tasks. The ceiling is observable even under favorable conditions (homogeneous configuration, dual Ollama, lightweight model).
The practical implication is that Python-based multi-agent systems carry an inherent 14+ percentage-point efficiency penalty relative to Rust. For low-concurrency deployments (2 agents), this translates to approximately 15--20% longer wall-clock time. For higher concurrency (4+ agents), the penalty compounds.
N.3 Configuration Sensitivity and Transferability
TR109 demonstrated that optimal configurations are workload-specific and cannot be transferred across modalities. This finding has broader implications for LLM operations:
- Configuration tuning must be performed per-workload, not per-model
- Auto-tuning systems must account for workload type as a first-class parameter
- Benchmark results from single-inference tests should not be applied to agent workflows without validation
Appendix O: Extended Results Narratives
O.1 TR108 Narrative
TR108 established the foundational performance landscape for local LLM inference on the RTX 4080 platform. The most consequential finding was not a single number but a relationship: model size and throughput are inversely correlated, and GPU layer allocation is the dominant parameter. Gemma 3's 34% throughput advantage over Llama 3.1 at comparable quality set the stage for its selection as the program's primary model. The 303.9 tok/s peak for Gemma 270m demonstrated that sub-1B models can achieve interactive speeds far beyond real-time requirements.
O.2 TR109 Narrative
TR109 disrupted the comfortable optimization narrative of TR108 by demonstrating that single-inference optima actively harm agent workflow performance. The +323.7% TTFT regression when transferring TR108's optimal configuration to agent tasks was a foundational negative result. It forced the program to adopt workload-specific optimization as a design principle.
O.3 TR110--TR114_v2 Narrative
The multi-agent arc (TR110 through TR114_v2) is the program's most complete investigation. It progressed from Python baseline (TR110), through Rust disappointment (TR113), to Rust redemption (TR114_v2). The critical insight -- that server-level serialization, not runtime overhead, was the bottleneck -- required architectural intervention (dual Ollama) rather than code optimization.
O.4 TR115_v2 Narrative
TR115_v2 resolved the runtime selection question by demonstrating that all four working runtimes achieve near-identical peak efficiency (99.87--99.99%), making consistency the selection criterion. Tokio-default's 1.21pp sigma -- the lowest of all runtimes -- established it as the production recommendation. The async-std catastrophic failure (50% serialization across all 150 runs) served as a cautionary tale about ecosystem compatibility.
O.5 TR116 Narrative
TR116 closed the program by varying the dimension that all prior reports held constant: model architecture. The finding that Gemma 3 achieves 99.2% chimera-homo efficiency in Rust -- the highest of any model -- while Python's best (Llama 3.1 at 83.8%) never breaks 86% efficiency, crystallized the program's central conclusion: for production multi-agent deployments, the Rust + Gemma 3 + dual Ollama + tokio-default stack is the empirically validated optimum.
Appendix P: Decision-Grade Reporting Rubric
This appendix defines the criteria by which a technical report is assessed as "decision-grade" -- that is, containing sufficient evidence, rigor, and translation to support production deployment decisions.
P.1 Rubric Criteria
| Criterion | Requirement | Grade if Met | Grade if Not Met |
|---|---|---|---|
| Replication | Minimum 3 runs per configuration | Required | Exploratory only |
| Statistical reporting | Mean, CV, min, max for all primary metrics | Required | Descriptive only |
| Measurement boundaries | Explicit inclusion/exclusion list for timing | Required | Non-reproducible |
| Falsification | At least one alternative hypothesis tested | Required | Confirmatory only |
| Workflow parity | Compared implementations perform identical work | Required for cross-language | N/A for single-language |
| Hardware documentation | Full hardware spec with GPU, CPU, RAM, OS version | Required | Non-reproducible |
| Effect size | Primary result exceeds practical significance threshold | Required for decision | Informational only |
| Artifact availability | Raw data and scripts available for re-analysis | Recommended | Acceptable without |
P.2 Program Report Grades
| TR | Grade | Limiting Factor |
|---|---|---|
| TR108 | B+ | Variable replication count (2--5 per config); no formal falsification |
| TR109 | B | Low total run count (20+); single-day collection |
| TR110 | A | 150 runs, 30 configs, 5 reps, systematic falsification |
| TR111_v2 | A | Full workflow parity, 3 reps, 19 configs, tight CV |
| TR112_v2 | A | Matched methodology, 111 runs, comprehensive comparison |
| TR113 | B+ | Low config count (19); identified bottleneck but not resolved |
| TR114_v2 | A | 135 runs, 27 configs, 5 reps, dual-Ollama validation |
| TR115_v2 | A | 150 runs, 5 runtimes, comprehensive failure analysis |
| TR116 | A- | 60 runs, 3 models x 2 runtimes; limited to 2 scenarios per combination |
Appendix Q: Decision Case Studies
This appendix presents realistic deployment scenarios and demonstrates how the six decisions (D1--D6) guide configuration choices.
Q.1 Case Study: Startup Deploying a Customer Support Agent Swarm
Context: A SaaS startup wants to deploy 2 concurrent LLM agents for customer support triage on a single GPU workstation (RTX 4080 class).
Applicable Decisions:
- D1 (Rust): Deploy agents in Rust for 15% higher throughput and 67% lower memory, enabling headroom for future scaling.
- D2 (Dual Ollama): Mandatory. Single Ollama would cause 63% contention rate, degrading customer experience.
- D3 (Tokio-default): Use
#[tokio::main]for consistent 98.7% efficiency with minimal configuration. - D4 (Gemma 3): Use gemma3:latest for best multi-agent scaling (99.2% efficiency). If reasoning depth is needed, Llama 3.1 8B is viable (96.5% efficiency).
- D6 (Config non-transfer): Do not reuse single-inference benchmarks for multi-agent tuning. Run multi-agent-specific parameter sweep.
Expected Performance: 2 agents x 114.54 tok/s x 98.28% = 225 tok/s effective throughput.
Q.2 Case Study: Research Lab Prototyping an Agent Pipeline
Context: A research lab needs to rapidly iterate on a multi-step agent pipeline (data ingestion, analysis, report generation).
Applicable Decisions:
- D1 (Python for prototyping): Use Python for development velocity; Rust migration after pipeline is stable.
- D6 (Config non-transfer): Optimize for agent workflow metrics (TTFT, end-to-end latency), not single-inference throughput. Use CTX=256-512 for agent workflows (TR109).
Expected Performance: ~99 tok/s single-agent (Python baseline). Accept the Python ceiling during prototyping.
Q.3 Case Study: Migrating a Python Agent to Rust for Production
Context: A team has a working Python agent and wants to migrate to Rust for production deployment.
Applicable Decisions:
- D1 (Rust): Expected gains: +15.2% throughput, -58% TTFT, -67% memory.
- D2 (Dual Ollama): Required for multi-agent mode.
- Follow Migration Checklist (Appendix E.2).
Timeline: Estimate 2--4 weeks for Rust re-implementation with workflow parity, 1 week for benchmarking, 1 week for burn-in. Break-even: ~20--24 months depending on request volume.
Q.4 Case Study: Selecting a Model for Multi-Agent Reasoning Tasks
Context: An application requires deep reasoning (multi-step logic, code generation) from concurrent agents.
Applicable Decisions:
- D4 (Model selection): Gemma 3 is the efficiency champion (99.2%), but Llama 3.1 8B provides stronger reasoning at 96.5% efficiency. The 2.7pp efficiency gap costs approximately 3% in effective throughput.
- Trade-off quantification: Gemma 3 at 97.3% x 102 tok/s = 99.3 effective tok/s per agent. Llama 3.1 at 96.5% x 68 tok/s = 65.6 effective tok/s per agent. For reasoning workloads, the absolute throughput difference may be less important than output quality.
Q.5 Case Study: Diagnosing a Production Performance Regression
Context: A production Rust multi-agent system reports efficiency dropping from 98% to 85%.
Applicable Decisions:
- D2 (Dual Ollama): First check: Are both Ollama instances healthy? A crashed instance causes agent routing failures.
- Use Regression Response Checklist (Appendix E.3).
- If contention is detected, verify process isolation. If TTFT is elevated, check for model unloading (keep-alive timeout).
Appendix R: Metric Definitions and Data Schema
This appendix defines the data schema used for benchmark result storage across the TR108--TR116 program.
R.1 Core Metric Schema
| Field | Type | Unit | Description | Collected In |
|---|---|---|---|---|
run_id |
string | -- | Unique identifier for the benchmark run | All TRs |
config_name |
string | -- | Configuration label (e.g., gpu80_ctx1024_temp0p6) |
All TRs |
model |
string | -- | Model identifier (e.g., gemma3:latest) |
All TRs |
language |
string | -- | Implementation language (rust or python) |
TR111+ |
runtime |
string | -- | Async runtime (e.g., tokio-default) |
TR115_v2 |
throughput_tok_s |
float | tok/s | Decode-phase throughput | All TRs |
ttft_ms |
float | ms | Time to first token | All TRs |
total_tokens |
integer | tokens | Total tokens generated | All TRs |
total_time_s |
float | seconds | Wall-clock execution time | All TRs |
parallel_efficiency_pct |
float | % | Parallel efficiency (multi-agent only) | TR110+ |
concurrency_speedup |
float | x | Speedup over sequential (multi-agent only) | TR110+ |
contention_detected |
boolean | -- | Whether contention threshold was exceeded | TR110+ |
num_gpu |
integer | layers | GPU layer allocation | All TRs |
num_ctx |
integer | tokens | Context window size | All TRs |
temperature |
float | -- | Sampling temperature | All TRs |
vram_mb |
float | MB | Peak VRAM usage during run | TR108, TR110 |
timestamp |
datetime | ISO 8601 | Run start timestamp | All TRs |
R.2 Derived Metrics
| Metric | Formula | Computed From |
|---|---|---|
cv_throughput |
std(throughput) / mean(throughput) * 100 |
Per-config run set |
efficiency_spread_pp |
max(efficiency) - min(efficiency) |
Per-config run set |
cost_per_1M_tokens |
(1e6 / throughput / 3600) * usd_per_hour |
Per-run throughput + cost rate |
memory_efficiency |
throughput / (vram_mb / 1024) |
Per-run throughput + VRAM |
Appendix S: Governance Templates
This appendix provides templates for decision governance in organizations adopting the TR108--TR116 findings.
S.1 Decision Record Template
DECISION RECORD
===============
Decision ID: DR-YYYY-NNN
Date: YYYY-MM-DD
Decision: [One-sentence summary]
Context: [Business context and constraints]
Options Evaluated:
1. [Option A] -- [Pros] / [Cons]
2. [Option B] -- [Pros] / [Cons]
Decision: [Selected option]
Evidence: [TR numbers and key metrics]
Risk: [Primary risk and mitigation]
Review Date: [Date for revisiting the decision]
Owner: [Responsible individual or team]
S.2 Benchmark Validation Template
BENCHMARK VALIDATION RECORD
============================
Validation ID: BV-YYYY-NNN
Date: YYYY-MM-DD
Hardware: [GPU, CPU, RAM, OS]
Model: [Model name and quantization]
Configuration: [num_gpu, num_ctx, temperature]
Runs Completed: [N]
Mean Throughput: [X tok/s]
CV: [Y%]
Mean Efficiency: [Z%] (if multi-agent)
Contention Rate: [W%] (if multi-agent)
Pass/Fail: [Pass if CV < 5% and efficiency > 95%]
Baseline Match: [Within 5% of TR reference value? Y/N]
Notes: [Any deviations or anomalies]
S.3 Change Approval Template
CHANGE APPROVAL REQUEST
========================
Change ID: CR-YYYY-NNN
Requested By: [Name]
Date: YYYY-MM-DD
Change: [Description: model change, config change, infrastructure change]
Impact Analysis:
- Expected throughput change: [+/- X%]
- Expected efficiency change: [+/- Y pp]
- Expected cost change: [+/- $Z/month]
Evidence: [TR numbers supporting the change]
Rollback Plan: [How to revert if change causes regression]
Approval: [Approver name and date]
Appendix T: Extended Risk Register
This appendix catalogs the principal risks associated with production deployment of the TR108--TR116 validated stack and provides likelihood, impact, and mitigation guidance for each.
T.1 Risk Register
| ID | Risk | Likelihood | Impact | Severity | Mitigation | TR Evidence |
|---|---|---|---|---|---|---|
| R01 | Ollama instance crash under load | Medium | High | High | Health-check polling every 30s; auto-restart via systemd/supervisord; failover to single-instance degraded mode | TR114_v2 |
| R02 | VRAM exhaustion with large contexts | Medium | High | High | Cap num_ctx at 2048; monitor VRAM with 1 GB headroom; alert at 10 GB / 12 GB | TR108, TR110 |
| R03 | Model update changes performance profile | Medium | Medium | Medium | Pin model version by hash; re-benchmark after any model update; maintain performance baseline record | TR108 |
| R04 | Tokio runtime version regression | Low | Medium | Low-Medium | Pin tokio version in Cargo.lock; benchmark after dependency updates; maintain CI performance gate | TR115_v2 |
| R05 | GPU thermal throttling under sustained load | Medium | Medium | Medium | Monitor GPU temperature; ensure adequate cooling; set thermal alert at 85C; reduce load at 90C | TR108 |
| R06 | Configuration drift from validated settings | High | Medium | High | Infrastructure-as-code for Ollama configs; automated validation on deployment; reject unvalidated configs | TR109 (C6) |
| R07 | Python-to-Rust migration introduces bugs | Medium | High | High | Comprehensive output comparison (10+ prompts); 7-day burn-in; parallel deployment during transition | TR112_v2 |
| R08 | Single-machine failure (no redundancy) | Low | Critical | Medium-High | Document recovery procedure; maintain cold standby; consider multi-machine deployment for critical workloads | All TRs |
| R09 | Contention return after infrastructure change | Medium | High | High | Re-run contention benchmark after any infrastructure change; maintain dual-Ollama architecture invariant | TR113, TR114_v2 |
| R10 | Model quality degradation at high concurrency | Low | Medium | Low-Medium | Periodic quality sampling; compare multi-agent output quality to single-agent reference | TR116 |
| R11 | Async runtime ecosystem fragmentation | Low | Low | Low | Standardize on tokio-default; avoid runtime-specific APIs; abstract HTTP client interface | TR115_v2 |
| R12 | Benchmark results not representative of production workload | Medium | High | High | Validate benchmarks against production traffic sample; adjust workload taxonomy (Appendix F) as needed | TR109, TR116 |
| R13 | Cold-start latency spikes after idle periods | High | Medium | Medium-High | Configure Ollama keep-alive to prevent model unloading; implement pre-warming on schedule | TR109, TR112_v2 |
| R14 | Network latency between agent and Ollama on localhost | Low | Low | Low | Use localhost binding (127.0.0.1); avoid network-attached Ollama unless necessary | All TRs |
| R15 | Quantization level mismatch between benchmark and production | Medium | Medium | Medium | Verify quantization via ollama show --modelfile; include quantization in benchmark metadata |
TR108 |
T.2 Risk Severity Matrix
Impact
Low Medium High Critical
Likelihood
High M M-H H C
Medium L-M M H H
Low L L-M M-H M-H
T.3 Top-5 Risks by Severity
- R06 (Configuration drift): High likelihood, medium impact. Mitigation: infrastructure-as-code.
- R01 (Ollama crash): Medium likelihood, high impact. Mitigation: health checks and auto-restart.
- R02 (VRAM exhaustion): Medium likelihood, high impact. Mitigation: context size cap and monitoring.
- R09 (Contention return): Medium likelihood, high impact. Mitigation: re-benchmark after changes.
- R13 (Cold-start spikes): High likelihood, medium impact. Mitigation: keep-alive configuration.
End of Appendices A through T.
Note: This work has not been externally peer reviewed. All findings represent the output of a single research program without independent expert verification.