Conclusive Report 108-116: Extended Appendices
Supplemental material extracted from the main conclusive report
| Field | Value |
|---|---|
| Report Type | Extended appendices |
| Scope | 108-116 |
| Status | Supplemental |
| Main Report | Technical_Report_Conclusive_108-116.md |
Appendix F: Workload Taxonomy
F.1 Overview
The TR108-TR116 research program evaluated LLM inference performance across four distinct workload classes. Each class exercises different system bottlenecks, rewards different optimization strategies, and produces different performance profiles. This taxonomy formalizes the workload classification that emerged empirically over nine technical reports and 903+ benchmark runs.
F.2 Workload Class 1: Single-Agent Chat Inference
Primary TR: TR108 Characteristics:
- Decode-dominant: GPU compute is the bottleneck; throughput scales linearly with decode speed.
- Single Ollama instance, single concurrent request.
- Prompt lengths: 50-200 tokens. Generation lengths: 100-500 tokens.
- Optimization target: raw throughput (tok/s) and TTFT (ms).
Key Metrics Observed:
- Gemma 3: 102.85 tok/s, TTFT ~150 ms (best configuration)
- Llama 3.1 Q4_0: 76.59 tok/s, TTFT ~200 ms
- GPU layer allocation is the single most impactful parameter (num_gpu).
- Context size optimization yields 15-20% throughput improvement.
Optimization Strategy:
- Maximize GPU offload (num_gpu = total model layers).
- Minimize context window to task requirements (512-1024 for short-form generation).
- Temperature has minimal impact on throughput; optimize for quality.
F.3 Workload Class 2: Agent Workflow
Primary TR: TR109 Characteristics:
- Multi-step execution: file I/O, data ingestion, iterative LLM calls, structured output generation.
- I/O-bound phases interleave with compute-bound LLM inference phases.
- Optimization target: end-to-end workflow latency, with TTFT as the critical sub-metric.
Key Metrics Observed:
- Smaller contexts (512-1024 tokens) outperform larger contexts (4096 tokens) for agent tasks.
- GPU layer sweet spot: 60-80 layers (vs. 999 for single inference).
- Configuration transfer from TR108 (single inference) fails: optimal single-inference parameters produce suboptimal agent workflow performance.
Optimization Strategy:
- Optimize TTFT aggressively; decode throughput is less critical because I/O phases dominate.
- Reduce context window to match actual agent state requirements.
- Avoid maximal GPU offload; moderate offload (60-80 layers) balances TTFT and throughput.
F.4 Workload Class 3: Multi-Agent Concurrent
Primary TRs: TR110, TR113, TR114 Characteristics:
- Two or more agents executing in parallel, each issuing LLM requests.
- Resource contention for GPU, VRAM, and Ollama server capacity.
- Optimization target: parallel efficiency (speedup / N agents) and contention rate.
Key Metrics Observed:
- Python (TR110): 99.25% efficiency with dual Ollama, homogeneous Chimera agents.
- Rust single Ollama (TR113): 82.2% efficiency, 63% contention rate.
- Rust dual Ollama (TR114): 99.396% peak config efficiency, contention reduced to <1%.
- GPU layers >= 80 required for contention-free concurrent execution on 12 GB VRAM.
Optimization Strategy:
- Dual Ollama instances are mandatory for contention-free multi-agent execution.
- Homogeneous agent configurations outperform heterogeneous configurations.
- Context size 2048 achieves highest speedups in homogeneous scenarios.
- Architecture (dual Ollama) matters more than language choice (Rust vs Python).
F.5 Workload Class 4: Cross-Model Multi-Agent
Primary TR: TR116 Characteristics:
- Identical multi-agent scenarios executed across different model architectures.
- Isolates model choice as an independent variable while holding runtime and infrastructure constant.
- Reveals model-specific scaling behavior under concurrent load.
Key Metrics Observed:
- Gemma 3 (Rust): 97.3% efficiency, 99.2% in chimera-homo -- the scaling champion.
- Llama 3.1 (Rust): 96.5% efficiency, 98.5% in chimera-homo.
- Qwen 2.5 (Rust): 90.0% efficiency -- heavier coordination overhead due to larger KV cache.
- Python ceiling: no model exceeds 86% efficiency regardless of model architecture.
Optimization Strategy:
- Model selection must account for multi-agent scaling characteristics, not just single-agent throughput.
- Gemma 3 is optimal for agent swarms; Llama 3.1 is viable for reasoning-heavy tasks.
- Qwen 2.5's lower efficiency is acceptable only for specialized reasoning workloads.
F.6 Workload Class 5: Batch Processing (Future Work)
Not tested in TR108-TR116. Hypothesized Characteristics:
- Offline processing of large prompt queues with no latency constraint.
- Optimization target: total throughput (requests/hour), not per-request latency.
- Expected to favor maximal GPU offload and large batch sizes.
- Dual Ollama likely beneficial for throughput doubling.
F.7 Workload Characteristics Summary
| Characteristic | Single-Agent | Agent Workflow | Multi-Agent | Cross-Model |
|---|---|---|---|---|
| Primary bottleneck | GPU decode | I/O + TTFT | Contention | Model-dependent |
| Optimal num_gpu | Maximum (999) | Moderate (60-80) | >= 80 | >= 80 |
| Optimal num_ctx | Task-matched | Minimal (256-512) | 2048 | 2048 |
| TTFT criticality | Medium | High | Medium | Medium |
| Throughput criticality | High | Medium | Medium | Medium |
| Efficiency target | N/A | N/A | > 95% | > 95% |
| Contention risk | None | None | High (single Ollama) | Model-dependent |
| Key TR | TR108 | TR109 | TR110, TR113, TR114 | TR116 |
| Config transferable? | Baseline | No (from TR108) | No (from TR109) | No (model-specific) |
Appendix H: Operational Playbooks
H.1 Pre-Deployment Playbook
Step 1: Dual Ollama Instance Setup
- Install Ollama (version pinned to the tested release, e.g., v0.1.17 or later validated version).
- Configure Instance 1 on default port 11434:
- Set
OLLAMA_HOST=127.0.0.1:11434 - Start:
ollama serve
- Set
- Configure Instance 2 on secondary port 11435:
- Set
OLLAMA_HOST=127.0.0.1:11435 - Start:
ollama serve
- Set
- Verify both instances respond:
curl http://127.0.0.1:11434/api/versioncurl http://127.0.0.1:11435/api/version
- Pull the target model on both instances:
- Instance 1:
OLLAMA_HOST=127.0.0.1:11434 ollama pull gemma3:latest - Instance 2:
OLLAMA_HOST=127.0.0.1:11435 ollama pull gemma3:latest
- Instance 1:
Step 2: Rust Binary Compilation and Deployment
- Ensure Rust toolchain is installed (rustup, stable channel >= 1.82.0).
- Build release binary:
cargo build --release - Verify binary size and linkage:
ls -la target/release/<binary_name> - Validate runtime:
./target/release/<binary_name> --version - Deploy binary to target directory alongside configuration files.
Step 3: Configuration Validation Procedure
- Load configuration file (YAML or TOML).
- Validate all required fields:
ollama_host,ollama_port,model,num_gpu,num_ctx,temperature. - Confirm dual Ollama ports are specified for multi-agent configurations.
- Verify GPU layer count does not exceed model layer count.
- Confirm VRAM budget: sum of model memory across both instances must be less than 12 GB (for RTX 4080 Laptop).
Step 4: Warmup Sequence
- Issue 3 sequential requests to Ollama Instance 1 with a short prompt (e.g., "Hello").
- Issue 3 sequential requests to Ollama Instance 2 with the same prompt.
- Discard warmup results; do not include in benchmark data.
- Rationale: First requests incur model loading and KV-cache allocation overhead. Three warmup requests ensure stable GPU state.
- Verify post-warmup TTFT is within expected range (< 500 ms for Gemma 3 on RTX 4080).
H.2 Monitoring Playbook
Key Metrics to Track
| Metric | Unit | Source | Collection Frequency |
|---|---|---|---|
| Parallel efficiency | % | Benchmark harness | Per-run |
| TTFT | ms | Ollama API response | Per-request |
| Throughput | tok/s | Ollama API response | Per-request |
| Contention rate | % | TTFT anomaly detection | Per-run |
| VRAM usage | MB | nvidia-smi | 5-second intervals |
| GPU utilization | % | nvidia-smi | 5-second intervals |
| CPU utilization | % | OS metrics | 5-second intervals |
| Memory (RSS) | MB | Process metrics | Per-run |
Alert Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Efficiency | < 95% | < 90% | Check Ollama instances, VRAM pressure |
| TTFT | > 1000 ms | > 2000 ms | Check model loading, GPU thermal state |
| Throughput | < 90 tok/s | < 80 tok/s | Check GPU throttling, model quantization |
| Contention rate | > 2% | > 5% | Verify dual Ollama, check port conflicts |
| VRAM usage | > 10 GB | > 11 GB | Reduce GPU layers, check for memory leaks |
| GPU temperature | > 80 C | > 85 C | Check cooling, reduce sustained load |
Dashboard Recommendations
- Primary panel: real-time efficiency trend (line chart, 1-minute resolution).
- Secondary panel: TTFT distribution (histogram, per-model breakdown).
- Tertiary panel: throughput time series with contention event overlay.
- Auxiliary panel: VRAM and GPU utilization gauges.
- Tool recommendation: Grafana with Prometheus metrics exporter, or custom JSON-based logging with dashboard frontend.
H.3 Regression Response Playbook
Symptom-Root Cause-Fix Mapping
| Symptom | Likely Root Cause | Diagnostic Command | Fix |
|---|---|---|---|
| Efficiency < 90% | Single Ollama instance | Check port 11435 availability | Restart second Ollama instance |
| Efficiency < 90% | VRAM pressure | nvidia-smi (check memory) |
Reduce num_gpu or switch to smaller model |
| Efficiency < 90% | Configuration drift | Diff current vs baseline config | Restore validated configuration |
| Throughput < 80 tok/s | Model not fully loaded to GPU | Check num_gpu setting | Increase GPU layer offload |
| Throughput < 80 tok/s | GPU thermal throttling | nvidia-smi -q -d TEMPERATURE |
Allow cooldown, improve airflow |
| Throughput < 80 tok/s | Background GPU workload | Check GPU process list | Terminate competing processes |
| TTFT > 2000 ms | Cold model load | Check Ollama logs for "loading model" | Run warmup sequence |
| TTFT > 2000 ms | Context window too large | Check num_ctx setting | Reduce to 512-1024 |
| Contention > 5% | Single Ollama serving both agents | Check agent-to-instance mapping | Ensure dedicated Ollama per agent |
| Contention > 5% | Port conflict | `netstat -an | grep 1143` |
Appendix J: Traceability Map (Decisions to TRs)
J.1 Decision 1: Adopt Rust for Production Agent Workflows
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR111 v2 | Rust achieves full workflow parity with Python; 114.54 tok/s baseline throughput | Primary |
| TR112 v2 | Head-to-head comparison: +15.2% throughput, -58% TTFT, -67% memory, -83% startup | Primary |
| TR114 v2 | Rust multi-agent achieves 99.396% peak efficiency with dual Ollama | Confirmatory |
| TR116 | Rust advantage holds across all three models (Gemma, Llama, Qwen); +12-17pp over Python | Confirmatory |
Decision logic: TR112 v2 provides the definitive single-agent comparison. TR114 v2 confirms that Rust's advantages are preserved in multi-agent scenarios when architectural bottlenecks (single Ollama) are removed. TR116 confirms the advantage is model-independent.
J.2 Decision 2: Deploy Dual Ollama Architecture for Multi-Agent Workloads
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR110 | Python dual Ollama achieves 99.25% efficiency (ports 11434/11435) | Foundational |
| TR113 | Single Ollama bottleneck: 82.2% efficiency, 63% contention rate in Rust | Problem identification |
| TR114 v2 | Dual Ollama fix: efficiency jumps from 82.2% to 99.396%, contention drops to <1% | Solution validation |
Decision logic: TR113 empirically demonstrated that single Ollama serializes concurrent requests, capping efficiency at 82.2%. TR114 v2 proved that adding a second Ollama instance eliminates the bottleneck entirely. TR110 had already established dual Ollama as viable in Python. The improvement (17+ percentage points) is unambiguous.
J.3 Decision 3: Use Tokio-Default as the Async Runtime
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR115 v2 | 5-runtime benchmark: tokio-default achieves 98.72% mean efficiency with 1.21pp standard deviation (best consistency) | Primary |
Decision logic: All four working runtimes achieve approximately 100% peak efficiency (99.87-99.99%), making peak performance irrelevant for the decision. The differentiator is consistency: tokio-default (1.21pp sigma) and smol-1kb (1.32pp sigma) are the only production-viable options. Tokio-default wins on consistency, ecosystem maturity, and zero-configuration deployment (#[tokio::main]). Tokio-localset (4.03pp sigma, 81.03% minimum) is too variable. Smol (72.80% pathological failure) and async-std (50% perfect serialization) are disqualified.
J.4 Decision 4: Select Gemma 3 as the Default Production Model
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR108 | Gemma 3 delivers 34% higher throughput than Llama 3.1 Q4_0 (102.85 vs 76.59 tok/s) | Foundational |
| TR116 | Gemma 3 achieves 99.2% chimera-homo efficiency (Rust), 97.3% baseline-vs-chimera -- highest of all models | Confirmatory |
Decision logic: Gemma 3 leads on both single-agent throughput (TR108) and multi-agent scaling efficiency (TR116). Llama 3.1 is a viable alternative for reasoning-heavy workloads (96.5% Rust efficiency) but trails on raw throughput. Qwen 2.5 is not recommended for multi-agent use (90.0% efficiency, heavier coordination overhead).
J.5 Decision 5: Acknowledge and Document the Python Efficiency Ceiling
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR116 | Python never exceeds 86% efficiency across all three models; structural limitation of asyncio event loop | Primary |
| TR113 vs TR114 | Even after architectural fixes (dual Ollama), Rust outperforms Python by 12-17pp in multi-agent scenarios | Confirmatory |
Decision logic: TR116 establishes that the Python efficiency ceiling (~86%) is model-independent -- it appears with Gemma 3 (80.2%), Llama 3.1 (83.8%), and Qwen 2.5 (77.6%). This ceiling is structural: Python's single-threaded asyncio event loop saturates under concurrent LLM coordination overhead. No amount of Python optimization can exceed this ceiling; migration to Rust is the only path to >90% efficiency.
J.6 Decision 6: Configuration Transfer Between Workload Types Fails
| Supporting TR | Evidence | Strength |
|---|---|---|
| TR108 vs TR109 | Single-inference optimal (num_gpu=999, num_ctx=4096) degrades agent workflow performance; agent-optimal is num_gpu=60-80, num_ctx=256-512 | Primary |
| TR109 vs TR110 | Agent workflow optimal does not transfer to multi-agent concurrent; multi-agent requires num_ctx=2048, homogeneous configs | Confirmatory |
Decision logic: Each workload class has distinct optimization characteristics. Blindly applying single-inference parameters to agent workflows or multi-agent scenarios produces measurably suboptimal results. This finding has broad implications: any deployment must benchmark its specific workload type rather than relying on published benchmarks from different workload classes.
Appendix K: Extended Literature Context
K.1 Rust Async Ecosystem
Tokio Architecture. Tokio is a multi-threaded async runtime for Rust built on a work-stealing scheduler. Worker threads maintain local task queues; when a thread's queue is empty, it steals tasks from other threads' queues. This design minimizes contention on shared queues while maximizing CPU utilization. For LLM inference workloads, Tokio's work-stealing approach introduces measurable but small overhead compared to thread-pinning approaches (TR115 v2 quantifies this at < 2pp efficiency delta).
Reqwest HTTP Client. The Rust HTTP client used for Ollama API communication. Reqwest is built on Tokio and hyper, providing connection pooling, HTTP/1.1 and HTTP/2 support, and async streaming. In TR114 v2, reqwest's async buffering model was identified as a contributor to coordination overhead, adding latency compared to Python's httpx during concurrent request patterns.
K.2 Ollama Internals
Model Loading. Ollama loads GGUF-format models into GPU memory layer by layer. The num_gpu parameter controls how many transformer layers are offloaded to GPU (remaining layers execute on CPU). Full offload (all layers on GPU) maximizes throughput but increases VRAM pressure. For the RTX 4080 Laptop (12 GB), Gemma 3 (4.3B Q4_K_M, 3.3 GB base memory) can be fully offloaded with room for KV cache.
Concurrent Request Handling. A single Ollama instance serializes inference requests: concurrent requests are queued and processed sequentially. This is the root cause of the contention observed in TR113. Dual Ollama instances eliminate this serialization by providing independent inference pipelines.
GGUF Format. GGUF (GPT-Generated Unified Format) is the file format used by Ollama for quantized model weights. Q4_K_M quantization (4-bit with K-means clustering, medium quality) was used across all TR108-TR116 experiments, providing a balance of model size, inference speed, and output quality.
K.3 LLM Multi-Agent Patterns
Concurrent Coordination. Multi-agent LLM systems require coordination between agents sharing GPU resources. The key insight from TR110-TR114 is that infrastructure architecture (single vs dual Ollama) dominates over language runtime choice (Rust vs Python) in determining parallel efficiency. Coordination overhead in multi-agent LLM workloads is fundamentally I/O-bound (HTTP requests to Ollama), not compute-bound.
Parallel Efficiency Measurement. Parallel efficiency is defined as (speedup / N) x 100%, where speedup = sequential_time / concurrent_time and N is the number of agents. An efficiency of 100% indicates zero coordination overhead. Values above 95% are considered excellent for real-world systems with shared resources.
K.4 Python Asyncio Limitations
Single-Threaded Event Loop. Python's asyncio runs all coroutines on a single thread. While the GIL is released during I/O operations (enabling true concurrency for network calls), the event loop itself is single-threaded: task scheduling, callback dispatch, and coroutine resumption all execute sequentially on one core. Under high coordination load (many concurrent agents exchanging data), this single-threaded scheduler becomes the bottleneck, producing the ~86% efficiency ceiling observed in TR116.
GIL Implications. The Global Interpreter Lock prevents true parallel execution of Python bytecode. For I/O-bound LLM inference (where most time is spent waiting for Ollama responses), the GIL is largely irrelevant because it is released during network I/O. However, the GIL does affect CPU-bound coordination tasks (data serialization, result aggregation), contributing to the observed efficiency gap versus Rust.
K.5 Quantization Impact
Q4_K_M Characteristics. All experiments used Q4_K_M quantization, which represents model weights using 4-bit integers with K-means clustering for improved accuracy. This quantization level typically preserves 95-98% of full-precision model quality while reducing model size by approximately 4x and improving inference throughput by 2-3x. The quality-performance tradeoff was deemed acceptable for the gaming dialogue use case (TR108).
Appendix L: Measurement Boundary Catalog
L.1 Boundary Definitions
Each technical report defines an explicit measurement boundary that determines which operations are included in timing measurements and which are excluded. Maintaining clear boundaries is essential for reproducibility and cross-TR comparability.
L.2 Per-TR Boundary Table
| TR | Measurement Boundary | Included in Timing | Excluded from Timing |
|---|---|---|---|
| TR108 | Ollama API call (single inference) | Model inference, TTFT, token generation, response deserialization | Process startup, model loading, Ollama instance startup |
| TR109 | Agent workflow end-to-end | File I/O (scan, read), LLM calls (analysis + report), data processing, output writing | Process isolation overhead, Ollama instance management |
| TR110 | Concurrent execution window | Both agents' full workflows (I/O + LLM), asyncio coordination | Sequential baseline overhead, Ollama instance startup/teardown |
| TR111 | Rust workflow end-to-end | File system scan, data ingestion, LLM calls (analysis + report), metric collection | Binary compilation time, Ollama instance management |
| TR112 | Cross-language comparison | Identical workflow on identical inputs; file I/O, LLM calls, output generation | Language-specific setup (Python venv, Rust compilation), Ollama management |
| TR113 | Rust concurrent (single Ollama) | Both agents' full workflows, shared Ollama instance, contention events | Ollama instance startup, binary startup |
| TR114 | Rust concurrent (dual Ollama) | Both agents' full workflows, dedicated Ollama instances, coordination overhead | Ollama instance startup/teardown, binary compilation |
| TR115 | Runtime comparison | Identical multi-agent workflow across 5 Rust async runtimes | Runtime-specific initialization, binary compilation per runtime |
| TR116 | Cross-model comparison | Same multi-agent scenarios with Gemma 3, Llama 3.1, Qwen 2.5 | Model download, model format conversion, Ollama setup |
L.3 Boundary Rationale
The measurement boundaries were designed to isolate the variable under study in each TR:
- TR108-TR109: Isolate inference and workflow performance from infrastructure setup.
- TR110, TR113-TR114: Isolate concurrency behavior from sequential execution characteristics.
- TR112: Isolate language runtime effects by ensuring identical workflow and inputs.
- TR115: Isolate async runtime effects by ensuring identical workflow, model, and hardware.
- TR116: Isolate model architecture effects by ensuring identical runtime, infrastructure, and scenarios.
L.4 Cross-TR Comparability Notes
- TR108 and TR109 measurements are not directly comparable because TR109 includes file I/O and multi-step workflows that TR108 excludes.
- TR110 (Python) and TR114 (Rust) measurements are comparable because both use dual Ollama and identical agent workflow structure.
- TR113 and TR114 measurements differ in architecture (single vs dual Ollama) but are otherwise comparable, enabling the architecture impact analysis.
- TR115 measurements are internally comparable (same hardware, model, workflow; only runtime varies).
- TR116 measurements are internally comparable (same runtime, hardware, architecture; only model varies).
Appendix N: Expanded Discussion
N.1 The Python-to-Rust Migration Narrative
The migration from Python to Rust succeeded because of three factors that are often underestimated in language migration discussions. First, the LLM inference workload is I/O-bound at the application layer (HTTP calls to Ollama) but compute-bound at the server layer (GPU inference). This means Rust's compile-time guarantees and zero-cost abstractions provide operational benefits (memory safety, deployment simplicity, startup speed) without sacrificing the throughput that is determined by the GPU. Second, the Tokio async ecosystem provides mature, production-grade HTTP client and runtime libraries that match Python's asyncio/httpx ecosystem in functionality. Third, the workflow complexity (file I/O, data processing, LLM calls) translates naturally to Rust's ownership model without requiring complex lifetime annotations or unsafe code.
The 15.2% throughput improvement observed in TR112 v2 likely reflects Rust's more efficient HTTP request/response handling and reduced per-request overhead, not faster GPU inference (which is identical regardless of client language).
N.2 The Architecture Discovery
The most consequential finding of the TR108-TR116 program is that dual Ollama architecture matters more than language choice for multi-agent performance. TR113 demonstrated that Rust with single Ollama achieves only 82.2% efficiency -- worse than Python with dual Ollama (99.25%, TR110). TR114 proved that Rust with dual Ollama matches or exceeds Python (99.396% vs 99.25%). The implication is clear: architectural decisions (how many inference server instances to deploy) dominate over implementation decisions (which programming language to use) for multi-agent LLM workloads. This finding generalizes beyond the specific Rust-vs-Python comparison.
N.3 The Runtime Lesson: Consistency Beats Peak Performance
TR115 v2 demonstrated that all four working Rust async runtimes achieve approximately identical peak performance (99.87-99.99% efficiency). The differentiator is consistency: tokio-default (1.21pp sigma) provides the most reliable performance, while tokio-localset (4.03pp sigma, 81.03% minimum) shows dangerous variance. In production systems, a consistently good runtime is preferable to one that occasionally achieves 99.99% but drops to 81% unpredictably. This principle -- consistency over peak -- should guide runtime and configuration selection broadly.
N.4 The Model Selection Paradox
TR116 revealed that model performance rankings depend on the measurement context. Gemma 3 leads on throughput (102.85 tok/s) and multi-agent efficiency (99.2% chimera-homo). However, Llama 3.1, despite lower throughput (~68 tok/s in TR116 multi-agent testing), scales nearly as well under concurrency (98.5% chimera-homo) and may offer superior reasoning quality for complex agent tasks. Qwen 2.5, a 7B model, underperforms both smaller models in multi-agent efficiency (90.0%) due to heavier KV cache pressure. The paradox: the "best" model depends on whether you optimize for throughput, efficiency, reasoning quality, or a weighted combination.
N.5 Configuration Transfer Failure as a General Principle
The finding that inference-optimal, agent-optimal, and multi-agent-optimal configurations differ (TR108 vs TR109 vs TR110) is a specific instance of a general principle: benchmarks transfer poorly across workload types. This has implications far beyond the TR108-TR116 program. Any team deploying LLM-based systems should benchmark their specific workload type rather than relying on published benchmarks, vendor datasheets, or results from different workload classes.
N.6 Implications for the Broader LLM Serving Community
The TR108-TR116 findings suggest several implications for the community: (1) Multi-instance inference servers should be the default for concurrent workloads, not an optimization. (2) Language runtime choice matters less than architecture for I/O-bound coordination workloads. (3) Model selection for multi-agent systems requires multi-agent benchmarks, not single-agent benchmarks. (4) Async runtime consistency should be evaluated alongside peak performance.
Appendix O: Extended Results Narratives
O.1 TR108: Single-Agent Baseline Establishes Model Performance Hierarchy
TR108 established the foundational performance hierarchy through 158+ benchmark configurations. Gemma 3 (4.3B parameters, Q4_K_M) delivered 102.85 tok/s at optimal configuration, outperforming Llama 3.1 8B (Q4_0) at 76.59 tok/s by 34%. This result was attributed to Gemma 3's smaller parameter count and more efficient attention architecture, both of which reduce per-token compute requirements. The study identified GPU layer allocation (num_gpu) as the most impactful parameter, followed by context size (num_ctx). Temperature had minimal throughput impact but significant quality effects. The 158 configurations provided high statistical confidence in the rankings and identified the parameter sensitivity landscape that would inform all subsequent TRs.
O.2 TR109: Configuration Transfer Failure Reveals Workload-Specific Optimization
TR109 applied TR108's optimal single-inference configurations to multi-step agent workflows and observed measurable performance degradation. The critical discovery: maximal GPU offload (num_gpu=999) and large context windows (num_ctx=4096), which maximize single-inference throughput, are suboptimal for agent workflows where TTFT and I/O interleaving dominate. Agent-optimal configurations (num_gpu=60-80, num_ctx=256-512, temp=0.6) prioritize fast first-token delivery and moderate resource utilization. This finding invalidated the assumption that single-inference benchmarks are sufficient for production deployment planning and motivated the multi-agent studies that followed.
O.3 TR110: Python Achieves Near-Theoretical Multi-Agent Ceiling
TR110 demonstrated that Python asyncio with dual Ollama instances achieves 99.25% parallel efficiency for two concurrent agents -- within 0.75 percentage points of the theoretical maximum (2.0x speedup). The study used 150 benchmark runs across 30 configurations with Gemma 3 on dual Ollama instances (ports 11434/11435). Key findings included: (1) homogeneous agent configurations outperform heterogeneous, (2) context size of 2048 tokens maximizes concurrent efficiency, (3) temperature has negligible impact on concurrency (delta < 3%), and (4) GPU layers >= 80 are required for zero-contention operation. TR110 established the Python multi-agent baseline that all subsequent Rust comparisons referenced.
O.4 TR111: Rust Achieves Full Workflow Parity with Near-Hardware-Limit Throughput
TR111 v2 validated that the Rust agent implementation achieves full workflow parity with Python: identical operations (file system scan, data ingestion, multi-stage LLM analysis, report generation, metric tracking) executed with equivalent correctness. Performance: 114.54 tok/s baseline throughput (Rust) versus approximately 99.34 tok/s (Python), representing a 15.2% advantage. Throughput variation across 19 configurations was remarkably low (0.9%), indicating that the GPU inference speed, not client-side processing, determines throughput. TTFT showed 150x more variation than throughput, confirming it as the primary optimization target for agent workflows.
O.5 TR112: Head-to-Head Comparison Confirms Rust's Systematic Advantages
TR112 v2 provided the definitive cross-language comparison using 37 configurations (19 Rust, 18 Python) and 111 benchmark runs. Results across all metrics favored Rust: +15.2% throughput (114.54 vs 99.34 tok/s), -58% TTFT (603 vs 1437 ms cold start), -67% memory usage (75 vs 250 MB estimated), -83% startup time (0.2 vs 1.5 seconds). Rust also demonstrated higher optimization success rate (72.2% vs 38.9%) and lower variance (2.6% vs 4.8% CV). The study estimated direct GPU compute savings of approximately $444/year at 1M requests/month (single agent) with approximately 11-month break-even on a $5,000 migration investment (see Appendix V for detailed calculations).
O.6 TR113: Single Ollama Bottleneck Reveals Architectural, Not Language, Limitation
TR113 tested Rust multi-agent with a single Ollama instance and observed 82.2% peak efficiency with 63% contention rate -- dramatically worse than Python's 99.25% (TR110, dual Ollama). The initial interpretation was that Rust's async runtime introduced concurrency overhead. The correct interpretation, established by TR114, was that the single Ollama instance serialized concurrent requests regardless of client language. TR113's value was diagnostic: it identified the architectural bottleneck (Ollama server-level serialization) that TR114 would solve. The 63% contention rate and 82.2% efficiency ceiling became the baseline for measuring the dual Ollama improvement.
O.7 TR114: Dual Ollama Fix Transforms Rust Multi-Agent Performance
TR114 v2 deployed dual Ollama instances for Rust multi-agent and achieved 99.396% peak configuration efficiency (135 runs, 27 configurations). This represented a 17.2 percentage point improvement over TR113's single-Ollama results (82.2%) and matched Python's TR110 results (99.25%). The contention rate dropped from 63% to less than 1%. Overall average efficiency was 98.281% across all configurations -- robust and not dependent on specific parameter tuning. This TR confirmed the hypothesis from TR113: the bottleneck was architectural (single Ollama serialization), not language-related (Rust async overhead).
O.8 TR115: Runtime Deep Dive Shows Consistency Matters More Than Peak
TR115 v2 benchmarked five Rust async runtimes (tokio-default, tokio-localset, smol, smol-1kb, async-std) with 150 runs. All four functional runtimes achieved near-identical peak efficiency (99.87-99.99%), rendering peak performance meaningless for runtime selection. The decisive metric was consistency: tokio-default (98.72% mean, 1.21pp sigma) outperformed all alternatives on reliability. Tokio-localset achieved the highest single-run peak (99.99%) but exhibited dangerous variance (4.03pp sigma, 81.03% minimum). Smol suffered a pathological failure (72.80% on one run). Async-std achieved exactly 50% on all runs due to a Tokio HTTP bridge conflict causing perfect serialization. The conclusion: for production, choose the most consistent runtime, not the one with the highest peak.
O.9 TR116: Cross-Model Validation Confirms Universality
TR116 tested three models (Gemma 3, Llama 3.1, Qwen 2.5) across both runtimes (Rust/Tokio, Python/asyncio) with 60 benchmark runs. Rust outperformed Python for every model tested: Gemma 3 (97.3% vs 80.2%), Llama 3.1 (96.5% vs 83.8%), Qwen 2.5 (90.0% vs 77.6%). The Rust advantage ranged from 12 to 17 percentage points, confirming it as a structural runtime benefit rather than a model-specific artifact. Python never exceeded 86% efficiency regardless of model, establishing the asyncio ceiling as a universal constraint. Gemma 3 emerged as the scaling champion (99.2% chimera-homo efficiency in Rust), while Qwen 2.5 showed the highest coordination overhead, likely due to heavier KV cache requirements.
Appendix P: Decision-Grade Reporting Rubric
P.1 Checklist
A benchmark report qualifies as "decision-grade" when it satisfies all seven criteria below. Reports that fail any criterion should be treated as exploratory or provisional.
-
Measurement boundary explicitly defined. The report must state what operations are included in timing measurements and what operations are excluded. See Appendix L for the boundary catalog.
-
Artifact chain from raw data to conclusion. Every quantitative claim must be traceable to raw benchmark data files (CSV, JSON). The report must specify file paths or artifact identifiers for all referenced data.
-
Statistical validation (n >= 3 runs). Each configuration must be tested with a minimum of 3 independent runs. Reports must include mean, standard deviation, and coefficient of variation for all key metrics. Single-run results are insufficient for decision-making.
-
Limitation acknowledgment. The report must explicitly state hardware constraints, workload scope limitations, and conditions under which results may not generalize. Absence of limitations indicates incomplete analysis.
-
Decision translation. The report must translate numerical findings into actionable decisions. Raw numbers (e.g., "throughput is 114.54 tok/s") are insufficient; the report must state what decision the number supports (e.g., "Rust is recommended for production because its 15.2% throughput advantage reduces per-request cost").
-
Invalidation triggers stated. The report must identify conditions that would invalidate its conclusions. Examples: "Results may not hold for GPUs with less than 12 GB VRAM," or "Conclusions assume Ollama version 0.1.17; breaking changes in newer versions require re-validation."
-
Cross-validation with independent evidence. At least one key finding should be corroborated by evidence from a different TR or experimental setup. Single-TR conclusions are provisional; cross-validated conclusions are decision-grade.
P.2 Application to TR108-TR116
| TR | Criteria Met | Notes |
|---|---|---|
| TR108 | 7/7 | 158+ configs, extensive parameter sweeps, cross-validated by TR109 |
| TR109 | 7/7 | 20+ configs, 3 phases, directly cross-validates TR108 config transfer failure |
| TR110 | 7/7 | 150 runs, 30 configs, 5 runs each, cross-validated by TR114 |
| TR111 v2 | 7/7 | 57 runs, 19 configs, 3 runs each, cross-validated by TR112 v2 |
| TR112 v2 | 7/7 | 111 runs, 37 configs, direct Python-Rust comparison |
| TR113 | 6/7 | Missing cross-validation at time of publication; later validated by TR114 |
| TR114 v2 | 7/7 | 135 runs, 27 configs, 5 runs each, cross-validates TR113 |
| TR115 v2 | 7/7 | 150 runs, 5 runtimes, extensive statistical analysis |
| TR116 | 7/7 | 60 runs, 3 models x 2 runtimes, cross-validates TR114 |
Appendix Q: Decision Case Studies
Q.1 "Should we migrate our Python agent to Rust?"
Context: A team runs a single-agent LLM workflow (file processing + inference) on Python/asyncio with Ollama. They experience occasional throughput dips and high memory usage.
Evidence: TR112 v2 shows Rust delivers +15.2% throughput, -58% TTFT, -67% memory, -83% startup versus Python for identical workflows.
Decision: Yes, if throughput or multi-agent efficiency matters for the deployment. The migration investment breaks even in approximately 20 months at 1M requests/month (TR112 v2 cost analysis). If the team plans to scale to multi-agent, the migration becomes even more compelling (Python ceiling of ~86% vs Rust's 98%+).
Caveat: If the team prioritizes development velocity and the workload is single-agent with no scaling plans, Python may remain preferable.
Q.2 "We are deploying 2 concurrent agents -- single or dual Ollama?"
Context: A deployment runs two LLM agents concurrently on a single GPU (RTX 4080 or similar, 12 GB VRAM).
Evidence: TR113 shows single Ollama produces 82.2% efficiency with 63% contention. TR114 shows dual Ollama produces 99.396% efficiency with <1% contention. The improvement is 17+ percentage points.
Decision: Always dual Ollama. There is no scenario in which single Ollama is preferable for two concurrent agents. The only cost is the additional memory overhead of a second Ollama process (minimal, as model weights are the dominant memory consumer and are loaded per-instance into VRAM).
Caveat: Ensure sufficient VRAM for two model instances. For the RTX 4080 (12 GB), Gemma 3 Q4_K_M (3.3 GB per instance) fits comfortably. Larger models may require reducing GPU layer offload.
Q.3 "Which async runtime should we use?"
Context: A Rust team is choosing an async runtime for their LLM agent system.
Evidence: TR115 v2 benchmarked 5 runtimes across 150 runs. Tokio-default achieves 98.72% mean efficiency with 1.21pp standard deviation. All alternatives are either less consistent (tokio-localset: 4.03pp sigma) or exhibit pathological failures (smol: 72.80% minimum; async-std: 50% serialization).
Decision: Tokio-default, with no exceptions. It provides the best consistency, the most mature ecosystem, and requires no custom configuration. Use #[tokio::main] and the standard Tokio runtime.
Caveat: If binary size is the primary constraint and the team accepts slightly higher variance, smol-1kb (1.32pp sigma) is a viable alternative.
Q.4 "Gemma or Llama for our agent swarm?"
Context: A team is selecting a model for a multi-agent swarm (2+ concurrent agents) on consumer GPU hardware.
Evidence: TR116 shows Gemma 3 achieves 99.2% chimera-homo efficiency and 102.85 tok/s throughput. Llama 3.1 achieves 98.5% chimera-homo efficiency but only 68 tok/s throughput. Gemma leads on both metrics.
Decision: Gemma 3 for throughput-sensitive workloads (gaming dialogue, high-frequency queries). Llama 3.1 for reasoning-heavy workloads where output quality is more important than speed. The 34% throughput difference is substantial; the 0.7pp efficiency difference is negligible.
Caveat: Quality was not formally measured in TR108-TR116 (see Risk R10). If reasoning quality is mission-critical, conduct a quality evaluation before committing to Gemma 3.
Q.5 "Can we just optimize our Python code instead of migrating?"
Context: A team observes suboptimal multi-agent efficiency in Python and considers code optimization instead of Rust migration.
Evidence: TR116 shows Python never exceeds 86% efficiency regardless of model (Gemma: 80.2%, Llama: 83.8%, Qwen: 77.6%). This ceiling is structural: asyncio's single-threaded event loop saturates under concurrent coordination load. TR113 vs TR114 shows that even architectural fixes (dual Ollama) cannot push Python past this ceiling.
Decision: No. The ~86% ceiling is a runtime limitation, not a code quality issue. Optimizing Python code may improve performance within the ceiling (e.g., from 77% to 84%) but cannot break through it. Only a runtime migration (to Rust/Tokio or another multi-threaded runtime) enables >90% efficiency.
Caveat: If the workload is single-agent (no concurrency), the Python ceiling is irrelevant and Python optimization may be sufficient.
Appendix S: Governance Templates
S.1 Benchmark Report Template
# Technical Report [NNN]: [Title]
## [Subtitle]
**Date:** YYYY-MM-DD
**Hardware:** [GPU, CPU, RAM, OS]
**Total Runs:** [N configurations x M runs]
**Model(s):** [model:quantization]
**Related Work:** [List of prior TRs]
## Executive Summary
[3-5 sentence summary of findings and their implications]
## Methodology
- Measurement boundary: [What is included/excluded]
- Statistical protocol: [N runs per config, metrics collected]
- Hardware configuration: [Full spec table]
## Results
[Tables and analysis with mean, std dev, CV for all metrics]
## Limitations
[Hardware constraints, workload scope, generalization caveats]
## Conclusions and Decisions
[Actionable decisions with invalidation triggers]
## Appendices
[Raw data references, configuration files, artifact paths]
S.2 Configuration Change Request Template
# Configuration Change Request
**Requester:** [Name]
**Date:** YYYY-MM-DD
**Affected Component:** [Ollama, Rust binary, model, runtime]
## Current Configuration
[Parameter: current value]
## Proposed Configuration
[Parameter: proposed value]
## Justification
[Which TR supports this change? Specific evidence.]
## Risk Assessment
[What could go wrong? Rollback plan.]
## Validation Plan
[How will we verify the change improves performance?]
[Minimum N runs required for validation.]
## Approval
[ ] Benchmarked on target hardware
[ ] Reviewed against relevant TR findings
[ ] Rollback procedure documented
S.3 Performance Regression Report Template
# Performance Regression Report
**Date Detected:** YYYY-MM-DD
**Severity:** [Critical / Warning]
**Metric Affected:** [efficiency / throughput / TTFT / contention]
## Observed Behavior
[Current metric value vs expected baseline]
## Diagnostic Steps Taken
[Commands run, logs checked, hardware inspected]
## Root Cause
[Identified cause or "Under Investigation"]
## Resolution
[Fix applied or proposed]
## Prevention
[How to prevent recurrence]
S.4 Re-Validation Trigger Checklist
A re-validation benchmark run is required when any of the following conditions are met:
- Ollama version updated (any version change)
- GPU driver updated (major version change)
- Operating system updated (major version change)
- Model quantization changed (e.g., Q4_K_M to Q5_K_M)
- Hardware changed (different GPU, different VRAM capacity)
- Agent workflow logic modified (new LLM call stages, changed I/O patterns)
- Rust toolchain updated (major version change)
- Tokio runtime updated (major version change)
- Number of concurrent agents changed (e.g., 2 to 3)
- New model deployed (not previously benchmarked)
Appendix T: Extended Risk Register
| Risk ID | Risk Description | Likelihood | Impact | Mitigation Strategy | Owner | Status |
|---|---|---|---|---|---|---|
| R1 | Hardware-specific results do not generalize to different GPUs (e.g., RTX 3090, A100) | Medium | High | Re-benchmark on target hardware before deployment; document hardware-specific assumptions | Deployment Lead | Open |
| R2 | Ollama update breaks dual-instance configuration or changes serialization behavior | Low | High | Pin Ollama version; test new versions in staging before production upgrade | Infrastructure | Open |
| R3 | Tokio breaking change in major version alters work-stealing scheduler behavior | Low | Medium | Monitor Tokio release notes; maintain smol-1kb as fallback runtime; pin Tokio version in Cargo.lock | Development | Open |
| R4 | Gemma model quality regression in future releases (architecture changes, quantization artifacts) | Medium | Medium | Maintain quality benchmark suite; evaluate new model versions against baseline before adoption | Research | Open |
| R5 | Scaling beyond 2 agents introduces unknown contention patterns not covered by TR108-TR116 | High | Medium | Plan TR117+ investigations for 3+ agent scaling; do not extrapolate 2-agent results | Research | Open |
| R6 | Windows-specific behavior (GPU driver, process scheduling) does not reproduce on Linux | Medium | Medium | Conduct Linux validation study; document any OS-specific findings | Infrastructure | Open |
| R7 | VRAM pressure at scale: larger models or more agents exceed 12 GB budget | Medium | High | Monitor VRAM utilization in production; set GPU layer budgets per model; implement VRAM-aware scheduling | Operations | Open |
| R8 | Configuration drift: production configs diverge from benchmarked configs over time | High | Medium | Implement config-as-code with version control; automated validation against benchmark baselines | DevOps | Open |
| R9 | Shadow pricing inaccuracy: cost estimates in TR112 v2 may not reflect actual cloud pricing | Medium | Low | Validate cost estimates against actual cloud bills quarterly; update pricing model | Finance | Open |
| R10 | Quality unmeasured in multi-agent: efficiency metrics do not capture output quality degradation | High | Medium | Add quality metrics (coherence, relevance, factual accuracy) to benchmark suite in future TRs | Research | Open |
| R11 | Thermal throttling under sustained load reduces throughput below benchmarked levels | Medium | Medium | Monitor GPU temperature; implement cooling-aware load scheduling; benchmark sustained (>1hr) workloads | Operations | Open |
| R12 | Reqwest HTTP client update changes buffering behavior, altering coordination overhead | Low | Low | Pin reqwest version; benchmark after updates | Development | Open |
| R13 | Model context length requirements exceed benchmarked ranges (256-2048 tokens) | Medium | Medium | Benchmark with production-representative context lengths before deployment | Research | Open |
Appendix U: Program Evolution Narrative
U.1 Phase 1: Python Baseline (TR108-TR110)
The research program began with TR108, which established single-agent LLM performance baselines on the RTX 4080 Laptop. Through 158+ configurations, TR108 identified Gemma 3 as the throughput champion (102.85 tok/s) and mapped the parameter sensitivity landscape (GPU layers > context size > temperature). TR109 extended this to agent workflows and discovered the first major insight: configuration transfer fails. Parameters optimal for single inference (maximal GPU offload, large context) degrade agent workflow performance. TR110 pushed into multi-agent territory, achieving 99.25% parallel efficiency with Python asyncio and dual Ollama. At the end of Phase 1, the program had established a high Python performance baseline and identified the workload-specific nature of optimization.
U.2 Phase 2: Rust Migration (TR111-TR112)
Phase 2 asked whether Rust could match Python's performance while providing operational advantages (memory efficiency, startup speed, deployment simplicity). TR111 v2 demonstrated full workflow parity: the Rust agent performed identical operations to Python with 114.54 tok/s throughput. TR112 v2 provided the definitive comparison: Rust delivered +15.2% throughput, -58% TTFT, -67% memory, -83% startup. The migration was validated as beneficial for production deployment.
U.3 Phase 3: Architecture Discovery (TR113-TR114)
Phase 3 produced the program's most important finding. TR113 tested Rust multi-agent with a single Ollama instance and observed only 82.2% efficiency with 63% contention -- initially interpreted as a Rust async runtime limitation. TR114 proved this interpretation wrong: deploying dual Ollama instances eliminated the bottleneck entirely, achieving 99.396% efficiency. The insight was architectural: Ollama's single-instance serialization, not Rust's async runtime, was the bottleneck. This discovery reframed the program's understanding: infrastructure architecture dominates over language runtime for multi-agent LLM workloads.
U.4 Phase 4: Runtime Selection (TR115)
With the architectural question settled, TR115 optimized within the Rust ecosystem by comparing five async runtimes. The finding that all functional runtimes achieve approximately 100% peak efficiency (99.87-99.99%) shifted the selection criterion from "which is fastest" to "which is most consistent." Tokio-default won on consistency (1.21pp sigma), producing the production recommendation.
U.5 Phase 5: Cross-Model Validation (TR116)
TR116 validated whether the Rust advantage and Python ceiling were model-dependent artifacts. Testing three models (Gemma 3, Llama 3.1, Qwen 2.5) across both runtimes confirmed: Rust outperforms Python by 12-17 percentage points for every model tested, and Python never exceeds 86% efficiency regardless of model. These findings are structural and universal within the tested hardware and software configuration.
U.6 Program Arc Summary
The program evolved from "which model is fastest?" (TR108) to "which architecture is optimal?" (TR113-TR114) to "is the finding universal?" (TR116). Each phase built on the previous, and findings from later phases retroactively recontextualized earlier results (e.g., TR113's 82.2% was initially blamed on Rust but later attributed to single Ollama by TR114). The total evidence base -- 903+ runs across 9 reports -- provides high confidence in the six decisions derived from this program.
Appendix V: Cost Modeling Examples
V.1 Assumptions
- Rust throughput: 114.54 tok/s (TR112 v2 baseline)
- Python throughput: 99.34 tok/s (TR112 v2 baseline)
- Average request: 200 tokens generated
- Average request duration: Rust = 200/114.54 = 1.75s; Python = 200/99.34 = 2.01s
- GPU cost: $0.50/hour (consumer-grade GPU cloud instance estimate)
- Multi-agent efficiency: Rust = 98% (TR114 v2); Python = 82% (TR116 average)
V.2 Scenario 1: Small Team (100K requests/month)
| Metric | Python | Rust | Delta |
|---|---|---|---|
| Total inference time | 201,329 s (55.9 hr) | 174,659 s (48.5 hr) | -13.0% |
| GPU cost | $27.95/month | $24.25/month | -$3.70/month |
| Annual GPU cost | $335.40 | $291.00 | -$44.40/year |
Verdict: Savings are modest at this scale. Migration investment is not justified unless multi-agent scaling is planned.
V.3 Scenario 2: Medium Team (1M requests/month)
| Metric | Python | Rust | Delta |
|---|---|---|---|
| Total inference time | 2,013,285 s (559 hr) | 1,746,590 s (485 hr) | -13.0% |
| GPU cost | $279.50/month | $242.50/month | -$37.00/month |
| Annual GPU cost | $3,354.00 | $2,910.00 | -$444.00/year |
Verdict: Annual savings of $444 begin to justify migration investment (estimated $5,000 one-time). Break-even: approximately 11 months.
V.4 Scenario 3: Large Deployment (10M requests/month)
| Metric | Python | Rust | Delta |
|---|---|---|---|
| Total inference time | 20,132,853 s (5,593 hr) | 17,465,903 s (4,852 hr) | -13.0% |
| GPU cost | $2,796.50/month | $2,426.00/month | -$370.50/month |
| Annual GPU cost | $33,558.00 | $29,112.00 | -$4,446.00/year |
Verdict: Annual savings of $4,446 with break-even under 2 months. Migration is strongly recommended.
V.5 Scenario 4: Multi-Agent Swarm (2 agents, 1M requests/month)
| Metric | Python (82% eff.) | Rust (98% eff.) | Delta |
|---|---|---|---|
| Effective throughput (2 agents) | 2 x 99.34 x 0.82 = 162.92 tok/s | 2 x 114.54 x 0.98 = 224.50 tok/s | +37.8% |
| Time for 1M requests | 200M / 162.92 = 1,227,791 s (341 hr) | 200M / 224.50 = 891,314 s (248 hr) | -27.4% |
| GPU cost (dual GPU) | $341.00/month | $248.00/month | -$93.00/month |
| Annual GPU cost | $4,092.00 | $2,976.00 | -$1,116.00/year |
Verdict: Multi-agent amplifies the Rust advantage. The 37.8% effective throughput improvement and $1,116/year savings make migration compelling at medium scale.
V.6 Scenario 5: Break-Even Analysis
| Migration cost component | Estimate |
|---|---|
| Developer time (Rust rewrite) | $4,000 (80 hours at $50/hr) |
| Testing and validation | $750 (15 hours) |
| Deployment infrastructure | $250 (one-time) |
| Total migration cost | $5,000 |
| Scale | Annual Savings | Break-Even |
|---|---|---|
| 100K req/month (single agent) | $44 | 113 months (not justified) |
| 1M req/month (single agent) | $444 | 11 months |
| 1M req/month (multi-agent) | $1,116 | 4.5 months |
| 10M req/month (single agent) | $4,446 | 1.1 months |
Appendix W: Workload Taxonomy Extensions
W.1 Taxonomy Dimensions
Beyond the four workload classes defined in Appendix F, workloads can be further classified along the following dimensions:
Latency Sensitivity:
- Real-time (< 200 ms TTFT): gaming dialogue, interactive chat.
- Near-real-time (< 2s TTFT): developer tools, code completion.
- Batch (no TTFT constraint): document processing, summarization pipelines.
Concurrency Level:
- Single (1 agent): simplest deployment, no contention possible.
- Low (2-4 agents): dual Ollama sufficient, well-characterized by TR108-TR116.
- High (5+ agents): uncharacterized, requires TR117+ investigation.
Context Depth:
- Shallow (< 512 tokens): short prompts, single-turn queries.
- Moderate (512-2048 tokens): multi-turn chat, agent state.
- Deep (> 2048 tokens): document analysis, long-context reasoning.
Quality Criticality:
- Low: creative generation, brainstorming (temperature 0.8-1.0).
- Medium: structured output, report generation (temperature 0.6-0.8).
- High: factual extraction, code generation (temperature 0.0-0.4).
W.2 Classification Examples
| Workload | Latency | Concurrency | Context | Quality | Recommended Stack |
|---|---|---|---|---|---|
| Gaming banter | Real-time | Single | Shallow | Low | Rust + Gemma 3, num_gpu=max |
| Agent report gen | Near-real-time | Single | Moderate | Medium | Rust + Gemma 3, num_gpu=60-80 |
| Dual-agent analysis | Near-real-time | Low | Moderate | Medium | Rust + dual Ollama + Gemma 3 |
| Document processing | Batch | Low | Deep | High | Rust + Llama 3.1, num_ctx=4096+ |
Appendix X: Experiment Planning Template
# Experiment Plan: TR[NNN]
## Research Question
[Single, focused question this experiment answers]
## Hypothesis
[Predicted outcome with rationale]
## Variables
- Independent: [What is being varied]
- Dependent: [What is being measured]
- Controlled: [What is held constant]
## Configuration Matrix
| Config ID | Variable 1 | Variable 2 | ... |
|-----------|-----------|-----------|-----|
| C001 | value | value | ... |
## Sample Size
- Runs per configuration: [N >= 3, recommended 5]
- Total configurations: [M]
- Total runs: [N x M]
- Estimated duration: [hours]
## Hardware and Software
- GPU: [model, VRAM]
- CPU: [model, cores]
- RAM: [capacity, speed]
- OS: [version]
- Ollama: [version, instance count, ports]
- Model: [name:quantization]
- Runtime: [language, version, async runtime]
## Measurement Boundary
- Included: [operations within timing window]
- Excluded: [operations outside timing window]
## Success Criteria
[What result would confirm/refute the hypothesis?]
## Dependencies
[Prior TRs whose findings this experiment depends on]
## Artifacts
- Raw data: [file path pattern]
- Analysis scripts: [file path]
- Report: [output path]
Appendix Y: Extended Operational Playbook
Y.1 Model Hot-Swap Procedure
- Verify current model is idle (no active requests).
- On target Ollama instance:
ollama pull <new_model>:<quantization> - Issue a warmup request to load the new model into GPU memory.
- Run 3 warmup requests (discard results).
- Verify TTFT and throughput are within expected ranges.
- Update configuration file to reference new model.
- Log the model change with timestamp and reason.
Y.2 Emergency Rollback Procedure
- Stop all agent processes.
- Restore previous configuration file from version control.
- Restart both Ollama instances with previous model.
- Run 3 warmup requests per instance.
- Verify efficiency returns to baseline (> 95%).
- Document the incident in a Performance Regression Report (Appendix S.3).
Y.3 VRAM Budget Management
For RTX 4080 Laptop (12 GB VRAM), the budget allocation is:
| Component | VRAM Budget |
|---|---|
| Ollama Instance 1 (model weights) | 3.3 GB (Gemma 3 Q4_K_M) |
| Ollama Instance 1 (KV cache) | 1.0 GB |
| Ollama Instance 2 (model weights) | 3.3 GB |
| Ollama Instance 2 (KV cache) | 1.0 GB |
| OS and driver overhead | 1.0 GB |
| Safety margin | 2.4 GB |
| Total | 12.0 GB |
If VRAM usage approaches the safety margin, reduce num_gpu to offload some layers to CPU, or reduce num_ctx to shrink the KV cache.
Appendix Z: Efficiency-Quality Tradeoff Analysis
Z.1 Observations from TR109
TR109 established that quality and performance trade differently in agent workflows than in single inference:
- Temperature 0.6: Best quality-consistency balance. Structured outputs (reports, analysis) are more coherent. TTFT is 10-15% lower than temperature 1.0.
- Temperature 0.8: Moderate creativity. Acceptable for most agent tasks. Minimal throughput difference from 0.6.
- Temperature 1.0: Highest variance in output quality. Slightly higher throughput in some configurations. Not recommended for structured output tasks.
Z.2 Quality Proxies
TR108-TR116 did not formally measure output quality. The following proxies were observed:
- Output length stability: Lower temperature produces more consistent output lengths, suggesting more deterministic generation.
- Structural compliance: Agent-generated reports at temperature 0.6 more reliably followed the requested markdown structure.
- Repetition rate: Higher temperatures occasionally produced repetitive outputs (observed anecdotally, not quantified).
Z.3 Recommendations
For production deployments where both efficiency and quality matter:
- Use temperature 0.6 for structured output tasks (reports, analysis, code).
- Use temperature 0.8 for creative tasks (dialogue, brainstorming).
- Avoid temperature 1.0 unless output diversity is explicitly valued over consistency.
- Quality measurement (coherence scores, human evaluation) should be added to future TRs (see Risk R10).
Appendices AA-AO: Additional Deep-Dives
AA: Measurement Formula Catalog
Throughput:
throughput_tok_s = total_tokens_generated / generation_duration_seconds
Time to First Token (TTFT):
TTFT_ms = timestamp_first_token - timestamp_request_sent
Concurrency Speedup:
speedup = sequential_estimated_time / concurrent_wall_time
sequential_estimated_time = sum(agent_i_duration for all agents)
Parallel Efficiency:
efficiency_pct = (speedup / N_agents) * 100
Contention Rate:
contention_rate = count(runs_with_TTFT_anomaly) / total_runs
TTFT_anomaly = (TTFT_concurrent - TTFT_baseline) > threshold
Coefficient of Variation:
CV = (standard_deviation / mean) * 100
Throughput Improvement:
improvement_pct = ((throughput_new - throughput_baseline) / throughput_baseline) * 100
Cost Savings:
annual_savings = (python_gpu_hours - rust_gpu_hours) * gpu_hourly_rate * 12
AB: Phase-Specific Observations
TR108 Phases:
- Phase 1 (Llama 3.1): 3 quantizations tested (FP16, Q8_0, Q4_0). Q4_0 provided best throughput-to-quality ratio.
- Phase 2 (Gemma 3): 3 variants tested. Default (Q4_K_M) provided optimal balance.
- Phase 3 (Cross-model): Gemma 3 established as throughput champion.
TR109 Phases:
- Phase 1 (Config transfer): TR108 optimal configs tested on agent workflows. Transfer failure confirmed.
- Phase 2 (Parameter sweep): Systematic sweep identifies agent-specific optima.
- Phase 3 (Quality tradeoffs): Temperature impact on output quality evaluated.
TR110 Phases:
- Phase 1 (Baseline vs Chimera): Mixed agent pair testing.
- Phase 2 (Homogeneous): Identical Chimera agents achieve 99.25% efficiency.
- Phase 3 (Heterogeneous): Different-config agents show lower but acceptable efficiency.
AC: Detailed Model Comparison
| Metric | Gemma 3 (4.3B) | Llama 3.1 8B (Q4_0) | Qwen 2.5 7B |
|---|---|---|---|
| Parameters | 4.3B | 8B | 7B |
| Quantization | Q4_K_M | Q4_0 | Q4_K_M |
| Single-agent throughput | 102.85 tok/s | 76.59 tok/s | ~85 tok/s (est.) |
| VRAM footprint | 3.3 GB | 4.5 GB | 4.2 GB |
| Rust multi-agent eff. | 97.3% (b-v-c), 99.2% (homo) | 96.5% (b-v-c), 98.5% (homo) | 90.0% (b-v-c) |
| Python multi-agent eff. | 80.2% | 83.8% | 77.6% |
| Rust advantage | +17.1 pp | +12.7 pp | +12.4 pp |
| Recommended use | Throughput-critical swarms | Reasoning-heavy agents | Specialized tasks only |
Observations: Gemma 3's smaller size and efficient architecture make it the best all-around choice. Llama 3.1's larger context capacity and reasoning strength justify its lower throughput for complex tasks. Qwen 2.5 underperforms in multi-agent scenarios despite competitive parameter count, likely due to heavier KV cache requirements and different attention patterns.
AD: Extended Methodological Rationale
Why 5 runs per configuration? Statistical power analysis indicates that 5 runs provide sufficient power to detect a 2 percentage-point efficiency difference with 95% confidence, given the observed variance (1-5pp sigma). Three runs are the minimum acceptable; five provide robustness against outliers.
Why dual Ollama instead of batched inference? Ollama does not natively support batched inference (multiple prompts in a single request). Dual instances are the only mechanism for true concurrent inference on Ollama-served models. Alternative inference servers (vLLM, TGI) were out of scope for this consumer-hardware study.
Why Q4_K_M quantization? Q4_K_M provides the best balance of model size, inference speed, and output quality for the target workload (gaming dialogue). Higher quantizations (Q5, Q8) are unnecessarily large for the quality requirements; lower quantizations (Q2, Q3) produce noticeable quality degradation.
AE: Future Directions Beyond TR116
- 3+ Agent Scaling (TR117+): Characterize performance beyond 2 concurrent agents. Expected challenges: VRAM pressure, increased coordination overhead, potential need for 3+ Ollama instances.
- Linux Validation: Reproduce key findings (TR112, TR114, TR116) on Linux to confirm OS independence.
- Cloud GPU Validation: Test on datacenter GPUs (A100, H_100) to determine if consumer-GPU findings generalize.
- Quality Measurement Integration: Add automated quality metrics (BLEU, coherence scores, structured output compliance) to the benchmark suite.
- Long-Context Evaluation: Extend benchmarks to 4096+ token contexts for document processing workloads.
- Model Fine-Tuning Impact: Evaluate whether fine-tuned models exhibit different scaling characteristics than base/instruct models.
- vLLM/TGI Comparison: Compare Ollama serving against vLLM and Text Generation Inference for production scenarios.
AF: Annotated Literature Notes
- Tokio documentation (tokio.rs): Work-stealing scheduler design documented in Tokio's architecture guide. Key insight: work-stealing minimizes tail latency but adds per-steal overhead of approximately 100ns, negligible for LLM inference timescales.
- Ollama GitHub (github.com/ollama/ollama): Server architecture confirms single-request serialization at the model level. Concurrent requests to the same model are queued, not parallelized.
- GGUF specification (ggml.ai): Format specification for quantized model weights. Q4_K_M uses 4.5 bits per weight on average with K-means optimization for key layers.
- Python asyncio documentation (docs.python.org): Single-threaded event loop confirmed as fundamental design constraint. No plans for multi-threaded event loop in CPython roadmap.
AG: Extended Glossary
| Term | Definition |
|---|---|
| Chimera | The optimization framework used in Banterhearts for LLM agent configuration |
| TTFT | Time to First Token: latency from request submission to first generated token |
| Parallel efficiency | (speedup / N_agents) x 100%; measures how well concurrent execution utilizes available parallelism |
| Contention | Resource conflict when multiple agents compete for shared GPU/Ollama resources |
| num_gpu | Ollama parameter controlling how many model layers are offloaded to GPU |
| num_ctx | Ollama parameter controlling the context window size (in tokens) |
| Q4_K_M | Quantization format: 4-bit with K-means clustering, medium quality preset |
| Work-stealing | Scheduler design where idle threads steal tasks from busy threads' queues |
| Dual Ollama | Architecture deploying two independent Ollama instances on different ports |
| KV cache | Key-Value cache storing attention computations for previously processed tokens |
| Chimera-homo | Homogeneous configuration: both agents use identical optimized parameters |
| Chimera-hetero | Heterogeneous configuration: agents use different parameter settings |
| Baseline-vs-chimera | Scenario comparing default config agent against optimized (Chimera) agent |
| CV | Coefficient of Variation: standard deviation / mean, expressed as percentage |
| pp | Percentage points: absolute difference between two percentages |
AH: Artifact Inventory
| TR | Artifact Type | Path Pattern |
|---|---|---|
| TR108 | Raw benchmark data | research/tr108/data/gemma3/, research/tr108/data/llama3/ |
| TR108 | Published report | PublishReady/reports/Technical_Report_108.md |
| TR108 | Visualization exports | PublishReady/notebooks/exports/TR108_Comprehensive/ |
| TR109 | Published report | PublishReady/reports/Technical_Report_109.md |
| TR109 | Visualization exports | PublishReady/notebooks/exports/TR109_Agent_Workflow/ |
| TR110 | Published report | PublishReady/reports/Technical_Report_110.md |
| TR110 | Visualization exports | PublishReady/notebooks/exports/TR110_MultiAgent_Concurrent/ |
| TR111 | Published report (v2) | PublishReady/reports/Technical_Report_111_v2.md |
| TR111 | Artifacts | research/tr111/artifacts/ |
| TR112 | Published report (v2) | PublishReady/reports/Technical_Report_112_v2.md |
| TR112 | Artifacts | research/tr112/artifacts/ |
| TR113 | Published report | PublishReady/reports/Technical_Report_113.md |
| TR114 | Published report (v2) | PublishReady/reports/Technical_Report_114_v2.md |
| TR114 | Artifacts | research/tr114/artifacts/ |
| TR115 | Published report (v2) | PublishReady/reports/Technical_Report_115_v2.md |
| TR115 | Results | research/tr115/runtime_optimization/results_v2/ |
| TR116 | Published report | PublishReady/reports/Technical_Report_116.md |
AI: Artifact-to-Claim Examples
Claim: "Rust delivers +15.2% throughput over Python."
- Artifact:
PublishReady/reports/Technical_Report_112_v2.md, Section 4 (Throughput Analysis) - Raw data:
research/tr111/artifacts/(Rust runs),research/tr112/artifacts/(Python comparison) - Calculation: (114.54 - 99.34) / 99.34 x 100 = 15.28%
Claim: "Dual Ollama improves efficiency from 82.2% to 99.396%."
- Artifact:
PublishReady/reports/Technical_Report_113.md(82.2% figure),PublishReady/reports/Technical_Report_114_v2.md(99.396% figure) - Raw data:
research/tr114/artifacts/(dual Ollama results) - Calculation: 99.396 - 82.2 = 17.196 percentage points improvement
Claim: "Tokio-default achieves 98.72% mean with 1.21pp sigma."
- Artifact:
PublishReady/reports/Technical_Report_115_v2.md, Section 3 (Comprehensive Results) - Raw data:
research/tr115/runtime_optimization/results_v2/
AJ: Reproducibility Notes
Hardware Sensitivity: All TR108-TR116 results were obtained on a single hardware configuration (RTX 4080 Laptop, i9-13980HX, 32 GB DDR5-4800). Reproducing these results on different hardware may yield different absolute values but should preserve relative rankings and efficiency percentages. GPU thermal state, background processes, and driver version can introduce run-to-run variance of 1-3%.
Software Versions: Results depend on specific software versions. Ollama's internal scheduling and model loading behavior may change between versions. Tokio's work-stealing scheduler has been stable across minor versions but could change in major releases.
Warmup Protocol: Consistent warmup (3 requests per Ollama instance before measurement) is critical for reproducibility. Cold-start measurements include model loading time that can vary by 2-10x depending on model size and disk speed.
Statistical Protocol: Minimum 3 runs per configuration (5 preferred). Report mean, standard deviation, and CV. Discard clear outliers only if a hardware explanation is documented (e.g., thermal throttling event confirmed by GPU temperature log).
AK: Scenario-Specific Playbooks
Gaming Dialogue Agent Deployment:
- Model: Gemma 3 Q4_K_M. Runtime: Rust/Tokio-default. Ollama: single instance.
- Configuration: num_gpu=max, num_ctx=512, temp=0.8.
- Target: throughput > 100 tok/s, TTFT < 200 ms.
- Monitoring: track TTFT p99 for user experience; alert if > 300 ms.
Dual-Agent Research Pipeline:
- Model: Gemma 3 Q4_K_M. Runtime: Rust/Tokio-default. Ollama: dual instances (11434/11435).
- Configuration: num_gpu=80, num_ctx=2048, temp=0.6.
- Target: efficiency > 95%, contention < 2%.
- Monitoring: track parallel efficiency per batch; alert if < 90%.
AL: Scenario Taxonomy
| Scenario ID | Agents | Ollama | Runtime | Use Case |
|---|---|---|---|---|
| S1 | 1 | Single | Rust | Production single-agent |
| S2 | 1 | Single | Python | Development/prototyping |
| S3 | 2 | Dual | Rust | Production multi-agent |
| S4 | 2 | Dual | Python | Multi-agent prototyping |
| S5 | 2 | Single | Rust | Not recommended (TR113) |
| S6 | 3+ | N/A | Rust | Future work (TR117+) |
AM: Decision Heuristics
Heuristic 1: Language Selection
- If multi-agent efficiency > 90% required: Rust.
- If development velocity > performance: Python.
- If both: Rust with Python orchestrator (hybrid).
Heuristic 2: Ollama Architecture
- If agents = 1: single Ollama.
- If agents >= 2: dual Ollama (one per agent).
- If agents >= 3: one Ollama per agent (requires VRAM budget validation).
Heuristic 3: Model Selection
- If throughput-critical: Gemma 3.
- If reasoning-critical: Llama 3.1.
- If neither is dominant: Gemma 3 (default).
- Avoid Qwen 2.5 for multi-agent (90% efficiency ceiling).
Heuristic 4: Configuration Transfer
- Never assume single-inference optimal configs transfer to agent workflows.
- Never assume agent-optimal configs transfer to multi-agent.
- Always benchmark the specific workload type.
AN: Policy Decision Trees
Tree 1: Should We Use Rust?
Is multi-agent efficiency critical?
Yes --> Is the team capable of Rust development?
Yes --> Use Rust (Decision 1)
No --> Hire/train, then use Rust
No --> Is throughput > 15% improvement valuable?
Yes --> Use Rust
No --> Stay with Python
Tree 2: How Many Ollama Instances?
How many concurrent agents?
1 --> Single Ollama
2 --> Dual Ollama (Decision 2)
3+ --> One per agent (validate VRAM first)
VRAM sufficient? --> Deploy
VRAM insufficient? --> Reduce GPU layers or use smaller model
Tree 3: Which Model?
Is throughput the primary metric?
Yes --> Gemma 3 (Decision 4)
No --> Is reasoning quality the primary metric?
Yes --> Llama 3.1
No --> Is multi-agent efficiency critical?
Yes --> Gemma 3 (99.2% efficiency)
No --> Either Gemma 3 or Llama 3.1
AO: Extended Systems Glossary
| Term | Definition | Context |
|---|---|---|
| Ollama | Open-source LLM inference server supporting GGUF model format | All TRs |
| Tokio | Rust async runtime with work-stealing scheduler | TR111-TR116 |
| asyncio | Python standard library for asynchronous I/O | TR108-TR110, TR116 |
| reqwest | Rust HTTP client library built on Tokio and hyper | TR111-TR116 |
| httpx | Python async HTTP client library | TR108-TR110 |
| CUDA | NVIDIA parallel computing platform for GPU acceleration | All TRs |
| GGUF | GPT-Generated Unified Format for quantized model weights | All TRs |
| RTX 4080 | NVIDIA laptop GPU with 12 GB GDDR6X, 9728 CUDA cores | All TRs |
| i9-13980HX | Intel 24-core hybrid CPU (8P + 16E cores) | All TRs |
| DDR5-4800 | System memory standard, 4800 MHz transfer rate | All TRs |
| smol | Lightweight Rust async runtime | TR115 |
| async-std | Rust async runtime mirroring std library API | TR115 |
| hyper | Low-level HTTP library for Rust (used by reqwest) | TR111-TR116 |
| GIL | Global Interpreter Lock in CPython; prevents true parallel bytecode execution | TR116 discussion |
| VRAM | Video RAM; GPU-accessible memory for model weights and KV cache | All TRs |
| Chimera | Banterhearts optimization framework for LLM agent configuration tuning | All TRs |
| Work-stealing | Scheduler pattern where idle threads take tasks from other threads' queues | TR115 |
| Serialization (Ollama) | Behavior where concurrent requests to a single Ollama instance are processed sequentially | TR113, TR114 |
| Thermal throttling | GPU frequency reduction when temperature exceeds safe limits | Performance monitoring |
| Binary deployment | Distributing Rust application as a single compiled executable | TR112 |
End of Extended Appendices. Total appendices: F, H, J, K, L, N, O, P, Q, S, T, U, V, W, X, Y, Z, AA-AO. This document is supplemental to the main Conclusive Report 108-116.