Appendices

Phase 1 Extended Appendices

Supplemental material extracted from the Phase 1 conclusive report.

Table of Contents

Supplemental material extracted from the main conclusive report
Appendix F: Workload Taxonomy
F.1 Overview
F.2 Workload Class 1: Single-Agent Chat Inference
F.3 Workload Class 2: Agent Workflow
F.4 Workload Class 3: Multi-Agent Concurrent
F.5 Workload Class 4: Cross-Model Multi-Agent
F.6 Workload Class 5: Batch Processing (Future Work)
F.7 Workload Characteristics Summary
Appendix H: Operational Playbooks
H.1 Pre-Deployment Playbook
Step 1: Dual Ollama Instance Setup
Step 2: Rust Binary Compilation and Deployment
Step 3: Configuration Validation Procedure
Step 4: Warmup Sequence
H.2 Monitoring Playbook
Key Metrics to Track
Alert Thresholds
Dashboard Recommendations
H.3 Regression Response Playbook
Symptom-Root Cause-Fix Mapping
Appendix J: Traceability Map (Decisions to TRs)
J.1 Decision 1: Adopt Rust for Production Agent Workflows
J.2 Decision 2: Deploy Dual Ollama Architecture for Multi-Agent Workloads
J.3 Decision 3: Use Tokio-Default as the Async Runtime
J.4 Decision 4: Select Gemma 3 as the Default Production Model
J.5 Decision 5: Acknowledge and Document the Python Efficiency Ceiling
J.6 Decision 6: Configuration Transfer Between Workload Types Fails
Appendix K: Extended Literature Context
K.1 Rust Async Ecosystem
K.2 Ollama Internals
K.3 LLM Multi-Agent Patterns
K.4 Python Asyncio Limitations
K.5 Quantization Impact
Appendix L: Measurement Boundary Catalog
L.1 Boundary Definitions
L.2 Per-TR Boundary Table
L.3 Boundary Rationale
L.4 Cross-TR Comparability Notes
Appendix N: Expanded Discussion
N.1 The Python-to-Rust Migration Narrative
N.2 The Architecture Discovery
N.3 The Runtime Lesson: Consistency Beats Peak Performance
N.4 The Model Selection Paradox
N.5 Configuration Transfer Failure as a General Principle
N.6 Implications for the Broader LLM Serving Community
Appendix O: Extended Results Narratives
O.1 TR108: Single-Agent Baseline Establishes Model Performance Hierarchy
O.2 TR109: Configuration Transfer Failure Reveals Workload-Specific Optimization
O.3 TR110: Python Achieves Near-Theoretical Multi-Agent Ceiling
O.4 TR111: Rust Achieves Full Workflow Parity with Near-Hardware-Limit Throughput
O.5 TR112: Head-to-Head Comparison Confirms Rust's Systematic Advantages
O.6 TR113: Single Ollama Bottleneck Reveals Architectural, Not Language, Limitation
O.7 TR114: Dual Ollama Fix Transforms Rust Multi-Agent Performance
O.8 TR115: Runtime Deep Dive Shows Consistency Matters More Than Peak
O.9 TR116: Cross-Model Validation Confirms Universality
Appendix P: Decision-Grade Reporting Rubric
P.1 Checklist
P.2 Application to TR108-TR116
Appendix Q: Decision Case Studies
Q.1 "Should we migrate our Python agent to Rust?"
Q.2 "We are deploying 2 concurrent agents -- single or dual Ollama?"
Q.3 "Which async runtime should we use?"
Q.4 "Gemma or Llama for our agent swarm?"
Q.5 "Can we just optimize our Python code instead of migrating?"
Appendix S: Governance Templates
S.1 Benchmark Report Template
[Subtitle]
Executive Summary
Methodology
Results
Limitations
Conclusions and Decisions
Appendices
S.2 Configuration Change Request Template
Current Configuration
Proposed Configuration
Justification
Risk Assessment
Validation Plan
Approval
S.3 Performance Regression Report Template
Observed Behavior
Diagnostic Steps Taken
Root Cause
Resolution
Prevention
S.4 Re-Validation Trigger Checklist
Appendix T: Extended Risk Register
Appendix U: Program Evolution Narrative
U.1 Phase 1: Python Baseline (TR108-TR110)
U.2 Phase 2: Rust Migration (TR111-TR112)
U.3 Phase 3: Architecture Discovery (TR113-TR114)
U.4 Phase 4: Runtime Selection (TR115)
U.5 Phase 5: Cross-Model Validation (TR116)
U.6 Program Arc Summary
Appendix V: Cost Modeling Examples
V.1 Assumptions
V.2 Scenario 1: Small Team (100K requests/month)
V.3 Scenario 2: Medium Team (1M requests/month)
V.4 Scenario 3: Large Deployment (10M requests/month)
V.5 Scenario 4: Multi-Agent Swarm (2 agents, 1M requests/month)
V.6 Scenario 5: Break-Even Analysis
Appendix W: Workload Taxonomy Extensions
W.1 Taxonomy Dimensions
W.2 Classification Examples
Appendix X: Experiment Planning Template
Research Question
Hypothesis
Variables
Configuration Matrix
Sample Size
Hardware and Software
Measurement Boundary
Success Criteria
Dependencies
Artifacts
Appendix Y: Extended Operational Playbook
Y.1 Model Hot-Swap Procedure
Y.2 Emergency Rollback Procedure
Y.3 VRAM Budget Management
Appendix Z: Efficiency-Quality Tradeoff Analysis
Z.1 Observations from TR109
Z.2 Quality Proxies
Z.3 Recommendations
Appendices AA-AO: Additional Deep-Dives
AA: Measurement Formula Catalog
AB: Phase-Specific Observations
AC: Detailed Model Comparison
AD: Extended Methodological Rationale
AE: Future Directions Beyond TR116
AF: Annotated Literature Notes
AG: Extended Glossary
AH: Artifact Inventory
AI: Artifact-to-Claim Examples
AJ: Reproducibility Notes
AK: Scenario-Specific Playbooks
AL: Scenario Taxonomy
AM: Decision Heuristics
AN: Policy Decision Trees
AO: Extended Systems Glossary

Conclusive Report 108-116: Extended Appendices

Supplemental material extracted from the main conclusive report

Field	Value
Report Type	Extended appendices
Scope	108-116
Status	Supplemental
Main Report	Technical_Report_Conclusive_108-116.md

Appendix F: Workload Taxonomy

F.1 Overview

The TR108-TR116 research program evaluated LLM inference performance across four distinct workload classes. Each class exercises different system bottlenecks, rewards different optimization strategies, and produces different performance profiles. This taxonomy formalizes the workload classification that emerged empirically over nine technical reports and 903+ benchmark runs.

F.2 Workload Class 1: Single-Agent Chat Inference

Primary TR: TR108 Characteristics:

Decode-dominant: GPU compute is the bottleneck; throughput scales linearly with decode speed.
Single Ollama instance, single concurrent request.
Prompt lengths: 50-200 tokens. Generation lengths: 100-500 tokens.
Optimization target: raw throughput (tok/s) and TTFT (ms).

Key Metrics Observed:

Gemma 3: 102.85 tok/s, TTFT ~150 ms (best configuration)
Llama 3.1 Q4_0: 76.59 tok/s, TTFT ~200 ms
GPU layer allocation is the single most impactful parameter (num_gpu).
Context size optimization yields 15-20% throughput improvement.

Optimization Strategy:

Maximize GPU offload (num_gpu = total model layers).
Minimize context window to task requirements (512-1024 for short-form generation).
Temperature has minimal impact on throughput; optimize for quality.

F.3 Workload Class 2: Agent Workflow

Primary TR: TR109 Characteristics:

Multi-step execution: file I/O, data ingestion, iterative LLM calls, structured output generation.
I/O-bound phases interleave with compute-bound LLM inference phases.
Optimization target: end-to-end workflow latency, with TTFT as the critical sub-metric.

Key Metrics Observed:

Smaller contexts (512-1024 tokens) outperform larger contexts (4096 tokens) for agent tasks.
GPU layer sweet spot: 60-80 layers (vs. 999 for single inference).
Configuration transfer from TR108 (single inference) fails: optimal single-inference parameters produce suboptimal agent workflow performance.

Optimization Strategy:

Optimize TTFT aggressively; decode throughput is less critical because I/O phases dominate.
Reduce context window to match actual agent state requirements.
Avoid maximal GPU offload; moderate offload (60-80 layers) balances TTFT and throughput.

F.4 Workload Class 3: Multi-Agent Concurrent

Primary TRs: TR110, TR113, TR114 Characteristics:

Two or more agents executing in parallel, each issuing LLM requests.
Resource contention for GPU, VRAM, and Ollama server capacity.
Optimization target: parallel efficiency (speedup / N agents) and contention rate.

Key Metrics Observed:

Python (TR110): 99.25% efficiency with dual Ollama, homogeneous Chimera agents.
Rust single Ollama (TR113): 82.2% efficiency, 63% contention rate.
Rust dual Ollama (TR114): 99.396% peak config efficiency, contention reduced to <1%.
GPU layers >= 80 required for contention-free concurrent execution on 12 GB VRAM.

Optimization Strategy:

Dual Ollama instances are mandatory for contention-free multi-agent execution.
Homogeneous agent configurations outperform heterogeneous configurations.
Context size 2048 achieves highest speedups in homogeneous scenarios.
Architecture (dual Ollama) matters more than language choice (Rust vs Python).

F.5 Workload Class 4: Cross-Model Multi-Agent

Primary TR: TR116 Characteristics:

Identical multi-agent scenarios executed across different model architectures.
Isolates model choice as an independent variable while holding runtime and infrastructure constant.
Reveals model-specific scaling behavior under concurrent load.

Key Metrics Observed:

Gemma 3 (Rust): 97.3% efficiency, 99.2% in chimera-homo -- the scaling champion.
Llama 3.1 (Rust): 96.5% efficiency, 98.5% in chimera-homo.
Qwen 2.5 (Rust): 90.0% efficiency -- heavier coordination overhead due to larger KV cache.
Python ceiling: no model exceeds 86% efficiency regardless of model architecture.

Optimization Strategy:

Model selection must account for multi-agent scaling characteristics, not just single-agent throughput.
Gemma 3 is optimal for agent swarms; Llama 3.1 is viable for reasoning-heavy tasks.
Qwen 2.5's lower efficiency is acceptable only for specialized reasoning workloads.

F.6 Workload Class 5: Batch Processing (Future Work)

Not tested in TR108-TR116. Hypothesized Characteristics:

Offline processing of large prompt queues with no latency constraint.
Optimization target: total throughput (requests/hour), not per-request latency.
Expected to favor maximal GPU offload and large batch sizes.
Dual Ollama likely beneficial for throughput doubling.

F.7 Workload Characteristics Summary

Characteristic	Single-Agent	Agent Workflow	Multi-Agent	Cross-Model
Primary bottleneck	GPU decode	I/O + TTFT	Contention	Model-dependent
Optimal num_gpu	Maximum (999)	Moderate (60-80)	>= 80	>= 80
Optimal num_ctx	Task-matched	Minimal (256-512)	2048	2048
TTFT criticality	Medium	High	Medium	Medium
Throughput criticality	High	Medium	Medium	Medium
Efficiency target	N/A	N/A	> 95%	> 95%
Contention risk	None	None	High (single Ollama)	Model-dependent
Key TR	TR108	TR109	TR110, TR113, TR114	TR116
Config transferable?	Baseline	No (from TR108)	No (from TR109)	No (model-specific)

Appendix H: Operational Playbooks

H.1 Pre-Deployment Playbook

Step 1: Dual Ollama Instance Setup

Install Ollama (version pinned to the tested release, e.g., v0.1.17 or later validated version).
Configure Instance 1 on default port 11434:
- Set OLLAMA_HOST=127.0.0.1:11434
- Start: ollama serve
Configure Instance 2 on secondary port 11435:
- Set OLLAMA_HOST=127.0.0.1:11435
- Start: ollama serve
Verify both instances respond:
- curl http://127.0.0.1:11434/api/version
- curl http://127.0.0.1:11435/api/version
Pull the target model on both instances:
- Instance 1: OLLAMA_HOST=127.0.0.1:11434 ollama pull gemma3:latest
- Instance 2: OLLAMA_HOST=127.0.0.1:11435 ollama pull gemma3:latest

Step 2: Rust Binary Compilation and Deployment

Ensure Rust toolchain is installed (rustup, stable channel >= 1.82.0).
Build release binary: cargo build --release
Verify binary size and linkage: ls -la target/release/<binary_name>
Validate runtime: ./target/release/<binary_name> --version
Deploy binary to target directory alongside configuration files.

Step 3: Configuration Validation Procedure

Load configuration file (YAML or TOML).
Validate all required fields: ollama_host, ollama_port, model, num_gpu, num_ctx, temperature.
Confirm dual Ollama ports are specified for multi-agent configurations.
Verify GPU layer count does not exceed model layer count.
Confirm VRAM budget: sum of model memory across both instances must be less than 12 GB (for RTX 4080 Laptop).

Step 4: Warmup Sequence

Issue 3 sequential requests to Ollama Instance 1 with a short prompt (e.g., "Hello").
Issue 3 sequential requests to Ollama Instance 2 with the same prompt.
Discard warmup results; do not include in benchmark data.
Rationale: First requests incur model loading and KV-cache allocation overhead. Three warmup requests ensure stable GPU state.
Verify post-warmup TTFT is within expected range (< 500 ms for Gemma 3 on RTX 4080).

H.2 Monitoring Playbook

Key Metrics to Track

Metric	Unit	Source	Collection Frequency
Parallel efficiency	%	Benchmark harness	Per-run
TTFT	ms	Ollama API response	Per-request
Throughput	tok/s	Ollama API response	Per-request
Contention rate	%	TTFT anomaly detection	Per-run
VRAM usage	MB	nvidia-smi	5-second intervals
GPU utilization	%	nvidia-smi	5-second intervals
CPU utilization	%	OS metrics	5-second intervals
Memory (RSS)	MB	Process metrics	Per-run

Alert Thresholds

Metric	Warning	Critical	Action
Efficiency	< 95%	< 90%	Check Ollama instances, VRAM pressure
TTFT	> 1000 ms	> 2000 ms	Check model loading, GPU thermal state
Throughput	< 90 tok/s	< 80 tok/s	Check GPU throttling, model quantization
Contention rate	> 2%	> 5%	Verify dual Ollama, check port conflicts
VRAM usage	> 10 GB	> 11 GB	Reduce GPU layers, check for memory leaks
GPU temperature	> 80 C	> 85 C	Check cooling, reduce sustained load

Dashboard Recommendations

Primary panel: real-time efficiency trend (line chart, 1-minute resolution).
Secondary panel: TTFT distribution (histogram, per-model breakdown).
Tertiary panel: throughput time series with contention event overlay.
Auxiliary panel: VRAM and GPU utilization gauges.
Tool recommendation: Grafana with Prometheus metrics exporter, or custom JSON-based logging with dashboard frontend.

H.3 Regression Response Playbook

Symptom-Root Cause-Fix Mapping

Symptom	Likely Root Cause	Diagnostic Command	Fix
Efficiency < 90%	Single Ollama instance	Check port 11435 availability	Restart second Ollama instance
Efficiency < 90%	VRAM pressure	`nvidia-smi` (check memory)	Reduce num_gpu or switch to smaller model
Efficiency < 90%	Configuration drift	Diff current vs baseline config	Restore validated configuration
Throughput < 80 tok/s	Model not fully loaded to GPU	Check num_gpu setting	Increase GPU layer offload
Throughput < 80 tok/s	GPU thermal throttling	`nvidia-smi -q -d TEMPERATURE`	Allow cooldown, improve airflow
Throughput < 80 tok/s	Background GPU workload	Check GPU process list	Terminate competing processes
TTFT > 2000 ms	Cold model load	Check Ollama logs for "loading model"	Run warmup sequence
TTFT > 2000 ms	Context window too large	Check num_ctx setting	Reduce to 512-1024
Contention > 5%	Single Ollama serving both agents	Check agent-to-instance mapping	Ensure dedicated Ollama per agent
Contention > 5%	Port conflict	`netstat -an	grep 1143`

Appendix J: Traceability Map (Decisions to TRs)

J.1 Decision 1: Adopt Rust for Production Agent Workflows

Supporting TR	Evidence	Strength
TR111 v2	Rust achieves full workflow parity with Python; 114.54 tok/s baseline throughput	Primary
TR112 v2	Head-to-head comparison: +15.2% throughput, -58% TTFT, -67% memory, -83% startup	Primary
TR114 v2	Rust multi-agent achieves 99.396% peak efficiency with dual Ollama	Confirmatory
TR116	Rust advantage holds across all three models (Gemma, Llama, Qwen); +12-17pp over Python	Confirmatory

Decision logic: TR112 v2 provides the definitive single-agent comparison. TR114 v2 confirms that Rust's advantages are preserved in multi-agent scenarios when architectural bottlenecks (single Ollama) are removed. TR116 confirms the advantage is model-independent.

J.2 Decision 2: Deploy Dual Ollama Architecture for Multi-Agent Workloads

Supporting TR	Evidence	Strength
TR110	Python dual Ollama achieves 99.25% efficiency (ports 11434/11435)	Foundational
TR113	Single Ollama bottleneck: 82.2% efficiency, 63% contention rate in Rust	Problem identification
TR114 v2	Dual Ollama fix: efficiency jumps from 82.2% to 99.396%, contention drops to <1%	Solution validation

Decision logic: TR113 empirically demonstrated that single Ollama serializes concurrent requests, capping efficiency at 82.2%. TR114 v2 proved that adding a second Ollama instance eliminates the bottleneck entirely. TR110 had already established dual Ollama as viable in Python. The improvement (17+ percentage points) is unambiguous.

J.3 Decision 3: Use Tokio-Default as the Async Runtime

Supporting TR	Evidence	Strength
TR115 v2	5-runtime benchmark: tokio-default achieves 98.72% mean efficiency with 1.21pp standard deviation (best consistency)	Primary

Decision logic: All four working runtimes achieve approximately 100% peak efficiency (99.87-99.99%), making peak performance irrelevant for the decision. The differentiator is consistency: tokio-default (1.21pp sigma) and smol-1kb (1.32pp sigma) are the only production-viable options. Tokio-default wins on consistency, ecosystem maturity, and zero-configuration deployment (#[tokio::main]). Tokio-localset (4.03pp sigma, 81.03% minimum) is too variable. Smol (72.80% pathological failure) and async-std (50% perfect serialization) are disqualified.

J.4 Decision 4: Select Gemma 3 as the Default Production Model

Supporting TR	Evidence	Strength
TR108	Gemma 3 delivers 34% higher throughput than Llama 3.1 Q4_0 (102.85 vs 76.59 tok/s)	Foundational
TR116	Gemma 3 achieves 99.2% chimera-homo efficiency (Rust), 97.3% baseline-vs-chimera -- highest of all models	Confirmatory

Decision logic: Gemma 3 leads on both single-agent throughput (TR108) and multi-agent scaling efficiency (TR116). Llama 3.1 is a viable alternative for reasoning-heavy workloads (96.5% Rust efficiency) but trails on raw throughput. Qwen 2.5 is not recommended for multi-agent use (90.0% efficiency, heavier coordination overhead).

J.5 Decision 5: Acknowledge and Document the Python Efficiency Ceiling

Supporting TR	Evidence	Strength
TR116	Python never exceeds 86% efficiency across all three models; structural limitation of asyncio event loop	Primary
TR113 vs TR114	Even after architectural fixes (dual Ollama), Rust outperforms Python by 12-17pp in multi-agent scenarios	Confirmatory

Decision logic: TR116 establishes that the Python efficiency ceiling (~86%) is model-independent -- it appears with Gemma 3 (80.2%), Llama 3.1 (83.8%), and Qwen 2.5 (77.6%). This ceiling is structural: Python's single-threaded asyncio event loop saturates under concurrent LLM coordination overhead. No amount of Python optimization can exceed this ceiling; migration to Rust is the only path to >90% efficiency.

J.6 Decision 6: Configuration Transfer Between Workload Types Fails

Supporting TR	Evidence	Strength
TR108 vs TR109	Single-inference optimal (num_gpu=999, num_ctx=4096) degrades agent workflow performance; agent-optimal is num_gpu=60-80, num_ctx=256-512	Primary
TR109 vs TR110	Agent workflow optimal does not transfer to multi-agent concurrent; multi-agent requires num_ctx=2048, homogeneous configs	Confirmatory

Decision logic: Each workload class has distinct optimization characteristics. Blindly applying single-inference parameters to agent workflows or multi-agent scenarios produces measurably suboptimal results. This finding has broad implications: any deployment must benchmark its specific workload type rather than relying on published benchmarks from different workload classes.

Appendix K: Extended Literature Context

K.1 Rust Async Ecosystem

Tokio Architecture. Tokio is a multi-threaded async runtime for Rust built on a work-stealing scheduler. Worker threads maintain local task queues; when a thread's queue is empty, it steals tasks from other threads' queues. This design minimizes contention on shared queues while maximizing CPU utilization. For LLM inference workloads, Tokio's work-stealing approach introduces measurable but small overhead compared to thread-pinning approaches (TR115 v2 quantifies this at < 2pp efficiency delta).

Reqwest HTTP Client. The Rust HTTP client used for Ollama API communication. Reqwest is built on Tokio and hyper, providing connection pooling, HTTP/1.1 and HTTP/2 support, and async streaming. In TR114 v2, reqwest's async buffering model was identified as a contributor to coordination overhead, adding latency compared to Python's httpx during concurrent request patterns.

K.2 Ollama Internals

Model Loading. Ollama loads GGUF-format models into GPU memory layer by layer. The num_gpu parameter controls how many transformer layers are offloaded to GPU (remaining layers execute on CPU). Full offload (all layers on GPU) maximizes throughput but increases VRAM pressure. For the RTX 4080 Laptop (12 GB), Gemma 3 (4.3B Q4_K_M, 3.3 GB base memory) can be fully offloaded with room for KV cache.

Concurrent Request Handling. A single Ollama instance serializes inference requests: concurrent requests are queued and processed sequentially. This is the root cause of the contention observed in TR113. Dual Ollama instances eliminate this serialization by providing independent inference pipelines.

GGUF Format. GGUF (GPT-Generated Unified Format) is the file format used by Ollama for quantized model weights. Q4_K_M quantization (4-bit with K-means clustering, medium quality) was used across all TR108-TR116 experiments, providing a balance of model size, inference speed, and output quality.

K.3 LLM Multi-Agent Patterns

Concurrent Coordination. Multi-agent LLM systems require coordination between agents sharing GPU resources. The key insight from TR110-TR114 is that infrastructure architecture (single vs dual Ollama) dominates over language runtime choice (Rust vs Python) in determining parallel efficiency. Coordination overhead in multi-agent LLM workloads is fundamentally I/O-bound (HTTP requests to Ollama), not compute-bound.

Parallel Efficiency Measurement. Parallel efficiency is defined as (speedup / N) x 100%, where speedup = sequential_time / concurrent_time and N is the number of agents. An efficiency of 100% indicates zero coordination overhead. Values above 95% are considered excellent for real-world systems with shared resources.

K.4 Python Asyncio Limitations

Single-Threaded Event Loop. Python's asyncio runs all coroutines on a single thread. While the GIL is released during I/O operations (enabling true concurrency for network calls), the event loop itself is single-threaded: task scheduling, callback dispatch, and coroutine resumption all execute sequentially on one core. Under high coordination load (many concurrent agents exchanging data), this single-threaded scheduler becomes the bottleneck, producing the ~86% efficiency ceiling observed in TR116.

GIL Implications. The Global Interpreter Lock prevents true parallel execution of Python bytecode. For I/O-bound LLM inference (where most time is spent waiting for Ollama responses), the GIL is largely irrelevant because it is released during network I/O. However, the GIL does affect CPU-bound coordination tasks (data serialization, result aggregation), contributing to the observed efficiency gap versus Rust.

K.5 Quantization Impact

Q4_K_M Characteristics. All experiments used Q4_K_M quantization, which represents model weights using 4-bit integers with K-means clustering for improved accuracy. This quantization level typically preserves 95-98% of full-precision model quality while reducing model size by approximately 4x and improving inference throughput by 2-3x. The quality-performance tradeoff was deemed acceptable for the gaming dialogue use case (TR108).

Appendix L: Measurement Boundary Catalog

L.1 Boundary Definitions

Each technical report defines an explicit measurement boundary that determines which operations are included in timing measurements and which are excluded. Maintaining clear boundaries is essential for reproducibility and cross-TR comparability.

L.2 Per-TR Boundary Table

TR	Measurement Boundary	Included in Timing	Excluded from Timing
TR108	Ollama API call (single inference)	Model inference, TTFT, token generation, response deserialization	Process startup, model loading, Ollama instance startup
TR109	Agent workflow end-to-end	File I/O (scan, read), LLM calls (analysis + report), data processing, output writing	Process isolation overhead, Ollama instance management
TR110	Concurrent execution window	Both agents' full workflows (I/O + LLM), asyncio coordination	Sequential baseline overhead, Ollama instance startup/teardown
TR111	Rust workflow end-to-end	File system scan, data ingestion, LLM calls (analysis + report), metric collection	Binary compilation time, Ollama instance management
TR112	Cross-language comparison	Identical workflow on identical inputs; file I/O, LLM calls, output generation	Language-specific setup (Python venv, Rust compilation), Ollama management
TR113	Rust concurrent (single Ollama)	Both agents' full workflows, shared Ollama instance, contention events	Ollama instance startup, binary startup
TR114	Rust concurrent (dual Ollama)	Both agents' full workflows, dedicated Ollama instances, coordination overhead	Ollama instance startup/teardown, binary compilation
TR115	Runtime comparison	Identical multi-agent workflow across 5 Rust async runtimes	Runtime-specific initialization, binary compilation per runtime
TR116	Cross-model comparison	Same multi-agent scenarios with Gemma 3, Llama 3.1, Qwen 2.5	Model download, model format conversion, Ollama setup

L.3 Boundary Rationale

The measurement boundaries were designed to isolate the variable under study in each TR:

TR108-TR109: Isolate inference and workflow performance from infrastructure setup.
TR110, TR113-TR114: Isolate concurrency behavior from sequential execution characteristics.
TR112: Isolate language runtime effects by ensuring identical workflow and inputs.
TR115: Isolate async runtime effects by ensuring identical workflow, model, and hardware.
TR116: Isolate model architecture effects by ensuring identical runtime, infrastructure, and scenarios.

L.4 Cross-TR Comparability Notes

TR108 and TR109 measurements are not directly comparable because TR109 includes file I/O and multi-step workflows that TR108 excludes.
TR110 (Python) and TR114 (Rust) measurements are comparable because both use dual Ollama and identical agent workflow structure.
TR113 and TR114 measurements differ in architecture (single vs dual Ollama) but are otherwise comparable, enabling the architecture impact analysis.
TR115 measurements are internally comparable (same hardware, model, workflow; only runtime varies).
TR116 measurements are internally comparable (same runtime, hardware, architecture; only model varies).

Appendix N: Expanded Discussion

N.1 The Python-to-Rust Migration Narrative

The migration from Python to Rust succeeded because of three factors that are often underestimated in language migration discussions. First, the LLM inference workload is I/O-bound at the application layer (HTTP calls to Ollama) but compute-bound at the server layer (GPU inference). This means Rust's compile-time guarantees and zero-cost abstractions provide operational benefits (memory safety, deployment simplicity, startup speed) without sacrificing the throughput that is determined by the GPU. Second, the Tokio async ecosystem provides mature, production-grade HTTP client and runtime libraries that match Python's asyncio/httpx ecosystem in functionality. Third, the workflow complexity (file I/O, data processing, LLM calls) translates naturally to Rust's ownership model without requiring complex lifetime annotations or unsafe code.

The 15.2% throughput improvement observed in TR112 v2 likely reflects Rust's more efficient HTTP request/response handling and reduced per-request overhead, not faster GPU inference (which is identical regardless of client language).

N.2 The Architecture Discovery

The most consequential finding of the TR108-TR116 program is that dual Ollama architecture matters more than language choice for multi-agent performance. TR113 demonstrated that Rust with single Ollama achieves only 82.2% efficiency -- worse than Python with dual Ollama (99.25%, TR110). TR114 proved that Rust with dual Ollama matches or exceeds Python (99.396% vs 99.25%). The implication is clear: architectural decisions (how many inference server instances to deploy) dominate over implementation decisions (which programming language to use) for multi-agent LLM workloads. This finding generalizes beyond the specific Rust-vs-Python comparison.

N.3 The Runtime Lesson: Consistency Beats Peak Performance

TR115 v2 demonstrated that all four working Rust async runtimes achieve approximately identical peak performance (99.87-99.99% efficiency). The differentiator is consistency: tokio-default (1.21pp sigma) provides the most reliable performance, while tokio-localset (4.03pp sigma, 81.03% minimum) shows dangerous variance. In production systems, a consistently good runtime is preferable to one that occasionally achieves 99.99% but drops to 81% unpredictably. This principle -- consistency over peak -- should guide runtime and configuration selection broadly.

N.4 The Model Selection Paradox

TR116 revealed that model performance rankings depend on the measurement context. Gemma 3 leads on throughput (102.85 tok/s) and multi-agent efficiency (99.2% chimera-homo). However, Llama 3.1, despite lower throughput (~68 tok/s in TR116 multi-agent testing), scales nearly as well under concurrency (98.5% chimera-homo) and may offer superior reasoning quality for complex agent tasks. Qwen 2.5, a 7B model, underperforms both smaller models in multi-agent efficiency (90.0%) due to heavier KV cache pressure. The paradox: the "best" model depends on whether you optimize for throughput, efficiency, reasoning quality, or a weighted combination.

N.5 Configuration Transfer Failure as a General Principle

The finding that inference-optimal, agent-optimal, and multi-agent-optimal configurations differ (TR108 vs TR109 vs TR110) is a specific instance of a general principle: benchmarks transfer poorly across workload types. This has implications far beyond the TR108-TR116 program. Any team deploying LLM-based systems should benchmark their specific workload type rather than relying on published benchmarks, vendor datasheets, or results from different workload classes.

N.6 Implications for the Broader LLM Serving Community

The TR108-TR116 findings suggest several implications for the community: (1) Multi-instance inference servers should be the default for concurrent workloads, not an optimization. (2) Language runtime choice matters less than architecture for I/O-bound coordination workloads. (3) Model selection for multi-agent systems requires multi-agent benchmarks, not single-agent benchmarks. (4) Async runtime consistency should be evaluated alongside peak performance.

Appendix O: Extended Results Narratives

O.1 TR108: Single-Agent Baseline Establishes Model Performance Hierarchy

TR108 established the foundational performance hierarchy through 158+ benchmark configurations. Gemma 3 (4.3B parameters, Q4_K_M) delivered 102.85 tok/s at optimal configuration, outperforming Llama 3.1 8B (Q4_0) at 76.59 tok/s by 34%. This result was attributed to Gemma 3's smaller parameter count and more efficient attention architecture, both of which reduce per-token compute requirements. The study identified GPU layer allocation (num_gpu) as the most impactful parameter, followed by context size (num_ctx). Temperature had minimal throughput impact but significant quality effects. The 158 configurations provided high statistical confidence in the rankings and identified the parameter sensitivity landscape that would inform all subsequent TRs.

O.2 TR109: Configuration Transfer Failure Reveals Workload-Specific Optimization

TR109 applied TR108's optimal single-inference configurations to multi-step agent workflows and observed measurable performance degradation. The critical discovery: maximal GPU offload (num_gpu=999) and large context windows (num_ctx=4096), which maximize single-inference throughput, are suboptimal for agent workflows where TTFT and I/O interleaving dominate. Agent-optimal configurations (num_gpu=60-80, num_ctx=256-512, temp=0.6) prioritize fast first-token delivery and moderate resource utilization. This finding invalidated the assumption that single-inference benchmarks are sufficient for production deployment planning and motivated the multi-agent studies that followed.

O.3 TR110: Python Achieves Near-Theoretical Multi-Agent Ceiling

TR110 demonstrated that Python asyncio with dual Ollama instances achieves 99.25% parallel efficiency for two concurrent agents -- within 0.75 percentage points of the theoretical maximum (2.0x speedup). The study used 150 benchmark runs across 30 configurations with Gemma 3 on dual Ollama instances (ports 11434/11435). Key findings included: (1) homogeneous agent configurations outperform heterogeneous, (2) context size of 2048 tokens maximizes concurrent efficiency, (3) temperature has negligible impact on concurrency (delta < 3%), and (4) GPU layers >= 80 are required for zero-contention operation. TR110 established the Python multi-agent baseline that all subsequent Rust comparisons referenced.

O.4 TR111: Rust Achieves Full Workflow Parity with Near-Hardware-Limit Throughput

TR111 v2 validated that the Rust agent implementation achieves full workflow parity with Python: identical operations (file system scan, data ingestion, multi-stage LLM analysis, report generation, metric tracking) executed with equivalent correctness. Performance: 114.54 tok/s baseline throughput (Rust) versus approximately 99.34 tok/s (Python), representing a 15.2% advantage. Throughput variation across 19 configurations was remarkably low (0.9%), indicating that the GPU inference speed, not client-side processing, determines throughput. TTFT showed 150x more variation than throughput, confirming it as the primary optimization target for agent workflows.

O.5 TR112: Head-to-Head Comparison Confirms Rust's Systematic Advantages

TR112 v2 provided the definitive cross-language comparison using 37 configurations (19 Rust, 18 Python) and 111 benchmark runs. Results across all metrics favored Rust: +15.2% throughput (114.54 vs 99.34 tok/s), -58% TTFT (603 vs 1437 ms cold start), -67% memory usage (75 vs 250 MB estimated), -83% startup time (0.2 vs 1.5 seconds). Rust also demonstrated higher optimization success rate (72.2% vs 38.9%) and lower variance (2.6% vs 4.8% CV). The study estimated direct GPU compute savings of approximately $444/year at 1M requests/month (single agent) with approximately 11-month break-even on a $5,000 migration investment (see Appendix V for detailed calculations).

O.6 TR113: Single Ollama Bottleneck Reveals Architectural, Not Language, Limitation

TR113 tested Rust multi-agent with a single Ollama instance and observed 82.2% peak efficiency with 63% contention rate -- dramatically worse than Python's 99.25% (TR110, dual Ollama). The initial interpretation was that Rust's async runtime introduced concurrency overhead. The correct interpretation, established by TR114, was that the single Ollama instance serialized concurrent requests regardless of client language. TR113's value was diagnostic: it identified the architectural bottleneck (Ollama server-level serialization) that TR114 would solve. The 63% contention rate and 82.2% efficiency ceiling became the baseline for measuring the dual Ollama improvement.

O.7 TR114: Dual Ollama Fix Transforms Rust Multi-Agent Performance

TR114 v2 deployed dual Ollama instances for Rust multi-agent and achieved 99.396% peak configuration efficiency (135 runs, 27 configurations). This represented a 17.2 percentage point improvement over TR113's single-Ollama results (82.2%) and matched Python's TR110 results (99.25%). The contention rate dropped from 63% to less than 1%. Overall average efficiency was 98.281% across all configurations -- robust and not dependent on specific parameter tuning. This TR confirmed the hypothesis from TR113: the bottleneck was architectural (single Ollama serialization), not language-related (Rust async overhead).

O.8 TR115: Runtime Deep Dive Shows Consistency Matters More Than Peak

TR115 v2 benchmarked five Rust async runtimes (tokio-default, tokio-localset, smol, smol-1kb, async-std) with 150 runs. All four functional runtimes achieved near-identical peak efficiency (99.87-99.99%), rendering peak performance meaningless for runtime selection. The decisive metric was consistency: tokio-default (98.72% mean, 1.21pp sigma) outperformed all alternatives on reliability. Tokio-localset achieved the highest single-run peak (99.99%) but exhibited dangerous variance (4.03pp sigma, 81.03% minimum). Smol suffered a pathological failure (72.80% on one run). Async-std achieved exactly 50% on all runs due to a Tokio HTTP bridge conflict causing perfect serialization. The conclusion: for production, choose the most consistent runtime, not the one with the highest peak.

O.9 TR116: Cross-Model Validation Confirms Universality

TR116 tested three models (Gemma 3, Llama 3.1, Qwen 2.5) across both runtimes (Rust/Tokio, Python/asyncio) with 60 benchmark runs. Rust outperformed Python for every model tested: Gemma 3 (97.3% vs 80.2%), Llama 3.1 (96.5% vs 83.8%), Qwen 2.5 (90.0% vs 77.6%). The Rust advantage ranged from 12 to 17 percentage points, confirming it as a structural runtime benefit rather than a model-specific artifact. Python never exceeded 86% efficiency regardless of model, establishing the asyncio ceiling as a universal constraint. Gemma 3 emerged as the scaling champion (99.2% chimera-homo efficiency in Rust), while Qwen 2.5 showed the highest coordination overhead, likely due to heavier KV cache requirements.

Appendix P: Decision-Grade Reporting Rubric

P.1 Checklist

A benchmark report qualifies as "decision-grade" when it satisfies all seven criteria below. Reports that fail any criterion should be treated as exploratory or provisional.

Measurement boundary explicitly defined. The report must state what operations are included in timing measurements and what operations are excluded. See Appendix L for the boundary catalog.
Artifact chain from raw data to conclusion. Every quantitative claim must be traceable to raw benchmark data files (CSV, JSON). The report must specify file paths or artifact identifiers for all referenced data.
Statistical validation (n >= 3 runs). Each configuration must be tested with a minimum of 3 independent runs. Reports must include mean, standard deviation, and coefficient of variation for all key metrics. Single-run results are insufficient for decision-making.
Limitation acknowledgment. The report must explicitly state hardware constraints, workload scope limitations, and conditions under which results may not generalize. Absence of limitations indicates incomplete analysis.
Decision translation. The report must translate numerical findings into actionable decisions. Raw numbers (e.g., "throughput is 114.54 tok/s") are insufficient; the report must state what decision the number supports (e.g., "Rust is recommended for production because its 15.2% throughput advantage reduces per-request cost").
Invalidation triggers stated. The report must identify conditions that would invalidate its conclusions. Examples: "Results may not hold for GPUs with less than 12 GB VRAM," or "Conclusions assume Ollama version 0.1.17; breaking changes in newer versions require re-validation."
Cross-validation with independent evidence. At least one key finding should be corroborated by evidence from a different TR or experimental setup. Single-TR conclusions are provisional; cross-validated conclusions are decision-grade.

P.2 Application to TR108-TR116

TR	Criteria Met	Notes
TR108	7/7	158+ configs, extensive parameter sweeps, cross-validated by TR109
TR109	7/7	20+ configs, 3 phases, directly cross-validates TR108 config transfer failure
TR110	7/7	150 runs, 30 configs, 5 runs each, cross-validated by TR114
TR111 v2	7/7	57 runs, 19 configs, 3 runs each, cross-validated by TR112 v2
TR112 v2	7/7	111 runs, 37 configs, direct Python-Rust comparison
TR113	6/7	Missing cross-validation at time of publication; later validated by TR114
TR114 v2	7/7	135 runs, 27 configs, 5 runs each, cross-validates TR113
TR115 v2	7/7	150 runs, 5 runtimes, extensive statistical analysis
TR116	7/7	60 runs, 3 models x 2 runtimes, cross-validates TR114

Appendix Q: Decision Case Studies

Q.1 "Should we migrate our Python agent to Rust?"

Context: A team runs a single-agent LLM workflow (file processing + inference) on Python/asyncio with Ollama. They experience occasional throughput dips and high memory usage.

Evidence: TR112 v2 shows Rust delivers +15.2% throughput, -58% TTFT, -67% memory, -83% startup versus Python for identical workflows.

Decision: Yes, if throughput or multi-agent efficiency matters for the deployment. The migration investment breaks even in approximately 20 months at 1M requests/month (TR112 v2 cost analysis). If the team plans to scale to multi-agent, the migration becomes even more compelling (Python ceiling of ~86% vs Rust's 98%+).

Caveat: If the team prioritizes development velocity and the workload is single-agent with no scaling plans, Python may remain preferable.

Q.2 "We are deploying 2 concurrent agents -- single or dual Ollama?"

Context: A deployment runs two LLM agents concurrently on a single GPU (RTX 4080 or similar, 12 GB VRAM).

Evidence: TR113 shows single Ollama produces 82.2% efficiency with 63% contention. TR114 shows dual Ollama produces 99.396% efficiency with <1% contention. The improvement is 17+ percentage points.

Decision: Always dual Ollama. There is no scenario in which single Ollama is preferable for two concurrent agents. The only cost is the additional memory overhead of a second Ollama process (minimal, as model weights are the dominant memory consumer and are loaded per-instance into VRAM).

Caveat: Ensure sufficient VRAM for two model instances. For the RTX 4080 (12 GB), Gemma 3 Q4_K_M (3.3 GB per instance) fits comfortably. Larger models may require reducing GPU layer offload.

Q.3 "Which async runtime should we use?"

Context: A Rust team is choosing an async runtime for their LLM agent system.

Evidence: TR115 v2 benchmarked 5 runtimes across 150 runs. Tokio-default achieves 98.72% mean efficiency with 1.21pp standard deviation. All alternatives are either less consistent (tokio-localset: 4.03pp sigma) or exhibit pathological failures (smol: 72.80% minimum; async-std: 50% serialization).

Decision: Tokio-default, with no exceptions. It provides the best consistency, the most mature ecosystem, and requires no custom configuration. Use #[tokio::main] and the standard Tokio runtime.

Caveat: If binary size is the primary constraint and the team accepts slightly higher variance, smol-1kb (1.32pp sigma) is a viable alternative.

Q.4 "Gemma or Llama for our agent swarm?"

Context: A team is selecting a model for a multi-agent swarm (2+ concurrent agents) on consumer GPU hardware.

Evidence: TR116 shows Gemma 3 achieves 99.2% chimera-homo efficiency and 102.85 tok/s throughput. Llama 3.1 achieves 98.5% chimera-homo efficiency but only 68 tok/s throughput. Gemma leads on both metrics.

Decision: Gemma 3 for throughput-sensitive workloads (gaming dialogue, high-frequency queries). Llama 3.1 for reasoning-heavy workloads where output quality is more important than speed. The 34% throughput difference is substantial; the 0.7pp efficiency difference is negligible.

Caveat: Quality was not formally measured in TR108-TR116 (see Risk R10). If reasoning quality is mission-critical, conduct a quality evaluation before committing to Gemma 3.

Q.5 "Can we just optimize our Python code instead of migrating?"

Context: A team observes suboptimal multi-agent efficiency in Python and considers code optimization instead of Rust migration.

Evidence: TR116 shows Python never exceeds 86% efficiency regardless of model (Gemma: 80.2%, Llama: 83.8%, Qwen: 77.6%). This ceiling is structural: asyncio's single-threaded event loop saturates under concurrent coordination load. TR113 vs TR114 shows that even architectural fixes (dual Ollama) cannot push Python past this ceiling.

Decision: No. The ~86% ceiling is a runtime limitation, not a code quality issue. Optimizing Python code may improve performance within the ceiling (e.g., from 77% to 84%) but cannot break through it. Only a runtime migration (to Rust/Tokio or another multi-threaded runtime) enables >90% efficiency.

Caveat: If the workload is single-agent (no concurrency), the Python ceiling is irrelevant and Python optimization may be sufficient.

Appendix S: Governance Templates

S.1 Benchmark Report Template

# Technical Report [NNN]: [Title]
## [Subtitle]

**Date:** YYYY-MM-DD
**Hardware:** [GPU, CPU, RAM, OS]
**Total Runs:** [N configurations x M runs]
**Model(s):** [model:quantization]
**Related Work:** [List of prior TRs]

## Executive Summary
[3-5 sentence summary of findings and their implications]

## Methodology
- Measurement boundary: [What is included/excluded]
- Statistical protocol: [N runs per config, metrics collected]
- Hardware configuration: [Full spec table]

## Results
[Tables and analysis with mean, std dev, CV for all metrics]

## Limitations
[Hardware constraints, workload scope, generalization caveats]

## Conclusions and Decisions
[Actionable decisions with invalidation triggers]

## Appendices
[Raw data references, configuration files, artifact paths]

S.2 Configuration Change Request Template

# Configuration Change Request

**Requester:** [Name]
**Date:** YYYY-MM-DD
**Affected Component:** [Ollama, Rust binary, model, runtime]

## Current Configuration
[Parameter: current value]

## Proposed Configuration
[Parameter: proposed value]

## Justification
[Which TR supports this change? Specific evidence.]

## Risk Assessment
[What could go wrong? Rollback plan.]

## Validation Plan
[How will we verify the change improves performance?]
[Minimum N runs required for validation.]

## Approval
[  ] Benchmarked on target hardware
[  ] Reviewed against relevant TR findings
[  ] Rollback procedure documented

S.3 Performance Regression Report Template

# Performance Regression Report

**Date Detected:** YYYY-MM-DD
**Severity:** [Critical / Warning]
**Metric Affected:** [efficiency / throughput / TTFT / contention]

## Observed Behavior
[Current metric value vs expected baseline]

## Diagnostic Steps Taken
[Commands run, logs checked, hardware inspected]

## Root Cause
[Identified cause or "Under Investigation"]

## Resolution
[Fix applied or proposed]

## Prevention
[How to prevent recurrence]

S.4 Re-Validation Trigger Checklist

A re-validation benchmark run is required when any of the following conditions are met:

Appendix T: Extended Risk Register

Risk ID	Risk Description	Likelihood	Impact	Mitigation Strategy	Owner	Status
R1	Hardware-specific results do not generalize to different GPUs (e.g., RTX 3090, A100)	Medium	High	Re-benchmark on target hardware before deployment; document hardware-specific assumptions	Deployment Lead	Open
R2	Ollama update breaks dual-instance configuration or changes serialization behavior	Low	High	Pin Ollama version; test new versions in staging before production upgrade	Infrastructure	Open
R3	Tokio breaking change in major version alters work-stealing scheduler behavior	Low	Medium	Monitor Tokio release notes; maintain smol-1kb as fallback runtime; pin Tokio version in Cargo.lock	Development	Open
R4	Gemma model quality regression in future releases (architecture changes, quantization artifacts)	Medium	Medium	Maintain quality benchmark suite; evaluate new model versions against baseline before adoption	Research	Open
R5	Scaling beyond 2 agents introduces unknown contention patterns not covered by TR108-TR116	High	Medium	Plan TR117+ investigations for 3+ agent scaling; do not extrapolate 2-agent results	Research	Open
R6	Windows-specific behavior (GPU driver, process scheduling) does not reproduce on Linux	Medium	Medium	Conduct Linux validation study; document any OS-specific findings	Infrastructure	Open
R7	VRAM pressure at scale: larger models or more agents exceed 12 GB budget	Medium	High	Monitor VRAM utilization in production; set GPU layer budgets per model; implement VRAM-aware scheduling	Operations	Open
R8	Configuration drift: production configs diverge from benchmarked configs over time	High	Medium	Implement config-as-code with version control; automated validation against benchmark baselines	DevOps	Open
R9	Shadow pricing inaccuracy: cost estimates in TR112 v2 may not reflect actual cloud pricing	Medium	Low	Validate cost estimates against actual cloud bills quarterly; update pricing model	Finance	Open
R10	Quality unmeasured in multi-agent: efficiency metrics do not capture output quality degradation	High	Medium	Add quality metrics (coherence, relevance, factual accuracy) to benchmark suite in future TRs	Research	Open
R11	Thermal throttling under sustained load reduces throughput below benchmarked levels	Medium	Medium	Monitor GPU temperature; implement cooling-aware load scheduling; benchmark sustained (>1hr) workloads	Operations	Open
R12	Reqwest HTTP client update changes buffering behavior, altering coordination overhead	Low	Low	Pin reqwest version; benchmark after updates	Development	Open
R13	Model context length requirements exceed benchmarked ranges (256-2048 tokens)	Medium	Medium	Benchmark with production-representative context lengths before deployment	Research	Open

Appendix U: Program Evolution Narrative

U.1 Phase 1: Python Baseline (TR108-TR110)

The research program began with TR108, which established single-agent LLM performance baselines on the RTX 4080 Laptop. Through 158+ configurations, TR108 identified Gemma 3 as the throughput champion (102.85 tok/s) and mapped the parameter sensitivity landscape (GPU layers > context size > temperature). TR109 extended this to agent workflows and discovered the first major insight: configuration transfer fails. Parameters optimal for single inference (maximal GPU offload, large context) degrade agent workflow performance. TR110 pushed into multi-agent territory, achieving 99.25% parallel efficiency with Python asyncio and dual Ollama. At the end of Phase 1, the program had established a high Python performance baseline and identified the workload-specific nature of optimization.

U.2 Phase 2: Rust Migration (TR111-TR112)

Phase 2 asked whether Rust could match Python's performance while providing operational advantages (memory efficiency, startup speed, deployment simplicity). TR111 v2 demonstrated full workflow parity: the Rust agent performed identical operations to Python with 114.54 tok/s throughput. TR112 v2 provided the definitive comparison: Rust delivered +15.2% throughput, -58% TTFT, -67% memory, -83% startup. The migration was validated as beneficial for production deployment.

U.3 Phase 3: Architecture Discovery (TR113-TR114)

Phase 3 produced the program's most important finding. TR113 tested Rust multi-agent with a single Ollama instance and observed only 82.2% efficiency with 63% contention -- initially interpreted as a Rust async runtime limitation. TR114 proved this interpretation wrong: deploying dual Ollama instances eliminated the bottleneck entirely, achieving 99.396% efficiency. The insight was architectural: Ollama's single-instance serialization, not Rust's async runtime, was the bottleneck. This discovery reframed the program's understanding: infrastructure architecture dominates over language runtime for multi-agent LLM workloads.

U.4 Phase 4: Runtime Selection (TR115)

With the architectural question settled, TR115 optimized within the Rust ecosystem by comparing five async runtimes. The finding that all functional runtimes achieve approximately 100% peak efficiency (99.87-99.99%) shifted the selection criterion from "which is fastest" to "which is most consistent." Tokio-default won on consistency (1.21pp sigma), producing the production recommendation.

U.5 Phase 5: Cross-Model Validation (TR116)

TR116 validated whether the Rust advantage and Python ceiling were model-dependent artifacts. Testing three models (Gemma 3, Llama 3.1, Qwen 2.5) across both runtimes confirmed: Rust outperforms Python by 12-17 percentage points for every model tested, and Python never exceeds 86% efficiency regardless of model. These findings are structural and universal within the tested hardware and software configuration.

U.6 Program Arc Summary

The program evolved from "which model is fastest?" (TR108) to "which architecture is optimal?" (TR113-TR114) to "is the finding universal?" (TR116). Each phase built on the previous, and findings from later phases retroactively recontextualized earlier results (e.g., TR113's 82.2% was initially blamed on Rust but later attributed to single Ollama by TR114). The total evidence base -- 903+ runs across 9 reports -- provides high confidence in the six decisions derived from this program.

Appendix V: Cost Modeling Examples

V.1 Assumptions

Rust throughput: 114.54 tok/s (TR112 v2 baseline)
Python throughput: 99.34 tok/s (TR112 v2 baseline)
Average request: 200 tokens generated
Average request duration: Rust = 200/114.54 = 1.75s; Python = 200/99.34 = 2.01s
GPU cost: $0.50/hour (consumer-grade GPU cloud instance estimate)
Multi-agent efficiency: Rust = 98% (TR114 v2); Python = 82% (TR116 average)

V.2 Scenario 1: Small Team (100K requests/month)

Metric	Python	Rust	Delta
Total inference time	201,329 s (55.9 hr)	174,659 s (48.5 hr)	-13.0%
GPU cost	$27.95/month	$24.25/month	-$3.70/month
Annual GPU cost	$335.40	$291.00	-$44.40/year

Verdict: Savings are modest at this scale. Migration investment is not justified unless multi-agent scaling is planned.

V.3 Scenario 2: Medium Team (1M requests/month)

Metric	Python	Rust	Delta
Total inference time	2,013,285 s (559 hr)	1,746,590 s (485 hr)	-13.0%
GPU cost	$279.50/month	$242.50/month	-$37.00/month
Annual GPU cost	$3,354.00	$2,910.00	-$444.00/year

Verdict: Annual savings of $444 begin to justify migration investment (estimated $5,000 one-time). Break-even: approximately 11 months.

V.4 Scenario 3: Large Deployment (10M requests/month)

Metric	Python	Rust	Delta
Total inference time	20,132,853 s (5,593 hr)	17,465,903 s (4,852 hr)	-13.0%
GPU cost	$2,796.50/month	$2,426.00/month	-$370.50/month
Annual GPU cost	$33,558.00	$29,112.00	-$4,446.00/year

Verdict: Annual savings of $4,446 with break-even under 2 months. Migration is strongly recommended.

V.5 Scenario 4: Multi-Agent Swarm (2 agents, 1M requests/month)

Metric	Python (82% eff.)	Rust (98% eff.)	Delta
Effective throughput (2 agents)	2 x 99.34 x 0.82 = 162.92 tok/s	2 x 114.54 x 0.98 = 224.50 tok/s	+37.8%
Time for 1M requests	200M / 162.92 = 1,227,791 s (341 hr)	200M / 224.50 = 891,314 s (248 hr)	-27.4%
GPU cost (dual GPU)	$341.00/month	$248.00/month	-$93.00/month
Annual GPU cost	$4,092.00	$2,976.00	-$1,116.00/year

Verdict: Multi-agent amplifies the Rust advantage. The 37.8% effective throughput improvement and $1,116/year savings make migration compelling at medium scale.

V.6 Scenario 5: Break-Even Analysis

Migration cost component	Estimate
Developer time (Rust rewrite)	$4,000 (80 hours at $50/hr)
Testing and validation	$750 (15 hours)
Deployment infrastructure	$250 (one-time)
Total migration cost	$5,000

Scale	Annual Savings	Break-Even
100K req/month (single agent)	$44	113 months (not justified)
1M req/month (single agent)	$444	11 months
1M req/month (multi-agent)	$1,116	4.5 months
10M req/month (single agent)	$4,446	1.1 months

Appendix W: Workload Taxonomy Extensions

W.1 Taxonomy Dimensions

Beyond the four workload classes defined in Appendix F, workloads can be further classified along the following dimensions:

Latency Sensitivity:

Real-time (< 200 ms TTFT): gaming dialogue, interactive chat.
Near-real-time (< 2s TTFT): developer tools, code completion.
Batch (no TTFT constraint): document processing, summarization pipelines.

Concurrency Level:

Single (1 agent): simplest deployment, no contention possible.
Low (2-4 agents): dual Ollama sufficient, well-characterized by TR108-TR116.
High (5+ agents): uncharacterized, requires TR117+ investigation.

Context Depth:

Shallow (< 512 tokens): short prompts, single-turn queries.
Moderate (512-2048 tokens): multi-turn chat, agent state.
Deep (> 2048 tokens): document analysis, long-context reasoning.

Quality Criticality:

Low: creative generation, brainstorming (temperature 0.8-1.0).
Medium: structured output, report generation (temperature 0.6-0.8).
High: factual extraction, code generation (temperature 0.0-0.4).

W.2 Classification Examples

Workload	Latency	Concurrency	Context	Quality	Recommended Stack
Gaming banter	Real-time	Single	Shallow	Low	Rust + Gemma 3, num_gpu=max
Agent report gen	Near-real-time	Single	Moderate	Medium	Rust + Gemma 3, num_gpu=60-80
Dual-agent analysis	Near-real-time	Low	Moderate	Medium	Rust + dual Ollama + Gemma 3
Document processing	Batch	Low	Deep	High	Rust + Llama 3.1, num_ctx=4096+

Appendix X: Experiment Planning Template

# Experiment Plan: TR[NNN]

## Research Question
[Single, focused question this experiment answers]

## Hypothesis
[Predicted outcome with rationale]

## Variables
- Independent: [What is being varied]
- Dependent: [What is being measured]
- Controlled: [What is held constant]

## Configuration Matrix
| Config ID | Variable 1 | Variable 2 | ... |
|-----------|-----------|-----------|-----|
| C001      | value     | value     | ... |

## Sample Size
- Runs per configuration: [N >= 3, recommended 5]
- Total configurations: [M]
- Total runs: [N x M]
- Estimated duration: [hours]

## Hardware and Software
- GPU: [model, VRAM]
- CPU: [model, cores]
- RAM: [capacity, speed]
- OS: [version]
- Ollama: [version, instance count, ports]
- Model: [name:quantization]
- Runtime: [language, version, async runtime]

## Measurement Boundary
- Included: [operations within timing window]
- Excluded: [operations outside timing window]

## Success Criteria
[What result would confirm/refute the hypothesis?]

## Dependencies
[Prior TRs whose findings this experiment depends on]

## Artifacts
- Raw data: [file path pattern]
- Analysis scripts: [file path]
- Report: [output path]

Appendix Y: Extended Operational Playbook

Y.1 Model Hot-Swap Procedure

Verify current model is idle (no active requests).
On target Ollama instance: ollama pull <new_model>:<quantization>
Issue a warmup request to load the new model into GPU memory.
Run 3 warmup requests (discard results).
Verify TTFT and throughput are within expected ranges.
Update configuration file to reference new model.
Log the model change with timestamp and reason.

Y.2 Emergency Rollback Procedure

Stop all agent processes.
Restore previous configuration file from version control.
Restart both Ollama instances with previous model.
Run 3 warmup requests per instance.
Verify efficiency returns to baseline (> 95%).
Document the incident in a Performance Regression Report (Appendix S.3).

Y.3 VRAM Budget Management

For RTX 4080 Laptop (12 GB VRAM), the budget allocation is:

Component	VRAM Budget
Ollama Instance 1 (model weights)	3.3 GB (Gemma 3 Q4_K_M)
Ollama Instance 1 (KV cache)	1.0 GB
Ollama Instance 2 (model weights)	3.3 GB
Ollama Instance 2 (KV cache)	1.0 GB
OS and driver overhead	1.0 GB
Safety margin	2.4 GB
Total	12.0 GB

If VRAM usage approaches the safety margin, reduce num_gpu to offload some layers to CPU, or reduce num_ctx to shrink the KV cache.

Appendix Z: Efficiency-Quality Tradeoff Analysis

Z.1 Observations from TR109

TR109 established that quality and performance trade differently in agent workflows than in single inference:

Temperature 0.6: Best quality-consistency balance. Structured outputs (reports, analysis) are more coherent. TTFT is 10-15% lower than temperature 1.0.
Temperature 0.8: Moderate creativity. Acceptable for most agent tasks. Minimal throughput difference from 0.6.
Temperature 1.0: Highest variance in output quality. Slightly higher throughput in some configurations. Not recommended for structured output tasks.

Z.2 Quality Proxies

TR108-TR116 did not formally measure output quality. The following proxies were observed:

Output length stability: Lower temperature produces more consistent output lengths, suggesting more deterministic generation.
Structural compliance: Agent-generated reports at temperature 0.6 more reliably followed the requested markdown structure.
Repetition rate: Higher temperatures occasionally produced repetitive outputs (observed anecdotally, not quantified).

Z.3 Recommendations

For production deployments where both efficiency and quality matter:

Use temperature 0.6 for structured output tasks (reports, analysis, code).
Use temperature 0.8 for creative tasks (dialogue, brainstorming).
Avoid temperature 1.0 unless output diversity is explicitly valued over consistency.
Quality measurement (coherence scores, human evaluation) should be added to future TRs (see Risk R10).

Appendices AA-AO: Additional Deep-Dives

AA: Measurement Formula Catalog

Throughput:

throughput_tok_s = total_tokens_generated / generation_duration_seconds

Time to First Token (TTFT):

TTFT_ms = timestamp_first_token - timestamp_request_sent

Concurrency Speedup:

speedup = sequential_estimated_time / concurrent_wall_time
sequential_estimated_time = sum(agent_i_duration for all agents)

Parallel Efficiency:

efficiency_pct = (speedup / N_agents) * 100

Contention Rate:

contention_rate = count(runs_with_TTFT_anomaly) / total_runs
TTFT_anomaly = (TTFT_concurrent - TTFT_baseline) > threshold

Coefficient of Variation:

CV = (standard_deviation / mean) * 100

Throughput Improvement:

improvement_pct = ((throughput_new - throughput_baseline) / throughput_baseline) * 100

Cost Savings:

annual_savings = (python_gpu_hours - rust_gpu_hours) * gpu_hourly_rate * 12

AB: Phase-Specific Observations

TR108 Phases:

Phase 1 (Llama 3.1): 3 quantizations tested (FP16, Q8_0, Q4_0). Q4_0 provided best throughput-to-quality ratio.
Phase 2 (Gemma 3): 3 variants tested. Default (Q4_K_M) provided optimal balance.
Phase 3 (Cross-model): Gemma 3 established as throughput champion.

TR109 Phases:

Phase 1 (Config transfer): TR108 optimal configs tested on agent workflows. Transfer failure confirmed.
Phase 2 (Parameter sweep): Systematic sweep identifies agent-specific optima.
Phase 3 (Quality tradeoffs): Temperature impact on output quality evaluated.

TR110 Phases:

Phase 1 (Baseline vs Chimera): Mixed agent pair testing.
Phase 2 (Homogeneous): Identical Chimera agents achieve 99.25% efficiency.
Phase 3 (Heterogeneous): Different-config agents show lower but acceptable efficiency.

AC: Detailed Model Comparison

Metric	Gemma 3 (4.3B)	Llama 3.1 8B (Q4_0)	Qwen 2.5 7B
Parameters	4.3B	8B	7B
Quantization	Q4_K_M	Q4_0	Q4_K_M
Single-agent throughput	102.85 tok/s	76.59 tok/s	~85 tok/s (est.)
VRAM footprint	3.3 GB	4.5 GB	4.2 GB
Rust multi-agent eff.	97.3% (b-v-c), 99.2% (homo)	96.5% (b-v-c), 98.5% (homo)	90.0% (b-v-c)
Python multi-agent eff.	80.2%	83.8%	77.6%
Rust advantage	+17.1 pp	+12.7 pp	+12.4 pp
Recommended use	Throughput-critical swarms	Reasoning-heavy agents	Specialized tasks only

Observations: Gemma 3's smaller size and efficient architecture make it the best all-around choice. Llama 3.1's larger context capacity and reasoning strength justify its lower throughput for complex tasks. Qwen 2.5 underperforms in multi-agent scenarios despite competitive parameter count, likely due to heavier KV cache requirements and different attention patterns.

AD: Extended Methodological Rationale

Why 5 runs per configuration? Statistical power analysis indicates that 5 runs provide sufficient power to detect a 2 percentage-point efficiency difference with 95% confidence, given the observed variance (1-5pp sigma). Three runs are the minimum acceptable; five provide robustness against outliers.

Why dual Ollama instead of batched inference? Ollama does not natively support batched inference (multiple prompts in a single request). Dual instances are the only mechanism for true concurrent inference on Ollama-served models. Alternative inference servers (vLLM, TGI) were out of scope for this consumer-hardware study.

Why Q4_K_M quantization? Q4_K_M provides the best balance of model size, inference speed, and output quality for the target workload (gaming dialogue). Higher quantizations (Q5, Q8) are unnecessarily large for the quality requirements; lower quantizations (Q2, Q3) produce noticeable quality degradation.

AE: Future Directions Beyond TR116

3+ Agent Scaling (TR117+): Characterize performance beyond 2 concurrent agents. Expected challenges: VRAM pressure, increased coordination overhead, potential need for 3+ Ollama instances.
Linux Validation: Reproduce key findings (TR112, TR114, TR116) on Linux to confirm OS independence.
Cloud GPU Validation: Test on datacenter GPUs (A100, H_100) to determine if consumer-GPU findings generalize.
Quality Measurement Integration: Add automated quality metrics (BLEU, coherence scores, structured output compliance) to the benchmark suite.
Long-Context Evaluation: Extend benchmarks to 4096+ token contexts for document processing workloads.
Model Fine-Tuning Impact: Evaluate whether fine-tuned models exhibit different scaling characteristics than base/instruct models.
vLLM/TGI Comparison: Compare Ollama serving against vLLM and Text Generation Inference for production scenarios.

AF: Annotated Literature Notes

Tokio documentation (tokio.rs): Work-stealing scheduler design documented in Tokio's architecture guide. Key insight: work-stealing minimizes tail latency but adds per-steal overhead of approximately 100ns, negligible for LLM inference timescales.
Ollama GitHub (github.com/ollama/ollama): Server architecture confirms single-request serialization at the model level. Concurrent requests to the same model are queued, not parallelized.
GGUF specification (ggml.ai): Format specification for quantized model weights. Q4_K_M uses 4.5 bits per weight on average with K-means optimization for key layers.
Python asyncio documentation (docs.python.org): Single-threaded event loop confirmed as fundamental design constraint. No plans for multi-threaded event loop in CPython roadmap.

AG: Extended Glossary

Term	Definition
Chimera	The optimization framework used in Banterhearts for LLM agent configuration
TTFT	Time to First Token: latency from request submission to first generated token
Parallel efficiency	(speedup / N_agents) x 100%; measures how well concurrent execution utilizes available parallelism
Contention	Resource conflict when multiple agents compete for shared GPU/Ollama resources
num_gpu	Ollama parameter controlling how many model layers are offloaded to GPU
num_ctx	Ollama parameter controlling the context window size (in tokens)
Q4_K_M	Quantization format: 4-bit with K-means clustering, medium quality preset
Work-stealing	Scheduler design where idle threads steal tasks from busy threads' queues
Dual Ollama	Architecture deploying two independent Ollama instances on different ports
KV cache	Key-Value cache storing attention computations for previously processed tokens
Chimera-homo	Homogeneous configuration: both agents use identical optimized parameters
Chimera-hetero	Heterogeneous configuration: agents use different parameter settings
Baseline-vs-chimera	Scenario comparing default config agent against optimized (Chimera) agent
CV	Coefficient of Variation: standard deviation / mean, expressed as percentage
pp	Percentage points: absolute difference between two percentages

AH: Artifact Inventory

TR	Artifact Type	Path Pattern
TR108	Raw benchmark data	`research/tr108/data/gemma3/`, `research/tr108/data/llama3/`
TR108	Published report	`PublishReady/reports/Technical_Report_108.md`
TR108	Visualization exports	`PublishReady/notebooks/exports/TR108_Comprehensive/`
TR109	Published report	`PublishReady/reports/Technical_Report_109.md`
TR109	Visualization exports	`PublishReady/notebooks/exports/TR109_Agent_Workflow/`
TR110	Published report	`PublishReady/reports/Technical_Report_110.md`
TR110	Visualization exports	`PublishReady/notebooks/exports/TR110_MultiAgent_Concurrent/`
TR111	Published report (v2)	`PublishReady/reports/Technical_Report_111_v2.md`
TR111	Artifacts	`research/tr111/artifacts/`
TR112	Published report (v2)	`PublishReady/reports/Technical_Report_112_v2.md`
TR112	Artifacts	`research/tr112/artifacts/`
TR113	Published report	`PublishReady/reports/Technical_Report_113.md`
TR114	Published report (v2)	`PublishReady/reports/Technical_Report_114_v2.md`
TR114	Artifacts	`research/tr114/artifacts/`
TR115	Published report (v2)	`PublishReady/reports/Technical_Report_115_v2.md`
TR115	Results	`research/tr115/runtime_optimization/results_v2/`
TR116	Published report	`PublishReady/reports/Technical_Report_116.md`

AI: Artifact-to-Claim Examples

Claim: "Rust delivers +15.2% throughput over Python."

Artifact: PublishReady/reports/Technical_Report_112_v2.md, Section 4 (Throughput Analysis)
Raw data: research/tr111/artifacts/ (Rust runs), research/tr112/artifacts/ (Python comparison)
Calculation: (114.54 - 99.34) / 99.34 x 100 = 15.28%

Claim: "Dual Ollama improves efficiency from 82.2% to 99.396%."

Artifact: PublishReady/reports/Technical_Report_113.md (82.2% figure), PublishReady/reports/Technical_Report_114_v2.md (99.396% figure)
Raw data: research/tr114/artifacts/ (dual Ollama results)
Calculation: 99.396 - 82.2 = 17.196 percentage points improvement

Claim: "Tokio-default achieves 98.72% mean with 1.21pp sigma."

Artifact: PublishReady/reports/Technical_Report_115_v2.md, Section 3 (Comprehensive Results)
Raw data: research/tr115/runtime_optimization/results_v2/

AJ: Reproducibility Notes

Hardware Sensitivity: All TR108-TR116 results were obtained on a single hardware configuration (RTX 4080 Laptop, i9-13980HX, 32 GB DDR5-4800). Reproducing these results on different hardware may yield different absolute values but should preserve relative rankings and efficiency percentages. GPU thermal state, background processes, and driver version can introduce run-to-run variance of 1-3%.

Software Versions: Results depend on specific software versions. Ollama's internal scheduling and model loading behavior may change between versions. Tokio's work-stealing scheduler has been stable across minor versions but could change in major releases.

Warmup Protocol: Consistent warmup (3 requests per Ollama instance before measurement) is critical for reproducibility. Cold-start measurements include model loading time that can vary by 2-10x depending on model size and disk speed.

Statistical Protocol: Minimum 3 runs per configuration (5 preferred). Report mean, standard deviation, and CV. Discard clear outliers only if a hardware explanation is documented (e.g., thermal throttling event confirmed by GPU temperature log).

AK: Scenario-Specific Playbooks

Gaming Dialogue Agent Deployment:

Model: Gemma 3 Q4_K_M. Runtime: Rust/Tokio-default. Ollama: single instance.
Configuration: num_gpu=max, num_ctx=512, temp=0.8.
Target: throughput > 100 tok/s, TTFT < 200 ms.
Monitoring: track TTFT p99 for user experience; alert if > 300 ms.

Dual-Agent Research Pipeline:

Model: Gemma 3 Q4_K_M. Runtime: Rust/Tokio-default. Ollama: dual instances (11434/11435).
Configuration: num_gpu=80, num_ctx=2048, temp=0.6.
Target: efficiency > 95%, contention < 2%.
Monitoring: track parallel efficiency per batch; alert if < 90%.

AL: Scenario Taxonomy

Scenario ID	Agents	Ollama	Runtime	Use Case
S1	1	Single	Rust	Production single-agent
S2	1	Single	Python	Development/prototyping
S3	2	Dual	Rust	Production multi-agent
S4	2	Dual	Python	Multi-agent prototyping
S5	2	Single	Rust	Not recommended (TR113)
S6	3+	N/A	Rust	Future work (TR117+)

AM: Decision Heuristics

Heuristic 1: Language Selection

If multi-agent efficiency > 90% required: Rust.
If development velocity > performance: Python.
If both: Rust with Python orchestrator (hybrid).

Heuristic 2: Ollama Architecture

If agents = 1: single Ollama.
If agents >= 2: dual Ollama (one per agent).
If agents >= 3: one Ollama per agent (requires VRAM budget validation).

Heuristic 3: Model Selection

If throughput-critical: Gemma 3.
If reasoning-critical: Llama 3.1.
If neither is dominant: Gemma 3 (default).
Avoid Qwen 2.5 for multi-agent (90% efficiency ceiling).

Heuristic 4: Configuration Transfer

Never assume single-inference optimal configs transfer to agent workflows.
Never assume agent-optimal configs transfer to multi-agent.
Always benchmark the specific workload type.

AN: Policy Decision Trees

Tree 1: Should We Use Rust?

Is multi-agent efficiency critical?
  Yes --> Is the team capable of Rust development?
    Yes --> Use Rust (Decision 1)
    No --> Hire/train, then use Rust
  No --> Is throughput > 15% improvement valuable?
    Yes --> Use Rust
    No --> Stay with Python

Tree 2: How Many Ollama Instances?

How many concurrent agents?
  1 --> Single Ollama
  2 --> Dual Ollama (Decision 2)
  3+ --> One per agent (validate VRAM first)
    VRAM sufficient? --> Deploy
    VRAM insufficient? --> Reduce GPU layers or use smaller model

Tree 3: Which Model?

Is throughput the primary metric?
  Yes --> Gemma 3 (Decision 4)
  No --> Is reasoning quality the primary metric?
    Yes --> Llama 3.1
    No --> Is multi-agent efficiency critical?
      Yes --> Gemma 3 (99.2% efficiency)
      No --> Either Gemma 3 or Llama 3.1

AO: Extended Systems Glossary

Term	Definition	Context
Ollama	Open-source LLM inference server supporting GGUF model format	All TRs
Tokio	Rust async runtime with work-stealing scheduler	TR111-TR116
asyncio	Python standard library for asynchronous I/O	TR108-TR110, TR116
reqwest	Rust HTTP client library built on Tokio and hyper	TR111-TR116
httpx	Python async HTTP client library	TR108-TR110
CUDA	NVIDIA parallel computing platform for GPU acceleration	All TRs
GGUF	GPT-Generated Unified Format for quantized model weights	All TRs
RTX 4080	NVIDIA laptop GPU with 12 GB GDDR6X, 9728 CUDA cores	All TRs
i9-13980HX	Intel 24-core hybrid CPU (8P + 16E cores)	All TRs
DDR5-4800	System memory standard, 4800 MHz transfer rate	All TRs
smol	Lightweight Rust async runtime	TR115
async-std	Rust async runtime mirroring std library API	TR115
hyper	Low-level HTTP library for Rust (used by reqwest)	TR111-TR116
GIL	Global Interpreter Lock in CPython; prevents true parallel bytecode execution	TR116 discussion
VRAM	Video RAM; GPU-accessible memory for model weights and KV cache	All TRs
Chimera	Banterhearts optimization framework for LLM agent configuration tuning	All TRs
Work-stealing	Scheduler pattern where idle threads take tasks from other threads' queues	TR115
Serialization (Ollama)	Behavior where concurrent requests to a single Ollama instance are processed sequentially	TR113, TR114
Thermal throttling	GPU frequency reduction when temperature exceeds safe limits	Performance monitoring
Binary deployment	Distributing Rust application as a single compiled executable	TR112

End of Extended Appendices. Total appendices: F, H, J, K, L, N, O, P, Q, S, T, U, V, W, X, Y, Z, AA-AO. This document is supplemental to the main Conclusive Report 108-116.