Research Archive
Technical Report

TR114: Dual Ollama Architecture

Rust concurrent multi-agent performance with dual Ollama instances — near-zero contention.

Table of Contents

Technical Report 114 v2: Rust Concurrent Multi-Agent Performance with Dual Ollama Architecture

Comprehensive Cross-Language Analysis and Production Validation

Field Value
TR Number 114 v2
Author Research Team
Date 2025-11-15
Test Duration 8+ hours (135 benchmark runs across 27 configurations)
Framework Demo_rust_multiagent (Rust async/tokio + Dual Ollama)
Total Configurations 27 (7 baseline-vs-chimera, 7 chimera-hetero, 13 chimera-homo)
Total Runs 135 (27 configs x 5 runs each)
Related Work TR110 (Python Multi-Agent), TR111_v2 (Rust Single-Agent), TR112_v2 (Rust vs Python Comparison), TR115 (Rust Runtime Optimization)
Artifacts Demo_rust_multiagent/runs/tr110_rust_full/ (135 benchmark runs), Demo_rust_multiagent/summary.json

Executive Summary

This technical report presents a systematic analysis of Rust multi-agent concurrent execution with full architectural parity to Python (TR110). Through 135 comprehensive benchmark runs across 27 configurations using dual Ollama instances, we establish the true performance characteristics of Rust async multi-agent workflows and quantify the multi-agent coordination overhead that transforms Rust's 15% single-agent advantage into a 3-4% multi-agent gap against Python.

Critical Context: This v2 report supersedes the previous TR113/TR114 analyses by:

  1. Correcting single-agent baselines: Rust is 15.2% faster than Python at single-agent tasks (TR111_v2/TR112_v2), not slower
  2. Dual Ollama architecture: Eliminates server-level serialization bottlenecks (TR113 identified this issue)
  3. Full workflow parity: Matches TR110's comprehensive methodology (150 runs, 30 configs)
  4. Root cause analysis: Explains why Rust's single-agent advantage diminishes in multi-agent scenarios

Key Findings

Multi-Agent Performance:

  • Peak Single Run: 99.992% (test004/test006: baseline-vs-chimera gpu80_ctx1024_temp0.8)
  • Best Config Average: 99.396% (test011: chimera-hetero gpu120/140_ctx512/1024)
  • Overall Average: 98.281% across all 135 runs (27 configs x 5 runs)
  • Python Comparison (TR110): Python achieves 99.25% peak config average (homogeneous Chimera)
  • Gap Analysis: Rust config average +0.15pp ahead of Python (99.396% vs 99.25%)

Performance Context:

  • Single-Agent (TR111_v2): Rust 15.2% faster than Python (114.54 vs 99.34 tok/s)
  • Multi-Agent (TR114_v2): Rust matches Python in peak config efficiency (99.396% vs 99.25%)
  • Net Result: Multi-agent coordination reduces but preserves Rust's advantage (98.28% overall mean vs Python's 95.8%)

Root Cause Analysis:

  1. Tokio work-stealing overhead: Thread migration costs dominate in I/O-wait scenarios
  2. HTTP client buffering: Reqwest's async model adds latency vs Python's httpx
  3. Async runtime complexity: More sophisticated scheduler = more overhead when waiting
  4. Python's advantage: GIL release during I/O eliminates contention, simpler event loop faster for coordination

Business Impact:

  • Rust strengths preserved: Memory efficiency (67% less), startup speed (83% faster), type safety
  • Python wins multi-agent: 3-4% better coordination efficiency at scale
  • Recommendation: Rust for single-agent production, Python for multi-agent orchestration
  • Hybrid strategy: Rust workers + Python coordinator combines strengths of both runtimes

Table of Contents

  1. Introduction & Objectives
  2. Methodology & Architectural Parity
  3. Comprehensive Results Analysis
  4. Statistical Deep Dive
  5. Rust vs Python Multi-Agent Comparison
  6. Multi-Agent Coordination Analysis
  7. Configuration Optimization
  8. Business Impact & Cost Analysis
  9. Production Deployment Strategy
  10. Conclusions & Recommendations
  11. Appendices

1. Introduction & Objectives

1.1 Research Context & Evolution

The Journey to TR114_v2:

  1. TR113 (November 12, 2025): Initial Rust multi-agent tests revealed 82.2% peak efficiency using single Ollama instance. Hypothesis: Server serialization limiting performance.

  2. TR114 v1 (November 13, 2025): Dual Ollama validation achieved 95.7% efficiency (+17.3pp improvement), confirming hypothesis. However, still referenced incorrect TR111 baselines (Rust 98.86 tok/s, implying Python advantage).

  3. TR111_v2 (November 14, 2025): Comprehensive single-agent retesting with full workflow parity revealed Rust is actually 15.2% faster than Python (114.54 vs 99.34 tok/s), completely reversing prior understanding.

  4. TR112_v2 (November 14, 2025): Cross-language validation confirmed Rust's single-agent dominance across all metrics: throughput (+15.2%), TTFT (-58%), memory (-67%), startup (-83%).

  5. TR114_v2 (November 15, 2025 - This Report): Reanalysis of multi-agent performance with corrected baselines reveals Rust's Multi-Agent Excellence: Despite expectations, Rust's 15% single-agent throughput advantage translates to superior multi-agent coordination (98.281% mean vs Python's 95.8%).

Critical Question: How does Rust, which leads single-agent performance by 15%, maintain this advantage in multi-agent scenarios? Does coordination overhead differ between Rust's Tokio and Python's asyncio?

1.2 Research Questions

This study addresses:

  1. Q1: What is Rust's true multi-agent peak performance with dual Ollama architecture?
  2. Q2: How does Rust multi-agent efficiency compare to Python when both use dual Ollama?
  3. Q3: Why does Rust's 15% single-agent advantage disappear in multi-agent execution?
  4. Q4: What are the root causes of multi-agent coordination overhead in Rust vs Python?
  5. Q5: What production deployment strategy maximizes Rust's strengths while mitigating multi-agent weaknesses?

1.3 Scope & Significance

This Report's Scope:

  • Data: 135 Rust multi-agent runs (27 configs x 5 runs)
  • Comparison: TR110 Python data (150 runs, 30 configs)
  • Analysis: Statistical validation, root cause analysis, business impact
  • Recommendations: Production-grade deployment strategies

Significance:

  • First analysis integrating corrected single-agent baselines
  • First rigorous explanation of multi-agent coordination overhead
  • First business impact analysis considering full performance profile
  • Production-ready recommendations backed by 285+ total benchmark runs

2. Methodology & Architectural Parity

2.1 Test Environment

Hardware Configuration:

GPU: NVIDIA GeForce RTX 4080 Laptop
- VRAM: 12 GB GDDR6X
- CUDA Cores: 9728
- Tensor Cores: 304 (4th Gen)
- Memory Bandwidth: 504 GB/s
- Driver: 566.03

CPU: Intel Core i9-13980HX
- Cores: 24 (8P + 16E)
- Threads: 32
- Base Clock: 2.2 GHz
- Boost Clock: 5.6 GHz

RAM: 32 GB DDR5-4800
OS: Windows 11 Pro (Build 26200)
Ollama: v0.1.17 (dual instances, ports 11434/11435)
Model: gemma3:latest (4.3B params, Q4_K_M quantization, 3.3GB base memory)
Rust: 1.90.0 (stable, x86_64-pc-windows-msvc)
Framework: Demo_rust_multiagent (tokio async runtime)

2.2 Architectural Parity with Python (TR110)

Dual Ollama Configuration:

Aspect Python (TR110) Rust (TR114_v2) Parity
Ollama Instances 2 servers (11434, 11435) 2 servers (11434, 11435) PASS
Agent Isolation Dedicated servers per agent Dedicated servers per agent PASS
VRAM Allocation Simultaneous independent Simultaneous independent PASS
Model Loading Parallel (both agents start together) Parallel (both agents start together) PASS
HTTP Client httpx (Python async) reqwest (Rust async) PASS
Async Runtime asyncio (single-threaded event loop) Tokio (multi-threaded work-stealing) PASS
Concurrency Model asyncio.gather() tokio::join!() PASS
Execution Protocol Process isolation, forced unloads Natural cache eviction WARNING Minor difference

Key Difference:

  • Python uses forced Ollama model unloads between configs (strict isolation)
  • Rust relies on natural cache eviction (accepted trade-off for 8-hour runtime)
  • Impact: Minimal (cache eviction happens naturally due to config changes)

2.3 Test Matrix

27 Configurations Across 3 Scenarios:

Scenario 1: Baseline vs Chimera (7 configs)

  • Agent A: Baseline (Ollama defaults, no overrides)
  • Agent B: Chimera-optimized config
  • Goal: Quantify mixed deployment overhead
  • Configs: 3 GPU layers (60/80/120) x 2 contexts (512/1024) x temp 0.8 + 1 validation config

Scenario 2: Chimera Hetero (7 configs)

  • Agent A: Chimera config A
  • Agent B: Chimera config B (different parameters)
  • Goal: Test asymmetric optimization with dual Ollama
  • Configs: Various GPU/context asymmetric combinations

Scenario 3: Chimera Homo (13 configs)

  • Both agents: Identical Chimera config
  • Goal: Measure peak concurrent efficiency
  • Configs: Full parameter sweep (GPU: 60/80/120, CTX: 512/1024/2048, TEMP: 0.6/0.8/1.0)

Total: 27 configs x 5 runs = 135 benchmarks

2.4 Metrics Collection

Per-Run Metrics:

  • concurrency_speedup: sequential_estimated_time / concurrent_wall_time
  • efficiency_percent: (speedup / 2) x 100%
  • throughput_delta: collector_throughput - insight_throughput (tok/s)
  • ttft_delta_ms: collector_ttft - insight_ttft (milliseconds)
  • resource_contention_detected: Boolean flag for TTFT anomalies (>3s increase)

Aggregate Metrics (per config):

  • average_concurrency_speedup: Mean across 5 runs
  • average_efficiency: Mean efficiency across 5 runs
  • average_throughput_delta: Mean throughput difference
  • average_ttft_delta_ms: Mean TTFT difference

Statistical Validation:

  • Standard deviation calculated for efficiency and speedup
  • Coefficient of Variation (CV) = stddev / mean x 100%
  • Outlier detection via resource contention flags

3. Comprehensive Results Analysis

3.1 Overall Performance Summary

Aggregate Statistics (All 135 Runs):

Metric Min Max Mean Median StdDev CV (%)
Concurrency Speedup 1.234x 2.000x 1.969x 1.977x 0.098 5.0%
Efficiency Percent 61.7% 99.992% 98.281% 98.6% 4.9 5.0%
Throughput Delta (tok/s) -8.19 +22.83 +0.09 -0.27 3.45 N/A
TTFT Delta (ms) -2328.6 +459.2 -43.2 -28.3 312.5 N/A

Key Observations:

  1. High median efficiency: 98.6% indicates most runs achieve strong performance
  2. Wide efficiency range: 61.7-99.98% (38.3pp spread) driven by outlier runs
  3. Low CV: 5.0% demonstrates good consistency across configurations
  4. Throughput balance: Mean delta near zero (+0.09 tok/s) shows balanced agent performance

3.2 Scenario-Level Breakdown

Scenario 1: Baseline vs Chimera (7 configs, 35 runs)

Config GPU (Chimera) CTX Avg Speedup Avg Efficiency Peak Efficiency Contention
test001 60 512 1.9481x 97.41% 98.78% 0/5
test002 60 1024 1.9532x 97.66% 99.30% 0/5
test003 80 512 1.9698x 98.49% 99.36% 0/5
test004 80 1024 1.9797x 98.984% 99.487% 0/5
test005 120 512 1.8321x 91.60% 99.21% 1/5
test006 120 1024 1.9739x 98.69% 99.961% 0/5
test202 (validation) 80 512 1.9752x 98.76% 99.93% 0/5

Aggregate:

  • Mean Efficiency: 97.37%
  • Peak Single Run (in baseline-vs-chimera): 99.961% (test006, run 3)
  • Best Config Average: 98.984% (test004: gpu80_ctx1024_temp0.8)
  • Contention Rate: 1/35 runs (2.9%)

Analysis:

  • GPU=80 sweet spot confirmed (test003, test004, test202 all >98.4%)
  • GPU=120 shows instability (test005 @ 91.60% with contention)
  • Larger context (1024) improves efficiency by 0.5-1.5pp
  • Python Comparison (TR110): Python best baseline-vs-chimera = 97.9% (test 202), Rust exceeds at 98.984% (test004)

Scenario 2: Chimera Hetero (7 configs, 35 runs)

Config GPU A/B CTX A/B Avg Speedup Avg Efficiency Peak Efficiency Contention
test007 60/80 512/1024 1.9580x 97.90% 99.33% 0/5
test008 60/80 1024/2048 1.9744x 98.72% 99.86% 0/5
test009 80/100 512/1024 1.9784x 98.92% 99.63% 0/5
test010 80/100 1024/2048 1.9785x 98.93% 99.73% 0/5
test011 120/140 512/1024 1.9879x 99.396% PASS 99.57% 0/5
test012 120/140 1024/2048 1.9744x 98.72% 99.38% 0/5
test201 (validation) 80/80 512/1024 1.9793x 98.96% 99.90% 0/5

Aggregate:

  • Mean Efficiency: 98.79%
  • Peak Single Run: 99.57% (test011)
  • Best Config Average: 99.396% (test011)
  • Contention Rate: 0/35 runs (0%)
  • Analysis: Heterogeneous configs achieve highest average efficiency (98.79% vs 97.37% baseline-vs-chimera)

Asymmetric GPU Allocation Benefits:

  • GPU differential 20-40 layers (60/80, 80/100, 120/140) performs optimally
  • Prevents thread starvation in Tokio work-stealing scheduler
  • Python Comparison (TR110): Python hetero best = 99.0%, Rust exceeds at 99.396% (+0.4pp)

Scenario 3: Chimera Homo (13 configs, 65 runs - sampled 3)

Config GPU CTX TEMP Avg Speedup Avg Efficiency Peak Efficiency Contention
test100 80 512 0.6 1.8846x 94.23% 99.77% 0/5
test108 80 2048 1.0 1.9777x 98.88% 99.992% 0/5
test200 80 512 0.8 1.9802x 99.01% 99.80% 0/5

Aggregate (sampled runs only - test100, test108, test200):

  • Mean Efficiency: 97.4% (sampled subset only)
  • Peak Single Run: 99.80% (test200, run 4)
  • Best Config Average (sampled): 99.01% (test200)
  • Contention Rate: 0/15 sampled runs (0%)

Full Homo Analysis (All 13 configs with summary.json):

  • Overall Mean: 98.40% (chimera-homo + chimera_homo combined)
  • Peak Config Average: 99.356% (test018)
  • Peak Single Run: 99.990% (test108, run 4)
  • Python Comparison (TR110): Python homo best = 99.25% config avg (test108), Rust exceeds at 99.356%

3.3 Top Performing Configurations

Top 5 by Average Efficiency:

Rank Config Scenario GPU Config CTX Config Avg Efficiency Peak Efficiency
1 test011 chimera_hetero 120/140 512/1024 99.396% 99.57%
2 test018 chimera_homo 80/80 1024/1024 99.356% 99.85%
3 test200 chimera_homo 80/80 512/512 99.01% 99.80%
4 test004 baseline_vs_chimera 80 (chimera) 1024 98.984% 99.49%
5 test201 chimera_hetero 80/80 512/1024 98.96% 99.90%

Top 5 by Peak Single-Run Efficiency:

Rank Config Scenario Run Efficiency Speedup Notes
1 test108 chimera_homo 1 99.992% 1.9998x Best overall run
2 test006 baseline_vs_chimera 3 99.961% 1.9992x Near-theoretical maximum
3 test013 chimera_homo 4 99.950% 1.9990x High-efficiency homo run
4 test104 chimera_homo 5 99.933% 1.9987x Validation config
5 test202 baseline_vs_chimera 5 99.932% 1.9986x Best baseline-vs-chimera

Observation: Peak single runs approach the theoretical 2.0x speedup limit (99.992% efficiency = 1.9998x). Config averages stabilize at 98-99%, indicating reliable performance across runs.

3.4 Worst Performing Outliers

Bottom 3 by Average Efficiency:

Rank Config Scenario Avg Efficiency Issue Root Cause
1 test005 baseline_vs_chimera 91.60% Contention 1/5 runs GPU=120 memory pressure
2 test100 chimera_homo 94.23% One bad run (75.8%) Unknown anomaly, temp=0.6 issue
3 (Other configs all >97%) N/A N/A N/A N/A

Contention Analysis:

  • Total Runs: 135
  • Contention Detected: 1 run (test005, run 4)
  • Contention Rate: 0.74% (vs TR113 single Ollama: 63%)
  • Conclusion: Dual Ollama eliminates server-level contention (0.74% vs 63%)

4. Statistical Deep Dive

4.1 Efficiency Distribution

Population Statistics (All 135 Runs):

  • Mean: 98.281%
  • Median: 98.6%
  • Mode: 98-99% range (50+ runs)
  • Standard Deviation: 4.9pp
  • Coefficient of Variation: 5.0%
  • Range: 38.3pp (61.7% - 99.98%)

Distribution Characteristics:

  • Highly skewed right: Most runs cluster 97-100%, with tail extending left
  • Outliers: 3 runs < 95% (test005 contention run 61.7%, test100 anomaly 75.8%, one other ~94%)
  • Tight core: 90% of runs within 96-100% (4pp spread)

Percentile Analysis:

  • P5: 88.9%
  • P25: 97.9%
  • P50: 98.6%
  • P75: 99.4%
  • P95: 99.8%

Interpretation: Rust multi-agent performance is highly consistent (CV 5.0%), with 95% of runs achieving >88% efficiency and 50% achieving >98.6%.

4.2 Speedup Distribution

Population Statistics:

  • Mean: 1.969x
  • Median: 1.977x
  • Range: 1.234x - 2.000x (0.766x spread)
  • Standard Deviation: 0.098x
  • Coefficient of Variation: 5.0%

Percentile Analysis:

  • P5: 1.778x
  • P25: 1.958x
  • P50: 1.977x
  • P75: 1.988x
  • P95: 1.996x

Interpretation:

  • Median speedup 1.977x is 98.9% of theoretical 2.0x maximum
  • Top quartile (P75-P100): 1.988-2.000x (99.4-100% efficiency)
  • Bottom decile (P0-P10): 1.234-1.778x driven by outliers

4.3 Configuration-Level Consistency

Within-Config Variance (StdDev of 5 runs per config):

Config Type Mean StdDev Mean CV (%) Interpretation
Baseline vs Chimera 3.8pp 3.9% Moderate variance
Chimera Hetero 2.1pp 2.1% Low variance PASS
Chimera Homo (sampled) 6.2pp 6.3% Higher variance (temp sensitivity)

Finding: Heterogeneous configs show best run-to-run consistency (2.1% CV), suggesting asymmetric allocation stabilizes Tokio scheduler.

4.4 Comparison to Python (TR110)

Statistical Comparison:

Metric Python (TR110) Rust (TR114_v2) Delta Winner
Mean Efficiency 95.8% 98.281% +2.48pp Rust
Peak Config Avg 99.25% 99.396% +0.15pp Rust
Peak Single Run 99.25% 99.992% +0.74pp Rust
Consistency (CV) 7.4pp 4.9pp -2.5pp Rust
Contention Rate ~10-15% 0.74% -10-14pp Rust
Median Efficiency ~96.5% 98.6% +2.1pp Rust

Critical Observation: Rust multi-agent performance statistically matches or exceeds Python across most metrics. However, this conclusion ignores single-agent baselines:

  • Rust single-agent: 114.54 tok/s (TR111_v2)
  • Python single-agent: 99.34 tok/s (TR109)
  • Gap: Rust +15.2% faster

If Rust maintained its single-agent advantage in multi-agent, we'd expect ~110-112 tok/s effective throughput. Instead, we see parity with Python, indicating ~15% overhead from multi-agent coordination in Rust.


5. Rust vs Python Multi-Agent Comparison

5.1 Direct Performance Comparison

Peak Performance Comparison:

Metric Python (TR110) Rust (TR114_v2) Rust Advantage Notes
Best Config Avg Efficiency 99.25% (test108) 99.396% (test011) +0.15pp Rust slight edge
Best Single Run 99.25% 99.992% (test108) +0.74pp Rust
Mean Efficiency 95.8% 98.281% +2.48pp Rust more consistent
P95 Efficiency ~98.5% 99.8% +1.3pp Rust better tail
Contention Rate 10-15% 0.74% -10-14pp Rust
Consistency (CV) 7.4pp 4.9pp -2.5pp Rust more predictable

Verdict on Multi-Agent Performance: Rust matches or slightly exceeds Python in multi-agent efficiency.

5.2 The Multi-Agent Paradox

Single-Agent Baseline Comparison (from TR112_v2):

Metric Python Rust Rust Advantage
Throughput 99.34 tok/s 114.54 tok/s +15.2%
TTFT 1437 ms 603 ms -58.0%
Memory 250 MB 75 MB -67%
Startup 1.5s 0.2s -83%

The Reality:

  • Single-Agent: Rust leads by 15% (throughput), 58% (latency), 67% (memory)
  • Multi-Agent: Rust exceeds Python by +2.48pp in mean efficiency (98.281% vs 95.8%)
  • Implication: Multi-agent coordination preserves and extends Rust's advantages

Quantifying the Gap:

  • Expected Rust multi-agent throughput (maintaining 15% advantage): ~110-112 tok/s per agent
  • Observed: ~41-44 tok/s per agent (comparable to Python's ~40-43 tok/s)
  • Coordination overhead: ~15-18% throughput degradation vs single-agent baseline

5.3 Architectural Differences

Python (asyncio) Architecture:

async def run_multi_agent():
    agent1_task = asyncio.create_task(run_agent_1())
    agent2_task = asyncio.create_task(run_agent_2())
    results = await asyncio.gather(agent1_task, agent2_task)
  • Single-threaded event loop
  • Cooperative multitasking (explicit yields)
  • GIL release during I/O (no contention)
  • Minimal context switching overhead

Rust (Tokio) Architecture:

async fn run_multi_agent() -> Result<(AgentResult, AgentResult)> {
    let agent1_future = run_agent_1();
    let agent2_future = run_agent_2();
    tokio::join!(agent1_future, agent2_future)
}
  • Multi-threaded work-stealing scheduler
  • True parallelism (tasks can run on different threads)
  • Task migration overhead (work-stealing between threads)
  • More sophisticated but heavier runtime

Key Difference:

  • Python: Simpler, lighter coordination (single event loop)
  • Rust: More powerful but heavier coordination (work-stealing scheduler)
  • For I/O-bound multi-agent: Python's simplicity wins (less overhead)
  • For CPU-bound multi-agent: Rust's parallelism would win (not tested here)

5.4 Throughput Per Agent Analysis

Average Per-Agent Throughput (Sampled Configs):

Config Rust Collector (tok/s) Rust Insight (tok/s) Python Collector (tok/s) Python Insight (tok/s)
Baseline vs Chimera 43.2 40.5 ~45 ~40
Chimera Hetero 42.1 42.4 ~43 ~42
Chimera Homo 41.8 42.3 ~42 ~42

Analysis:

  • Rust per-agent throughput: 40-44 tok/s (avg ~42 tok/s)
  • Python per-agent throughput: 40-45 tok/s (avg ~42 tok/s)
  • Gap to single-agent baseline:
    • Rust: 114.54 tok/s -> 42 tok/s = -63% degradation
    • Python: 99.34 tok/s -> 42 tok/s = -58% degradation

Observation: Both languages experience substantial degradation in multi-agent scenarios (-58% to -63%), and Rust loses its 15% single-agent advantage. The degradation is not due to multi-agent coordination overhead alone, but rather:

  1. Different workload characteristics: Multi-agent tasks may involve more I/O waits
  2. Model loading overhead: Each agent loads model separately
  3. Prompt complexity differences: Multi-agent prompts may be heavier

However: The fact that Rust degrades 5pp more than Python (-63% vs -58%) suggests multi-agent coordination overhead is real and measurable.


6. Multi-Agent Coordination Analysis

6.1 Performance Analysis Framework

Key Question: How does multi-agent coordination overhead compare between Rust (Tokio) and Python (asyncio)?

Findings: Rust's coordination is more efficient than Python's:

  1. Work-stealing scheduler handles I/O-bound workloads effectively (98.281% mean vs Python's 95.8%)
  2. Dual Ollama architecture eliminates server-level contention (0.74% vs Python's 10-15%)
  3. Async runtime overhead is negligible compared to LLM inference time (0.3-0.6% of wall time)

Alternative Hypotheses:

  1. Measurement artifact (timing differences)
  2. Workload characteristic differences (not apples-to-apples)
  3. Ollama server behavior differences (caching, scheduling)

6.2 Evidence Analysis

Evidence 1: Tokio Work-Stealing Overhead

Mechanism:

  • Tokio maintains thread pool (default: CPU core count)
  • Tasks can migrate between threads (work-stealing)
  • Each migration incurs context switch cost
  • For I/O-bound tasks (waiting on Ollama), migrations happen frequently

Supporting Data:

  • Single-agent Rust (no work-stealing needed): 114.54 tok/s
  • Multi-agent Rust (work-stealing active): ~42 tok/s per agent
  • Python asyncio (single-threaded, no migration): ~42 tok/s per agent

Conclusion: Work-stealing provides no benefit for I/O-bound workloads (agents spend 90%+ time waiting on Ollama responses), but adds overhead from thread migration.

Evidence 2: HTTP Client Async Model Differences

Python (httpx):

async with httpx.AsyncClient() as client:
    response = await client.post(url, json=data)
    # Yields to event loop during I/O
    # Single-threaded, no locking needed

Rust (reqwest):

let client = reqwest::Client::new();
let response = client.post(url).json(&data).send().await?;
// Spawns background task for HTTP I/O
// Multi-threaded, synchronization needed

Difference:

  • Python: Direct I/O on event loop thread (minimal overhead)
  • Rust: Background task spawn + synchronization (overhead)

Quantified Impact: TR115 found reqwest adds ~50-100ms latency vs direct TCP. Over 2 LLM calls per agent x 2 agents = 400ms total overhead possible.

Evidence 3: Python GIL Release Advantage

Python Advantage:

  • During I/O (Ollama HTTP requests), Python releases GIL
  • Both agents can make progress simultaneously
  • No contention for interpreter lock during I/O

Rust Disadvantage:

  • No GIL (positive normally), but work-stealing scheduler adds complexity
  • Task migration during I/O waits introduces overhead
  • More sophisticated = more overhead for simple I/O coordination

Evidence 4: TR115 Runtime Comparison

TR115 tested 5 Rust async runtimes:

  • Tokio (default): 95-96% peak multi-agent efficiency
  • Tokio LocalSet (thread-pinned): 96% peak (slight improvement)
  • Smol (minimal runtime): 95-96% peak (same as Tokio)
  • Async-std: 50% efficiency (failed, Tokio HTTP dependency)

Finding: Runtime choice has minimal impact (<1pp variation). The overhead is architectural, not runtime-specific.

6.3 Measured Overhead Breakdown

Estimated Overhead Sources (per agent per run):

Source Estimated Overhead Basis
Work-stealing migrations 50-100ms Thread switch cost x migration frequency
HTTP client spawning 100-200ms Reqwest background task overhead (TR115)
Task coordination 20-50ms Tokio scheduler overhead
Memory synchronization 10-30ms Arc/Mutex overhead for shared state
Total Estimated 180-380ms Per agent per run

Impact on Throughput:

  • Baseline inference time: ~50-60 seconds per agent
  • Overhead: ~0.18-0.38 seconds
  • Overhead percentage: 0.3-0.6% of wall time

Coordination Efficiency: The measured coordination overhead (0.3-0.6% of wall time) is minimal, allowing Rust to maintain its performance advantages:

  1. Dual Ollama Benefits: Eliminates server-level contention (0.74% vs Python's 10-15%)
  2. Tokio Efficiency: Work-stealing scheduler optimally distributes I/O-bound tasks
  3. Consistent Performance: Rust achieves 98.281% mean efficiency vs Python's 95.8% (+2.48pp)

Revised Conclusion: Multi-agent coordination overhead exists (~1-2%), but is not the primary driver of Rust's loss of advantage. The main factor is likely workload characteristic differences between single-agent and multi-agent scenarios.

6.4 Production Implications

When Rust Wins:

  • Single-agent production workloads: 15% faster, 67% less memory, 83% faster startup
  • CPU-bound multi-agent: Tokio's true parallelism would dominate
  • Memory-constrained environments: 67% less memory crucial
  • Type-safe mission-critical: Compile-time guarantees

When Python Wins:

  • I/O-bound multi-agent coordination: Simpler event loop, less overhead
  • Rapid prototyping: Faster development iteration
  • Complex orchestration: Easier to reason about single-threaded execution
  • Ecosystem richness: More libraries, easier integration

Optimal Strategy: Hybrid Architecture

+---------------------------------------------+
| Python Orchestrator (FastAPI)              |
| - Receives requests                         |
| - Routes to Rust workers                    |
| - Aggregates results                        |
| - Lightweight async coordination            |
+-------------+-------------------------------+
              |
         +----+----+
         v         v
    +--------+ +--------+
    | Rust   | | Rust   |
    | Worker | | Worker |
    | Agent  | | Agent  |
    | (15%   | | (15%   |
    | faster)| | faster)|
    +--------+ +--------+

Benefits:

  • Python handles orchestration (its strength)
  • Rust handles inference (its strength)
  • Combines Rust inference speed (15% faster) with Python coordination efficiency

7. Configuration Optimization

Tier 1: Maximum Efficiency (Chimera Hetero)

# Agent A
[agent_a]
num_gpu = 120
num_ctx = 512
temperature = 0.8
base_url = "http://localhost:11434"

# Agent B
[agent_b]
num_gpu = 140
num_ctx = 1024
temperature = 0.8
base_url = "http://localhost:11435"

# Expected Performance
efficiency = 99.396%  # Config average (test011)
peak_efficiency = 99.57%
speedup = 1.988x
contention_risk = "Very Low" (0/5 runs)

Use Case: Maximum performance, cost-insensitive

Tier 2: Balanced (Chimera Homo - High Context)

# Both Agents
[agents]
num_gpu = 80
num_ctx = 2048
temperature = 1.0
base_urls = ["http://localhost:11434", "http://localhost:11435"]

# Expected Performance
efficiency = 98.88%  # Config average (test108)
peak_efficiency = 99.99%
speedup = 1.978x
contention_risk = "Very Low"

Use Case: Production standard, good balance of performance and resource usage

Tier 3: Resource-Constrained (Baseline vs Chimera)

# Agent A (Baseline)
[agent_a]
# Use Ollama defaults
base_url = "http://localhost:11434"

# Agent B (Chimera)
[agent_b]
num_gpu = 80
num_ctx = 1024
temperature = 0.8
base_url = "http://localhost:11435"

# Expected Performance
efficiency = 98.984%  # Config average (test004)
peak_efficiency = 99.992%
speedup = 1.980x
contention_risk = "Very Low"

Use Case: Mixed deployment, gradual migration, cost-sensitive

7.2 Configuration Decision Tree

                    +----------------------+
                    |  VRAM Available?     |
                    +----------+-----------+
                               |
                    +----------+----------+
                    |                     |
                < 10GB                > 10GB
                    |                     |
         +----------+---------+          |
         |                    |          |
    Latency-Critical?    Cost-Sensitive? |
         |                    |          |
        Yes                  Yes         |
         |                    |          |
    Baseline+Chimera     Homo ctx512    |
    (Tier 3)             gpu60/80       |
                                        |
                               +--------+--------+
                               |                 |
                          Quality Focus    Performance Focus
                               |                 |
                          Homo ctx2048      Hetero gpu120/140
                          (Tier 2)          (Tier 1)

7.3 Anti-Patterns to Avoid

Anti-Pattern 1: GPU=120 in Baseline-vs-Chimera

  • Observed: test005 @ 91.60% efficiency (1 contention event)
  • Cause: Memory pressure from full layer offload
  • Fix: Use GPU=80 for mixed deployments

Anti-Pattern 2: Low Temperature in Homo Configs

  • Observed: test100 (temp=0.6) @ 94.23% (one 75% outlier)
  • Cause: Unknown, but temp=0.8/1.0 more stable
  • Fix: Use temp >= 0.8 for production

Anti-Pattern 3: Single Ollama Instance

  • TR113 Result: 82.2% peak efficiency (63% contention rate)
  • TR114_v2 Result: 99.396% peak config efficiency, 99.992% peak single run (0.74% contention rate)
  • Fix: Always use dual Ollama for multi-agent

Anti-Pattern 4: Symmetric Low GPU Allocation

  • Poor: GPU=60/60 (both agents)
  • Better: GPU=60/80 (asymmetric, prevents starvation)
  • Best: GPU=80/100 or 120/140 (high + asymmetric)

8. Business Impact & Cost Analysis

8.1 Infrastructure Cost Modeling

Scenario: 1M multi-agent executions per month (500K concurrent pairs)

Python Multi-Agent Deployment:

  • Config: TR110 best (GPU=80, CTX=2048, TEMP=1.0)
  • Efficiency: 99.25%
  • Per-Agent Throughput: ~42 tok/s
  • Memory per Agent: 250 MB
  • Instances Required: 8 x 8GB RAM @ $50/month = $400/month
  • Total Cost: $400/month

Rust Multi-Agent Deployment:

  • Config: TR114_v2 best (GPU=120/140, CTX=512/1024)
  • Efficiency: 99.396%
  • Per-Agent Throughput: ~42 tok/s (same as Python)
  • Memory per Agent: 75 MB
  • Instances Required: 4 x 8GB RAM @ $50/month = $200/month
  • Total Cost: $200/month

Monthly Savings: $200 (50% cost reduction from memory efficiency) Annual Savings: $2,400

Note: This comparison ignores single-agent potential.

8.2 Hybrid Architecture Cost Analysis

Optimal Architecture: Python Orchestrator + Rust Single-Agent Workers

+-------------------------------------+
| Python FastAPI Orchestrator         |
| - 1 instance, 2GB RAM ($25/month)   |
| - Handles routing, aggregation      |
+------------+------------------------+
             |
        +----+-----+
        v          v
   +---------+ +---------+
   | Rust    | | Rust    |
   | Single  | | Single  |
   | Agent   | | Agent   |
   | Workers | | Workers |
   | (114.54 | | (114.54 |
   | tok/s)  | | tok/s)  |
   +---------+ +---------+

Cost Calculation:

  • Orchestrator: 1 x 2GB ($25/month)
  • Workers: 4 x 4GB @ $40/month = $160/month (Rust single-agent, 75 MB each, 114.54 tok/s)
  • Total: $185/month

Comparison:

  • Python multi-agent: $400/month
  • Rust multi-agent: $200/month
  • Hybrid (Python orchestrator + Rust workers): $185/month

Annual Savings (Hybrid vs Python multi-agent): $2,580 (64% reduction) Annual Savings (Hybrid vs Rust multi-agent): $180 (8% reduction)

Performance:

  • Hybrid: 15% faster per agent (114.54 vs ~42 tok/s effective multi-agent)
  • Hybrid: Better orchestration (Python's simpler coordination)
  • Combines strengths of both runtimes

8.3 ROI Analysis

Development Costs:

Item Python Only Rust Multi-Agent Hybrid Notes
Initial Development $15k (3 weeks) $25k (5 weeks) $30k (6 weeks) Hybrid most complex
Testing & QA $5k $7k $8k More integration testing
Deployment Setup $2k $1k $3k Hybrid needs orchestration layer
Total Dev $22k $33k $41k

Operational Costs (Annual):

Item Python Only Rust Multi-Agent Hybrid
Infrastructure $4,800 $2,400 $2,220
Monitoring $1,200 $600 $800
Maintenance $3,000 $2,000 $2,500
Total Annual $9,000 $5,000 $5,520

5-Year TCO:

  • Python Only: $22k dev + $45k ops = $67k
  • Rust Multi-Agent: $33k dev + $25k ops = $58k (13% savings)
  • Hybrid: $41k dev + $27.6k ops = $68.6k (2% more than Python)

Notable Finding: Hybrid has higher TCO due to development complexity ($19k more than Rust multi-agent), despite operational savings.

However: TCO ignores performance:

  • Hybrid: 15% faster per agent (114.54 vs ~42 tok/s)
  • This translates to faster user experience, not captured in TCO

Revised Recommendation:

  • Cost-Sensitive: Rust Multi-Agent ($58k TCO, 99.396% peak efficiency, 98.281% mean)
  • Performance-Sensitive: Hybrid ($68.6k TCO, 15% faster agents, better orchestration)
  • Python Only: Not recommended (highest cost, no performance advantage)

8.4 Break-Even Analysis

Rust Multi-Agent vs Python:

  • Additional dev cost: $11k
  • Annual savings: $4k
  • Break-even: 33 months (2.75 years)

Hybrid vs Python:

  • Additional dev cost: $19k
  • Annual savings: $3.48k
  • Break-even: 65 months (5.4 years) - Not attractive

Hybrid vs Rust Multi-Agent:

  • Additional dev cost: $8k
  • Annual savings: -$520 (Hybrid costs more to operate)
  • Never breaks even on cost
  • Justification: Performance (15% faster) and architectural flexibility

Business Decision:

  • Short-term (<3 years): Python (lowest dev cost)
  • Medium-term (3-5 years): Rust Multi-Agent (breaks even, lower TCO)
  • Long-term (>5 years) + Performance-Critical: Hybrid (best architecture, performance, future-proof)

9. Production Deployment Strategy

9.1 Deployment Roadmap

Phase 1: Validation (Months 1-2)

  • Deploy Python multi-agent (proven, fast to market)
  • Establish baseline metrics (efficiency, latency, cost)
  • Build monitoring dashboards
  • Goal: Production stability

Phase 2: Rust Multi-Agent Pilot (Months 3-4)

  • Deploy Rust multi-agent to 10% traffic
  • Compare efficiency, latency, cost vs Python
  • Validate 99.396% peak efficiency (98.281% mean) in production
  • Goal: Prove Rust multi-agent viability

Phase 3: Gradual Migration (Months 5-8)

  • Increase Rust traffic: 25% -> 50% -> 75% -> 100%
  • Monitor cost savings accumulation
  • Decommission Python infrastructure
  • Goal: Full migration, realize cost savings

Phase 4: Hybrid Evolution (Months 9-12+)

  • Option A: Stay with Rust multi-agent (lower TCO, proven)
  • Option B: Evolve to hybrid (Python orchestrator + Rust workers)
    • Refactor Rust multi-agent -> Rust single-agent workers
    • Build Python FastAPI orchestration layer
    • Gain 15% performance improvement
  • Decision: Based on performance requirements and budget

9.2 Monitoring & SLAs

Key Metrics to Track:

Performance Metrics:

  • Concurrency speedup (target: >1.95x)
  • Parallel efficiency (target: >98%)
  • Per-agent throughput (Rust: >40 tok/s, Python: >40 tok/s)
  • TTFT p50/p95/p99 (target: p95 <2s)

Reliability Metrics:

  • Resource contention rate (target: <1%)
  • Error rate (target: <0.1%)
  • Timeout rate (target: <0.5%)

Cost Metrics:

  • Cost per 1K multi-agent executions (target: Rust <50% of Python)
  • Memory utilization (target: Rust <100MB per agent, Python <300MB)
  • Instance count (target: Rust <=50% of Python)

SLA Targets:

  • Availability: 99.9% uptime
  • Latency: P95 <2s end-to-end
  • Efficiency: >97% average across all configs
  • Cost: <$250/month per 1M executions

9.3 Rollback Strategy

Rollback Triggers:

  • Efficiency drops below 95% for >1 hour
  • Contention rate exceeds 5%
  • Error rate exceeds 1%
  • Cost exceeds 120% of Python baseline

Rollback Procedure:

  1. Stop Rust deployments
  2. Scale up Python instances
  3. Redirect 100% traffic to Python
  4. Investigate root cause
  5. Fix and re-pilot

Rollback Time: <30 minutes (keep Python warm standby for 3 months post-migration)

9.4 Operational Best Practices

Best Practice 1: Dual Ollama Mandatory

  • Never deploy single Ollama for multi-agent
  • Dual Ollama reduces contention from 63% to 0.74%
  • Cost: Minimal (just port separation)

Best Practice 2: Asymmetric GPU Allocation

  • Use heterogeneous configs (GPU 120/140, CTX 512/1024)
  • Prevents Tokio work-stealing starvation
  • Improves efficiency by 1-2pp over symmetric

Best Practice 3: Temperature >= 0.8

  • Lower temperatures (0.6) show instability in homo configs
  • temp=0.8 or 1.0 more consistent
  • Quality impact: Minimal (validated in TR111_v2)

Best Practice 4: Monitoring TTFT Deltas

  • Track abs(collector_ttft - insight_ttft)
  • Spikes indicate load imbalance
  • Alert threshold: >1000ms delta

Best Practice 5: Gradual Rollout

  • Start with 5-10% traffic
  • Increase by 25% every 2 weeks
  • Monitor efficiency, contention, cost
  • Full migration only after 4-6 weeks validation

10. Conclusions & Recommendations

10.1 Key Findings Summary

Multi-Agent Performance:

  1. Rust achieves 99.396% average efficiency (best config: test011 chimera-hetero)
  2. Rust matches or exceeds Python in multi-agent scenarios (99.396% vs 99.25%)
  3. Overall mean efficiency: 98.281% across all 135 runs (vs Python 95.8%)
  4. Dual Ollama mandatory: Reduces contention from 63% to 0.74%
  5. Heterogeneous configs optimal: Asymmetric GPU allocation prevents scheduler starvation

Multi-Agent Performance Reality:

  1. Rust is 15.2% faster than Python in single-agent tasks (TR111_v2/TR112_v2)
  2. Rust exceeds Python in multi-agent mean efficiency (+2.48pp: 98.281% vs 95.8%)
  3. Coordination efficiency: Multi-agent execution preserves Rust's advantages
  4. Key factors: Dual Ollama eliminates contention, Tokio work-stealing handles I/O-bound workloads efficiently

Business Impact:

  1. Rust multi-agent: 50% lower infrastructure cost (67% less memory per agent)
  2. Hybrid architecture: Best performance (15% faster agents) but higher dev cost ($19k more)
  3. Break-even: Rust multi-agent breaks even at 33 months vs Python
  4. Recommendation: Start Rust multi-agent, evolve to hybrid if performance-critical

10.2 Production Recommendations

Immediate Actions (Month 1):

  1. PASS Deploy Python multi-agent for fastest time-to-market
  2. PASS Use dual Ollama (mandatory for either language)
  3. PASS Establish baseline metrics (efficiency, cost, latency)

Short-Term (Months 2-6):

  1. PASS Pilot Rust multi-agent on 10% traffic
  2. PASS Validate 99% efficiency in production
  3. PASS Measure cost savings (target: 50% reduction)
  4. WARNING Decide migration based on ROI (33-month break-even)

Medium-Term (Months 6-12):

  1. PASS Full Rust multi-agent migration (if pilot successful)
  2. PASS Realize cost savings ($2,400/year)
  3. WARNING Evaluate hybrid evolution (if 15% performance gain justifies $19k dev cost)

Long-Term (Year 2+):

  1. WARNING Consider hybrid architecture (Python orchestrator + Rust workers)
  2. PASS Optimize further (TR115 runtime tuning, prompt optimization)
  3. PASS Scale horizontally leveraging Rust's memory efficiency

10.3 When to Choose Each Approach

Choose Python Multi-Agent When:

  • PASS Rapid time-to-market is priority
  • PASS Development velocity > cost savings
  • PASS Team expertise is Python-heavy
  • PASS Ecosystem integration is critical
  • WARNING Budget allows higher operational costs ($400/month vs $200/month)

Choose Rust Multi-Agent When:

  • PASS Cost optimization is priority (50% infrastructure savings)
  • PASS Memory efficiency is critical (67% less per agent)
  • PASS Type safety and reliability are valued
  • PASS Long-term deployment (>3 years to break even)
  • PASS Team has Rust expertise or willing to invest

Choose Hybrid Architecture When:

  • PASS Performance is critical (15% faster per agent)
  • PASS Budget allows higher dev cost ($19k additional)
  • PASS Long-term strategic (>5 years to break even)
  • PASS Architectural flexibility valued
  • PASS Combined-strengths approach justified

10.4 Final Verdict

For Most Organizations:

  • Start: Python multi-agent (fast to market, proven)
  • Migrate: Rust multi-agent (cost savings, 33-month break-even)
  • Optimize: Consider hybrid if performance-critical and long-term

For Cost-Sensitive:

  • Go directly to Rust multi-agent (50% cost savings outweigh dev time)

For Performance-Critical:

  • Plan hybrid architecture from day 1 (15% performance gain worth $19k investment)

For Rapid Prototyping:

  • Python only (fastest iteration, defer optimization)

10.5 Limitations & Future Work

Current Limitations:

  1. Single platform: Windows-only testing (cross-platform validation needed)
  2. Single model: gemma3:latest only (generalization to other models unknown)
  3. Limited runs: 5 runs per config (more runs would improve statistical confidence)
  4. No streaming optimization: Full responses only (streaming may change characteristics)

Future Research Directions:

  1. Cross-platform validation: Linux, macOS performance comparison
  2. Model generalization: Test Llama3.1, Mistral, Qwen
  3. Streaming optimization: Real-time token processing
  4. 3+ agent orchestration: Scaling beyond dual-agent
  5. CPU-bound workloads: Test Tokio's parallelism advantage
  6. Quantization impact: Q2_K, Q4_0, Q8_0 comparisons
  7. Long-context scenarios: 4K+, 8K+ token contexts
  8. Production case studies: Real-world deployment validation

11. Appendices

Appendix A: Complete Configuration Table

Baseline vs Chimera Configs:

Test GPU (Chimera) CTX TEMP Avg Speedup Avg Eff Peak Eff Contention
001 60 512 0.8 1.9481x 97.41% 98.78% 0/5
002 60 1024 0.8 1.9532x 97.66% 99.30% 0/5
003 80 512 0.8 1.9698x 98.49% 99.36% 0/5
004 80 1024 0.8 1.9797x 98.98% 99.96% 0/5
005 120 512 0.8 1.8321x 91.60% 99.21% 1/5
006 120 1024 0.8 1.9739x 98.69% 99.96% 0/5
202 80 512 0.8 1.9752x 98.76% 99.93% 0/5

Chimera Hetero Configs:

Test GPU A/B CTX A/B TEMP Avg Speedup Avg Eff Peak Eff Contention
007 60/80 512/1024 0.8 1.9580x 97.90% 99.33% 0/5
008 60/80 1024/2048 0.8 1.9744x 98.72% 99.86% 0/5
009 80/100 512/1024 0.8 1.9784x 98.92% 99.63% 0/5
010 80/100 1024/2048 0.8 1.9785x 98.93% 99.73% 0/5
011 120/140 512/1024 0.8 1.9879x 99.396% 99.57% 0/5
012 120/140 1024/2048 0.8 1.9744x 98.72% 99.38% 0/5
201 80/80 512/1024 0.8 1.9793x 98.96% 99.90% 0/5

Chimera Homo Configs (Sampled):

Test GPU CTX TEMP Avg Speedup Avg Eff Peak Eff Contention
100 80 512 0.6 1.8846x 94.23% 99.77% 0/5
108 80 2048 1.0 1.9777x 98.88% 99.99% 0/5
200 80 512 0.8 1.9802x 99.01% 99.80% 0/5

Appendix B: Comparison to Python TR110

Direct Metric Comparison:

Metric Python (TR110) Rust (TR114_v2) Delta Winner
Peak Config Avg Efficiency 99.25% (test108) 99.396% (test011) +0.15pp Rust
Peak Single Run 99.25% 99.992% (test108) +0.74pp Rust
Mean Efficiency (All Runs) 95.8% 98.281% +2.48pp Rust
Median Efficiency ~96.5% 98.6% +2.1pp Rust
Consistency (StdDev) 7.4pp 4.9pp -2.5pp Rust
Consistency (CV) 7.7% 5.0% -2.7pp Rust
Contention Rate 10-15% 0.74% -10-14pp Rust
Best Baseline-vs-Chimera 97.9% 98.984% +1.08pp Rust
Best Chimera-Hetero 99.0% 99.396% +0.40pp Rust
Best Chimera-Homo 99.25% 99.01% -0.24pp Python

Single-Agent Baseline Comparison (TR112_v2):

Metric Python Rust Delta Winner
Throughput 99.34 tok/s 114.54 tok/s +15.2% Rust
TTFT 1437 ms 603 ms -58.0% Rust
Memory 250 MB 75 MB -67% Rust
Startup 1.5s 0.2s -83% Rust

Appendix C: Statistical Formulas

Concurrency Speedup:

speedup = sequential_estimated_time / concurrent_wall_time
where sequential_estimated_time = agent1_time + agent2_time

Parallel Efficiency:

efficiency = (speedup / num_agents) x 100%
where num_agents = 2

Coefficient of Variation:

CV = (stddev / mean) x 100%

Throughput Delta:

throughput_delta = collector_throughput - insight_throughput (tok/s)

Resource Contention Detection:

contention = (agent_ttft > baseline_ttft + 3000ms)

Appendix D: Glossary

  • Concurrency Speedup: Ratio of sequential time to concurrent time (ideal = 2.0x for 2 agents)
  • Parallel Efficiency: Percentage of theoretical maximum speedup achieved
  • TTFT: Time-to-First-Token (latency from request to first generated token)
  • Throughput: Tokens generated per second (eval phase only)
  • Resource Contention: Anomalous TTFT increase indicating server-level serialization
  • Chimera: Optimized configuration (custom num_gpu, num_ctx, temperature)
  • Baseline: Ollama default configuration (no manual overrides)
  • Homogeneous: Both agents use identical configuration
  • Heterogeneous: Agents use different configurations (asymmetric)
  • Work-Stealing: Tokio's thread scheduling algorithm (tasks can migrate between threads)
  • Dual Ollama: Two independent Ollama server instances (ports 11434/11435)

Acknowledgments

This research builds upon:

  • Technical Report 109: Python agent workflow analysis (baseline single-agent data)
  • Technical Report 110: Python multi-agent orchestration (comparison baseline)
  • Technical Report 111_v2: Rust agent comprehensive optimization (corrected single-agent baselines)
  • Technical Report 112_v2: Rust vs Python single-agent comparison (revealed 15% Rust advantage)
  • Technical Report 113: Rust multi-agent initial analysis (identified dual Ollama requirement)
  • Technical Report 115: Rust async runtime analysis (quantified runtime overhead)

This work used the Ollama local LLM inference server and the Rust/Tokio async ecosystem.


Document Version: 2.0 Last Updated: 2025-11-15 Status: Final Supersedes: Technical Report 113, Technical Report 114 (v1)


Related Documentation:


For questions or clarifications, refer to the complete dataset in Demo_rust_multiagent/runs/tr110_rust_full/ or contact the research team.