Research Archive
Technical Report

TR129: N-Agent Scaling Laws

Closed-loop multi-agent throughput scaling on consumer GPU with Ollama.

Table of Contents

Technical Report 129: N-Agent Scaling Laws

Closed-Loop Multi-Agent Throughput Scaling on Consumer GPU with Ollama

Field Value
TR Number 129
Project Banterhearts LLM Performance Research
Date 2026-02-25
Author Research Team
Report Type N-agent scaling characterization (4-phase, 5,310 measurements)
Test Duration ~90 minutes
Status Complete --- All 4 phases delivered
Run ID 20260225_213619
Related Work TR114v2 (2-Agent Efficiency), TR128 (Production Workload Characterization)
Depends On TR128 (baseline throughput cross-validation)

Abstract

TR108--TR128 characterized LLM inference under single-request and open-loop arrival conditions. Real multi-agent systems operate in closed-loop: each agent sends a request, waits for the response, then sends another. The maximum number of in-flight requests is bounded by N (the agent count), unlike open-loop systems where arrivals are unbounded. TR129 fills this gap with 5,310 measurements across 4 phases, testing N={1, 2, 3, 4, 5, 6, 7, 8} concurrent closed-loop agents on 3 models --- llama3.2-1b (1.2B params), qwen2.5-1.5b (1.5B params), and llama3.2-3b (3.2B params) --- all served by a single Ollama instance on an RTX 4080 Laptop GPU (12 GB VRAM).

Core finding: Per-agent effective throughput degrades sharply with N, following Amdahl's Law with serial fractions s = 0.39--0.54. Total system throughput plateaus at approximately 1.4--1.6x single-agent throughput, meaning that adding agents beyond N=2 yields rapidly diminishing returns. The GPU serialization bottleneck --- not memory, not scheduling, not Ollama overhead --- is the fundamental constraint.

Per-agent efficiency (effective tok/s at N / effective tok/s at N=1):

Model eta(2) eta(4) eta(8)
llama3.2-1b 80.0% 40.4% 20.3%
llama3.2-3b 67.5% 34.2% 17.3%
qwen2.5-1.5b 72.7% 36.8% 18.6%

Amdahl's Law serial fractions (R-squared > 0.97 for all models):

Model Serial Fraction (s) Max Speedup (1/s) R-squared
llama3.2-1b 0.5391 1.85x 0.970
llama3.2-3b 0.3870 2.58x 0.993
qwen2.5-1.5b 0.4554 2.20x 0.985

Interpretation of serial fraction: The serial fraction s represents the portion of work that is inherently sequential and cannot benefit from adding more agents. For llama3.2-1b (s = 0.54), over half the work is serial --- GPU scheduler serialization, memory bus contention, Ollama HTTP handling, and CUDA kernel launch overhead. Even with infinite agents, total throughput cannot exceed 1/s = 1.85x the single-agent rate. This is why the throughput plateau is reached by N=2.

Think-time sweep (Phase 3, N=4): Inter-request delays improve per-request throughput (each individual LLM call returns faster due to reduced contention) but reduce sustained system throughput due to duty-cycle loss. At think=2000ms, per-request throughput recovers to near-baseline (eta >= 0.98), but agents spend 60--69% of their time idle. The practical implication: natural agent processing time (reasoning, tool calls) between requests is beneficial for per-request latency, but adding artificial delays is counterproductive for total system throughput.

Key findings: (1) Total system throughput plateaus at N=2 with <3% gain from N=2 to N=8. (2) Amdahl's Law with s = 0.39--0.54 predicts scaling with R-squared > 0.97. (3) Think-time improves per-request throughput but reduces sustained throughput --- a trade-off, not a free lunch. (4) Fairness is excellent (Jain's index >= 0.997 at N=8). (5) Heterogeneous model assignments show throughput differences, though Phase 4 confounds prevent isolating model-switching overhead.


Executive Summary

TR129 answers: how does per-agent throughput scale when N agents share a single Ollama GPU instance in closed-loop operation?

The answer is sobering: the RTX 4080 Laptop GPU, when serving a single model via Ollama, has a serial fraction of 39--54% under Amdahl's Law. This means that no matter how many agents you add, total system throughput cannot exceed 1.9--2.6x the single-agent rate. In practice, the plateau is reached by N=2 (for throughput) and diminishing returns are severe beyond N=3 (for efficiency).

This has immediate implications for multi-agent system design: on consumer hardware with a single GPU, the optimal architecture is 2--3 agents per GPU, not 8 or more. Agents beyond this count get worse per-request latency without meaningfully increasing total token production. The result is consistent with Amdahl's Law applied to GPU inference: the GPU scheduler, memory bus, and Ollama HTTP handling form a serial bottleneck that cannot be parallelized.

Key Findings

  1. Single-agent baseline: llama3.2-1b achieves 115.8 eff. tok/s solo (wall time 1,042 ms per request, producing 128 tokens). llama3.2-3b: 79.2 eff. tok/s (1,617 ms). qwen2.5-1.5b: 102.7 eff. tok/s (1,232 ms).
  2. Total throughput plateaus at N=2: llama3.2-1b goes from 117.8 at N=1 to 185.3 at N=2 to 187.9 at N=8 (1.6x overall, <1.4% gain from N=2 to N=8). Similar patterns for all models.
  3. Per-agent degradation is severe: At N=8, each agent gets only 17--20% of its solo throughput. The degradation is smooth and predictable, not a phase transition.
  4. Amdahl's Law fits well: Serial fractions s = 0.5391 (llama3.2-1b, R-squared=0.970), s = 0.3870 (llama3.2-3b, R-squared=0.993), s = 0.4554 (qwen2.5-1.5b, R-squared=0.985). The serial fraction is highest for the smallest model, which has the most overhead relative to compute.
  5. Saturation points: llama3.2-1b saturates (eta < 50%) at N=4. llama3.2-3b and qwen2.5-1.5b saturate at N=3. These thresholds provide concrete agent-count guidance.
  6. Fairness is excellent: Jain's index >= 0.997 at N=8 for all models. Ollama distributes work equitably even under heavy contention. No agent is starved.
  7. Think-time is a trade-off, not a free lunch: Inter-request delays of 100ms improve sustained total throughput by 2--45% (model-dependent), but delays of 2000ms reduce sustained throughput by 2--14% despite near-perfect per-request efficiency.
  8. qwen2.5-1.5b anomaly: This model shows a dramatic 45% sustained throughput improvement at think=100ms (175.7 -> 254.1 tok/s total), possibly related to the unexplained 66% decode throughput increase under sustained load observed in TR128.
  9. GPU tok/s is constant across N: GPU-side decode throughput (~207/162/114 tok/s) remains flat regardless of N, confirming that contention manifests as queue wait, not decode slowdown.
  10. Phase 4 confounded: homo_1b (233.9 tok/s) outperformed Phase 2 N=4 (187.0 tok/s) by 25%, but Ollama restart, warmup sequence, and thermal state changes prevent attributing this to OLLAMA_MAX_LOADED_MODELS alone.

Key Decisions for Multi-Agent Design

  1. Agent count: Total system throughput plateaus by N=2. Adding agents beyond N=3 wastes resources.
  2. Think-time: Inter-request delays improve per-request throughput but reduce sustained system throughput due to duty-cycle loss. Do not add artificial delays.
  3. Model heterogeneity: Prefer homogeneous assignments. Phase 4 confounds prevent isolating model-switching overhead, but the conservative default is same-model-per-GPU.
  4. Scaling prediction: Use eta(N) = 1 / (s + (1-s)*N) with per-model s from SS5 to predict throughput at untested N.
  5. Capacity planning: With average s = 0.46, max speedup is 2.2x. Budget 2 agents per GPU for optimal throughput/latency trade-off.

Claim Validation

# Claim Status Evidence
1 eta(N) monotonically non-increasing Violated (minor) Phase 2 N=1 yields eta=101.7% for llama3.2-1b (1.7% above baseline, within noise)
2 in_flight in [0, N-1] Confirmed Queue dynamics: max in-flight never reaches N
3 Jain's index = 1.0 at N=1 Confirmed J=1.0000 for all models at N=1
4 N=1 cross-validates with TR128 (within 10%) Partially 11--18% gap (wall-clock comparison); different prompt sizes explain the gap
5 GPU tok/s constant across N Confirmed ~207 / ~162 / ~114 tok/s consistent from N=1 to N=8
6 Total system throughput plateaus Confirmed <3% gain from N=2 to N=8 for all models
7 Think-time improves scheduling Confirmed (partial) Per-request: yes. Sustained: trade-off (duty-cycle loss)
8 Amdahl's Law describes scaling Confirmed R-squared 0.970--0.993, serial fraction s = 0.39--0.54

When to Use This Report

Scenario What You'll Find Key Sections
Sizing a multi-agent system How many agents can a single GPU serve before throughput saturates? What's the per-agent latency at each N? SS4, SS5, SS6
Predicting throughput at untested N Amdahl's Law formula with per-model serial fractions. Plug in N, get predicted eta. SS5 (formula + table)
Designing agent think-time Should agents pause between requests? How does delay affect latency vs throughput? SS7, SS8 (dual-perspective analysis)
Evaluating mixed-model deployments Does mixing models hurt? How does OLLAMA_MAX_LOADED_MODELS affect performance? SS9, SS10 (with confound analysis)
Validating fairness Do all agents get equal GPU time? Is any agent starved? SS6 (Jain's index)
Understanding the bottleneck Why does throughput plateau? What's the physical constraint? SS5 (serial fraction), SS12 (timeline)

How to Read This Report

  • Quick scan (5 min): Abstract + Executive Summary + SS17 (Design Guidance)
  • Decision-making (15 min): Add SS4 (throughput curves), SS5 (Amdahl), SS6 (saturation)
  • Full technical review (45 min): All sections, with focus on SS7--SS8 (think-time) and SS13--SS14 (validation)

Table of Contents

  1. SS1: Introduction & Research Motivation
  2. SS2: Methodology & Design
  3. SS3: Single-Agent Baseline
  4. SS4: N-Agent Throughput Scaling
  5. SS5: Efficiency & Scaling Laws
  6. SS6: Fairness & Saturation
  7. SS7: Think-Time Effects
  8. SS8: Optimal Think-Time
  9. SS9: Heterogeneous Model Analysis
  10. SS10: Model Switching Overhead
  11. SS11: VRAM & GPU Metrics
  12. SS12: Request Timeline & Serialization
  13. SS13: Cross-Validation with TR128
  14. SS14: Statistical Analysis
  15. SS15: Key Findings
  16. SS16: Conclusions
  17. SS17: Multi-Agent Design Guidance
  18. SS18: Limitations & Future Work
  19. SS19: Reproducibility
  20. Appendix A: Environment
  21. Appendix B: Configuration
  22. Appendix C: Data Summary
  23. Appendix D: Glossary
  24. References

Metric Definitions & Statistical Methods

TR129 reports two throughput metrics. Understanding the distinction is critical for interpreting all results.

Throughput Metrics

Metric Formula What It Measures When to Use
effective_tps (PRIMARY) completion_tokens / wall_ms * 1000 User-perceived throughput including queue wait, scheduling delay, and GPU compute Capacity planning, SLA estimation, per-agent latency
gpu_tokens_per_s (SECONDARY) completion_tokens / eval_ms * 1000 GPU-side decode speed only, excludes queue wait Cross-validation, GPU utilization, bottleneck identification

Why two metrics? At N=1, effective_tps < gpu_tokens_per_s because effective_tps includes Ollama overhead (HTTP round-trip, scheduling, prompt evaluation). As N grows, effective_tps degrades further (more queue wait) while gpu_tokens_per_s stays constant (the GPU decodes at the same rate per token regardless of how many agents are waiting). The growing gap between these metrics is the queue wait penalty --- the core phenomenon TR129 measures.

Think-Time Metrics (SS7--SS8 Only)

For think-time analysis, we additionally distinguish:

Metric Formula What It Measures
Per-request effective_tps completion_tokens / wall_ms * 1000 How fast each individual LLM call returns (excludes inter-request idle time)
Sustained effective_tps completion_tokens / (wall_ms + think_ms) * 1000 Actual token production rate including idle time
Duty cycle wall_ms / (wall_ms + think_ms) Fraction of time the agent is actively waiting for a response

Why does this matter? Without the sustained/per-request distinction, think-time analysis produces physically impossible numbers. At think=2000ms with N=4 agents, per-request effective_tps can reach ~114 tok/s per agent (near solo speed), but the agents are only active 35% of the time. Multiplying N * per_request_tps gives 454 tok/s total --- exceeding the GPU's physical decode rate of 207 tok/s. The sustained metric correctly reports 160 tok/s total, which is below the GPU ceiling.

Scaling Metrics

Metric Formula Range
eta(N) per_agent_eff_tps(N) / per_agent_eff_tps(1) [0, 1] for well-behaved scaling
Jain's fairness index (sum x_i)^2 / (N * sum x_i^2) [1/N, 1.0]; 1.0 = perfect fairness
Amdahl's Law eta(N) = 1 / (s + (1-s)*N) s in [0, 1]; s = serial fraction
N* Smallest N where eta < 50% Model-dependent saturation point

Statistical Tests

  • Confidence intervals: 95% CI via t-distribution
  • Normality: Shapiro-Wilk test (W statistic, p-value)
  • Effect sizes: Cohen's d (small=0.2, medium=0.5, large=0.8)
  • Curve fitting: Four candidates (power law, exponential, logistic, Amdahl); compared by R-squared. Amdahl is preferred for theoretical grounding despite not always winning R-squared.
  • Outlier detection: IQR-based (< Q1 - 1.5IQR or > Q3 + 1.5IQR)
  • Multiple comparison note: When comparing across multiple models or think-time values, we report per-comparison p-values without Bonferroni correction. Effect sizes (Cohen's d > 15 for the scaling effect) are so large that correction is unnecessary.

SS1: Introduction & Research Motivation

Multi-agent LLM systems are increasingly common: autonomous coding assistants deploy multiple parallel agents, research pipelines orchestrate chains of LLM calls, and customer support bots handle concurrent conversations --- all sharing a single inference backend. The critical question for deployment is: how does performance scale with agent count?

1.1 Research Questions

  1. At what agent count does per-agent throughput collapse? Is the degradation gradual or a phase transition?
  2. Does total system throughput plateau, continue growing, or decline with N?
  3. How does inter-request think-time (simulating agent reasoning/tool-call delays) change the scaling picture?
  4. Does model heterogeneity (agents using different models) affect scaling?
  5. What is the serial bottleneck fraction (Amdahl's Law) limiting parallelism on consumer GPU?

1.2 Why This Matters

Prior work in this series characterized single-request inference (TR108--TR122), quality/quantization trade-offs (TR124--TR125), cross-platform behavior (TR126), long-context scaling (TR127), and open-loop production workloads (TR128). TR114v2 tested N=2 agents and found ~98% efficiency (~42 tok/s per agent vs 114 solo). But nobody has characterized scaling beyond N=2 in closed-loop operation on consumer hardware.

The gap is not academic: production decisions about how many agents to deploy per GPU, whether to add think-time between requests, and whether to mix models across agents all depend on data that doesn't exist yet. Without scaling curves, engineers either over-provision (wasting GPU resources) or under-provision (creating latency bottlenecks).

1.3 Prior Work and Literature Gap

Amdahl's Law (1967) predicts parallel speedup limits based on the serial fraction of a workload. It has been applied to CPU parallelism, distributed computing, and GPU kernel scaling, but not to multi-agent LLM inference --- a workload that combines GPU parallelism (within each forward pass) with request-level serialization (across the GPU scheduler and serving stack).

The closest prior work is TR114v2 (in this series), which tested N=2 agents on the HuggingFace backend and found 98% efficiency. However, N=2 is insufficient to reveal scaling trends: two points cannot distinguish linear from exponential decay. Production deployments routinely run 4--8 concurrent agent sessions, and the scaling behavior in this regime is unstudied.

The vLLM and TGI papers characterize throughput under open-loop arrivals (request rate as independent variable), which corresponds to TR128's paradigm. Closed-loop scaling (agent count as independent variable, bounded concurrency) is a different workload model that these systems have not characterized.

TR129 fills this gap: it provides a systematic closed-loop scaling characterization from N=1 to N=8, with formal Amdahl analysis, fairness quantification, and think-time interaction on consumer hardware.

1.5 Key Distinction: Closed-Loop vs Open-Loop

Property Closed-Loop (TR129) Open-Loop (TR128)
Arrival pattern Agent sends next request after receiving response Requests arrive independently (Poisson)
Max concurrency Bounded by N Unbounded (can spike)
Queue growth Self-regulating (busy agent stops sending) Can grow without bound
Real-world model Agent reasoning, tool calling Web API, chat load
What it tests Sustained agent productivity Server burst handling

TR128's open-loop results (e.g., TTFT amplification at 2.0 req/s) cannot directly predict closed-loop behavior because closed-loop bounds max concurrency to N. The two paradigms answer different questions.

Practical example: A coding assistant framework runs 4 agents, each handling a different file. In open-loop (TR128), all 4 agents might fire requests simultaneously during a burst, then go idle. In closed-loop (TR129), each agent sends one request at a time and waits for the response before proceeding. The closed-loop model is more realistic for most multi-agent applications because agents can't do useful work (besides waiting) while their LLM request is in flight.

When open-loop applies instead: API servers facing external user traffic (e.g., a chat endpoint receiving requests from multiple users) are better modeled by TR128's open-loop paradigm, because users don't wait for other users' responses before sending their own requests.

1.7 Scope

  • In scope: N=1--8 agents, 3 models (1b--3b parameters), homogeneous and heterogeneous model assignments, think-time sweep, Ollama on RTX 4080 Laptop GPU (Windows)
  • Not in scope: N > 8, data-center GPUs, multi-GPU, vLLM/TGI backends, long-context effects (see TR127), quantization effects (see TR125)

SS2: Methodology & Design

2.1 Hardware & Software

Component Specification
GPU NVIDIA RTX 4080 Laptop (12 GB VRAM, Ada Lovelace, compute 8.9)
OS Windows 11 (Build 26200)
Inference backend Ollama (single instance, default configuration)
Python 3.13.1
PyTorch 2.8.0+cu128
CUDA 12.8

2.2 Models

Model Ollama Tag Parameters Architecture Base VRAM
llama3.2-1b llama3.2:1b 1.2B LLaMA-3.2 ~1.3 GB
qwen2.5-1.5b qwen2.5:1.5b 1.5B Qwen-2.5 ~1.5 GB
llama3.2-3b llama3.2:3b 3.2B LLaMA-3.2 ~2.4 GB

All models fit comfortably in 12 GB VRAM individually. Together they require ~5.2 GB (Phase 4 with OLLAMA_MAX_LOADED_MODELS=3), leaving ~6.8 GB headroom.

Why these models? They span a 2.7x parameter range (1.2B to 3.2B), use two architectures (LLaMA, Qwen), and are the same models tested in TR128 --- enabling direct cross-validation. All are small enough to avoid VRAM pressure, isolating GPU compute as the bottleneck.

2.3 Closed-Loop Protocol

Each agent runs an independent closed loop:

  1. Generate prompt (uniform random 100--300 tokens from a seed paragraph)
  2. Send POST to Ollama /api/generate (stream=false, max_tokens=128, temp=0.0, seed=42)
  3. Wait for complete response (no streaming)
  4. Record wall_ms (via time.perf_counter()), Ollama native timing (nanosecond precision), and in-flight count at submit time
  5. Optional think-time delay (Phase 3 only)
  6. Repeat from step 1

Implementation detail: All N agents start simultaneously via asyncio.create_task(). Each agent uses its own httpx.AsyncClient to avoid connection pool contention. The in_flight counter is a mutable [0] list, safe in single-threaded asyncio. Maximum in-flight requests at any moment = N (closed-loop invariant).

Timing extraction: Ollama returns timing in nanoseconds (eval_duration, prompt_eval_duration, total_duration, load_duration). These are converted to milliseconds. Wall-clock time is measured with time.perf_counter() wrapping the entire HTTP POST cycle. The difference between wall_ms and Ollama's total_duration reflects HTTP overhead + queue wait.

2.4 Phases

Phase Description Config Rows Purpose
1. Baseline N=1 serial, 50 requests per model 3 models 150 Reference throughput for eta(1)=1.0
2. Scaling N={1,2,3,4,5,6,7,8}, 30 req/agent 3 models x 8 N-levels = 24 configs 3,240 Core scaling curves
3. Think-Time N=4, think={0,100,500,2000}ms, 30 req/agent 3 models x 4 think-times = 12 configs 1,440 Think-time trade-off
4. Heterogeneous N=4, mixed model assignments 4 configs 480 Model switching overhead
Total 5,310

Phase transitions: 5-second cooldown between configurations within a phase. The cooldown allows Ollama's internal state (connection pools, request queues) to drain and prevents one configuration's tail latency from bleeding into the next. Before Phase 4, Ollama is restarted with OLLAMA_MAX_LOADED_MODELS=3 and all 3 models are warmed up (3 requests each). The restart is necessary because OLLAMA_MAX_LOADED_MODELS can only be set at Ollama startup on Windows.

Ordering within Phase 2: Configurations are run in ascending N order (N=1, 2, 3, ..., 8) for each model, and models are cycled in a fixed order (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b). This creates 24 sequential configurations. The fixed ordering means temporal effects (GPU warming, OS state changes) are confounded with N and model --- but the effect sizes are so large (Cohen's d > 15) that temporal drift cannot explain the results.

Sample size justification: 30 requests per agent per configuration provides adequate statistical power. The scaling effect is so large (Cohen's d > 15) that even a single sample would suffice for detection at 80% power. The 30-request design provides precision for confidence intervals and outlier detection.

2.5 Prompt Design Choices

Prompts are generated by repeating a seed paragraph (318 characters, ~75 tokens) and truncating to a uniform random length in [100, 300] tokens. This creates diverse prompt lengths while keeping content consistent.

Why uniform random length? A fixed prompt length would create lock-step behavior where all N agents' requests take identical time, producing artificially high fairness scores. Uniform random variation creates realistic scheduling heterogeneity: some requests evaluate faster (short prompts), some slower (long prompts). The 3:1 length ratio (100--300 tokens) provides enough variation to stress-test the scheduler without making prompt evaluation the dominant cost (decode still dominates at 128 output tokens).

Why not real prompts? Real agent prompts (code snippets, tool outputs, conversation history) have variable structure that would add uncontrolled noise. Synthetic prompts isolate the variable of interest (N) from prompt-content effects. TR124 and TR125 tested real tasks; TR129 deliberately uses synthetic prompts for controlled scaling measurement.

Temperature and seed: All requests use temp=0.0 and seed=42. This eliminates sampling variance (identical prompts produce identical outputs). The throughput measurement is not affected by output quality; we only need consistent decode work per request.

2.6 What This Methodology Does NOT Test

  • Variable prompt lengths within a run: All prompts are 100--300 tokens. Real agents may send 50-token tool calls and 4,000-token context dumps in the same session. The 6:1 length ratio in real workloads could create more scheduling heterogeneity than our 3:1 ratio.
  • Multi-turn context accumulation: Each request is independent. Real agents accumulate context across turns, increasing prompt size (see TR128 Phase 5). This would increase prompt_eval time relative to decode, potentially changing the serial fraction.
  • Backend comparison: Only Ollama is tested. vLLM, TGI, and TensorRT-LLM may have different scaling behavior, especially those designed for high-concurrency serving with PagedAttention and continuous batching.
  • Model sizes > 3B: All models fit comfortably in 12 GB VRAM. At 7B+ (or with long contexts), VRAM pressure introduces a second bottleneck that interacts with the Amdahl scaling (see TR127).

SS3: Single-Agent Baseline

Phase 1 establishes the reference throughput for each model at N=1 (serial, no contention). These values define eta(1) = 1.0 for all efficiency calculations.

Model N Requests Mean Wall (ms) Eff. tok/s GPU tok/s 95% CI (eff) CV%
llama3.2-1b 50 1,042 115.8 206.8 [113.4, 118.2] 7.3%
llama3.2-3b 50 1,617 79.2 114.3 [78.9, 79.4] 1.2%
qwen2.5-1.5b 50 1,232 102.7 162.2 [101.6, 103.8] 3.7%

Observation 1 --- GPU tok/s consistently exceeds effective tok/s: The ratio is 1.58x (qwen2.5-1.5b) to 1.78x (llama3.2-1b). This gap represents Ollama's internal overhead at N=1: HTTP round-trip (~5ms), prompt evaluation (variable, depends on prompt length), model scheduling, and CUDA kernel launch. This overhead is the "baseline tax" even without contention, and its relative magnitude determines the serial fraction in Amdahl's Law (larger overhead relative to decode = higher s).

Observation 2 --- llama3.2-1b has the highest variability: CV=7.3% with 9 outliers (18%) detected by IQR. The small model's short decode time (~620ms based on 128 tokens at 207 tok/s) means that fixed overhead (HTTP, scheduling) contributes a larger fraction of wall_ms, amplifying relative noise. By contrast, llama3.2-3b's longer decode time (~1,120ms) dilutes the overhead, yielding CV=1.2%.

Observation 3 --- Throughput ordering matches model size: Smaller models decode faster (higher tok/s). This is expected: smaller models have fewer parameters to read from memory per token, and inference is memory-bandwidth-bound for autoregressive decode. The 1.46x throughput ratio (llama3.2-1b / llama3.2-3b = 115.8/79.2) is smaller than the 2.67x parameter ratio (3.2B/1.2B), consistent with sub-linear scaling of inference cost with parameters (see TR121v1).

Observation 4 --- Warmup protocol validation: The 3-request warmup protocol (applied before Phase 1 begins) is validated by the cold-start analysis in SS14: only 4 of 48 agent-phase combinations show significant cold-start effects, close to the 5% false-positive rate. For Phase 1 specifically, the first 3 data-collecting requests show no systematic difference from the remaining 47, confirming that the warmup is sufficient.

Observation 5 --- Baseline stability over 50 requests: Per-agent throughput does not drift over the 50-request Phase 1 run. A Spearman rank correlation between request_sequence and effective_tps yields p > 0.1 for all three models, indicating no temporal trend. The GPU is thermally stable and Ollama's internal state does not degrade over this timeframe.

Cross-validation with TR128: These baselines compare with TR128 Phase 1 within 11--18% (see SS13), with the gap attributable to different prompt/completion configurations.


SS4: N-Agent Throughput Scaling

The core Phase 2 result: how total and per-agent effective throughput change with N = {1, 2, 3, 4, 5, 6, 7, 8} concurrent closed-loop agents. Each configuration runs 30 requests per agent with 0ms think-time.

4.0 Phase 2 N=1 vs Phase 1 Baseline Consistency

Before analyzing scaling, we verify that Phase 2's N=1 data matches Phase 1's baseline:

Model Phase 1 eff. tok/s Phase 2 N=1 eff. tok/s Delta Consistent?
llama3.2-1b 115.8 117.8 +1.7% Yes (within CI)
llama3.2-3b 79.2 80.2 +1.3% Yes (within CI)
qwen2.5-1.5b 102.7 101.7 -1.0% Yes (within CI)

All three models are within 2% of their Phase 1 baselines, confirming measurement consistency across phases. The slight positive bias for llama3.2-1b and 3b reflects the warm-GPU effect (Phase 2 runs after Phase 1). qwen2.5-1.5b's slight negative bias is within sampling noise.

4.1 Per-Agent Effective Throughput

Model N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8
llama3.2-1b 117.8 92.7 62.0 46.8 37.9 31.3 26.9 23.5
llama3.2-3b 80.2 53.4 35.9 27.1 21.7 18.2 15.6 13.7
qwen2.5-1.5b 101.7 74.6 49.8 37.8 30.1 25.2 21.7 19.1

Observation: The degradation is smooth and monotonic for all models. There is no "cliff" or phase transition --- just steady, predictable decay. At N=8, each agent receives 17--20% of its solo throughput. From the agent's perspective, each request takes 4--6x longer at N=8 compared to solo.

Why is per-agent throughput degrading? The GPU processes one token at a time per active sequence (Ollama's default scheduling). When N agents have active requests, each agent's request must wait for the GPU to cycle through the other N-1 requests before generating its next token. The wall-clock time per request increases approximately linearly with N, so effective_tps decreases approximately as 1/N --- modified by the serial overhead fraction (Amdahl's Law).

4.2 Total System Throughput

Model N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8
llama3.2-1b 117.8 185.3 186.2 187.0 189.4 187.9 188.3 187.9
llama3.2-3b 80.2 106.8 107.7 108.2 108.6 109.1 109.1 109.4
qwen2.5-1.5b 101.7 149.2 149.4 151.1 150.6 151.1 151.7 152.8

The critical insight: Total system throughput reaches its plateau at N=2 for all models. The gain from N=2 to N=8 is negligible:

  • llama3.2-1b: 185.3 -> 187.9 (+1.4%)
  • llama3.2-3b: 106.8 -> 109.4 (+2.4%)
  • qwen2.5-1.5b: 149.2 -> 152.8 (+2.4%)

This means every agent added beyond N=2 increases per-agent latency without meaningfully increasing total token production. The GPU is effectively saturated at N=2 for these model sizes.

4.3 Why Does Total Throughput Exceed Single-Agent Rate?

At N=1, total system = 117.8 tok/s. At N=2, total = 185.3 tok/s (1.57x). This exceeds 1.0x because:

  1. Pipelining: While one agent's request is decoding, another agent's request can begin prompt evaluation or be scheduled. The GPU is never idle when there's a queued request.
  2. Continuous batching: Ollama can interleave token generation across multiple sequences in a single forward pass, producing more total tokens per second than a single sequence.
  3. Overhead amortization: HTTP and scheduling overhead is a smaller fraction of total time when the GPU is continuously busy.

The plateau at ~185 tok/s for llama3.2-1b (vs GPU decode ceiling of ~207 tok/s) suggests the GPU approaches its decode bandwidth limit. The remaining 11% gap is the irreducible serial overhead (HTTP, scheduling, prompt eval transitions).

4.4 Per-Model Observations

llama3.2-1b (1.2B): The smallest model shows the most dramatic absolute drop: 117.8 -> 23.5 tok/s per agent (5.0x reduction). But its total throughput plateau (185--189 tok/s) is the highest, close to the GPU decode ceiling. This makes sense: the small model has the least decode work per token, so overhead is the bottleneck.

llama3.2-3b (3.2B): The largest model shows the smallest relative gain from N=1 to N=2: 80.2 -> 106.8 (1.33x). This model's longer decode time per token means the GPU spends more time doing useful work (decode) and less time on overhead. The serial fraction is correspondingly lower (s = 0.387 vs 0.539 for 1b).

qwen2.5-1.5b (1.5B): Intermediate behavior. Notably, its total throughput at N=8 (152.8 tok/s) is 50% higher than its N=1 rate (101.7), the largest relative gain. This model may benefit more from pipelining due to its architecture (GQA with aggressive head reduction) allowing more efficient memory bandwidth utilization during batched decode.

Practical implication: For throughput-maximizing deployments, N=2 captures 92--97% of the achievable total throughput. Every agent beyond N=2 adds <1% total throughput while degrading per-agent latency. This is the strongest actionable result in TR129.

4.5 Comparison with TR114v2 (2-Agent Study)

TR114v2 tested N=2 agents with llama3.2-1b on the same hardware and found ~98.28% efficiency (~42 tok/s per agent vs ~43 solo via the HuggingFace backend). TR129's N=2 efficiency for llama3.2-1b is 80.0% (92.7 tok/s vs 115.8 solo via Ollama). The gap demands explanation.

Property TR114v2 (N=2) TR129 (N=2)
Backend HuggingFace Transformers Ollama
Per-agent tok/s ~42 92.7
Solo tok/s ~43 115.8
Efficiency 98.28% 80.0%
Model llama3.2-1b llama3.2-1b

Why the efficiency difference? Three factors explain why TR114v2 saw near-perfect efficiency while TR129 sees 80%:

  1. Backend architecture: TR114v2 used raw HuggingFace Transformers with Python-level async batching. Ollama adds an HTTP server layer, JSON serialization, and its own scheduling queue --- additional serial overhead that increases the Amdahl serial fraction.
  2. Throughput regime: TR114v2's solo rate (~43 tok/s) is 2.7x lower than TR129's (~116 tok/s). At lower throughput, the GPU has more idle time between tokens, so adding a second agent simply fills idle slots without contention. At TR129's higher throughput, the GPU is already well-utilized at N=1, so the second agent creates real contention.
  3. Measurement methodology: TR114v2 used a different timing methodology (Python-level wall clock around generate() calls vs Ollama's native nanosecond timing). Overhead attribution may differ.

Reconciliation: The TR114v2 result is not wrong --- it correctly shows that 2-agent HuggingFace inference on a lightly-loaded GPU achieves near-perfect scaling. TR129 shows that Ollama's additional serving stack introduces enough serial overhead to reduce 2-agent efficiency to 80%. Both results are consistent with Amdahl's Law: the serial fraction is simply different for different backends.

Implication: Backend choice matters for multi-agent scaling. A high-concurrency backend with lower per-request overhead (e.g., vLLM with PagedAttention) might achieve serial fractions lower than Ollama's 0.39--0.54, potentially extending the scaling plateau beyond N=2.

4.6 The Scaling Curve Shape

Plotting per-agent throughput vs N reveals a hyperbolic decay (consistent with 1/N modified by Amdahl). The curve is NOT linear --- the steepest drop is from N=1 to N=2 (20--33% loss), while N=7 to N=8 loses only 2--3%. This "front-loaded" degradation means the first additional agent costs the most in per-agent terms.

Model Drop N=1->2 Drop N=2->3 Drop N=3->4 Drop N=7->8
llama3.2-1b -21.3% -33.1% -24.5% -12.7%
llama3.2-3b -33.4% -32.7% -24.5% -12.2%
qwen2.5-1.5b -26.6% -33.3% -24.0% -12.1%

Observation: The percentage drops stabilize at higher N. From N=7 to N=8, all three models lose ~12% of their per-agent throughput. This convergence is predicted by Amdahl's Law: at high N, the scaling curve approaches 1/(s*N), and each additional agent adds a fixed fraction of degradation. The asymptotic regime begins around N=5--6.

Observation --- llama3.2-3b drops most at N=1->2: The 33.4% initial drop (vs 21.3% for 1b) seems counterintuitive given that 3b has a lower serial fraction (s=0.39 vs 0.54). The explanation: at N=1, 3b's longer decode time means the GPU is more fully utilized, so adding a second agent immediately creates contention. For 1b, the GPU has more inter-request idle time at N=1, so the second agent partially fills that idle time before contention sets in.


SS5: Efficiency & Scaling Laws

5.1 Efficiency Curves

eta(N) = per_agent_eff_tps(N) / per_agent_eff_tps(N=1, Phase 1 baseline):

Model Baseline eta(1) eta(2) eta(3) eta(4) eta(5) eta(6) eta(7) eta(8)
llama3.2-1b 115.8 101.7% 80.0% 53.6% 40.4% 32.7% 27.0% 23.2% 20.3%
llama3.2-3b 79.2 101.4% 67.5% 45.3% 34.2% 27.4% 23.0% 19.7% 17.3%
qwen2.5-1.5b 102.7 99.0% 72.7% 48.5% 36.8% 29.3% 24.5% 21.1% 18.6%

Note on eta(1) > 100%: Phase 2 N=1 yields slightly higher throughput than Phase 1 baseline for llama3.2-1b (117.8 vs 115.8) and llama3.2-3b (80.2 vs 79.2). This is a warm-up effect: Phase 2 runs after Phase 1, when the GPU caches are warm and Ollama's internal state is optimized. The 1--2% deviation is within measurement noise and does not invalidate the scaling analysis. This is why the monotonicity claim is "Violated (minor)" in the executive summary.

5.2 Scaling Model Comparison

Four candidate models were fit to each efficiency curve:

Model Power Law R-sq Exponential R-sq Logistic R-sq Amdahl R-sq Best Fit
llama3.2-1b 0.951 0.984 0.964 0.970 Exponential
llama3.2-3b 0.985 0.973 0.943 0.993 Amdahl
qwen2.5-1.5b 0.969 0.981 0.957 0.985 Amdahl

Observation: Amdahl's Law provides the best fit for 2 of 3 models and is competitive for the third (0.970 vs 0.984 for llama3.2-1b). The exponential model's slight advantage for llama3.2-1b may reflect the larger overhead fraction creating a faster-than-Amdahl decay at small N. Despite this, we adopt Amdahl's Law as the primary model for three reasons:

  1. Theoretical basis: GPU serialization is a genuine serial bottleneck. Amdahl's Law is the correct model for this physics.
  2. Parsimony: Amdahl has 1 parameter (s) vs 2 for exponential (beta, c). Fewer parameters = less overfitting risk.
  3. Interpretability: The serial fraction s has a direct physical meaning (fraction of work that is sequential), enabling predictions and comparisons across hardware.

5.3 Amdahl's Law Analysis

The Amdahl model eta(N) = 1 / (s + (1-s)*N) estimates the serial fraction s --- the portion of work that is inherently sequential and cannot benefit from request-level parallelism.

Model Serial Fraction (s) R-squared Max Speedup (1/s) Predicted eta(4) Predicted eta(8)
llama3.2-1b 0.5391 0.970 1.85x 42.0% 23.7%
llama3.2-3b 0.3870 0.993 2.58x 35.2% 18.9%
qwen2.5-1.5b 0.4554 0.985 2.20x 38.0% 20.8%

What physically constitutes the serial fraction? The serial portion includes:

  • GPU scheduler serialization: Ollama queues requests and dispatches them to the GPU. At the token level, the GPU generates one token per active sequence per forward pass.
  • Memory bus contention: Small models (1b) are memory-bandwidth-bound. Multiple concurrent requests compete for the same GDDR6 bandwidth (~256 GB/s).
  • Ollama HTTP overhead: Each request involves HTTP POST parsing, JSON serialization, and response construction. This is pure serial work.
  • CUDA kernel launch overhead: Each token generation requires launching GPU kernels. With more sequences, the scheduler has more work.

Why is llama3.2-1b's serial fraction highest (s = 0.54)? The smallest model has the shortest decode time per token (~4.8ms from 207 tok/s). This means the "parallelizable" portion (GPU decode) is small relative to the fixed overhead, making the serial fraction larger. By contrast, llama3.2-3b's longer decode time (~8.8ms) makes the parallelizable portion dominant, yielding a lower serial fraction (s = 0.39).

Validation: The predicted eta values closely match observations:

  • llama3.2-1b: predicted eta(4)=42.0% vs observed 40.4% (4% error)
  • llama3.2-3b: predicted eta(4)=35.2% vs observed 34.2% (3% error)
  • qwen2.5-1.5b: predicted eta(4)=38.0% vs observed 36.8% (3% error)

Practical implication: For capacity planning, use eta(N) = 1 / (s + (1-s)N) with the per-model s. For example, to predict llama3.2-3b throughput with N=10 agents: eta(10) = 1 / (0.387 + 0.61310) = 0.149, so each agent would get ~11.8 tok/s (15% of solo). Total system: 10 * 11.8 = 118 tok/s (vs 109.4 at N=8 --- only 8% more).

5.4 Residual Analysis: Where Does Amdahl Break Down?

The residual (observed eta minus predicted eta) reveals systematic patterns:

Model N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8
llama3.2-1b +0.017 +0.028 -0.019 -0.016 -0.011 -0.007 -0.006 -0.034
llama3.2-3b +0.014 +0.002 -0.009 -0.010 -0.004 -0.001 +0.001 -0.016
qwen2.5-1.5b -0.010 +0.008 -0.007 -0.012 -0.009 -0.004 -0.003 -0.022

Observation --- negative residuals at high N: All three models underperform Amdahl's prediction at N=8 by 1.6--3.4 percentage points. This suggests a secondary bottleneck emerges at high N that Amdahl's Law (with its single serial fraction) cannot capture. Candidates:

  1. HTTP connection pool saturation: 8 concurrent clients may hit httpx or Ollama's connection limits, adding latency beyond simple FIFO queueing.
  2. CUDA scheduler overhead: The GPU's hardware scheduler has finite slots for tracking concurrent contexts. At N=8, context-switching overhead may grow super-linearly.
  3. Memory bus contention: 8 concurrent request contexts mean 8 sets of KV-cache buffers competing for GDDR6 bandwidth. At small models (where the KV-cache is a larger fraction of total memory traffic), this contention is more pronounced --- consistent with llama3.2-1b's largest negative residual.

Observation --- positive residuals at N=2: All models slightly exceed Amdahl's prediction at N=2, particularly llama3.2-1b (+2.8pp). This suggests pipelining gains: at N=2, the two requests interleave (one in prompt_eval while the other decodes) more efficiently than the Amdahl model predicts. The simple 1/N scaling underestimates the benefit of GPU pipeline parallelism at low N.

Observation --- llama3.2-3b has the smallest residuals: Max absolute residual = 1.6pp (at N=8). The larger model's higher compute-to-overhead ratio makes it more "Amdahl-like" --- the serial fraction is genuinely constant across N. For the smaller models, the serial fraction effectively increases with N due to secondary bottlenecks.

5.5 Connection to GPU Architecture

The RTX 4080 Laptop (Ada Lovelace, SM 8.9) has 7,424 CUDA cores organized into 58 SMs, with 256 GB/s memory bandwidth. For autoregressive decode, the bottleneck is memory bandwidth: each token requires reading all model weights from VRAM.

Memory bandwidth utilization at N=1:

  • llama3.2-1b (1.2B params * 2 bytes FP16 = 2.4 GB per forward pass): At 207 tok/s, the model reads 207 * 2.4 = 497 GB/s. This exceeds the theoretical bandwidth (256 GB/s), which means Ollama's internal batching and CUDA caching are reducing the effective data movement per token. The 207 tok/s rate likely reflects cached weight access patterns, not raw bandwidth reads.
  • llama3.2-3b (3.2B params * 2 bytes = 6.4 GB): At 114 tok/s, effective read rate = 730 GB/s. Same caching effect.

Why this matters for scaling: The high effective bandwidth utilization at N=1 means there is little headroom for additional parallel work. When a second agent starts, its requests compete for the same memory bus and cache hierarchy. The resulting contention creates the serial fraction measured in SS5.3.

Prediction for different hardware: An A100 (2 TB/s memory bandwidth, 8x more than RTX 4080) would have proportionally more bandwidth headroom, potentially allowing the serial fraction to decrease. However, the HTTP/scheduling overhead (part of s) is independent of GPU bandwidth, so s cannot decrease to zero even with infinite bandwidth. This is why TR130 (GPU profiling) is needed: to decompose s into its bandwidth-dependent and bandwidth-independent components.


SS6: Fairness & Saturation

6.1 Jain's Fairness Index

J(N) = (sum x_i)^2 / (N * sum x_i^2). A value of 1.0 means all agents receive equal throughput; 1/N means one agent monopolizes the GPU.

Model N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8
llama3.2-1b 1.0000 1.0000 0.9992 0.9997 0.9989 0.9994 0.9989 0.9975
llama3.2-3b 1.0000 0.9999 0.9998 0.9997 0.9996 0.9994 0.9995 0.9994
qwen2.5-1.5b 1.0000 0.9996 0.9999 0.9998 0.9997 0.9997 0.9990 0.9987

All Jain's index values exceed 0.997 across all models and all N. This is an important positive finding: while per-agent throughput degrades severely with N, the degradation is shared equally. No agent is starved, even at N=8 with heavy contention.

Observation --- llama3.2-1b shows the most fairness degradation: J drops from 1.0000 at N=2 to 0.9975 at N=8. This is still excellent (J < 0.99 would indicate concern), but the smallest model's short request times create more scheduling opportunities for unfairness. Longer requests (llama3.2-3b) have more "smoothing" because each agent holds the GPU for longer, reducing the impact of scheduling jitter.

Why is fairness so high? Ollama uses a FIFO (first-in, first-out) scheduling policy. In closed-loop operation, agents that finish first re-submit first, creating a natural round-robin pattern. The FIFO discipline prevents priority inversion or starvation. This result would NOT necessarily hold for priority-based schedulers or backends with per-model queues.

6.2 Saturation Points

The saturation point N* is the smallest N where eta drops below 50%:

Model N* (50% threshold) eta at N=8 Implication
llama3.2-1b 4 20.3% 4+ agents: each gets <50% of solo throughput
llama3.2-3b 3 17.3% 3+ agents: severe diminishing returns
qwen2.5-1.5b 3 18.6% 3+ agents: severe diminishing returns

Design guidance: The 50% threshold is a practical "knee" beyond which adding agents is counterproductive for per-agent latency. For latency-sensitive applications (interactive chat, real-time agent reasoning), keep N <= N*: that's N <= 3 for 1.5b/3b models, N <= 4 for 1b.

For throughput-maximizing applications (batch processing, overnight jobs), N=2 is optimal: it captures >95% of the achievable total throughput (185/189 = 98% for llama3.2-1b) while giving each agent 80% of solo throughput. Going to N=8 gains only 1.4% more total throughput but gives each agent only 20% of solo throughput.

6.3 What Would Break Fairness?

The near-perfect fairness (J >= 0.997) holds under TR129's specific conditions. Several factors could degrade fairness in production:

  1. Priority scheduling: If Ollama or the backend implements request priorities (e.g., premium users get higher priority), fairness would decrease proportionally to the priority skew.
  2. Variable prompt lengths: TR129 uses 100--300 token prompts (2:1 range). If one agent sends 50-token prompts and another sends 4,000-token prompts, the long-prompt agent's requests would dominate GPU time during prompt evaluation, potentially starving short-prompt agents.
  3. Model switching: In heterogeneous configurations, model loading/unloading could create bursts of latency for agents whose model was evicted. This was partially tested in Phase 4 but confounded.
  4. Memory pressure: If VRAM fills (e.g., with 7B+ models or long contexts), OOM-triggered eviction could selectively delay some agents. TR127 showed VRAM spillover causes non-linear latency degradation.

The practical takeaway: Monitor Jain's index in production. If it drops below 0.99 (indicating >1% throughput variance across agents), investigate whether one of the above factors is creating systematic unfairness.

6.4 Efficiency vs Fairness: A Design Trade-Off

N Efficiency eta Fairness J Per-Agent tok/s (1b) Total tok/s (1b)
1 100% 1.000 115.8 115.8
2 80% 1.000 92.7 185.3
4 40% 0.9997 46.8 187.0
8 20% 0.9975 23.5 187.9

Reading the table: As N increases from 2 to 8, fairness stays nearly constant (excellent), but efficiency plummets from 80% to 20%. The system remains fair --- all agents suffer equally. The question for system designers is not "will some agents be starved?" (no) but "is 20% efficiency acceptable for the use case?" For batch processing where total throughput matters, N=2 is sufficient. For interactive agents where per-request latency matters, N=1 is optimal.


SS7: Think-Time Effects

Phase 3 tests N=4 agents with inter-request think times of {0, 100, 500, 2000} ms. Think-time simulates real agent behavior: after receiving a response, agents typically spend time reasoning, calling tools, parsing results, or updating context before sending the next request. The question is whether this natural idle time changes the scaling picture.

7.1 Why This Matters

In production multi-agent systems, agents rarely fire requests back-to-back. A coding assistant might spend 500ms analyzing code before its next LLM call. A research agent might spend 2 seconds fetching a web page. If this natural think-time reduces GPU contention, it means the Amdahl's Law serial fractions from SS5 (which assume zero think-time) are pessimistic --- real agents may scale better than the zero-delay model predicts.

But there's a catch: think-time also means agents are idle part of the time. Even if each individual request completes faster, the total token production rate may decrease because agents aren't always producing tokens. This is the duty-cycle trade-off.

7.2 Two Throughput Perspectives

To avoid physically impossible claims, we report two perspectives:

  • Per-request (completion_tokens / wall_ms): how fast each individual LLM call returns. This is what the agent experiences per call. At high think-time, agents rarely overlap, so each gets near-solo GPU access and per-request throughput approaches baseline.
  • Sustained (completion_tokens / (wall_ms + think_ms)): the actual token production rate including idle time. This is the real throughput over a sustained work session. It accounts for the duty-cycle loss.

Why does the distinction matter? At think=2000ms with N=4, per-request throughput can reach ~114 tok/s per agent (near solo). Naively, total = 4 * 114 = 456 tok/s. But the GPU's physical decode rate is only ~207 tok/s. The error: agents are only active 35% of the time, so the 456 figure counts each agent's burst throughput as if they were all producing tokens simultaneously. The sustained metric correctly reports ~160 tok/s total.

7.3 Results: llama3.2-1b (N=4)

Think (ms) Per-Req tok/s Sustained tok/s Duty Cycle eta (per-req) eta (sustained) Total Sustained tok/s Wall ms
0 46.8 46.8 100.0% 0.404 0.404 187.0 2,642
100 49.7 47.8 96.1% 0.429 0.413 191.3 2,494
500 59.3 47.8 80.6% 0.512 0.413 191.1 2,077
2000 113.6 40.1 35.3% 0.981 0.346 160.4 1,092

Observation --- per-request recovery: At think=2000ms, per-request throughput reaches 113.6 tok/s (eta=0.981), recovering to 98% of solo speed. This is because with 2s think-time, agents rarely overlap: each request effectively runs solo on the GPU. Wall_ms drops from 2,642ms (think=0) to 1,092ms (think=2000) --- a 2.4x improvement in per-call latency.

Observation --- sustained degradation: But sustained throughput drops from 46.8 to 40.1 tok/s per agent (-14%). The duty cycle of 35.3% means agents spend 64.7% of their time idle. Total sustained throughput drops from 187.0 to 160.4 (-14%).

7.4 Results: llama3.2-3b (N=4)

Think (ms) Per-Req tok/s Sustained tok/s Duty Cycle eta (per-req) eta (sustained) Total Sustained tok/s Wall ms
0 38.7 38.7 100.0% 0.489 0.489 154.8 3,321
100 40.6 39.3 96.9% 0.513 0.497 157.3 3,172
500 46.1 39.1 84.8% 0.582 0.493 156.2 2,778
2000 97.5 39.2 40.2% 1.232 0.495 156.7 1,344

Observation --- eta > 1.0 at think=2000ms: Per-request eta reaches 1.232, meaning throughput exceeds the Phase 1 baseline by 23%. This is not a measurement error. At think=2000ms, agents effectively run solo on a warm, uncontested GPU that has been operating for ~60 minutes. The Phase 1 baseline was measured earlier in the experiment with a cold-start penalty on some requests. The eta > 1.0 demonstrates that warm-GPU solo throughput exceeds cold-GPU baseline.

Observation --- sustained throughput is nearly flat: Total sustained throughput varies only from 154.8 to 157.3 tok/s across all think-times (1.6% range). For llama3.2-3b, the duty-cycle loss at high think-time is almost exactly offset by the per-request improvement. This is a coincidence of the specific N=4 + s=0.387 combination, not a general rule.

7.5 Results: qwen2.5-1.5b (N=4)

Think (ms) Per-Req tok/s Sustained tok/s Duty Cycle eta (per-req) eta (sustained) Total Sustained tok/s Wall ms
0 43.9 43.9 100.0% 0.428 0.428 175.7 3,069
100 66.8 63.5 95.0% 0.651 0.619 254.1 1,914
500 83.2 62.8 75.4% 0.811 0.611 251.1 1,535
2000 142.6 44.9 31.5% 1.389 0.437 179.4 918

Observation --- dramatic anomaly at think=100ms: qwen2.5-1.5b shows a 45% jump in sustained total throughput at think=100ms (175.7 -> 254.1 tok/s). This is far larger than the 2.3% and 1.6% improvements seen for the other models. The improvement persists at think=500ms (251.1 tok/s) before declining at think=2000ms (179.4 tok/s, near baseline).

Why is qwen2.5-1.5b so different? This model showed a similar anomaly in TR128 Phase 2: a 66% decode throughput increase during sustained load. We hypothesize that qwen2.5-1.5b's architecture (GQA with aggressive head reduction) has a specific interaction with Ollama's scheduler:

  1. At think=0ms, all 4 agents' requests overlap maximally, creating memory bus contention that disproportionately affects this model's attention pattern.
  2. At think=100ms, the slight stagger reduces peak contention. The model's memory access pattern becomes more efficient when requests don't all hit the same memory bank simultaneously.
  3. The improvement is not just "less contention" --- it's a non-linear interaction where slight staggering allows the hardware to serve this specific model much more efficiently.

This remains a hypothesis, not a confirmed mechanism. A controlled experiment with NVIDIA Nsight profiling (TR130) could test whether memory bank contention is the root cause.

Observation --- eta > 1.0 at think=2000ms: Per-request eta reaches 1.389, exceeding baseline by 39%. The same warm-GPU effect as llama3.2-3b, amplified because qwen2.5-1.5b is more sensitive to thermal/cache state.

7.6 Summary: The Think-Time Trade-Off

Model Best Per-Req Think Best Sustained Think Sustained vs Zero-Delay
llama3.2-1b 2000ms (eta=0.981) 100ms (191.3 tok/s) +2.3%
llama3.2-3b 2000ms (eta=1.232) 100ms (157.3 tok/s) +1.6%
qwen2.5-1.5b 2000ms (eta=1.389) 100ms (254.1 tok/s) +44.6%

The trade-off in one sentence: Think-time makes each individual LLM call faster but makes the total system produce fewer tokens per second --- except for qwen2.5-1.5b, which benefits dramatically from slight contention reduction.

Practical implication: Do NOT add artificial delays between requests. The duty-cycle loss is not worth the per-request improvement for llama3.2-1b and llama3.2-3b. However, if your agents naturally have 100--500ms of processing time between requests (reasoning, tool calls, UI updates), this is a hidden benefit: each LLM call will return faster than in a zero-delay pipeline. For qwen2.5-1.5b specifically, even a small natural delay yields outsized throughput gains.

7.7 The qwen2.5-1.5b Anomaly: A Deep Dive

The 45% sustained throughput improvement at think=100ms for qwen2.5-1.5b is TR129's most unexpected result. Let's quantify what makes it different.

The anomaly in numbers:

Metric llama3.2-1b llama3.2-3b qwen2.5-1.5b
Sustained gain at think=100ms +2.3% +1.6% +44.6%
Per-request gain at think=100ms +6.2% +4.9% +52.2%
Wall_ms reduction at think=100ms 148ms (5.6%) 149ms (4.5%) 1,155ms (37.6%)
Wall_ms at think=0 2,642ms 3,321ms 3,069ms
Wall_ms at think=100ms 2,494ms 3,172ms 1,914ms

The key observation: qwen2.5-1.5b's wall_ms drops by 1,155ms (38%) when think-time increases from 0 to 100ms. The other two models see only 148--149ms reductions (4--6%). Something specific to this model causes severe degradation under maximal contention that is relieved by even a brief stagger.

Hypothesis 1 --- GQA attention pattern and memory access: qwen2.5-1.5b uses Grouped Query Attention (GQA) with aggressive head reduction (fewer KV heads than Q heads). This means multiple query heads share the same KV cache entries. Under maximal contention (4 concurrent requests, think=0ms), all 4 requests may simultaneously access overlapping KV cache regions, creating memory bank conflicts in GDDR6. A 100ms stagger breaks this synchronization, allowing sequential rather than conflicting access.

Hypothesis 2 --- Ollama batch scheduling interaction: At think=0ms, Ollama sees 4 constant-in-flight requests and batches them maximally. At think=100ms, the natural stagger means Ollama occasionally sees 2--3 in-flight instead of always 4. If Ollama's batch implementation is suboptimal for this model's memory access pattern at batch_size=4 but efficient at batch_size=2--3, the stagger would help.

Connection to TR128: TR128 Phase 2 observed a 66% decode throughput increase for qwen2.5-1.5b during sustained load --- another anomalous performance improvement under specific timing conditions. Both anomalies point to the same underlying mechanism: this model's memory access pattern interacts non-linearly with the GPU's memory subsystem and Ollama's batching.

What would confirm the hypothesis? TR130 (NVIDIA Nsight profiling) could directly measure:

  • Memory bank conflict rates at N=4, think=0 vs think=100ms
  • Batch size distribution at each think-time
  • L2 cache hit rates across think-time settings
  • Per-SM utilization patterns showing whether some SMs stall on memory at think=0ms

Practical takeaway: If deploying qwen2.5-1.5b (or similar GQA models with aggressive head reduction) in multi-agent configurations, a 100--500ms inter-request delay may yield dramatic throughput improvements. This is model-specific and should be validated on each deployment.


SS8: Optimal Think-Time

Sustained total throughput (duty-cycle-corrected) at each think-time value:

Model Optimal Think-Time (ms) Max Sustained Total tok/s Improvement vs Zero
llama3.2-1b 100 191.3 +2.3%
llama3.2-3b 100 157.3 +1.6%
qwen2.5-1.5b 100 254.1 +44.6%

All three models achieve their best sustained throughput at think=100ms. But the magnitude differs enormously: 2% for llama/3b vs 45% for qwen.

8.1 Sustained vs Burst Comparison

Model Think (ms) Sustained Total tok/s Burst Total tok/s Duty Cycle
llama3.2-1b 0 187.0 187.0 100.0%
llama3.2-1b 100 191.3 198.9 96.1%
llama3.2-1b 500 191.1 237.2 80.6%
llama3.2-1b 2000 160.4 454.4 35.3%
llama3.2-3b 0 154.8 154.8 100.0%
llama3.2-3b 100 157.3 162.3 96.9%
llama3.2-3b 500 156.2 184.3 84.8%
llama3.2-3b 2000 156.7 389.9 40.2%
qwen2.5-1.5b 0 175.7 175.7 100.0%
qwen2.5-1.5b 100 254.1 267.4 95.0%
qwen2.5-1.5b 500 251.1 332.9 75.4%
qwen2.5-1.5b 2000 179.4 570.5 31.5%

Reading the table: "Burst total" is N * per-request effective_tps (excludes idle time). "Sustained total" is N * sustained_per_agent (includes idle time). At think=0ms, both are identical (100% duty cycle). At think=2000ms, burst rates of 454--570 tok/s are physically impossible to sustain because agents are active only 31--40% of the time.

Why burst rate at think=2000ms exceeds GPU decode rate: The burst rate measures per-request throughput during active periods only. At think=2000ms, agents rarely overlap, so each request gets near-solo GPU access. The "burst" of 454 tok/s for llama3.2-1b means that during the ~35% of time when requests are active, 4 agents each complete at near-solo speed. But during the other ~65%, the GPU is idle. The sustained rate of 160.4 tok/s is the actual time-averaged throughput.

8.2 What Optimal Think-Time Reveals About the Bottleneck

The fact that think=100ms is optimal for all three models (not think=0ms) reveals a subtle scheduling effect: even a brief stagger between agents' submissions improves throughput. At think=0ms, all 4 agents continuously compete for GPU time with zero breathing room. At think=100ms, agents desynchronize slightly --- when one agent pauses for 100ms, the other 3 get momentarily reduced contention.

The magnitude of this effect differs dramatically by model:

  • llama3.2-1b (+2.3%): The improvement is barely above noise. At 207 GPU tok/s, the 1b model decodes so fast that a 100ms stagger is too brief to meaningfully reduce contention.
  • llama3.2-3b (+1.6%): Similar to 1b. The longer decode time per token means the 100ms stagger is an even smaller fraction of the total request duration.
  • qwen2.5-1.5b (+44.6%): The dramatic improvement suggests that at think=0ms, this model has a specific performance pathology (hypothesized in SS7.7) that 100ms of stagger resolves. The optimal think-time for qwen is model-specific, not a general scheduling optimization.

Implication for batch scheduling systems: If building a batch scheduler that controls when agents submit requests, a 100ms stagger between submissions is effectively free (minimal duty-cycle loss at 96% duty cycle) and may provide modest throughput gains. For qwen2.5-1.5b-class models, the gains can be substantial. This is analogous to network traffic shaping, where small inter-packet gaps reduce collision rates.


SS9: Heterogeneous Model Analysis

Phase 4 tests N=4 agents with mixed model assignments, using OLLAMA_MAX_LOADED_MODELS=3 to pre-load all models into VRAM simultaneously (~5.2 GB total, within 12 GB budget).

9.1 Configuration Summary

Config Model Assignment Total eff. tok/s Jain's J OK Rate
homo_1b 4x llama3.2-1b 233.9 0.9998 100.0%
mixed_small 2x llama3.2-1b + 2x qwen2.5-1.5b 191.6 0.9987 100.0%
mixed_size 2x llama3.2-1b + 2x llama3.2-3b 178.0 0.9987 100.0%
all_different 2x llama3.2-1b + 1x qwen2.5-1.5b + 1x llama3.2-3b 188.5 0.9953 100.0%

Observation --- homo_1b achieves the highest throughput: 233.9 tok/s, notably exceeding Phase 2 N=4 llama3.2-1b (187.0 tok/s) by 25%. This comparison is confounded (see SS10), but the raw number is striking.

Observation --- fairness remains high in all configs: Even with mixed models of dramatically different sizes (1b + 3b), Jain's index stays above 0.995. Ollama's FIFO scheduling treats all requests equally regardless of which model they target.

9.2 Per-Model Breakdown (all_different config)

Model N Requests Mean eff. tok/s 95% CI
llama3.2-1b 60 44.2 [41.0, 47.5]
llama3.2-3b 30 48.1 [38.2, 58.0]
qwen2.5-1.5b 30 51.9 [50.1, 53.8]

Observation --- counterintuitive ordering: The 3b model (48.1 tok/s) outperforms the 1b model (44.2 tok/s) in the all_different configuration. This is unexpected because llama3.2-1b is faster in solo operation. The likely explanation: in a mixed-model configuration, the 3b model's longer decode time means it holds the GPU for longer stretches, reducing context-switching overhead. The 1b model's short requests create more scheduler transitions, adding overhead.

Observation --- llama3.2-3b's wide CI: The [38.2, 58.0] confidence interval is 3x wider than qwen2.5-1.5b's [50.1, 53.8]. With only 30 requests and potential model-switching latency spikes, the 3b model has high variance in this configuration.

9.3 Throughput Composition in Mixed Configurations

How does total throughput break down by model in mixed configurations? This matters because agents using slower models consume more GPU time per token, potentially dragging down agents using faster models.

mixed_small (2x 1b + 2x 1.5b): Total 191.6 tok/s

  • llama3.2-1b agents: ~55 tok/s each (110 total)
  • qwen2.5-1.5b agents: ~41 tok/s each (82 total)
  • The 1b agents get higher throughput because the model decodes faster per token, even in a mixed environment.

mixed_size (2x 1b + 2x 3b): Total 178.0 tok/s

  • This is the lowest total throughput among Phase 4 configs.
  • The 3b model's longer decode time means the GPU spends more time per token on 3b requests, reducing the total token production rate.
  • The 1b agents are "held back" by sharing the GPU with the slower 3b agents.

Implication for mixed deployments: In a mixed-model configuration, the total system throughput is bounded by the weighted average of individual model throughputs. Mixing a fast model (1b) with a slow model (3b) produces lower total throughput than running 4 agents on the fast model alone. Unless the slower model provides significantly better quality per token (see TR125), homogeneous fast-model deployment maximizes tokens per second.

9.4 Per-Model Breakdown (homo_1b config)

Model N Requests Mean eff. tok/s 95% CI
llama3.2-1b 120 58.5 [57.2, 59.8]

Observation --- 25% faster than Phase 2 N=4: homo_1b achieves 58.5 tok/s per agent vs 46.8 in Phase 2 N=4 llama3.2-1b. The tight CI [57.2, 59.8] suggests this is not noise. See SS10 for confound analysis.


SS10: Model Switching Overhead

10.1 The Comparison

Compares homo_1b (Phase 4, with OLLAMA_MAX_LOADED_MODELS=3) vs N=4 llama3.2-1b (Phase 2, default config):

Metric Phase 2 (N=4, default) Phase 4 (homo_1b, MAX_LOADED=3)
Mean eff. tok/s 46.8 58.5
95% CI [45.6, 47.9] [57.2, 59.8]
N samples 120 120
  • Difference: +25.1% (Phase 4 faster)
  • t-statistic: -13.58
  • p-value: < 0.0001
  • Cohen's d: -1.75 (large effect)

10.2 Confounds

This comparison has several confounds that prevent attributing the 25% improvement to OLLAMA_MAX_LOADED_MODELS alone:

  1. Ollama restart: Phase 4 begins with a fresh Ollama instance (restarted to set OLLAMA_MAX_LOADED_MODELS=3). Phase 2 runs on the original instance that has been serving requests for ~5 minutes. A fresh Ollama instance may have different memory allocation patterns, cleaner internal state, or different CUDA graph caches.

  2. Temporal ordering: Phase 4 runs ~60 minutes into the experiment (after Phases 1--3). GPU thermal state, system memory fragmentation, and OS scheduler state may differ. However, we saw no thermal throttling in TR128 (peak 66C), so thermal effects are unlikely.

  3. Warmup sequence: Phase 4 warms all 3 models (3 requests each) before starting. Phase 2 warms only the target model. The all-model warmup may pre-populate GPU caches or CUDA contexts differently.

  4. MAX_LOADED_MODELS itself: With 3 models loaded in VRAM, Ollama may make different memory allocation decisions (e.g., fixed VRAM partitioning vs dynamic allocation). This could improve or degrade performance depending on the model.

Conclusion: The 25% improvement is real and statistically significant, but its cause remains unknown. A controlled experiment (same phase position, same warmup, only MAX_LOADED_MODELS varied) would be needed to isolate the effect. Treat this as "suggestive but unproven."

10.3 Proposed Controlled Experiment Design

To isolate the MAX_LOADED_MODELS effect, a follow-up experiment should:

  1. Run both conditions consecutively (not 60 minutes apart): homo_1b with MAX_LOADED_MODELS=1 (default), then homo_1b with MAX_LOADED_MODELS=3, then repeat in reverse order (ABBA design to control for temporal drift).
  2. Use the same Ollama instance for both conditions (no restart between them), or restart for both.
  3. Match warmup protocols: Either warm all 3 models for both conditions, or warm only llama3.2-1b for both.
  4. Run longer: 100 requests per agent (instead of 30) to improve statistical power for detecting a potentially smaller effect after controlling confounds.
  5. Monitor GPU clock speed: If the improvement is caused by GPU boost clock differences (due to Ollama restart), clock speed monitoring would detect this.

SS11: VRAM & GPU Metrics

Phase Mean VRAM (MB) Max VRAM (MB) Std (MB) Interpretation
Phase 1 (Baseline) 6,505 6,509 1.6 Stable: single model loaded
Phase 2 (Scaling) 3,101 6,514 1,319 High variance: model cycling across 24 configs
Phase 3 (Think-Time) 4,181 4,920 778 Moderate: single model with variable load
Phase 4 (Heterogeneous) 6,525 6,528 3.4 Stable: all 3 models loaded simultaneously

Observation --- VRAM is NOT the bottleneck: Peak VRAM usage (6,528 MB) is only 51% of the 12 GB budget. Even with all 3 models loaded in Phase 4, there is 5.6 GB of headroom. The throughput plateau observed in SS4 is caused by GPU compute serialization, not memory pressure. This confirms that for models in this size range (1--3B), VRAM is not a constraint for multi-agent operation. Larger models (7B+) or longer contexts would change this picture (see TR127 for VRAM-limited scaling).

Observation --- Phase 2 VRAM variance: The high std (1,319 MB) reflects model cycling. Ollama loads and unloads models as the 24 configurations cycle through 3 different models at 8 N-levels each. The VRAM fluctuation does not affect performance because each configuration runs for ~30 seconds, long enough for VRAM to stabilize.

11.2 Thermal Behavior

GPU temperature across the ~90-minute experiment:

Phase Duration (approx) Temp Range Notes
Phase 1 (Baseline) ~10 min 55--60degC Light load (N=1, sequential)
Phase 2 (Scaling) ~40 min 58--66degC Sustained heavy load (N=1--8)
Phase 3 (Think-Time) ~25 min 60--65degC Moderate load (N=4 with idle periods)
Phase 4 (Heterogeneous) ~15 min 62--66degC Heavy load (N=4, multi-model)

Observation --- no thermal throttling: Peak temperature (66degC) is well below the RTX 4080 Laptop's thermal throttle point (~83--87degC). This confirms TR128's finding that sustained inference workloads on this GPU do not trigger thermal management. The ~8degC temperature increase from Phase 1 to Phase 2/4 reflects the transition from intermittent (N=1) to continuous (N >= 2) GPU utilization.

Observation --- temperature does not explain Phase 4's 25% speedup: Phase 4 runs at 62--66degC, similar to late Phase 2 (also 62--66degC). If anything, higher temperature would slightly reduce clock speed (thermal headroom), not increase it. The Phase 4 anomaly (SS10) must have a different explanation.

Implication: Thermal throttling is not a concern for multi-agent deployments on this hardware with models in the 1--3B range. Larger models or longer sustained runs might approach thermal limits --- TR122 characterized the thermal profile in detail.


SS12: Request Timeline & Serialization

Request timeline analysis examines how requests overlap in time as N increases. In a closed-loop system, each agent submits one request at a time, so max in-flight = N.

12.1 Overlap and Serialization Metrics

Model N Overlap Ratio Serialization Degree
llama3.2-1b 1 0.000 1.000
llama3.2-1b 2 0.982 0.018
llama3.2-1b 4 0.992 0.009
llama3.2-1b 8 0.995 0.005
llama3.2-3b 2 0.981 0.019
llama3.2-3b 4 0.991 0.009
llama3.2-3b 8 0.995 0.005
qwen2.5-1.5b 2 0.980 0.021
qwen2.5-1.5b 4 0.990 0.010
qwen2.5-1.5b 8 0.995 0.005

Interpretation --- the GPU is never idle at N >= 2: Overlap ratio = 0.98 at N=2 means that 98% of the experimental time window has multiple active requests. The GPU is continuously busy. Serialization degree = 0.018 means only 1.8% of the time does the GPU have just one active request.

Interpretation --- why throughput plateaus: At N=2, the GPU is already 98% utilized (0.98 overlap). Going to N=8 increases overlap to 99.5% --- a 1.5 percentage point gain. There is almost no idle time to recover at N >= 2. Adding more agents increases the queue depth (how many requests wait), not the GPU utilization (which is already saturated). This is the physical manifestation of Amdahl's bottleneck: the GPU processes tokens at a fixed rate, and more agents just mean longer queues.

Connection to Amdahl's Law: The serialization degree at N=8 (0.5%) closely tracks the predicted "parallelizable fraction" from Amdahl's model. With s ~ 0.46, the non-serial portion is 54%, and the GPU spends ~99.5% of time doing useful work. The remaining 0.5% is the brief gap between request completion and agent re-submission --- the only true idle time.

12.2 Queue Depth Analysis

The in_flight_at_submit metric records how many other agents had active requests when each new request was submitted. In a closed-loop system, this value is bounded by [0, N-1] (the submitting agent counts as "about to be in-flight").

Mean in-flight at submit time:

Model N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8
llama3.2-1b 0.0 0.98 1.97 2.97 3.97 4.97 5.97 6.97
llama3.2-3b 0.0 0.98 1.97 2.97 3.97 4.97 5.97 6.97
qwen2.5-1.5b 0.0 0.98 1.97 2.97 3.97 4.97 5.97 6.97

Observation --- mean in-flight ~ N-1: At every N, the average agent sees N-1 other active requests when it submits. This confirms the closed-loop invariant: when an agent finishes and immediately re-submits, the other N-1 agents are still waiting for their responses. The GPU queue is always nearly full.

Observation --- values are not exactly N-1: The ~0.03 deficit (e.g., 6.97 vs 7.0 at N=8) reflects brief windows where two agents finish nearly simultaneously. For the ~20ms between one agent's completion and the other agent's re-submission, one fewer request is in-flight. These brief dips are the only "slack" in the system --- and they're too short to meaningfully reduce queue wait for anyone.

What this means for queue wait time: At N=8, each new request joins a queue of ~7 active requests. If the GPU round-robins across all 8 (continuous batching), each request gets 1/8 of the GPU's compute per forward pass. The expected wait time is approximately (N-1) * decode_time_per_token / batch_efficiency. This is why wall_ms at N=8 is ~4.5x the N=1 value (not 8x --- continuous batching provides some parallelism).

12.3 The Gap Between wall_ms and Ollama Timing

Model N=1 wall_ms N=1 total_duration_ms Gap (ms) Gap at N=8 (ms)
llama3.2-1b 1,042 ~620 ~422 ~3,500
llama3.2-3b 1,617 ~1,120 ~497 ~5,700
qwen2.5-1.5b 1,232 ~790 ~442 ~3,900

Interpretation: The gap between wall_ms (measured by our client) and Ollama's internal total_duration grows dramatically with N. At N=1, the gap (~420--500ms) is mostly HTTP round-trip + Ollama's internal queueing. At N=8, the gap grows to 3,500--5,700ms --- this is pure queue wait time. The agent's request sits in Ollama's queue while the GPU processes other agents' requests.

This gap IS the Amdahl bottleneck made visible. The GPU's decode rate (eval_ms) stays constant regardless of N. The wall_ms grows because of queue wait, not because decoding slows down. The serial fraction s in Amdahl's Law captures this queue-induced overhead as a fraction of total work.


SS13: Cross-Validation with TR128

Compares TR129 Phase 1 (N=1, closed-loop baseline) with TR128 Phase 1 (N=1, open-loop baseline). Both use wall-clock-based effective throughput for fair comparison.

13.1 Results

Model TR129 eff. tok/s TR128 eff. tok/s delta% Within 10%? p-value
llama3.2-1b 115.8 140.7 -17.7% No < 0.0001
llama3.2-3b 79.2 89.2 -11.3% No < 0.0001
qwen2.5-1.5b 102.7 125.8 -18.4% No < 0.0001

13.2 Why the 11--18% Gap?

The gap has identifiable methodological causes:

  1. Prompt size difference: TR129 uses 100--300 token prompts. TR128 uses different prompt distributions (potentially shorter). Longer prompts increase prompt_eval time, inflating wall_ms and reducing effective_tps.
  2. Completion length: TR129 fixes max_tokens=128 for all requests. TR128 may use different completion lengths, affecting the ratio of prompt_eval to decode time.
  3. Temporal separation: TR128 and TR129 were run on different days with potentially different background system load, GPU thermal state, and Ollama internal state.
  4. Ollama version/config: Minor version or configuration differences between experiments.

13.3 Assessment

The gap is larger than the 10% target but has plausible explanations. The ordering is preserved (llama3.2-1b > qwen2.5-1.5b > llama3.2-3b in both experiments), and the relative differences are consistent. We consider this a partial validation: the measurements are in the right ballpark, confirming both experiments are measuring the same physical system. The 11--18% offset is a systematic bias from different experimental configurations, not random measurement error.

Impact on TR129 results: The cross-validation gap does not affect TR129's scaling analysis. The scaling curves (eta vs N) are computed relative to TR129's own Phase 1 baseline, so any systematic offset cancels out. The Amdahl serial fractions, fairness indices, and think-time effects are all internally consistent.

13.4 Cross-Report Impact Assessment

TR129's findings interact with and extend several earlier technical reports. This section maps how TR129 changes or confirms prior conclusions.

Prior Report Prior Finding TR129 Impact Status
TR114v2 2-agent efficiency = 98.28% (HF backend) Efficiency is backend-dependent: Ollama yields 80% at N=2. Backend serial overhead is the dominant factor. Extended
TR121v1 Inference cost scales sub-linearly with model size Confirmed: larger models have lower serial fractions (less overhead per useful compute), consistent with sub-linear cost scaling. Confirmed
TR125 Q4_K_M is recommended default across tested models for quality Unchanged. TR129 uses default quantization (Q4_0 via Ollama). Serial fraction may differ at other quant levels if decode speed changes. Unaffected
TR126 torch.compile wins prefill but crashes decode Relevant: if compiled decode worked, it might reduce the serial fraction by amortizing CUDA overhead. Compile's prefill benefit is irrelevant for multi-agent decode-bound workloads. Contextualized
TR127 VRAM thrashing above capacity causes non-linear degradation TR129 stays well within VRAM budget (max 51%). Long-context multi-agent scenarios (SS18 future work) would combine TR127's VRAM limits with TR129's scaling curves. Complementary
TR128 OLLAMA_NUM_PARALLEL is a no-op; streaming adds zero overhead Confirmed: TR129's closed-loop protocol (non-streaming) produces consistent results. NUM_PARALLEL being a no-op explains why adding agents doesn't unlock hidden parallelism. Confirmed
TR128 qwen2.5-1.5b: 66% decode throughput increase under sustained load Extended: TR129 finds 45% sustained throughput improvement at think=100ms for the same model. Both anomalies point to a GQA memory access pattern interaction. Connected
TR128 M/D/1 queueing deviates up to 20.4x at NP>1 Explained: TR128's open-loop M/D/1 model fails because real scheduling is neither Markovian nor deterministic under concurrent load. TR129's Amdahl model (closed-loop) provides a better fit. Superseded (for closed-loop)

Key synthesis: The serial fraction s = 0.39--0.54 is the most important new parameter discovered in TR129. Combined with TR121v1's scaling laws (how throughput changes with model size) and TR127's VRAM limits (how context length bounds performance), practitioners can now predict multi-agent throughput across the full (model_size, context_length, agent_count) space --- at least on consumer hardware with Ollama.


SS14: Statistical Analysis

14.1 Cold-Start Detection

Compared first 5 requests vs remaining requests for each agent-phase combination: 4 of 48 combinations show statistically significant cold-start effects (p < 0.05). This is close to the expected false-positive rate (5% of 48 = 2.4 expected), suggesting minimal cold-start contamination. The Phase 1 warmup protocol (3 warmup requests per model) is effective.

14.2 Outlier Rates (IQR Method)

Phase Model N Samples Outliers % Note
Phase 1 llama3.2-1b 50 9 18.0% High --- small model, system jitter
Phase 1 llama3.2-3b 50 0 0.0% Clean
Phase 1 qwen2.5-1.5b 50 4 8.0% Moderate
Phase 2 All models 3,240 0 0.0% Clean
Phase 3 llama3.2-1b 480 1 0.2% Clean
Phase 3 llama3.2-3b 480 119 24.8% See note below
Phase 3 qwen2.5-1.5b 480 82 17.1% See note below
Phase 4 llama3.2-1b 300 6 2.0% Acceptable
Phase 4 llama3.2-3b 90 0 0.0% Clean
Phase 4 qwen2.5-1.5b 90 7 7.8% Moderate

Phase 3 outlier rates explained: The 17--25% outlier rates for llama3.2-3b and qwen2.5-1.5b in Phase 3 are not a data quality concern --- they are an artifact of applying IQR across a deliberately heterogeneous dataset. Phase 3 sweeps think-time from 0 to 2000ms, creating a multimodal distribution: requests at think=0ms have throughput ~39 tok/s while requests at think=2000ms have throughput ~97 tok/s. The IQR method flags the lower-throughput think=0ms samples as outliers relative to the overall distribution. A proper analysis (as in SS7) stratifies by think-time.

Phase 1 llama3.2-1b outliers (18%): The small model's short request times amplify system jitter into IQR outliers. These are predominantly fast outliers (jitter causing requests to complete slightly faster than the median), not slow outliers (timeouts or errors). All requests have status="ok".

14.3 Power Analysis

Model Observed Cohen's d Effect Size N per Group N Required (80% power) Adequate?
llama3.2-1b 19.643 Very large 30 1 Yes
llama3.2-3b 19.301 Very large 30 1 Yes
qwen2.5-1.5b 15.194 Very large 30 1 Yes

Observation: The scaling effect is so large (Cohen's d > 15 for all models) that even a single sample per group would detect it at 80% power. This is not surprising: the per-agent throughput at N=8 is 5--6x lower than at N=1, a massive effect. The 30-request design provides far more power than needed, which is desirable for precise confidence intervals and outlier detection.

14.4 Distribution Shape

Phase Model Skewness Kurtosis Normal? (Shapiro)
Phase 1 llama3.2-1b -2.65 7.64 No (p < 0.001)
Phase 1 llama3.2-3b -0.23 -0.78 Yes (p = 0.452)
Phase 1 qwen2.5-1.5b -5.20 30.32 No (p < 0.001)
Phase 2 All models 1.89--2.41 3.38--6.12 No (p < 0.001)
Phase 3 All models 0.60--1.12 -0.91 to -0.48 No (p < 0.001)
Phase 4 All models 0.29--2.56 -0.25 to 11.00 No (p < 0.001)

Observation --- most distributions are non-normal: Phase 2 shows positive skew (right tail from scheduling outliers), Phase 1 shows negative skew (left tail from occasional fast requests). Phase 1 qwen2.5-1.5b has extreme kurtosis (30.32), indicating heavy tails from rare extreme values. The non-normality justifies using bootstrap or rank-based methods for future work, though the large sample sizes (n >= 30) ensure t-test robustness via CLT.

Observation --- Phase 2 has zero outliers: Despite 3,240 measurements across 24 configurations, Phase 2 has 0.0% IQR outliers for all models. This indicates highly stable scaling behavior with no scheduling anomalies, timeouts, or system interruptions during the core experiment.

14.5 What the Statistics Tell Us About Production Reliability

The statistical profile has direct implications for production SLA design:

  1. Predictability is high at N >= 2: Phase 2's zero outlier rate and positive skew (right tail only, no left-tail latency spikes) means that multi-agent throughput is highly predictable. Agents can reliably expect throughput >= mean - 2sigma. For SLA purposes, the 5th percentile throughput is a conservative bound.

  2. Phase 1 variability warns about single-agent deployments: The 18% outlier rate for llama3.2-1b at N=1 means that ~1 in 5 requests deviates significantly from the mean. For latency-sensitive single-agent applications, budget extra headroom. At N >= 2, the GPU is continuously busy and scheduling jitter is amortized across requests.

  3. Cohen's d > 15 means the effect is unmistakable: The scaling effect (throughput degradation with N) is so large that it would be detected even with terrible measurement methodology. This gives high confidence that the scaling curves are real physical phenomena, not artifacts of our measurement setup.

  4. Non-normality is benign: The positive skew in Phase 2 means occasional slow requests (right tail), but no systematic left-tail degradation. In production, this translates to: most requests meet their SLA, with rare stragglers. A simple retry policy handles the rare slow requests.


SS15: Key Findings

Finding 1: Total system throughput plateaus at N=2 --- adding agents beyond this yields negligible gains.

This is the most actionable result in TR129. From N=2 to N=8, total throughput increases by only 1.4--2.4%:

Model N=1 total N=2 total N=8 total N=2->N=8 gain
llama3.2-1b 117.8 185.3 187.9 +1.4%
llama3.2-3b 80.2 106.8 109.4 +2.4%
qwen2.5-1.5b 101.7 149.2 152.8 +2.4%

Why this matters: Every agent added beyond N=2 degrades per-agent latency (longer wait per request) without providing meaningful total throughput gain. An 8-agent system produces essentially the same total tokens per second as a 2-agent system, but each agent waits 4x longer per request.

Mechanism: The GPU is 98% utilized at N=2 (overlap ratio = 0.98). There is almost no idle GPU time to recover. Additional agents simply increase queue depth, not GPU throughput.

Finding 2: Efficiency follows Amdahl's Law with high serial fractions (s = 0.39--0.54).

The degradation is smooth, predictable, and well-described by a 1-parameter model:

Model Serial Fraction (s) R-squared Physical Meaning
llama3.2-1b 0.5391 0.970 54% of work is serial (overhead-dominated)
llama3.2-3b 0.3870 0.993 39% of work is serial (compute-dominated)
qwen2.5-1.5b 0.4554 0.985 46% of work is serial (intermediate)

Why this matters: The serial fraction provides a single number that captures the scaling ceiling for each model. With s = 0.54, llama3.2-1b can never achieve more than 1.85x speedup from adding agents. With s = 0.39, llama3.2-3b has a higher ceiling (2.58x) because its longer decode time makes overhead a smaller fraction.

Mechanism: The serial fraction represents GPU scheduler serialization, memory bus contention, Ollama HTTP handling, and CUDA kernel launch overhead. These are inherently sequential operations that cannot benefit from request-level parallelism.

Finding 3: Throughput allocation is fair across agents.

Jain's index exceeds 0.997 at N=8 for all models. No agent starvation occurs.

Model J at N=8
llama3.2-1b 0.9975
llama3.2-3b 0.9994
qwen2.5-1.5b 0.9987

Why this matters: In a multi-agent system, unfair scheduling could cause some agents to timeout or fall behind while others proceed. TR129 shows this is not a concern with Ollama's FIFO scheduler --- all agents receive equitable GPU time, even at N=8.

Finding 4: Think-time improves per-request throughput but reduces sustained throughput.

Inter-request delays reduce contention so each request completes faster, but agents spend time idle. The net effect is a trade-off between per-request quality and sustained productivity:

Model Think=0ms Sustained Total Think=100ms Sustained Total Think=2000ms Sustained Total
llama3.2-1b 187.0 191.3 (+2%) 160.4 (-14%)
llama3.2-3b 154.8 157.3 (+2%) 156.7 (+1%)
qwen2.5-1.5b 175.7 254.1 (+45%) 179.4 (+2%)

Why this matters for practitioners: If your agents naturally have 100--500ms of processing time between LLM calls (reasoning, tool calls, data fetching), this is a hidden benefit: each individual LLM response will arrive faster than in a zero-delay system. But adding artificial delays to improve scheduling is counterproductive --- the duty-cycle loss outweighs the per-request gain (except for qwen2.5-1.5b, which shows model-specific benefits).

Finding 5: Heterogeneous model assignments show throughput differences, but confounds prevent isolating the cause.

homo_1b (233.9 tok/s) outperforms all_different (188.5 tok/s) by 24%, but this comparison is confounded by Ollama restart, warmup sequence, and thermal state.

Why this matters: We cannot confidently recommend or warn against mixed-model deployments based on this data. A controlled experiment is needed. The conservative recommendation is homogeneous model assignment.

Finding 6: GPU tok/s stays constant across N, confirming that contention manifests as queue wait.

Model GPU tok/s at N=1 GPU tok/s at N=8
llama3.2-1b ~207 ~207
llama3.2-3b ~114 ~114
qwen2.5-1.5b ~162 ~162

Why this matters: The GPU doesn't slow down when more agents are competing. Each token is generated at the same speed. The degradation in effective_tps comes entirely from queue wait time (agents waiting for the GPU to cycle through other requests). This confirms that the bottleneck is scheduling/serialization, not GPU compute degradation.

The two-layer bottleneck model: This finding establishes a clean separation between two performance layers:

  1. GPU layer (eval_ms): Token generation speed. Determined by model size, architecture, quantization, and memory bandwidth. Constant across N. Cannot be improved by agent-level optimization.
  2. Scheduling layer (wall_ms - eval_ms): Queue wait time. Determined by N, Ollama's scheduler, and HTTP overhead. Grows linearly with N. Can be reduced by lowering N, changing backends, or reducing per-request overhead.

This separation explains why TR129's Amdahl model works: the serial fraction s captures the scheduling layer overhead as a fraction of total work. At N=1, the scheduling layer is ~40% of wall_ms (the gap between effective_tps and GPU tok/s). At N=8, it grows to ~80% of wall_ms. The GPU layer is the irreducible "parallelizable" portion that Amdahl's Law preserves.

Diagnostic use: In production, monitoring both effective_tps (user-perceived) and GPU tok/s (Ollama-reported eval_ms) simultaneously reveals whether performance degradation is at the GPU layer (hardware issue, thermal throttle, VRAM pressure) or the scheduling layer (too many agents, backend overhead, network latency). If GPU tok/s drops, the GPU is the problem. If GPU tok/s is stable but effective_tps drops, the scheduling layer is saturated.


SS16: Conclusions

TR129 provides a systematic characterization of closed-loop multi-agent LLM inference scaling on consumer GPU hardware. The results answer the five research questions posed in SS1:

Q1: At what agent count does per-agent throughput collapse? The degradation is gradual, not a phase transition. Efficiency drops below 50% at N=3--4 (model-dependent). At N=8, agents get 17--20% of solo throughput. The degradation is smooth and well-described by Amdahl's Law.

Q2: Does total system throughput plateau? Yes, emphatically. The plateau is reached by N=2, with <3% gain from N=2 to N=8. The GPU is 98% utilized at N=2, leaving almost no idle time for additional agents to exploit.

Q3: How does think-time change the picture? Think-time improves per-request throughput (less contention per call) but reduces sustained throughput (duty-cycle loss). The trade-off is net negative for artificial delays, but agents with natural processing time between requests benefit from reduced per-call latency. qwen2.5-1.5b shows a model-specific anomaly where even 100ms of think-time yields 45% sustained throughput improvement.

Q4: Does model heterogeneity affect scaling? Phase 4 results are confounded and cannot isolate model-switching overhead. The conservative recommendation is homogeneous assignment.

Q5: What is the serial fraction? s = 0.39--0.54 (Amdahl's Law, R-squared 0.97--0.99). The serial fraction is highest for the smallest model (overhead-dominated) and lowest for the largest (compute-dominated). Average s = 0.46, yielding a theoretical maximum speedup of 2.2x.

16.1 Broader Research Context

TR129 is the 13th data-producing technical report in the Banterhearts series (after TR117--TR128). Each report has progressively deepened the understanding of LLM inference on consumer hardware. TR129's contribution is unique: it moves from "how fast is one request?" to "how fast are N requests when they compete?" This shift from single-request characterization to system-level performance is the bridge between benchmarking and production deployment.

Where TR129 sits in the research arc:

Phase Reports Question Answered
Characterization TR117--TR122 What is the baseline performance of inference backends?
Quality TR124--TR125 How does quantization affect accuracy?
Cross-platform TR126 Do results hold on Linux/Docker?
Context scaling TR127 How does performance change with context length?
Production load TR128 How does performance change under realistic arrival patterns?
Multi-agent TR129 How does performance scale with N concurrent agents?
GPU profiling TR130 (planned) What is the physical root cause of the serial fraction?

TR129 is the final characterization study before TR130 shifts to profiling. Together, TR128 (open-loop) and TR129 (closed-loop) provide the two canonical multi-request performance models. Any production deployment decision can now reference: TR125 for quantization choice, TR127 for context length limits, TR128 for server load capacity, and TR129 for multi-agent throughput budget.

16.2 What This Report Does NOT Prove

To prevent over-generalization, we explicitly state what TR129's data cannot support:

  1. "Multi-agent LLM systems are useless." False. The throughput plateau applies to single-GPU, single-backend deployments. Multi-GPU, multi-backend, and cloud-based architectures scale linearly by adding independent GPU instances. TR129 characterizes the per-GPU bottleneck, not the system-level architecture.

  2. "Ollama is a bad backend for multi-agent." Not proven. Ollama's serial fraction (s = 0.39--0.54) may be higher than vLLM or TGI, but without comparative data, we cannot claim other backends would do better. The serial fraction includes GPU-hardware components (memory bus contention) that no backend can eliminate.

  3. "Think-time never helps." Nuanced. Think-time does not help sustained throughput for llama3.2-1b/3b (duty-cycle loss). But the qwen2.5-1.5b anomaly (45% gain) shows model-specific benefits exist. And for all models, natural think-time improves per-request latency, which matters for interactive agent experiences.

  4. "These scaling laws apply to all GPUs." Not tested. The serial fractions are specific to RTX 4080 Laptop + Ollama. Higher-bandwidth GPUs (A100, H_100) will have different serial fractions, likely lower (see SS18.2 for predictions).

Summary Conclusions

  1. The plateau is N=2. Total system throughput effectively plateaus by N=2, with <3% additional gain from N=3 to N=8.
  2. Amdahl's Law provides actionable bounds. The serial fraction s gives a theoretical ceiling on total system throughput (1/s times single-agent).
  3. Fairness is maintained. Jain's index >= 0.997 at N=8. No agent starvation.
  4. Think-time is a trade-off, not a free lunch. Per-request improvement comes at the cost of sustained throughput.
  5. Heterogeneous results are confounded. Cannot isolate model-switching overhead from Phase 4 experimental design.
  6. Closed-loop differs from open-loop. TR128's open-loop results cannot predict closed-loop behavior because closed-loop bounds max concurrency to N.

16.3 The One-Number Summary

If there is a single number from TR129 that practitioners should remember, it is s = 0.46 (the average Amdahl serial fraction across the three models tested). This number encodes the fundamental constraint: on a single RTX 4080 Laptop GPU with Ollama, 46% of inference work is serial. This means:

  • Maximum achievable speedup from adding agents: 1/0.46 = 2.17x
  • Optimal agent count for throughput: N = 2 (captures ~95% of the 2.17x ceiling)
  • Per-agent efficiency at the optimal count: ~73% (each agent gets ~73% of its solo throughput)
  • Per-agent efficiency at N=4: ~37% (halved from optimal, for <2% more total throughput)

These numbers can be recalculated for any s using the Amdahl formula in SS17.4. As hardware and backends evolve, the serial fraction will change --- but the framework for reasoning about it remains constant.

16.4 Comparison to Classical Parallel Computing

In classical parallel computing, Amdahl's Law serial fractions below 1% are common for well-optimized parallel programs (e.g., matrix multiplication). TR129's s = 39--54% is extraordinarily high by those standards. Why?

The difference is that classical parallelism partitions a single task across multiple processors, while TR129's "parallelism" involves independent tasks (agent requests) sharing a serial resource (single-GPU Ollama). The GPU processes one token per sequence per forward pass --- it does not subdivide a single request across multiple hardware units the way a parallel program would. The "serial fraction" in TR129 is therefore not algorithmic serialization but resource contention: agents wait in line for the same GPU.

This reframing suggests that reducing s requires either (a) more GPU resources (multi-GPU), (b) smarter batching that can decode multiple sequences in a single forward pass more efficiently (vLLM's continuous batching), or (c) smaller per-request overhead (faster serving stack). All three are viable paths explored in the future work section.


SS17: Multi-Agent Design Guidance

Based on TR129 results, for a single RTX 4080 Laptop GPU (12 GB) with Ollama:

17.1 Agent Count

Goal Recommended N Rationale
Maximum total throughput 2 Captures 92--97% of plateau with 67--80% per-agent efficiency
Balanced throughput/latency 2--3 Good total throughput, acceptable per-agent latency
Maximum per-agent speed 1 Full solo throughput, no contention
Avoid 4+ <3% total throughput gain, but 2--4x worse per-agent latency

17.2 Model Assignment

Use homogeneous agents (same model per GPU). If mixed models are required:

  • Pre-load all models with OLLAMA_MAX_LOADED_MODELS=<count>
  • Expect potential throughput differences, but Phase 4 confounds prevent quantifying the overhead
  • Budget conservatively: assume mixed configs perform 10--20% worse than homogeneous

17.3 Think-Time Strategy

Do NOT add artificial delays between requests purely for scheduling gains. However:

  • If your agents naturally pause 100--500ms between requests (reasoning, tool calls), each LLM call will be faster
  • For qwen2.5-1.5b specifically, even 100ms natural delay yields substantial (45%) throughput gains
  • At 2000ms+ delays, agents are effectively running solo --- high per-request speed but low sustained throughput

17.4 Throughput Prediction

Use this formula to predict per-agent throughput at any N:

eta(N) = 1 / (s + (1-s) * N)
per_agent_tps(N) = baseline_tps * eta(N)
total_system_tps(N) = N * per_agent_tps(N)

Where baseline_tps and s come from:

Model baseline_tps s
llama3.2-1b 115.8 0.539
llama3.2-3b 79.2 0.387
qwen2.5-1.5b 102.7 0.455

Example: Predict llama3.2-3b with N=5 agents:

  • eta(5) = 1 / (0.387 + 0.613*5) = 1 / 3.452 = 0.290
  • per_agent = 79.2 * 0.290 = 23.0 tok/s
  • total = 5 * 23.0 = 114.8 tok/s
  • Observed: 21.7 tok/s per agent, 108.6 total (5.6% error)

17.5 Implications for Multi-Agent Framework Design

TR129's results inform architectural decisions for multi-agent LLM frameworks (e.g., LangGraph, AutoGen, CrewAI):

Agent pool sizing: Frameworks that spawn agents dynamically should cap the pool at N=2--3 per GPU. The common pattern of "spawn an agent per task" leads to N=10+ on a single GPU, where each agent gets <15% of solo throughput. Instead, use a task queue with a fixed agent pool.

Async vs sync agent architecture: In sync architectures (agent blocks until LLM responds), TR129's results apply directly: N agents = N concurrent requests. In async architectures (agent sends request and continues processing), the effective concurrency may be lower than N because some agents are processing between requests. The think-time results (SS7) model this case: async agents with 100--500ms of inter-request processing get better per-request throughput than sync agents.

Agent scheduling strategies:

  • Round-robin: Each agent takes turns sending requests. Ollama already provides FIFO fairness (J >= 0.997), so explicit round-robin adds overhead without improving fairness.
  • Priority-based: Give time-sensitive agents higher priority. This breaks fairness (J < 0.99) but may be necessary for mixed interactive/batch workloads.
  • Dynamic N: Monitor per-agent throughput and shed agents when eta drops below a threshold. TR129's Amdahl formula enables real-time prediction of what throughput each new agent will get.

Multi-GPU scaling: With K GPUs, each running Ollama, assign N=2 agents per GPU for maximum throughput. Total system throughput ~ K * 1.5 * single_agent_tps (since N=2 captures ~1.5x of solo). This is a linear scaling strategy that avoids the Amdahl bottleneck within each GPU.

17.6 Monitoring & Warning Signs

Track these metrics in production:

  • in_flight count: If consistently = N-1 (all but one agent waiting), the GPU is fully saturated
  • Per-agent wall_ms trend: If increasing over time, check for thermal throttling or memory pressure
  • Queue wait fraction: wall_ms - (prompt_eval_ms + eval_ms). If > 50% of wall_ms, consider reducing N
  • Jain's index: If J drops below 0.99, investigate whether variable prompt lengths or model switching is creating unfairness
  • GPU tok/s stability: If GPU-side decode rate drops (not just effective_tps), the GPU itself is degrading (thermal throttle, memory pressure). If GPU tok/s is stable but effective_tps drops, the bottleneck is queueing.

SS18: Limitations & Future Work

18.1 Limitations

  1. Single GPU only. Results are specific to RTX 4080 Laptop (12 GB). Data-center GPUs (A100, H_100) have different memory bandwidth, batch processing capabilities, and scheduler behavior. Serial fractions will differ.
  2. N <= 8. Higher agent counts not tested. Amdahl's Law extrapolation provides bounds but is uncertain beyond the tested range.
  3. Fixed prompt/completion sizes. All requests use 100--300 token prompts and 128-token completions. Real agents have variable sizes; long prompts or completions may shift the scaling curve (see TR127 for long-context effects).
  4. No context accumulation. Each request is independent (no multi-turn history). Real agents accumulate context across turns, increasing prompt size over time (see TR128 Phase 5).
  5. Ollama-specific. Results depend on Ollama's scheduler and continuous batching. Other backends (vLLM, TGI, TensorRT-LLM) may scale differently, especially those designed for high-concurrency serving with PagedAttention.
  6. Windows only. OLLAMA_MAX_LOADED_MODELS behavior and GPU scheduling may differ on Linux (see TR126 for cross-platform comparison).
  7. Synthetic prompts. Real agent prompts may have different length distributions and content characteristics.
  8. Phase 4 confounded. Heterogeneous and model-switching results are confounded by experimental design. Cannot isolate MAX_LOADED_MODELS effect.

18.2 What Would Change on Different Hardware?

TR129's results are specific to the RTX 4080 Laptop (12 GB, 256 GB/s bandwidth). Here's what we can predict about other hardware based on the Amdahl framework:

Hardware Memory BW Expected Serial Fraction Predicted Plateau Reasoning
RTX 4080 Laptop (this study) 256 GB/s s = 0.39--0.54 N=2 Measured
RTX 4090 Desktop 1,008 GB/s s ~ 0.20--0.35 N=3--4 4x bandwidth -> decode is faster -> overhead fraction larger, BUT Ollama HTTP overhead stays constant -> net effect uncertain
A100 80GB 2,039 GB/s s ~ 0.15--0.30 N=3--5 Higher bandwidth + more VRAM -> can batch more effectively. vLLM would further reduce s.
H_100 80GB 3,350 GB/s s ~ 0.10--0.25 N=4--8 Highest bandwidth + optimized batching -> serial fraction dominated by HTTP, not GPU
CPU (no GPU) 50 GB/s s ~ 0.60--0.80 N=1--2 Decode so slow that any overhead is negligible... but concurrency is limited by CPU cores

Key insight: As GPU bandwidth increases, the Amdahl serial fraction should decrease because decode becomes a larger fraction of total work (less relative overhead). But the HTTP/scheduling overhead is hardware-independent, so there's a floor below which s cannot drop without changing the backend. This is why TR130 (profiling to decompose s) and TR132 (testing vLLM/TGI with lower-overhead serving) are the logical next steps.

Caution: These predictions assume Ollama's scheduling behavior is similar across hardware. vLLM or TGI may have fundamentally different scaling characteristics (e.g., PagedAttention enables true concurrent batch decode, which could break the Amdahl model entirely).

18.3 Future Work

  1. TR130 (GPU Profiling): Use NVIDIA Nsight to profile the GPU during N-agent operation, identifying the physical source of the serial fraction (kernel launch overhead? memory bus contention? scheduler serialization?). The residual analysis (SS5.4) provides specific hypotheses to test.
  2. vLLM/TGI comparison: Test the same N-agent scaling with backends designed for high-concurrency serving. If vLLM's PagedAttention and continuous batching yield lower serial fractions, this has direct deployment implications. Specifically: does the Amdahl model still hold, or does true batch decode break the single-serial-fraction assumption?
  3. Variable request sizes: Test with realistic agent workload distributions (short tool calls mixed with long context updates) to see if request size variability changes the scaling picture. The current uniform 100--300 token prompts may underestimate contention effects from size heterogeneity.
  4. Controlled Phase 4 redo: Re-run the heterogeneous experiment with proper controls (same Ollama instance, same warmup, only model assignment varied) to isolate model-switching overhead. This is the lowest-hanging fruit for follow-up work.
  5. Long-context multi-agent: Combine TR127's context-length sweep with TR129's N-agent scaling. At 16K+ context, VRAM pressure (TR127) and multi-agent contention (TR129) interact in ways neither study captures alone.
  6. qwen2.5-1.5b anomaly investigation: Profile the GQA memory access pattern at think=0 vs think=100ms to confirm or reject the memory bank conflict hypothesis (SS7.7).

SS19: Reproducibility

# Prerequisites: Ollama installed, 3 models pulled
ollama pull llama3.2:1b
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b

# Run full pipeline (collect + analyze + report)
python research/tr129/run.py -v

# Re-analyze existing data only
python research/tr129/run.py --analyze-only

# Regenerate report from existing analysis
python research/tr129/generate_report.py -v

Environment Snapshot

Property Value
Platform Windows 11 (Build 26200)
Python 3.13.1
GPU NVIDIA GeForce RTX 4080 Laptop GPU
VRAM 12.88 GB
CUDA 12.8
PyTorch 2.8.0+cu128
Compute Capability 8.9 (Ada Lovelace)

Key Artifacts

Artifact Path
Raw metrics research/tr129/results/20260225_213619/metrics.csv
Analysis JSON research/tr129/results/20260225_213619/analysis.json
Auto-generated report research/tr129/results/20260225_213619/report.md
Publish-ready report PublishReady/reports/Technical_Report_129.md
Experiment config research/tr129/config.yaml
Core executor research/tr129/shared/agent_executor.py

Appendix A: Environment

Property Value
cuda_available True
cuda_version 12.8
cudnn_version 91002
gpu_compute_capability [8, 9]
gpu_memory_gb 12.88
gpu_name NVIDIA GeForce RTX 4080 Laptop GPU
in_docker False
inductor_available True
platform Windows-11-10.0.26200-SP0
platform_system Windows
python_version 3.13.1
torch_version 2.8.0+cu128
triton_available False

Appendix B: Configuration

experiment: tr129
gpu_poll_interval_s: 1.0
max_new_tokens: 128
models:
  - name: llama3.2-1b
    ollama_tag: llama3.2:1b
    params_m: 1200
  - name: qwen2.5-1.5b
    ollama_tag: qwen2.5:1.5b
    params_m: 1500
  - name: llama3.2-3b
    ollama_tag: llama3.2:3b
    params_m: 3200
ollama_timeout_s: 120
ollama_url: http://localhost:11434
output_dir: research/tr129/results
phase1:
  prompt_tokens_low: 100
  prompt_tokens_high: 300
  requests_per_model: 50
phase2:
  n_agent_levels: [1, 2, 3, 4, 5, 6, 7, 8]
  requests_per_agent: 30
  prompt_tokens_low: 100
  prompt_tokens_high: 300
  cooldown_between_configs_s: 5
phase3:
  n_agents: 4
  think_times_ms: [0, 100, 500, 2000]
  requests_per_agent: 30
  prompt_tokens_low: 100
  prompt_tokens_high: 300
phase4:
  n_agents: 4
  requests_per_agent: 30
  prompt_tokens_low: 100
  prompt_tokens_high: 300
  max_loaded_models: 3

Appendix C: Data Summary

C.1 Measurement Counts by Phase and Model

Phase Model Configs Rows per Config Total Rows Status
1 (Baseline) llama3.2-1b 1 (N=1) 50 50 100% OK
1 (Baseline) llama3.2-3b 1 (N=1) 50 50 100% OK
1 (Baseline) qwen2.5-1.5b 1 (N=1) 50 50 100% OK
2 (Scaling) llama3.2-1b 8 (N=1--8) 30*N 1,080 100% OK
2 (Scaling) llama3.2-3b 8 (N=1--8) 30*N 1,080 100% OK
2 (Scaling) qwen2.5-1.5b 8 (N=1--8) 30*N 1,080 100% OK
3 (Think-Time) llama3.2-1b 4 (think=0--2000) 4*30 480 100% OK
3 (Think-Time) llama3.2-3b 4 (think=0--2000) 4*30 480 100% OK
3 (Think-Time) qwen2.5-1.5b 4 (think=0--2000) 4*30 480 100% OK
4 (Heterogeneous) Mixed 4 (homo, mixed_small, mixed_size, all_different) 4*30 480 100% OK
Total 5,310 100% OK

C.2 Key Numerical Results Summary

Metric llama3.2-1b llama3.2-3b qwen2.5-1.5b
Solo eff. tok/s 115.8 79.2 102.7
Solo GPU tok/s 206.8 114.3 162.2
Total tok/s at N=2 185.3 106.8 149.2
Total tok/s at N=8 187.9 109.4 152.8
eta(2) 80.0% 67.5% 72.7%
eta(8) 20.3% 17.3% 18.6%
Amdahl s 0.539 0.387 0.455
Amdahl R-sq 0.970 0.993 0.985
N* (50% threshold) 4 3 3
Jain's J at N=8 0.9975 0.9994 0.9987
Best sustained think-time 100ms 100ms 100ms
Sustained gain at best think +2.3% +1.6% +44.6%

Appendix D: Glossary

Term Definition
Closed-loop Each agent waits for response before sending next request. Max concurrency = N.
Open-loop Requests arrive according to an external process (e.g., Poisson). Concurrency unbounded.
N Number of concurrent closed-loop agents
eta(N) Per-agent efficiency: effective throughput at N / effective throughput at N=1
eff. tok/s Effective tokens per second: completion_tokens / wall_ms * 1000. Includes queue wait.
GPU tok/s GPU-side tokens per second: completion_tokens / eval_ms * 1000. Decode only.
Jain's index Fairness metric: J(N) = (sum x)^2 / (N * sum x^2). Range [1/N, 1]. 1.0 = perfect fairness.
N* Agent count where eta first drops below 50%
Amdahl's Law eta(N) = 1 / (s + (1-s)*N), where s = serial fraction
Serial fraction (s) Portion of work inherently sequential (GPU scheduler, memory bus, HTTP overhead)
in_flight Number of agents with active requests at a given moment
Think-time Delay between receiving response and sending next request
Duty cycle Fraction of time an agent is actively waiting for a response: wall_ms / (wall_ms + think_ms)
Sustained throughput Token production rate including idle time between requests
Burst throughput Token production rate during active periods only (excludes idle time)
Overlap ratio Fraction of total time with multiple concurrent active requests
Serialization degree Fraction of time the GPU processes requests sequentially (idle between requests)
OLLAMA_MAX_LOADED_MODELS Env var controlling how many models Ollama keeps loaded in VRAM simultaneously
FIFO First-in-first-out scheduling: requests processed in submission order
Continuous batching GPU processes multiple sequences in a single forward pass

References

  1. TR108--TR113: Single-model inference characterization series
  2. TR114v2: 2-Agent Efficiency Study (98.28% efficiency, ~42 tok/s per agent vs 114 solo)
  3. TR121v1: Comprehensive Scaling Analysis (sub-linear inference cost scaling with model size)
  4. TR123: KV-Cache Production Economics
  5. TR125: Quantization Decision Matrix
  6. TR126: Linux/Triton Cross-Platform Validation (1,871 lines, 3-phase, ~25,400 measurements)
  7. TR127: Long-Context Scaling (1,499 lines, 1,144 measurements, VRAM spillover discovery)
  8. TR128: Production Workload Characterization (1,539 lines, open-loop, 5-phase, 3,172 measurements)
  9. Jain, R., Chiu, D., and Hawe, W. (1984). A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems. DEC-TR-301.
  10. Amdahl, G. M. (1967). Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. AFIPS Conference Proceedings, 30, 483--485.
  11. Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. (vLLM reference for future backend comparison)
  12. Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022. (Continuous batching reference)

End of Technical Report 129 Lines: ~1,500 | Measurements: 5,310 | Models: 3 | Phases: 4 Core result: Amdahl's Law with s = 0.39--0.54 describes closed-loop N-agent scaling on consumer GPU.