Technical Report 130: Serving Stack Benchmarking

Ollama vs vLLM vs TGI -- Multi-agent throughput scaling comparison

Field	Value
TR Number	130
Project	Banterhearts LLM Performance Research
Date	2026-02-26
Author	Research Team
Report Type	Cross-backend serving stack benchmarking (4-phase, 3 backends, 4797 measurements)
Test Duration	~3 hours
Status	Complete --- All 4 phases delivered
Run ID	`20260226_125833`
Related Work	TR129 (N-Agent Scaling Laws), TR128 (Production Workload Characterization)
Depends On	TR129 (Ollama serial fraction cross-validation)

Abstract

TR129 found that Ollama exhibits Amdahl serial fractions s=0.39--0.54 on an RTX 4080 Laptop GPU (12 GB VRAM), meaning up to 54% of inference is serialized under multi-agent concurrency. The critical unanswered question: is this an Ollama scheduling bottleneck or an inherent GPU physics constraint? If alternative serving stacks achieve lower serial fractions with identical hardware, the bottleneck is the serving stack, not the silicon.

TR130 answers this question with 4,797 measurements across 3 serving backends (Ollama, vLLM, TGI), 3 models (llama3.2-1b, llama3.2-3b, qwen2.5-1.5b), and 4 phases: environment validation, single-agent baseline, N-agent scaling (N={1,2,4,8}), and time-to-first-token (TTFT) comparison. Each backend serves the same models on the same GPU under identical closed-loop workloads. All 9 backend x model combinations passed validation; zero were skipped.

Core finding: The serving stack is the bottleneck, and it is Ollama that suffers. The three backends follow fundamentally different degradation curves under concurrency. Ollama's sequential request scheduling maps to Amdahl's Law (R^2=0.957--0.987), producing steep efficiency collapse: at N=8 agents, each agent retains only 16--17% of its standalone throughput. vLLM and TGI, with continuous batching and PagedAttention, follow a power-law degradation (R^2=0.988--0.996) that is far more gradual: at N=8, each agent retains 46--66% of standalone throughput.

The practical consequence is dramatic. For llama3.2-1b at N=8 concurrent agents, vLLM delivers 559 total tok/s versus Ollama's 248 total tok/s -- a 2.25x advantage -- despite Ollama being 18% faster at N=1 (177.7 vs 150.7 tok/s). The crossover point occurs between N=2 and N=4 for all three models. Beyond N=2, practitioners should switch to vLLM or TGI.

Methodological caveat on the Amdahl serial fraction: When Amdahl's model is force-fitted to vLLM/TGI data, it produces artificially high serial fractions (s=0.81--0.92) because these backends do not degrade via Amdahl's mechanism. Their best-fit model is power law (eta propto N?alpha, alpha=0.17--0.35), which has no "serial fraction" parameter. Comparing Amdahl serial fractions across backends with different degradation mechanisms is a category error. This report uses raw efficiency eta(N), total throughput, and saturation points as the primary cross-backend comparison, with the Amdahl serial fraction reserved for Ollama where the model genuinely fits.

Quantization note: Ollama serves Q4_0 quantized models while vLLM and TGI serve FP16. Absolute throughput differences from quantization are expected and do not affect the scaling comparison: eta(N) normalizes each backend against its own N=1 baseline.

Executive Summary

Key Findings

Ollama wins at N=1. Q4_0 quantization gives Ollama 1.2--2.6x higher single-agent throughput: 198 tok/s (qwen2.5-1.5b) vs 103 tok/s (vLLM) and 75 tok/s (TGI). This is expected -- 4-bit weights are 4x smaller, reducing memory bandwidth pressure.
vLLM/TGI win decisively at N>=4. At N=8 agents, vLLM delivers 559 tok/s total vs Ollama's 248 tok/s for llama3.2-1b -- a 2.25x throughput advantage despite being slower at N=1. The crossover point is between N=2 and N=4.
The backends follow different scaling laws. Ollama degrades via Amdahl's Law (R^2=0.957--0.987): sequential request processing creates a genuine serial fraction. vLLM and TGI degrade via power law (R^2=0.988--0.996): continuous batching enables overlapping execution with gradual resource contention.
Ollama efficiency collapses at N=8. Each agent retains only 16--17% of N=1 throughput (eta~~0.16). vLLM agents retain 46--65% (eta~~0.56). TGI agents retain 48--66% (eta~0.58). The difference is 3--4x in per-agent efficiency.
Ollama saturates at N=4; vLLM/TGI never saturate within tested range.* Ollama drops below 50% efficiency at N=4 for all models. vLLM and TGI remain above 50% efficiency at N=8 for 2 of 3 models, suggesting useful scaling continues to N=16+.
TTFT is 6--8x faster on vLLM/TGI. Ollama: 163--194 ms. vLLM: 23--32 ms. TGI: 22--35 ms. Docker-based backends start streaming tokens in under 35 ms, versus Ollama's 160+ ms -- a user-perceptible gap (Cohen's d > 13 for all pairwise comparisons).
All backends are perfectly fair. Jain's fairness index >= 0.996 across all backends at all concurrency levels. No backend starves individual agents under contention.
Zero cold-start effects detected. No phase x backend combination shows first-3-request latency more than 1.07x the steady-state mean. The warmup protocol successfully eliminates cold-start artifacts.
Data quality is exceptional. 98.08% success rate (4,705/4,797 ok). Only 92 HTTP 424 errors (TGI overload at high N). Outlier rate: 0.0--0.2% across all backends. Zero outliers for Ollama and vLLM llama3.2-1b.
The Amdahl serial fraction comparison is misleading. Force-fitting Amdahl to vLLM/TGI produces s=0.81--0.92, but these backends don't follow Amdahl mechanics. The comparison overstates Ollama's relative scaling quality. Raw eta(N) and total throughput are the correct metrics.

N=1 Baseline Throughput

Backend	Model	Quant	Mean TPS	95% CI	CV%	Wall ms
ollama	llama3.2-1b	Q4_0	177.7	[175.0, 180.4]	5.3	678.9
ollama	llama3.2-3b	Q4_0	130.1	[129.7, 130.4]	1.0	984.2
ollama	qwen2.5-1.5b	Q4_0	198.3	[196.5, 200.2]	3.3	637.9
vllm	llama3.2-1b	FP16	150.7	[149.8, 151.6]	2.0	849.7
vllm	llama3.2-3b	FP16	60.9	[60.8, 60.9]	0.2	2103.3
vllm	qwen2.5-1.5b	FP16	102.6	[101.3, 103.9]	4.4	1250.0
tgi	llama3.2-1b	FP16	125.2	[124.5, 125.9]	1.9	1023.0
tgi	llama3.2-3b	FP16	49.4	[48.4, 50.5]	7.5	2602.3
tgi	qwen2.5-1.5b	FP16	75.0	[74.4, 75.5]	2.5	1708.5

N=8 Total Throughput (the multi-agent metric that matters)

Backend	llama3.2-1b	llama3.2-3b	qwen2.5-1.5b
vllm	559 tok/s	319 tok/s	457 tok/s
tgi	483 tok/s	261 tok/s	362 tok/s
ollama	248 tok/s	162 tok/s	259 tok/s
vllm / ollama ratio	2.25x	1.97x	1.76x

Claim Validation

#	Claim	Evidence	Status
1	Serving stack affects multi-agent scaling	eta(8) ranges 0.16 (Ollama) to 0.66 (TGI); Cohen's d > 3	Confirmed
2	vLLM/TGI scale better than Ollama	Total throughput 1.76--2.25x higher at N=8	Confirmed
3	Backends follow different scaling laws	Ollama=Amdahl (R^2=0.96+), vLLM/TGI=power law (R^2=0.99+)	Confirmed
4	Ollama is fastest at N=1	Q4_0 gives 1.2--2.6x throughput advantage	Confirmed
5	TTFT is faster on Docker backends	6--8x faster (22--35 ms vs 163--194 ms)	Confirmed
6	All backends are fair under contention	Jain's index >= 0.996 at all N	Confirmed
7	No cold-start effects after warmup	Max ratio 1.07x, no detections	Confirmed
8	Amdahl serial fraction is valid cross-backend	vLLM/TGI best fit is power law, not Amdahl	Refuted

Key Decisions for Practitioners

For N=1 (single-agent): Use Ollama with Q4_0. It delivers the highest absolute throughput due to quantized weights requiring less memory bandwidth. There is no scheduling overhead to worry about.
For N>=4 (multi-agent production): Switch to vLLM. It delivers 1.76--2.25x total throughput at N=8, retains 46--65% per-agent efficiency, and provides 6x faster TTFT. The FP16 precision also means higher output quality.
For N=2--3 (light multi-agent): The choice depends on your priority. Ollama still delivers higher per-agent throughput at N=2 in absolute terms (Q4_0 advantage), but vLLM overtakes in total throughput by N=4.
For streaming/interactive use: vLLM or TGI regardless of N. TTFT of 22--35 ms vs 163--194 ms is a user-perceptible difference (Cohen's d > 13).
Do not use Amdahl serial fractions to compare backends. The metric is only valid when all systems follow Amdahl mechanics. Use raw eta(N) or total throughput instead.

How to Read This Report

Time	Reading Path
5 min	Abstract -> Executive Summary -> SS19 Conclusions
15 min	Add SS5 (throughput curves), SS6 (efficiency), SS8 (cross-backend comparison)
45 min	Full report, SS1--SS19 + Appendices

When to Use This Report

Scenario	How This Report Helps
Choosing a serving backend for multi-agent deployment	Serial fraction comparison shows which backend degrades least
Deciding between Ollama and vLLM/TGI	Head-to-head baseline + scaling + TTFT data
Capacity planning for N concurrent agents	eta(N) curves + saturation points per backend
Evaluating whether to switch from Ollama	Quantifies the multi-agent efficiency gap
Understanding if GPU or software limits concurrency	Cross-backend serial fractions isolate the bottleneck

SS1. Introduction and Motivation
SS2. Methodology
SS3. Phase 1 -- Environment Validation
SS4. Phase 2 -- Single-Agent Baseline
SS5. Phase 3 -- N-Agent Throughput Curves
SS6. Efficiency Curves eta(N)
SS7. Scaling Law Fitting
SS8. Cross-Backend Serial Fraction Comparison
SS9. Saturation Detection
SS10. Fairness Analysis
SS11. Phase 5 -- TTFT Comparison
SS12. Queue Dynamics
SS13. VRAM Usage
SS14. TR129 Cross-Validation
SS15. Cold-Start Detection
SS16. Outlier Analysis
SS17. Backend-Native Metrics
SS17b. Statistical Power and Distribution Analysis
SS18. Limitations and Future Work
SS19. Conclusions
Appendix A: Configuration
Appendix B: GPU Telemetry
Appendix C: Data Summary
Appendix D: Glossary
References

SS1. Introduction and Motivation

SS1.1 Background

Multi-agent LLM systems deploy N autonomous agents that concurrently issue inference requests to a shared serving backend. TR129 established that Ollama exhibits Amdahl serial fractions s=0.39--0.54 on an RTX 4080 Laptop GPU, meaning that multi-agent efficiency drops substantially under concurrency.

But this finding leaves a critical question unanswered: is the serial bottleneck in Ollama's request scheduling, or in the GPU hardware itself?

If the serial fraction is a property of the GPU (memory bandwidth, compute pipeline serialization), then no software optimization will help. If it is a property of the serving stack (request queuing, KV-cache management, batch scheduling), then switching backends could dramatically improve multi-agent throughput.

SS1.2 Experimental Design

TR130 isolates the variable: the serving stack. All other factors are held constant:

Factor	Controlled?	Value
GPU hardware	Yes	RTX 4080 Laptop 12 GB
Models	Yes	llama3.2-1b, qwen2.5-1.5b, llama3.2-3b
Workload pattern	Yes	Closed-loop, 128 max tokens
Concurrency levels	Yes	N={1,2,4,8}
Serving backend	Variable	Ollama, vLLM, TGI
Quantization	Partially	Ollama=Q4_0, vLLM/TGI=FP16

The quantization difference (Q4_0 vs FP16) affects absolute throughput but not the scaling efficiency eta(N), which normalizes against each backend's own N=1 baseline. Serial fraction comparisons remain valid.

SS1.3 Research Questions

TR130 is designed to answer five specific questions:

Q1: Does the serving stack affect multi-agent scaling efficiency? If eta(N) differs across backends on the same GPU, the software matters.
Q2: Which backend delivers the most total throughput at high concurrency? The metric that matters for production: N x per_agent_tps at N=8.
Q3: Do all backends follow the same scaling law? If one backend follows Amdahl while another follows power law, the degradation mechanisms are fundamentally different.
Q4: At what N does the best backend overtake Ollama in total throughput? Ollama starts faster (Q4_0) but may lose the lead under contention.
Q5: Is TTFT independent of throughput scaling? A backend could be slow in throughput but fast in time-to-first-token, or vice versa.

SS1.4 Literature Gap

Published LLM serving benchmarks (Patel et al. 2024, Kwon et al. 2023) compare backends under open-loop arrival conditions (Poisson arrivals at specified rates). Multi-agent systems are closed-loop: each agent sends one request, waits for completion, then sends the next. This fundamental difference means open-loop benchmarks overestimate queuing depth and underestimate per-request contention. TR130 is the first cross-backend comparison under closed-loop multi-agent workloads on consumer GPU hardware.

SS1.5 Why Three Backends

Backend	Scheduling	Batching	KV-Cache	Expected Scaling
Ollama	Sequential FIFO	None (one request at a time)	Implicit (ggml)	Amdahl -- strict serial fraction
vLLM	Continuous batching	Dynamic in-flight batching	PagedAttention (virtual memory)	Sub-linear but graceful -- resource contention
TGI	Continuous batching	Token-level scheduling	Paged blocks	Similar to vLLM -- different implementation

Ollama is the null hypothesis: a simple sequential server. vLLM represents the state of the art in LLM serving efficiency. TGI provides a second continuous-batching implementation to distinguish "continuous batching in general" from "vLLM specifically."

SS2. Methodology

SS2.1 Backends

Backend	API	Quantization	Deployment	Key Feature
Ollama	`/api/generate`	Q4_0	Native Windows	Timing in response (ns)
vLLM	`/v1/completions`	FP16	Docker GPU	PagedAttention, continuous batching
TGI	`/generate`	FP16	Docker GPU	`details=true` for per-request timing

Only one backend runs at a time. Between backend switches, the previous server is fully stopped and the GPU is allowed to cool.

SS2.2 Metrics

Metric	Formula	Availability
`effective_tps`	`completion_tokens / wall_ms * 1000`	All backends
`gpu_tokens_per_s`	`completion_tokens / decode_ms * 1000`	Ollama, TGI
`prefill_ms`	Backend-native prefill time	Ollama, TGI
`decode_ms`	Backend-native decode time	Ollama, TGI
`ttft_ms`	Time to first token (streaming)	All backends

effective_tps is the primary metric -- it captures the throughput each agent actually experiences, including all queue wait, scheduling overhead, and network latency.

SS2.3 Statistical Methods

95% CI via t-distribution (per-backend x per-model)
Bootstrap CIs (1,000 resamples) on Amdahl serial fractions
Shapiro-Wilk normality testing on wall_ms distributions
Cohen's d for cross-backend pairwise effect sizes
Curve fitting: Amdahl, power law, exponential, logistic (4 models)

SS2.4 Four Phases

Phase	Purpose	Approximate Rows
P1: Validation	Confirm Docker GPU, API format, model loading	~27
P2: Baseline	N=1 reference throughput per backend	~450
P3: Scaling	N={1,2,4,8} closed-loop agents (CORE)	~4,050
P4: TTFT	Streaming time-to-first-token	~270

SS3. Phase 1 -- Environment Validation

Phase 1 sent 27 validation requests across all backend x model combinations to confirm:

Docker GPU passthrough works for vLLM and TGI containers
Each model loads and generates coherent text
API response parsing extracts correct token counts and timing
Timing fields match expected availability per backend

All backend x model combinations that passed validation proceeded to Phase 2. Failed combinations were skipped with logged errors.

SS4. Phase 2 -- Single-Agent Baseline

SS4.1 Absolute Throughput (N=1)

Backend	Model	Quant	N	Mean TPS	95% CI	CV%	Wall ms
ollama	llama3.2-1b	Q4_0	1	177.7	[175.0, 180.4]	5.3	678.9
ollama	llama3.2-3b	Q4_0	1	130.1	[129.7, 130.4]	1.0	984.2
ollama	qwen2.5-1.5b	Q4_0	1	198.3	[196.5, 200.2]	3.3	637.9
tgi	llama3.2-1b	FP16	1	125.2	[124.5, 125.9]	1.9	1023.0
tgi	llama3.2-3b	FP16	1	49.4	[48.4, 50.5]	7.5	2602.3
tgi	qwen2.5-1.5b	FP16	1	75.0	[74.4, 75.5]	2.5	1708.5
vllm	llama3.2-1b	FP16	1	150.7	[149.8, 151.6]	2.0	849.7
vllm	llama3.2-3b	FP16	1	60.9	[60.8, 60.9]	0.2	2103.3
vllm	qwen2.5-1.5b	FP16	1	102.6	[101.3, 103.9]	4.4	1250.0

SS4.2 Observations

Ollama is 1.2--2.6x faster at N=1. The Q4_0 quantization advantage is consistent across all models. For llama3.2-1b: Ollama 177.7 vs vLLM 150.7 vs TGI 125.2 tok/s. The ratio is largest for llama3.2-3b (130.1/49.4 = 2.63x) where FP16 weights strain the 12 GB VRAM.
vLLM is 1.2--1.4x faster than TGI at N=1. Both serve FP16, so this gap reflects implementation efficiency: vLLM's optimized CUDA kernels and PagedAttention overhead is lower than TGI's at zero contention. For llama3.2-1b: 150.7 vs 125.2 tok/s (1.20x). For qwen2.5-1.5b: 102.6 vs 75.0 tok/s (1.37x).
vLLM llama3.2-3b has near-zero variance (CV=0.2%). This is remarkably consistent -- 50 requests spanning only 22 ms range (2097--2120 ms). This suggests vLLM's scheduler produces deterministic timing when there is no contention, likely because PagedAttention eliminates memory fragmentation randomness.
TGI llama3.2-3b has the highest variance (CV=7.5%). The same model on TGI shows 37x more relative variance than vLLM. Combined with TGI's non-normal wall_ms distribution (Shapiro-Wilk p < 0.001), this suggests occasional scheduling hiccups even at N=1.
Ollama exposes native prefill/decode timing. Ollama's llama3.2-1b shows prefill=7.1 ms, decode=459.5 ms at N=1, meaning only 1.5% of GPU time is prefill. The remaining 212 ms gap between (prefill+decode)=467 ms and wall_ms=679 ms is Ollama's HTTP/scheduling overhead -- 31% of total request time at N=1.

SS4.3 Cross-Backend Interpretation

Ollama serves Q4_0 quantized weights, which are ~4x smaller than FP16. This means Ollama has lower memory bandwidth pressure (less data to transfer per token), lower compute requirements (INT4 ops vs FP16 ops), and correspondingly higher absolute tok/s at N=1. This is expected and correct.

Why the baseline difference does not affect scaling comparison: Amdahl's eta(N) = TPS(N) / TPS(1) normalizes each backend against its own N=1 reference. A backend with 50 tok/s at N=1 and 25 tok/s at N=2 has the same eta(2)=0.5 as one with 100 tok/s at N=1 and 50 tok/s at N=2. However, for total throughput comparisons (the metric practitioners care about), the baseline matters -- Ollama's Q4_0 head start must be overcome by the competing backend's superior scaling before switching is worthwhile.

Why vLLM beats TGI at N=1: Both use continuous batching, but at N=1 this doesn't matter (there's nothing to batch). The gap likely reflects vLLM's more optimized CUDA graph execution and lower Python-side overhead in the critical path. TGI's Rust-based router adds a layer that may introduce small latencies under zero contention.

SS5. Phase 3 -- N-Agent Throughput Curves

SS5.1 Per-Agent Throughput vs N

Backend	Model	N	Per-Agent TPS	95% CI	Total TPS	Wall ms
ollama	llama3.2-1b	1	175.3	[172.4, 178.2]	175.3	695
ollama	llama3.2-1b	2	140.2	[134.8, 145.7]	280.4	886
ollama	llama3.2-1b	4	64.8	[62.8, 66.8]	259.1	1855
ollama	llama3.2-1b	8	31.0	[30.1, 31.9]	247.8	3986
ollama	llama3.2-3b	1	130.2	[129.7, 130.6]	130.2	983
ollama	llama3.2-3b	2	88.1	[86.7, 89.6]	176.2	1461
ollama	llama3.2-3b	4	41.6	[41.1, 42.1]	166.5	3075
ollama	llama3.2-3b	8	20.2	[19.9, 20.6]	162.0	6361
ollama	qwen2.5-1.5b	1	196.1	[193.0, 199.2]	196.1	640
ollama	qwen2.5-1.5b	2	150.6	[146.0, 155.3]	301.2	864
ollama	qwen2.5-1.5b	4	68.6	[67.3, 70.0]	274.5	1858
ollama	qwen2.5-1.5b	8	32.4	[32.0, 32.8]	259.4	3947
tgi	llama3.2-1b	1	121.7	[119.6, 123.8]	121.7	1054
tgi	llama3.2-1b	2	99.6	[97.4, 101.9]	199.3	1294
tgi	llama3.2-1b	4	82.3	[80.4, 84.3]	329.4	1579
tgi	llama3.2-1b	8	60.4	[58.6, 62.2]	483.2	2220
tgi	llama3.2-3b	1	47.2	[47.0, 47.4]	47.2	2711
tgi	llama3.2-3b	2	42.5	[42.1, 42.9]	85.0	3017
tgi	llama3.2-3b	4	38.3	[37.9, 38.8]	153.4	3349
tgi	llama3.2-3b	8	32.6	[32.1, 33.1]	260.7	3976
tgi	qwen2.5-1.5b	1	76.5	[76.1, 76.9]	76.5	1674
tgi	qwen2.5-1.5b	2	65.3	[64.3, 66.4]	130.6	1966
tgi	qwen2.5-1.5b	4	55.8	[54.9, 56.7]	223.1	2311
tgi	qwen2.5-1.5b	8	45.3	[44.3, 46.3]	362.4	2897
vllm	llama3.2-1b	1	149.1	[147.0, 151.1]	149.1	860
vllm	llama3.2-1b	2	123.7	[119.3, 128.0]	247.3	1055
vllm	llama3.2-1b	4	93.3	[89.9, 96.7]	373.3	1414
vllm	llama3.2-1b	8	69.9	[67.6, 72.2]	559.2	1928
vllm	llama3.2-3b	1	60.7	[60.4, 61.0]	60.7	2110
vllm	llama3.2-3b	2	54.6	[53.8, 55.4]	109.2	2351
vllm	llama3.2-3b	4	47.4	[46.7, 48.2]	189.7	2720
vllm	llama3.2-3b	8	39.8	[39.1, 40.5]	318.6	3279
vllm	qwen2.5-1.5b	1	104.7	[103.3, 106.0]	104.7	1224
vllm	qwen2.5-1.5b	2	88.8	[86.7, 90.9]	177.6	1454
vllm	qwen2.5-1.5b	4	72.8	[71.0, 74.5]	291.0	1788
vllm	qwen2.5-1.5b	8	57.1	[55.6, 58.6]	456.7	2326

SS5.2 Observations

Ollama total throughput PEAKS at N=2 and then declines. For llama3.2-1b: N=1->175, N=2->280, N=4->259, N=8->248 tok/s. The system actually delivers less total throughput at N=8 than at N=2. This is the signature of severe serialization -- adding agents beyond 2 creates more queue wait than it adds productive GPU time.
vLLM total throughput grows monotonically through N=8. For llama3.2-1b: 149->247->373->559 tok/s. Each doubling of N adds meaningful throughput, suggesting the system has not yet reached its scaling ceiling. Continuous batching enables the GPU to overlap compute across concurrent requests, converting queued requests into productive parallelism.
TGI follows the same monotonic pattern as vLLM but at lower absolute throughput. For llama3.2-1b: 122->199->329->483 tok/s. TGI at N=8 (483) is roughly comparable to vLLM at N=4 (373), suggesting TGI is approximately one concurrency level behind vLLM in terms of aggregate efficiency.
The vLLM advantage grows with model size. At N=8, the vLLM/Ollama total throughput ratio is 2.25x for llama3.2-1b, 1.97x for llama3.2-3b, and 1.76x for qwen2.5-1.5b. Larger FP16 models consume more VRAM, but vLLM's PagedAttention manages this memory more efficiently under contention than Ollama's implicit KV-cache.
Per-agent throughput at N=8 reveals the scheduling difference. Ollama: 31.0 tok/s per agent (each agent waits while 7 others are served sequentially). vLLM: 69.9 tok/s per agent (requests overlap via continuous batching). The 2.26x per-agent gap means each vLLM agent experiences half the latency of an Ollama agent.
Wall-clock latency progression confirms serial vs parallel. Ollama llama3.2-1b: 695->886->1855->3986 ms (roughly Nx the N=1 latency, confirming sequential). vLLM: 860->1055->1414->1928 ms (sub-linear growth, confirming overlapped execution). At N=8, an Ollama request takes 3.99 seconds vs vLLM's 1.93 seconds.

SS5.3 The Crossover Point

For production planning, the critical question is: at what N does vLLM's total throughput exceed Ollama's, despite Ollama's Q4_0 head start?

Model	Ollama N=2 Total	vLLM N=2 Total	Crossover Region
llama3.2-1b	280.4	247.3	Between N=2 and N=4
llama3.2-3b	176.2	109.2	Between N=2 and N=4
qwen2.5-1.5b	301.2	177.6	Between N=2 and N=4

At N=2, Ollama still leads thanks to Q4_0 throughput. By N=4, vLLM overtakes for llama3.2-1b (373.3 vs 259.1) and qwen2.5-1.5b (291.0 vs 274.5). The crossover occurs near N=3 -- the point at which continuous batching's parallelism advantage overcomes quantization's throughput advantage. For deployments with 3+ agents, vLLM is the strictly dominant choice.

SS6. Efficiency Curves eta(N)

SS6.1 Definition

eta(N) = effective_tps(N) / effective_tps(1)

This is the per-agent efficiency: the fraction of N=1 throughput that each agent retains when sharing the GPU with N-1 other agents. eta(1) = 1.0 by definition; eta(N) < 1 for N > 1.

SS6.2 Efficiency Table

Backend	Model	N	eta(N)	Per-Agent TPS	Baseline TPS
ollama	llama3.2-1b	1	0.987	175.3	177.7
ollama	llama3.2-1b	2	0.789	140.2	177.7
ollama	llama3.2-1b	4	0.365	64.8	177.7
ollama	llama3.2-1b	8	0.174	31.0	177.7
ollama	llama3.2-3b	1	1.001	130.2	130.1
ollama	llama3.2-3b	2	0.678	88.1	130.1
ollama	llama3.2-3b	4	0.320	41.6	130.1
ollama	llama3.2-3b	8	0.156	20.2	130.1
ollama	qwen2.5-1.5b	1	0.989	196.1	198.3
ollama	qwen2.5-1.5b	2	0.759	150.6	198.3
ollama	qwen2.5-1.5b	4	0.346	68.6	198.3
ollama	qwen2.5-1.5b	8	0.164	32.4	198.3
tgi	llama3.2-1b	1	0.972	121.7	125.2
tgi	llama3.2-1b	2	0.796	99.6	125.2
tgi	llama3.2-1b	4	0.658	82.3	125.2
tgi	llama3.2-1b	8	0.483	60.4	125.2
tgi	llama3.2-3b	1	0.955	47.2	49.4
tgi	llama3.2-3b	2	0.859	42.5	49.4
tgi	llama3.2-3b	4	0.775	38.3	49.4
tgi	llama3.2-3b	8	0.659	32.6	49.4
tgi	qwen2.5-1.5b	1	1.020	76.5	75.0
tgi	qwen2.5-1.5b	2	0.871	65.3	75.0
tgi	qwen2.5-1.5b	4	0.744	55.8	75.0
tgi	qwen2.5-1.5b	8	0.604	45.3	75.0
vllm	llama3.2-1b	1	0.989	149.1	150.7
vllm	llama3.2-1b	2	0.820	123.7	150.7
vllm	llama3.2-1b	4	0.619	93.3	150.7
vllm	llama3.2-1b	8	0.464	69.9	150.7
vllm	llama3.2-3b	1	0.997	60.7	60.9
vllm	llama3.2-3b	2	0.897	54.6	60.9
vllm	llama3.2-3b	4	0.779	47.4	60.9
vllm	llama3.2-3b	8	0.654	39.8	60.9
vllm	qwen2.5-1.5b	1	1.020	104.7	102.6
vllm	qwen2.5-1.5b	2	0.866	88.8	102.6
vllm	qwen2.5-1.5b	4	0.709	72.8	102.6
vllm	qwen2.5-1.5b	8	0.556	57.1	102.6

SS6.3 Observations

Observation 1 -- Ollama collapses by N=4, vLLM/TGI hold through N=8. Ollama's eta drops below 0.50 by N=4 for all three models (0.320--0.365), meaning each agent retains less than a third of its standalone throughput. vLLM/TGI remain above 0.60 at N=4 (0.619--0.779) and still above 0.46 at N=8. The gap is not marginal -- it is a 3--4x difference in scaling efficiency at N=8.

Observation 2 -- The three Ollama models degrade nearly identically. At N=8: eta=0.174 (1B), 0.156 (3B), 0.164 (1.5B). The spread is only 0.018 -- less than 2 percentage points. This model-invariance confirms that Ollama's efficiency loss is dominated by the serving scheduler, not model-specific compute characteristics. If the bottleneck were memory bandwidth (which scales with model size), the 3B model would degrade faster than the 1B.

Observation 3 -- vLLM/TGI efficiency is model-size-dependent. Unlike Ollama, the 3B model retains more efficiency than the 1B at N=8: vLLM eta=0.654 (3B) vs 0.464 (1B); TGI eta=0.659 (3B) vs 0.483 (1B). This inverted pattern reveals that continuous batching amortizes per-request scheduling overhead more effectively when requests are longer (larger model = longer decode per token = more time for the scheduler to batch the next request). The 1B model's fast decode leaves less slack for batch formation.

Observation 4 -- Efficiency drop 1->2 is the diagnostic moment. Ollama loses 21--32% of efficiency going from N=1 to N=2 (eta drops from ~1.0 to 0.68--0.79). vLLM/TGI lose only 3--14% (eta drops from ~1.0 to 0.82--0.90). A backend that struggles at N=2 -- when only one other request competes -- exposes scheduling-level serialization. The N=1->2 gap predicts the entire curve shape.

Observation 5 -- Per-agent TPS at N=8 tells the story. Ollama: 20--32 tok/s per agent. vLLM: 40--70 tok/s per agent. TGI: 33--60 tok/s per agent. For a user running 8 agents simultaneously, each Ollama agent produces text at roughly the speed of a slow typist. vLLM agents are 2--3.5x faster individually, and 2.25x faster in aggregate (since there are 8 of each).

SS6.4 What the Efficiency Curves Reveal

The answer to the central question of TR130 is visible in this table alone, before any curve fitting. If all backends showed similar eta(N) curves, the serial bottleneck would be in the GPU hardware. Instead, the backends diverge dramatically -- Ollama's eta=0.16 vs vLLM's eta=0.56 at N=8 -- proving that the serving stack is the bottleneck. The GPU is capable of much higher concurrent throughput than Ollama allows.

The mechanism: Ollama processes requests sequentially (one at a time on the GPU), so N concurrent agents form a queue where each waits for the others. vLLM/TGI batch multiple requests into a single GPU kernel launch, so N agents partially overlap rather than queuing. The efficiency table quantifies this difference exactly.

SS7. Scaling Law Fitting

SS7.1 Four-Model Comparison

Backend	Model	Best Fit	R^2	Amdahl s	Amdahl R^2	Power alpha	Exp beta
ollama	llama3.2-1b	exponential	0.984	0.5329	0.957	0.681	0.289
ollama	llama3.2-3b	amdahl	0.987	0.3920	0.987	0.782	0.347
ollama	qwen2.5-1.5b	exponential	0.987	0.4918	0.965	0.713	0.307
tgi	llama3.2-1b	power_law	0.989	0.8274	0.962	0.317	0.102
tgi	llama3.2-3b	power_law	0.988	0.9146	0.843	0.171	0.052
tgi	qwen2.5-1.5b	power_law	0.996	0.8960	0.973	0.245	0.075
vllm	llama3.2-1b	power_law	0.989	0.8125	0.987	0.353	0.116
vllm	llama3.2-3b	power_law	0.988	0.9168	0.975	0.197	0.061
vllm	qwen2.5-1.5b	power_law	0.993	0.8748	0.985	0.282	0.089

SS7.2 Observations

Observation 1 -- Ollama follows Amdahl's Law; vLLM/TGI follow power law. This is the most important finding in the table. Ollama's best-fit model is Amdahl or exponential (both consistent with fixed serialization), with Amdahl R^2=0.957--0.987. All six vLLM/TGI fits select power law as best fit (R^2=0.988--0.996). The backends obey fundamentally different scaling laws because they have fundamentally different scheduling architectures.

Observation 2 -- Amdahl R^2 is poor for vLLM/TGI. When Amdahl is force-fitted to TGI-3B, R^2=0.843 -- a terrible fit. For vLLM-1B, Amdahl R^2=0.987 looks acceptable only because the power law happens to be close to Amdahl at small N. The key diagnostic: Amdahl predicts eta(8)->1/(s+7(1-s)) which, for s=0.81, gives eta=0.15. The actual vLLM-1B eta(8)=0.464 -- 3x higher than Amdahl's prediction. The force-fitted Amdahl "serial fraction" is meaningless for these backends.

Observation 3 -- Power law exponents reveal the degradation rate. For vLLM/TGI, eta(N) propto N?alpha. The exponent alpha ranges from 0.171 (TGI-3B) to 0.353 (vLLM-1B). Smaller alpha = slower degradation = better scaling. The 3B models have the smallest alpha across both backends (0.171--0.197), confirming Observation 3 in SS6: larger models scale better under continuous batching because longer decode times provide more scheduling slack.

Observation 4 -- Ollama's serial fractions match TR129. Ollama-3B: s=0.392 (TR130) vs s=0.39 (TR129). Ollama-1.5B: s=0.492 (TR130) vs s=0.54 (TR129). The agreement within bootstrap CIs validates both experiments and confirms that Ollama's Amdahl-like behavior is reproducible.

Observation 5 -- Exponential beats Amdahl for the 1B model on Ollama. For llama3.2-1b, exponential (R^2=0.984) edges out Amdahl (R^2=0.957). The exponential model eta(N) propto e^{-betaN} predicts faster-than-Amdahl collapse at high N, suggesting that the 1B model's rapid decode (177.7 tok/s at N=1) intensifies scheduling contention beyond what a fixed serial fraction captures. At very high throughput, even Ollama's sequential scheduling may experience additional bottlenecks (e.g., CPU-side tokenization, HTTP overhead).

SS7.3 Why Different Backends Obey Different Laws

Amdahl's Law (eta = 1/(s + (1-s)N)) assumes a fixed serial fraction: some fraction s of the workload is strictly sequential, and the rest can overlap. Ollama's sequential request scheduling is a textbook Amdahl system -- it processes one request at a time, so the serial component is the scheduling granularity. The serial fraction s captures the ratio of scheduling overhead to total request time.

Power law (eta propto N?alpha) has no serial fraction. Instead, each additional agent causes a multiplicative slowdown that compounds, producing a straight line on a log-log plot. This matches continuous batching: adding a request to an in-progress batch has a cost proportional to the current batch size (attention computation grows, memory bandwidth is shared), not a fixed cost. The degradation is gradual and continuous, not threshold-driven.

The implication: Amdahl serial fractions are valid for Ollama and should be used for capacity planning with Ollama deployments. They are not valid for vLLM/TGI -- the power law exponent alpha is the correct characterization. Comparing Amdahl s values across backends with different best-fit models is a category error, addressed in SS8.

SS7.4 Amdahl's Law Definition

Amdahl's Law: eta(N) = 1 / (s + (1-s)*N)

The serial fraction s represents the fraction of the inference pipeline that cannot be overlapped across concurrent requests. Sources of serialization include:

GPU compute serialization: CUDA kernel launches are serialized, limiting how many requests can execute simultaneously
Memory bandwidth contention: All requests share the same HBM/GDDR bandwidth for KV-cache reads and weight fetches
Request scheduling overhead: The serving stack's scheduler adds latency when deciding which request to execute next
KV-cache management: Allocating, copying, and freeing KV-cache blocks requires synchronization

vLLM's PagedAttention and continuous batching are designed to minimize the last two sources.

SS8. Cross-Backend Serial Fraction Comparison

SS8.1 Core Question

Is the Amdahl serial fraction s=0.39--0.54 (from TR129) an Ollama problem or a GPU physics problem?

SS8.2 Bootstrap Serial Fraction CIs

Each serial fraction below is estimated via 1,000 bootstrap resamples of the efficiency curve, providing robust confidence intervals.

llama3.2-1b

Backend	s (mean)	s (median)	95% CI	Std
ollama	0.5146	0.5329	[0.3233, 0.7329]	0.1101
tgi	0.8159	0.8274	[0.7437, 0.8468]	0.0604
vllm	0.8076	0.8108	[0.7811, 0.8348]	0.0351

llama3.2-3b

Backend	s (mean)	s (median)	95% CI	Std
ollama	0.3781	0.3920	[0.2253, 0.5242]	0.0831
tgi	0.9034	0.9132	[0.8361, 0.9261]	0.0492
vllm	0.9092	0.9168	[0.8858, 0.9245]	0.0582

qwen2.5-1.5b

Backend	s (mean)	s (median)	95% CI	Std
ollama	0.4698	0.4918	[0.2691, 0.6833]	0.1119
tgi	0.8898	0.8960	[0.8524, 0.9065]	0.0451
vllm	0.8694	0.8748	[0.8446, 0.8861]	0.0435

SS8.3 Pairwise Comparisons

Model	Backend A	Backend B	s_A	s_B	Diff	Cohen's d	Effect	CIs Overlap
llama3.2-1b	ollama	tgi	0.5146	0.8159	-0.3013	-3.39	large	NO
llama3.2-1b	ollama	vllm	0.5146	0.8076	-0.2930	-3.59	large	NO
llama3.2-1b	tgi	vllm	0.8159	0.8076	0.0083	0.17	negligible	yes
llama3.2-3b	ollama	tgi	0.3781	0.9034	-0.5253	-7.69	large	NO
llama3.2-3b	ollama	vllm	0.3781	0.9092	-0.5311	-7.40	large	NO
llama3.2-3b	tgi	vllm	0.9034	0.9092	-0.0058	-0.11	negligible	yes
qwen2.5-1.5b	ollama	tgi	0.4698	0.8898	-0.4200	-4.92	large	NO
qwen2.5-1.5b	ollama	vllm	0.4698	0.8694	-0.3996	-4.71	large	NO
qwen2.5-1.5b	tgi	vllm	0.8898	0.8694	0.0204	0.46	small	yes

SS8.4 Aggregate Ranking

Rank	Backend	Mean s	Min s	Max s
1	ollama	0.4542	0.3781	0.5146
2	vllm	0.8621	0.8076	0.9092
3	tgi	0.8697	0.8159	0.9034

SS8.5 Observations

Observation 1 -- The Amdahl serial fractions tell a paradoxical story that reveals their invalidity for cross-backend comparison. The data says Ollama has the lowest serial fraction (s=0.38--0.51) while vLLM/TGI have the highest (s=0.81--0.92). Naively, this means Ollama scales best. But we know from SS6 that Ollama's eta(8)=0.16 while vLLM's eta(8)=0.56 -- Ollama scales worst. The paradox resolves when we recognize that Amdahl's model genuinely fits Ollama but is a bad model for vLLM/TGI (SS7). Force-fitting an inappropriate model produces meaningless parameters.

Observation 2 -- The bootstrap CIs confirm the meaninglessness of cross-backend comparison. Ollama's CIs are wide ([0.23, 0.73] for 1B) because 4 data points with noise allow Amdahl to fit a range of s values. vLLM/TGI's CIs are tight ([0.78, 0.83] for vLLM-1B) not because the fit is good, but because the power-law curve is locally well-approximated by a specific Amdahl curve at these N values. The tightness is an artifact, not evidence.

Observation 3 -- The pairwise comparisons are statistically sound but semantically wrong. Cohen's d values of 3--8 with non-overlapping CIs correctly indicate that the Amdahl fits produce different parameters across backends. But the question is not whether the parameters differ -- it is whether the parameter means the same thing across backends. It does not. Ollama's s=0.45 means "45% of request time is sequential scheduling." vLLM's s=0.86 means "the power-law curve happens to intersect Amdahl at this parameter value."

Observation 4 -- TGI ~ vLLM in the Amdahl frame. The pairwise TGI-vs-vLLM comparisons show negligible-to-small effects (Cohen's d = 0.11--0.46). This is the one valid comparison: both backends follow power law, so their Amdahl fits are equally wrong in the same way. The similar s values (0.86--0.87 mean) reflect similar scaling behavior, which is genuine -- both use continuous batching.

SS8.6 Answer to the Core Question

Is the Amdahl serial fraction s=0.39--0.54 an Ollama problem or a GPU physics problem?

It is an Ollama problem. But the answer comes not from comparing serial fractions (which is a category error), but from comparing raw efficiency:

Metric	Ollama	vLLM	TGI
eta(8) range	0.156--0.174	0.464--0.654	0.483--0.659
Total TPS at N=8 (1B model)	248.0	559.2	483.2
Best-fit law	Amdahl	Power	Power
N* (saturation)	4 (all models)	8 or >8	8 or >8

The GPU can deliver 3--4x better scaling than Ollama allows. vLLM/TGI prove this by achieving eta(8)=0.46--0.66 on the same GPU, same models, same workload. The bottleneck is Ollama's sequential request scheduling, not GPU memory bandwidth or compute serialization.

Mechanistic explanation: Ollama processes requests one at a time. When 8 agents send concurrent requests, 7 wait in a queue while 1 executes. vLLM batches all 8 into a single GPU kernel, sharing the attention computation. The cost of concurrent batched attention scales sub-linearly (power law), not linearly (Amdahl). This is why Ollama degrades 6x faster than vLLM at N=8.

SS8.7 What Should Practitioners Compare?

Since Amdahl serial fractions are invalid for cross-backend comparison, the correct metrics are:

eta(N) at target concurrency -- directly answers "how much throughput does each agent retain?"
Total throughput at target N -- answers "how much total work gets done?"
N (saturation point)* -- answers "how many agents can I run before efficiency halves?"
Power law exponent alpha -- for continuous-batching backends, answers "how fast does efficiency degrade?"

All four metrics consistently rank: vLLM > TGI >> Ollama for multi-agent deployments.

SS9. Saturation Detection

N* = the concurrency level where eta drops below 0.50 (each agent retains less than half of its standalone throughput).

Backend	Model	N*	eta at Max N	Max N Tested
ollama	llama3.2-1b	4	0.174	8
ollama	llama3.2-3b	4	0.156	8
ollama	qwen2.5-1.5b	4	0.164	8
tgi	llama3.2-1b	8	0.483	8
tgi	llama3.2-3b	None	0.659	8
tgi	qwen2.5-1.5b	None	0.604	8
vllm	llama3.2-1b	8	0.464	8
vllm	llama3.2-3b	None	0.654	8
vllm	qwen2.5-1.5b	None	0.556	8

SS9.2 Observations

Observation 1 -- Ollama saturates at N=4 for all three models.* By N=4, every Ollama-served model has eta<0.37 -- well below the 0.50 threshold. This is consistent with Amdahl serial fractions of 0.39--0.53: the mathematical prediction from s=0.45 gives eta(4)=1/(0.45+0.55x4)=0.35, matching the data. Practitioners running >3 Ollama agents are operating in the saturated regime where adding agents provides marginal total throughput gain.

Observation 2 -- vLLM/TGI have not saturated at N=8 for 3B and 1.5B models. N*=None means eta never dropped below 0.50 in our measurement range. TGI-3B: eta(8)=0.659, vLLM-3B: eta(8)=0.654 -- both retaining nearly two-thirds of per-agent throughput. Extrapolating the power law fit: vLLM-3B (alpha=0.197) predicts N~13, TGI-3B (alpha=0.171) predicts N~16. These backends can support roughly 4x more concurrent agents than Ollama before reaching the same efficiency threshold.

Observation 3 -- The 1B model saturates first on vLLM/TGI. Both show N*=8 for llama3.2-1b (eta~0.47), while the 3B model remains unsaturated. This reinforces the finding from SS6: the 1B model's fast decode (150 tok/s at N=1) leaves less slack for continuous batching to overlap requests. Faster individual requests paradoxically scale worse under concurrent batching because the scheduler has less time to amortize overhead.

Observation 4 -- The saturation gap is the simplest decision metric. Ollama: N*=4. vLLM: N*=8--13+. TGI: N*=8--16+. A practitioner choosing a backend for multi-agent deployment can compare these numbers directly: at N=6 agents, Ollama is deep in saturation (eta~~0.22) while vLLM is still efficient (eta~~0.55). No curve fitting or statistical analysis needed -- N* alone answers the deployment question.

SS10. Fairness Analysis

Jain's Fairness Index: J = (sum(x))^2 / (n x sum(x^2)), where x_i is agent i's mean effective_tps. J=1.0 means perfectly fair (all agents get equal throughput).

Backend	Model	N	Jain's Index	Agent TPS CV%
ollama	llama3.2-1b	2	0.9999	1.1
ollama	llama3.2-1b	4	0.9999	1.1
ollama	llama3.2-1b	8	0.9986	3.8
ollama	llama3.2-3b	2	1.0000	0.4
ollama	llama3.2-3b	4	1.0000	0.5
ollama	llama3.2-3b	8	0.9994	2.4
ollama	qwen2.5-1.5b	2	1.0000	0.7
ollama	qwen2.5-1.5b	4	0.9993	2.6
ollama	qwen2.5-1.5b	8	0.9999	1.0
tgi	llama3.2-1b	2	1.0000	0.4
tgi	llama3.2-1b	4	0.9973	5.2
tgi	llama3.2-1b	8	0.9960	6.3
tgi	llama3.2-3b	2	1.0000	0.0
tgi	llama3.2-3b	4	0.9997	1.7
tgi	llama3.2-3b	8	0.9994	2.4
tgi	qwen2.5-1.5b	2	1.0000	0.5
tgi	qwen2.5-1.5b	4	0.9987	3.6
tgi	qwen2.5-1.5b	8	0.9966	5.8
vllm	llama3.2-1b	2	0.9999	1.1
vllm	llama3.2-1b	4	0.9991	3.0
vllm	llama3.2-1b	8	0.9960	6.3
vllm	llama3.2-3b	2	1.0000	0.6
vllm	llama3.2-3b	4	0.9995	2.1
vllm	llama3.2-3b	8	0.9991	3.1
vllm	qwen2.5-1.5b	2	0.9999	0.9
vllm	qwen2.5-1.5b	4	0.9988	3.5
vllm	qwen2.5-1.5b	8	0.9995	2.3

SS10.2 Observations

Observation 1 -- All three backends achieve near-perfect fairness. Every Jain's index value exceeds 0.996. No backend systematically starves any agent. This is a positive result but also a non-differentiator: fairness cannot be used to choose between backends.

Observation 2 -- Ollama's sequential scheduling produces paradoxically good fairness. Ollama's round-robin queue (one request at a time, FIFO) gives each agent exactly the same wait time per cycle. The CV% at N=8 is 1.0--3.8% for Ollama vs 2.3--6.3% for vLLM/TGI. Sequential serving is perfectly fair because it is perfectly serialized -- the same property that kills throughput guarantees equitable distribution of the limited throughput.

Observation 3 -- vLLM/TGI show slightly higher variance at N=8. TGI llama3.2-1b at N=8: CV=6.3%, with per-agent TPS ranging from 51.4 to 65.4 tok/s -- a 27% spread between the slowest and fastest agent. vLLM llama3.2-1b at N=8: range 63.3 to 79.0 tok/s -- a 25% spread. This is a consequence of continuous batching: requests that arrive when the batch is less full get better service. The variation is small enough (Jain's > 0.996) to be operationally irrelevant, but it reveals that continuous batching introduces stochastic unfairness that sequential serving avoids.

Observation 4 -- Fairness is high but throughput is not. The key insight is that fairness != performance. Ollama at N=8 distributes 31 tok/s perfectly equally among 8 agents. vLLM at N=8 distributes 70 tok/s with 6.3% CV. A 6.3% unfairness in 70 tok/s (worst agent gets 63.3) still beats Ollama's perfectly-fair 31 tok/s by 2x.

SS11. Phase 5 -- TTFT Comparison

SS11.1 Time-to-First-Token (N=1)

Backend	Model	Mean TTFT (ms)	Median	P95	P99	CV%
ollama	llama3.2-1b	175.9	173.6	192.6	197.0	5.2
ollama	llama3.2-3b	194.4	192.3	219.1	220.7	6.5
ollama	qwen2.5-1.5b	163.2	162.0	185.6	194.0	7.7
tgi	llama3.2-1b	21.7	19.2	32.3	57.1	42.4
tgi	llama3.2-3b	34.5	30.5	58.4	71.3	31.5
tgi	qwen2.5-1.5b	24.1	23.2	30.9	36.1	13.6
vllm	llama3.2-1b	22.8	21.0	36.5	46.8	28.7
vllm	llama3.2-3b	32.3	29.7	50.7	68.5	29.9
vllm	qwen2.5-1.5b	29.6	27.9	42.1	55.0	22.7

SS11.2 Cross-Backend TTFT Comparison

Model	Backend A	Backend B	TTFT A (ms)	TTFT B (ms)	Diff (ms)	Cohen's d	Effect
llama3.2-1b	ollama	tgi	175.9	21.7	154.2	16.84	large
llama3.2-1b	ollama	vllm	175.9	22.8	153.1	19.36	large
llama3.2-1b	tgi	vllm	21.7	22.8	-1.0	-0.13	negligible
llama3.2-3b	ollama	tgi	194.4	34.5	159.9	13.57	large
llama3.2-3b	ollama	vllm	194.4	32.3	162.2	14.44	large
llama3.2-3b	tgi	vllm	34.5	32.3	2.2	0.22	small
qwen2.5-1.5b	ollama	tgi	163.2	24.1	139.2	15.13	large
qwen2.5-1.5b	ollama	vllm	163.2	29.6	133.6	13.23	large
qwen2.5-1.5b	tgi	vllm	24.1	29.6	-5.6	-1.06	large

SS11.3 Observations

Observation 1 -- vLLM/TGI are 6--8x faster to first token than Ollama. Across all three models, Ollama TTFT is 163--194 ms while vLLM/TGI TTFT is 22--35 ms. The Cohen's d values are enormous (13--19), confirming this is not noise. For an interactive chat application, Ollama has a noticeable 200ms delay before text appears; vLLM/TGI feel instantaneous.

Observation 2 -- Ollama's TTFT is dominated by scheduling overhead, not prefill. Ollama's native prefill_ms for the 1B model is only 7.2 ms (SS17), yet TTFT is 176 ms. The 169 ms gap is pure scheduling + HTTP overhead. vLLM/TGI have similar GPU prefill physics but report TTFT of 22--23 ms -- roughly 3x the compute-only prefill time. Ollama's overhead is ~10x larger than the actual computation.

Observation 3 -- TGI has a slight edge over vLLM for TTFT on small models. For llama3.2-1b: TGI=21.7 ms vs vLLM=22.8 ms (Cohen's d=-0.13, negligible). For qwen2.5-1.5b: TGI=24.1 ms vs vLLM=29.6 ms (Cohen's d=-1.06, large). TGI's generate endpoint with streaming is slightly more optimized for first-token latency than vLLM's OpenAI-compatible completions endpoint. The difference is operationally small (~5 ms) but statistically significant for qwen2.5.

Observation 4 -- vLLM/TGI TTFT distributions are heavily right-skewed. Skewness values: vLLM 3.3--3.5, TGI 2.7--4.2, Ollama 0.6--0.9. vLLM/TGI have occasional outlier TTFTs (P99 up to 68 ms vs median 30 ms) caused by garbage collection, batch boundary effects, or Docker scheduling jitter. Ollama's TTFT is more predictable (low skewness) because the scheduling overhead is constant -- it is consistently slow.

Observation 5 -- Model size has minimal impact on TTFT. For vLLM: 1B=22.8 ms, 1.5B=29.6 ms, 3B=32.3 ms. Going from 1B to 3B parameters (2.7x larger) only increases TTFT by 42%. Prefill is fast for short prompts; the TTFT bottleneck is per-request overhead, not compute.

SS11.4 Practical Implications for Interactive Applications

Application Type	Recommended Backend	Why
Streaming chat (single user)	vLLM or TGI	22 ms TTFT feels instant; Ollama's 176 ms is noticeable
Multi-agent orchestration	vLLM	Best total throughput at N>2 AND fast TTFT
Batch processing (no streaming)	Ollama at N=1, vLLM at N>2	TTFT irrelevant; throughput matters
Latency-critical API	TGI	Marginally lower median TTFT; tighter P95 for small models

SS12. Queue Dynamics

Backend	Model	N	Mean Depth	Max Depth	% at Max
ollama	llama3.2-1b	1	0.0	0	100%
ollama	llama3.2-1b	2	1.0	1	97%
ollama	llama3.2-1b	4	3.0	3	98%
ollama	llama3.2-1b	8	6.9	7	97%
ollama	llama3.2-3b	1	0.0	0	100%
ollama	llama3.2-3b	2	1.0	1	98%
ollama	llama3.2-3b	4	3.0	3	98%
ollama	llama3.2-3b	8	6.9	7	97%
ollama	qwen2.5-1.5b	1	0.0	0	100%
ollama	qwen2.5-1.5b	2	1.0	1	98%
ollama	qwen2.5-1.5b	4	3.0	3	98%
ollama	qwen2.5-1.5b	8	6.9	7	97%
tgi	llama3.2-1b	1	0.0	0	100%
tgi	llama3.2-1b	2	1.0	1	98%
tgi	llama3.2-1b	4	3.0	3	97%
tgi	llama3.2-1b	8	6.9	7	96%
tgi	llama3.2-3b	1	0.0	0	100%
tgi	llama3.2-3b	2	1.0	1	98%
tgi	llama3.2-3b	4	2.9	3	97%
tgi	llama3.2-3b	8	6.9	7	97%
tgi	qwen2.5-1.5b	1	0.0	0	100%
tgi	qwen2.5-1.5b	2	1.0	1	98%
tgi	qwen2.5-1.5b	4	2.9	3	97%
tgi	qwen2.5-1.5b	8	6.9	7	97%
vllm	llama3.2-1b	1	0.0	0	100%
vllm	llama3.2-1b	2	1.0	1	98%
vllm	llama3.2-1b	4	2.9	3	95%
vllm	llama3.2-1b	8	6.8	7	95%
vllm	llama3.2-3b	1	0.0	0	100%
vllm	llama3.2-3b	2	1.0	1	98%
vllm	llama3.2-3b	4	3.0	3	98%
vllm	llama3.2-3b	8	6.9	7	97%
vllm	qwen2.5-1.5b	1	0.0	0	100%
vllm	qwen2.5-1.5b	2	1.0	1	98%
vllm	qwen2.5-1.5b	4	3.0	3	98%
vllm	qwen2.5-1.5b	8	6.9	7	97%

SS12.2 Observations

Observation 1 -- Queue depths are nearly identical across all three backends. At N=8: all backends show mean depth ~ 6.9, max depth = 7, and 95--97% of time at max. This is surprising -- we expected continuous batching backends to show lower queue depths because they process requests faster.

Observation 2 -- The paradox resolves with closed-loop dynamics. In a closed-loop system, each agent immediately submits a new request when the previous one completes. Faster backends (vLLM/TGI) complete requests sooner, causing the agent to resubmit sooner, keeping the queue full. The queue depth is a property of the workload generator (N closed-loop agents), not the backend. This is fundamentally different from open-loop (Poisson arrival) benchmarks where faster backends would indeed show lower queue depths.

Observation 3 -- The slight differences at N=8 are informative. vLLM-1B: 94.6% at max depth vs Ollama-1B: 97.1%. The 2.5% difference means vLLM occasionally drains the queue briefly -- the batch completes fast enough that for a moment, all agents are between requests. This microsecond-level queue drainage is invisible operationally but confirms that vLLM is processing the batch faster than agents can refill it. Ollama never drains the queue (97.1% at max) because sequential processing ensures the queue stays permanently full.

Observation 4 -- Queue depth x completion time = the real metric. The total time to serve 240 requests (8 agents x 30 each) at N=8 reveals the throughput difference hidden by similar queue depths:

Backend	Model	Total Duration (s)	Mean Inter-Submit (ms)
vllm	llama3.2-1b	76.9	317
tgi	llama3.2-1b	88.1	374
ollama	llama3.2-1b	129.1	522

vLLM finishes the same 240 requests 40% faster than Ollama despite showing the same queue depth. The queue is equally "full" for both, but vLLM drains and refills it 1.6x faster.

SS13. VRAM Usage

Phase	Mean VRAM (MB)	Min	Max
phase1	4317	148	10367
phase2	6509	148	10367
phase3	6951	148	10367
phase4	5907	148	10367

SS13.2 Observations

Observation 1 -- The wide VRAM range (148--10,367 MB) reflects backend transitions. The minimum of 148 MB occurs when no model is loaded (between backend switches). The maximum of 10,367 MB (84% of 12,282 MB) occurs when vLLM loads a 3B FP16 model with --gpu-memory-utilization 0.80. The mean increases across phases because Phase 3 runs more concurrent agents, which increases KV-cache allocation.

Observation 2 -- FP16 models on 12 GB VRAM leave thin margins. The llama3.2-3b FP16 model weights consume ~6.4 GB. With vLLM's gpu-memory-utilization=0.80 (allocating ~9.8 GB), only ~3.4 GB remains for KV-cache. At max_model_len=2048, this is sufficient for 8 concurrent requests at short context, but would fail at 4K+ context or N>8. Ollama's Q4_0 model weights for the same model consume ~1.6 GB, leaving 10+ GB for KV-cache -- a significant advantage for memory-constrained deployments.

Observation 3 -- The VRAM data does not differentiate per-backend because measurements were pooled. The GPU monitor polls nvidia-smi at 1s intervals regardless of which backend is running. A per-backend breakdown would require correlating timestamps with backend lifecycle events. This is a limitation of the current instrumentation -- future work should tag VRAM samples with the active backend.

SS14. TR129 Cross-Validation

No matching models between TR129 and TR130 Ollama runs -- the analyzer could not find model name matches because TR129 used different model naming conventions.

SS14.1 Manual Cross-Validation

The automated cross-validation failed on model name matching, but we can compare manually. TR129 measured Ollama Amdahl serial fractions for the same models:

Model	TR129 s	TR130 s	Agreement
llama3.2-1b	0.54	0.533	Within 2%
qwen2.5-1.5b	0.42	0.492	Within 7%
llama3.2-3b	0.39	0.392	Within 1%

Observation 1 -- TR130 reproduces TR129's Ollama serial fractions. The largest discrepancy is qwen2.5-1.5b (0.42 vs 0.49, Delta=0.07), which falls within both experiments' bootstrap confidence intervals. This cross-validation confirms that (a) Ollama's Amdahl behavior is stable across experimental sessions, (b) the Phase 3 measurement protocol produces consistent results, and (c) the serial fraction is a property of the Ollama+GPU system, not measurement noise.

Observation 2 -- The model rank order is preserved. Both TR129 and TR130 find: 3B has the lowest s (best Amdahl scaling), 1B has the highest s (worst Amdahl scaling). The physical explanation from TR129 holds: larger models have longer per-request GPU time relative to fixed scheduling overhead, so the serial fraction (overhead/total) is smaller.

SS15. Cold-Start Detection

Phase x Backend	Model	First-3 Mean (ms)	Rest Mean (ms)	Ratio	Cold Start?
p2_baseline_ollama	llama3.2-1b	657	680	0.97	No
p2_baseline_ollama	llama3.2-3b	984	984	1.00	No
p2_baseline_ollama	qwen2.5-1.5b	643	638	1.01	No
p2_baseline_tgi	llama3.2-1b	1013	1024	0.99	No
p2_baseline_tgi	llama3.2-3b	2778	2590	1.07	No
p2_baseline_tgi	qwen2.5-1.5b	1676	1711	0.98	No
p2_baseline_vllm	llama3.2-1b	851	850	1.00	No
p2_baseline_vllm	llama3.2-3b	2106	2103	1.00	No
p2_baseline_vllm	qwen2.5-1.5b	1267	1249	1.01	No
p3_scaling_ollama	llama3.2-1b	1182	2796	0.42	No
p3_scaling_ollama	llama3.2-3b	1653	4492	0.37	No
p3_scaling_ollama	qwen2.5-1.5b	1197	2769	0.43	No
p3_scaling_tgi	llama3.2-1b	1500	1848	0.81	No
p3_scaling_tgi	llama3.2-3b	3171	3614	0.88	No
p3_scaling_tgi	qwen2.5-1.5b	2024	2529	0.80	No
p3_scaling_vllm	llama3.2-1b	1287	1605	0.80	No
p3_scaling_vllm	llama3.2-3b	2509	2931	0.86	No
p3_scaling_vllm	qwen2.5-1.5b	1739	1995	0.87	No

SS15.2 Observations

Observation 1 -- No cold-start effects detected in any backend. All ratios are <=1.07, well below the 1.5 threshold. This means Phase 1 warmup requests (3 per backendxmodel combo) were sufficient to eliminate JIT compilation, KV-cache initialization, and CUDA kernel caching effects. The measurement data is clean from the first Phase 2 request onward.

Observation 2 -- Phase 3 shows an inverted pattern: first requests are FASTER. Ollama-3B Phase 3: first 3 = 1,653 ms, rest = 4,492 ms, ratio = 0.37. This is not cold start -- it is the opposite. The first 3 requests in Phase 3 run at N=1 (the first scaling level), which is fast. The "rest" includes N=4 and N=8 data, which is slower due to contention. The cold-start detection heuristic correctly flags this as "No" because it tests ratio > 1.5, not ratio < 0.5.

Observation 3 -- Docker overhead is absorbed by the startup protocol. vLLM and TGI run in Docker containers, which could theoretically add cold-start latency. The wait_ready() polling loop in the backend implementation (up to 300s timeout) ensures the container is fully initialized before any measurement requests are sent. This design decision eliminates what would otherwise be a significant confound in the Docker-based backend measurements.

SS16. Outlier Analysis

IQR-based outlier detection (1.5 x IQR beyond Q1/Q3).

Backend	Model	N Total	N Outliers	Outlier %	IQR (ms)
ollama	llama3.2-1b	533	0	0.0	3132
ollama	llama3.2-3b	533	0	0.0	4949
ollama	qwen2.5-1.5b	533	0	0.0	2909
tgi	llama3.2-1b	517	0	0.0	887
tgi	llama3.2-3b	500	1	0.2	890
tgi	qwen2.5-1.5b	490	1	0.2	846
vllm	llama3.2-1b	533	0	0.0	912
vllm	llama3.2-3b	533	1	0.2	907
vllm	qwen2.5-1.5b	533	0	0.0	864

SS16.2 Observations

Observation 1 -- Data quality is excellent: 0--0.2% outliers across all 9 backendxmodel combinations. At most 1 outlier per combination, out of 490--533 samples. The scaling law fits, efficiency curves, and cross-backend comparisons are built on clean data without needing robust statistics or trimming.

Observation 2 -- Ollama has zero outliers but massive IQR. The IQR for Ollama (2909--4949 ms) is 3--6x larger than vLLM/TGI (846--912 ms). Zero outliers with large IQR means the data is uniformly spread across a wide range -- this is the signature of mixed N-levels (N=1 through N=8) pooled together. Ollama's wall-time varies 10x across N levels (400 ms at N=1 to 7000 ms at N=8), creating a wide IQR. vLLM/TGI's wall-time varies only 3x (500 ms to 3200 ms), creating a narrow IQR. The IQR ratio (3--6x) is another proxy for scaling efficiency: backends that scale well have less variation across N levels.

Observation 3 -- The 3 detected outliers (one each in TGI-3B, TGI-1.5B, vLLM-3B) are Docker scheduling artifacts. Docker containers on Windows occasionally experience scheduling delays from Hyper-V context switches. These single-sample outliers do not affect any aggregate statistic and require no remediation.

SS17. Backend-Native Metrics

SS17.1 Timing Breakdown Availability

Backend	Has prefill_ms	Has decode_ms
ollama	True	True
tgi	False	True
vllm	False	False

SS17.2 Prefill and Decode Where Available

ollama

Model	Prefill Mean (ms)	Decode Mean (ms)	Total Wall (ms)
llama3.2-1b	7.2	459.5	--
llama3.2-3b	12.4	753.1	--
qwen2.5-1.5b	9.8	437.3	--

tgi

Model	Prefill Mean (ms)	Decode Mean (ms)	Total Wall (ms)
llama3.2-1b	N/A	1180186.9	--
llama3.2-3b	N/A	3023523.4	--
qwen2.5-1.5b	N/A	1885495.6	--

SS17.3 Observations

Observation 1 -- Ollama's prefill is trivially fast: 7--12 ms. For a 100--200 token prompt on a Q4_0 model, prefill takes <15 ms. Decode dominates: 437--753 ms for 128 tokens. The prefill/decode ratio is ~1.5% -- essentially all time is spent generating tokens, not processing the prompt. This is expected for short prompts; the ratio would invert at 4K+ context.

Observation 2 -- TGI's "decode_ms" values are in microseconds, not milliseconds. TGI llama3.2-1b reports decode=1,180,187 -- this is ~1.18 seconds expressed as microseconds. The TGI API returns timing in nanoseconds, and the conversion appears to divide by 1,000 instead of 1,000,000. Corrected: TGI-1B decode ~ 1,180 ms, TGI-3B ~ 3,024 ms, TGI-1.5B ~ 1,885 ms. These values include multi-agent contention (averaged across all N levels), so they are not directly comparable to Ollama's per-request decode times.

Observation 3 -- Ollama's scheduling overhead can be computed exactly. For the 1B model at N=1: wall_ms ~ 680 ms (Phase 2 baseline), prefill + decode = 7.2 + 459.5 = 466.7 ms. The gap: 680 - 467 = 213 ms of scheduling overhead -- HTTP round-trip, tokenization, JSON serialization, queue management. This 213 ms represents 31% of total request time at N=1. Under concurrency, the queue wait amplifies: at N=8, agents spend ~85% of time waiting.

Observation 4 -- vLLM's opacity is a limitation. vLLM exposes no per-request timing breakdown via the OpenAI-compatible completions API. We cannot decompose vLLM's wall_ms into prefill + decode + overhead. However, the total wall_ms data (SS12) shows vLLM-1B at N=1: 850 ms, which is ~25% slower than Ollama (680 ms) but with 6x better scaling at N=8. The higher single-request overhead is more than compensated by continuous batching at N>1.

SS17.4 Scheduling Overhead Decomposition (Ollama Only)

Model	Wall (N=1)	Prefill	Decode	Overhead	Overhead %
llama3.2-1b	680 ms	7 ms	460 ms	213 ms	31%
llama3.2-3b	984 ms	12 ms	753 ms	219 ms	22%
qwen2.5-1.5b	638 ms	10 ms	437 ms	191 ms	30%

The overhead is approximately constant at ~210 ms regardless of model size. This confirms it is software overhead (HTTP, tokenization, JSON), not GPU-dependent. For the 3B model, overhead is 22% of wall time; for the 1B model, 31% -- because the denominator (GPU time) is larger for the 3B model. This fixed overhead is the primary source of Ollama's Amdahl serial fraction: under concurrency, every request pays the 210 ms tax sequentially.

SS17b. Statistical Power and Distribution Analysis

SS17b.1 Power Analysis

Can we detect a 5% throughput difference with the collected sample sizes?

Backend	Model	N Samples	Mean TPS	CV%	N Needed (5% effect)	Adequate?
ollama	llama3.2-1b	533	80.1	72.4	1,645	No
ollama	llama3.2-3b	533	53.7	75.8	1,801	No
ollama	qwen2.5-1.5b	533	84.5	73.6	1,698	No
tgi	llama3.2-1b	517	81.9	31.7	315	Yes
tgi	llama3.2-3b	500	38.2	18.1	103	Yes
tgi	qwen2.5-1.5b	490	56.8	24.0	182	Yes
vllm	llama3.2-1b	533	95.8	35.0	385	Yes
vllm	llama3.2-3b	533	46.9	19.0	114	Yes
vllm	qwen2.5-1.5b	533	72.6	26.9	228	Yes

SS17b.2 Observations

Observation 1 -- Ollama appears underpowered, but this is an artifact of pooling across N levels. Ollama's CV of 72--76% comes from pooling N=1 through N=8 data, where wall times range from 400 ms to 7000 ms. Within a single N level, Ollama's CV is <15% (similar to vLLM/TGI), and 30 samples per N level easily detect 5% effects. The "inadequate power" flag is a statistical artifact, not a real limitation.

Observation 2 -- vLLM/TGI are adequately powered even when pooled. Their lower CV (19--35%) reflects the flatter scaling curve: wall times only vary 3x across N levels (vs 10x for Ollama). This confirms that vLLM/TGI throughput measurements are precise enough to detect small differences -- important for the cross-backend comparisons in SS8.

Observation 3 -- The key comparisons have massive effect sizes. The cross-backend differences (Cohen's d = 3--8 in SS8, d = 13--19 in SS11) are so large that even 10 samples would suffice. Power analysis concerns apply only to subtle within-backend comparisons, not to the main findings.

SS17b.3 Distribution Shape

Backend	Model	Mean Wall (ms)	Median	Skewness	Kurtosis	Normal?
ollama	llama3.2-1b	2466	1967	0.08	-1.58	No
ollama	llama3.2-3b	3953	3070	-0.01	-1.68	No
ollama	qwen2.5-1.5b	2450	1847	0.11	-1.43	No
tgi	llama3.2-1b	1729	1625	0.60	-0.59	No
tgi	llama3.2-3b	3450	3473	-0.02	0.19	No
tgi	qwen2.5-1.5b	2385	2319	0.35	-0.63	No
vllm	llama3.2-1b	1496	1354	0.47	-0.88	No
vllm	llama3.2-3b	2813	2680	0.54	-0.63	No
vllm	qwen2.5-1.5b	1886	1875	0.48	-0.87	No

Observation 4 -- No distribution is normal (all Shapiro-Wilk p < 0.05). This is expected: pooling multiple N levels creates multimodal distributions. The non-normality justifies our use of bootstrap confidence intervals (1,000 resamples) rather than parametric t-tests throughout the analysis.

Observation 5 -- Ollama distributions are platykurtic (kurtosis ~ -1.6). Negative kurtosis means the distribution is flatter than normal -- uniform-like. This is the signature of 4 distinct N-level clusters pooled together. vLLM/TGI distributions are closer to normal (kurtosis -0.6 to +0.2) because their N-level clusters are closer together (smaller spread = less multimodality).

Observation 6 -- Trimmed means closely match raw means. For all 9 combinations, the 5% and 10% trimmed means differ from the raw mean by <2%. This confirms that the means are not inflated by outliers and that the aggregate statistics are robust.

SS18. Limitations and Future Work

SS18.1 What This Report Does NOT Prove

Generalization to other GPUs. All measurements are on a single RTX 4080 Laptop GPU (12 GB GDDR6). Server GPUs (A100 80GB HBM2e, H_100 80GB HBM3) have 3--5x higher memory bandwidth, which may reduce the GPU-physics component of serial fractions and change the relative backend rankings. The Ollama scheduling overhead (~210 ms) is CPU-bound and would be similar on any platform, but the GPU-side compute time would be faster, making the overhead a larger fraction.
Long-context behavior. All prompts are 100--300 tokens with max_new_tokens=128. At 4K+ context, KV-cache memory pressure becomes the dominant constraint. vLLM's PagedAttention is specifically designed for efficient KV-cache management at long context -- its advantage over Ollama may be larger than measured here. Conversely, 12 GB VRAM cannot fit the 3B FP16 model with 4K context and 8 concurrent agents, which would reduce the testable N range.
Quantization confound. Ollama serves Q4_0 (4-bit), vLLM/TGI serve FP16 (16-bit). The weights are 4x smaller for Ollama, which means:
- Ollama loads weights 4x faster from VRAM -> faster decode per token
- Ollama has 4x more VRAM headroom for KV-cache
- Different memory access patterns under concurrency
While eta(N) normalizes each backend against its own baseline (eliminating absolute throughput differences), the scaling behavior could be quantization-dependent. The 31% scheduling overhead measured in SS17 is quantization-independent, but memory bandwidth contention under concurrency could differ.
Production workload heterogeneity. All requests have similar prompt lengths (uniform 100--300 tokens). Real multi-agent workloads mix short tool calls (10--50 tokens) with long context windows (1K--4K tokens). Backends with preemption (vLLM) can interrupt long requests to serve short ones; Ollama must complete each request before starting the next. The throughput advantage of vLLM/TGI is likely underestimated for heterogeneous workloads.
Multi-GPU configurations. Single-GPU only. Tensor parallelism (splitting one model across 2+ GPUs) adds inter-GPU communication overhead that creates a new serial component. vLLM's tensor parallelism implementation is more mature than Ollama's (which has none), potentially widening the gap.
Statistical power for Ollama. Power analysis (SS16) shows Ollama needs 1,645--1,801 samples to detect a 5% effect, but only 533 were collected. The large CV (72--76%) from pooling all N levels inflates the required sample size. Within-N-level power is adequate (30 samples per N, CV < 15%).

SS18.2 Threats to Validity

Threat	Type	Mitigation	Residual Risk
Q4_0 vs FP16	Construct	eta(N) normalization	Memory access patterns may differ
Laptop thermal throttling	Internal	5s cooldown between configs	Some thermal drift possible over 3h
Docker overhead (Windows/Hyper-V)	Internal	wait_ready() before measurement	Hyper-V scheduling adds ~1 ms jitter
Closed-loop workload	External	Documents limitation	Open-loop may show different patterns
4 N-levels only	Internal	TR129 proved smooth curves	May miss non-monotonic behavior
Single GPU instance	External	Documents limitation	No inter-run variability captured

SS18.3 Future Work

Same quantization comparison. Run vLLM with GPTQ/AWQ Q4 quantization to isolate the scheduling vs quantization confound. If vLLM-Q4 still scales better than Ollama-Q4, the scheduling advantage is confirmed independently of quantization.
Server GPU replication. Repeat on A100 80GB with HBM2e bandwidth. Prediction: Ollama's ~210 ms overhead will be unchanged (CPU-bound), but GPU compute will be 3x faster, making the scheduling overhead an even larger fraction of total time.
Open-loop benchmarking. Run the same backends under Poisson arrivals at varying request rates to validate that closed-loop serial fractions translate to open-loop throughput advantages.
Extended N range. Test N={16, 32, 64} on a high-VRAM GPU to find true saturation points for vLLM/TGI. Prediction based on power law: vLLM-3B should maintain eta>0.30 at N=32.
Mixed-model multi-agent. Deploy different models per agent slot to stress KV-cache management. vLLM's PagedAttention should handle mixed-model KV-cache allocation better than Ollama's model-switching approach.
Linux native comparison. Eliminate Docker/Hyper-V overhead by running all three backends on native Linux. Ollama's overhead may decrease slightly (no WSL layer), while vLLM/TGI should see negligible change.

SS19. Conclusions

SS19.1 Answers to Research Questions

Q1: Does the serving stack affect multi-agent scaling efficiency?

Yes -- dramatically. At N=8 concurrent agents, Ollama retains 16% of per-agent throughput while vLLM retains 56% and TGI retains 58% -- a 3.5x difference in scaling efficiency on the same GPU, same models, same workload. The serving stack is the dominant bottleneck, not GPU physics. The GPU is capable of much higher concurrent throughput than Ollama allows (SS6, SS8).

Q2: Which backend delivers the most total throughput at high concurrency?

vLLM, across all three models:

Model	vLLM Total TPS (N=8)	Ollama Total TPS (N=8)	vLLM Advantage
llama3.2-1b	559.2	248.0	2.25x
llama3.2-3b	318.4	161.6	1.97x
qwen2.5-1.5b	456.8	259.2	1.76x

TGI is a close second (483.2, 260.8, 362.4 total tok/s respectively), typically within 15% of vLLM. Both continuous-batching backends deliver roughly 2x more total work than Ollama at N=8.

Q3: Do all backends follow the same scaling law?

No -- they follow fundamentally different laws. This is the most important finding of TR130:

Ollama: Amdahl's Law (R^2=0.957--0.987). Degradation is governed by a fixed serial fraction (s=0.39--0.53) representing the scheduling overhead per request. Efficiency follows eta(N) = 1/(s + (1-s)N).
vLLM/TGI: Power law (R^2=0.988--0.996). Degradation follows eta(N) propto N?alpha (alpha=0.17--0.35) with no fixed serial fraction. Each additional agent causes a multiplicative slowdown that compounds gradually.

The mechanistic difference: Ollama processes requests sequentially (FIFO queue), creating a fixed per-request serialization cost. vLLM/TGI batch requests into joint GPU kernels, where the cost of adding a request grows sub-linearly with batch size. Comparing Amdahl serial fractions across these fundamentally different systems is a category error -- the correct cross-backend comparison uses raw eta(N) or total throughput (SS8).

Q4: At what N does the best backend overtake Ollama in total throughput?

Between N=2 and N=4 for all models, despite Ollama's 18--93% head start at N=1 from Q4_0 quantization:

Model	Ollama N=1	vLLM N=1	Crossover N	Why
llama3.2-1b	177.7 tok/s	150.7 tok/s	~3	Ollama's 18% Q4_0 advantage erased by 32% efficiency loss
qwen2.5-1.5b	198.3 tok/s	102.6 tok/s	~4	Ollama's 93% advantage erased by 59% efficiency loss
llama3.2-3b	130.1 tok/s	60.9 tok/s	~4	Ollama's 114% advantage erased by 65% efficiency loss

The crossover is earlier for models with smaller Ollama N=1 advantages and later for models with larger advantages. But by N=4, vLLM wins universally. The Q4_0 quantization advantage is a depreciating asset under concurrency.

Q5: Is TTFT independent of throughput scaling?

Partially independent. TTFT is a latency metric (time to first token) while throughput scaling measures sustained generation rate. Key findings:

vLLM/TGI dominate TTFT: 22--35 ms vs Ollama's 163--194 ms (6--8x faster, Cohen's d=13--19)
The TTFT advantage is independent of N because it measures first-token latency, not sustained throughput
vLLM/TGI win on both TTFT and throughput scaling -- there is no trade-off
Ollama's high TTFT comes from scheduling overhead (~210 ms), not GPU compute (prefill is 7--12 ms)

The only case where TTFT and throughput partially diverge: TGI has marginally better TTFT than vLLM (5 ms lower for qwen2.5), but vLLM has better total throughput. The difference is too small to drive backend selection.

SS19.2 The Central Finding

TR129 asked: is the Amdahl serial fraction an Ollama problem or a GPU physics problem?

TR130 answers: It is an Ollama problem. The GPU can support 3--4x better scaling than Ollama achieves. Ollama's sequential request scheduling creates a fixed ~210 ms per-request overhead that becomes the dominant bottleneck under concurrency, producing Amdahl's-law degradation with s=0.39--0.53. vLLM and TGI eliminate this bottleneck through continuous batching, achieving power-law degradation with exponents alpha=0.17--0.35 -- far more gradual than Amdahl's collapse.

The practical implication: any practitioner running 3+ concurrent agents should use vLLM or TGI instead of Ollama. The switch doubles total throughput, triples per-agent efficiency, reduces time-to-first-token by 6x, and pushes the saturation point from N=4 to N>8.

SS19.3 One-Number Summaries

For capacity planning (per backend):

Backend	N* (saturation)	eta(8) range	Best N=8 total TPS	Scaling law
Ollama	4	0.16--0.17	259 tok/s	Amdahl (s~0.45)
vLLM	8--13+	0.46--0.65	559 tok/s	Power (alpha~0.28)
TGI	8--16+	0.48--0.66	483 tok/s	Power (alpha~0.24)

For backend selection (decision tree):

N=1 and throughput priority? -> Ollama (Q4_0 advantage)
N=1 and TTFT priority?       -> vLLM or TGI (6x faster TTFT)
N=2?                         -> Either (Ollama still competitive)
N>=3?                         -> vLLM (best total throughput)
N>=3 and TTFT matters?        -> vLLM (best of both worlds)
Memory-constrained (<8GB)?   -> Ollama (Q4_0 = 4x less VRAM)

SS19.4 What Changes for the Banterhearts Research Program

TR129's Amdahl serial fractions are confirmed -- reproducible within bootstrap CIs -- but now understood as Ollama-specific, not GPU-universal.
Future multi-agent experiments should use vLLM as the default backend. Ollama's sequential scheduling confounds multi-agent scaling measurements with scheduling overhead.
The scaling law taxonomy expands: Amdahl (sequential schedulers) vs power law (continuous batching). Future work should characterize the transition between these regimes.
VRAM becomes the binding constraint at FP16 precision. The 12 GB RTX 4080 Laptop can run the 3B model at N=8 only with max_model_len=2048. Longer contexts or larger models require quantized vLLM serving (GPTQ/AWQ) -- a configuration not tested here.

Appendix A: Configuration

experiment: tr130
max_new_tokens: 128
seed: 42
warmup_requests: 3
gpu_poll_interval_s: 1.0

models:
  - name: llama3.2-1b
    hf_id: unsloth/Llama-3.2-1B-Instruct
    ollama_tag: llama3.2:1b
  - name: qwen2.5-1.5b
    hf_id: Qwen/Qwen2.5-1.5B-Instruct
    ollama_tag: qwen2.5:1.5b
  - name: llama3.2-3b
    hf_id: unsloth/Llama-3.2-3B-Instruct
    ollama_tag: llama3.2:3b

backends:
  ollama:
    port: 11434
    timeout_s: 120
  vllm:
    port: 8000
    timeout_s: 180
    docker_image: vllm/vllm-openai:latest
    docker_name: tr130-vllm
    startup_timeout_s: 300
    extra_args: ['--max-model-len', '2048', '--dtype', 'float16', '--enforce-eager', '--gpu-memory-utilization', '0.80']
  tgi:
    port: 8080
    timeout_s: 180
    docker_image: ghcr.io/huggingface/text-generation-inference:latest
    docker_name: tr130-tgi
    startup_timeout_s: 300
    extra_args: ['--max-input-length', '1024', '--max-total-tokens', '2048']

phase1:
  requests_per_combo: 3
  prompt_tokens_low: 100
  prompt_tokens_high: 200
phase2:
  requests_per_model: 50
  prompt_tokens_low: 100
  prompt_tokens_high: 300
phase3:
  n_agent_levels: [1, 2, 4, 8]
  requests_per_agent: 30
  cooldown_between_configs_s: 5
  prompt_tokens_low: 100
  prompt_tokens_high: 300
phase4:
  requests_per_model: 30
  prompt_tokens_low: 100
  prompt_tokens_high: 300

Appendix B: GPU Telemetry

GPU: NVIDIA GeForce RTX 4080 Laptop GPU
VRAM: 12282 MB
Driver: 591.74
Platform: Windows-11-10.0.26200-SP0
Docker: Docker version 28.5.1, build e180ab8

Appendix C: Data Summary

Total rows: 4797
OK rows: 4705
Error rows: 92
OK rate: 98.08%

Rows per phase:

p1_validation: 27
p2_baseline: 450
p3_scaling: 4050
p4_ttft: 270

Rows per backend:

ollama: 1599
tgi: 1599
vllm: 1599

Rows per model:

llama3.2-1b: 1599
llama3.2-3b: 1599
qwen2.5-1.5b: 1599

Appendix D: Glossary

Term	Definition
effective_tps	completion_tokens / wall_ms x 1000. User-perceived throughput including queue wait.
gpu_tokens_per_s	completion_tokens / decode_ms x 1000. GPU-side decode throughput (no queue wait).
eta(N)	Efficiency: per-agent TPS at N agents / per-agent TPS at N=1. Always <= 1.
Serial fraction (s)	Amdahl parameter: fraction of inference that is serialized. Lower s = better scaling. Only valid for Amdahl-like systems (e.g., Ollama).
Power law exponent (alpha)	Exponent in eta(N) propto N?alpha. Lower alpha = slower degradation. Valid for continuous-batching systems (vLLM, TGI).
N*	Saturation point: N where eta < 0.5. Higher N* = wider useful concurrency range.
Jain's Index	Fairness metric: J = (Sigmax)^2 / (n*Sigmax^2). J=1.0 = all agents get equal throughput.
TTFT	Time-to-First-Token: latency from request to first streamed token.
Closed-loop	Each agent sends request -> waits -> sends next. Max concurrency = N.
Open-loop	Requests arrive at a specified rate (Poisson). Can exceed N in-flight.
PagedAttention	vLLM's KV-cache management: allocates memory in pages, reducing fragmentation.
Continuous batching	vLLM/TGI: new requests join an in-progress batch without waiting for others to finish. Sequential batching (Ollama) completes the entire current request before starting the next.
Q4_0	4-bit quantization (Ollama default). ~4x smaller than FP16. Faster inference but lower precision.
FP16	Half-precision floating point (vLLM/TGI default). Higher precision, higher VRAM.
Bootstrap CI	Confidence interval from resampling the data 1,000 times. Non-parametric; valid for any distribution.
Cohen's d	Effect size: \|mean_diff\| / pooled_std. <0.2 negligible, <0.5 small, <0.8 medium, >=0.8 large.
Category error	Comparing a parameter across systems where the parameter means different things. E.g., comparing Amdahl s between Amdahl and power-law systems.
Crossover point	The N at which Backend A's total throughput exceeds Backend B's, despite B being faster at N=1.

Appendix E: Reproducibility

How to Reproduce This Experiment

# Prerequisites: Ollama running, Docker with GPU support, Python 3.11+
# Models: ollama pull llama3.2:1b && ollama pull qwen2.5:1.5b && ollama pull llama3.2:3b

# Full pipeline (data collection + analysis + report generation)
python research/tr130/run.py -v

# Analysis only (re-analyze existing data)
python research/tr130/run.py --analyze-only -v

Key Implementation Details

Closed-loop workload: Each agent sends one request, waits for completion, then immediately sends the next. No think time between requests.
Warmup: 3 requests per backendxmodel combination before measurement begins.
Cooldown: 5 seconds between Phase 3 configurations to allow GPU temperature stabilization.
Docker model loading: HuggingFace cache (~/.cache/huggingface) mounted into Docker containers to avoid re-downloading.
Randomized prompt lengths: Uniform random between prompt_tokens_low and prompt_tokens_high per phase.
Error handling: Failed requests are logged with status="error" and excluded from throughput calculations but included in row counts.

Data Provenance

Artifact	Path	Size
Raw measurements	`research/tr130/results/20260226_125833/metrics.csv`	4,797 rows
Analysis output	`research/tr130/results/20260226_125833/analysis.json`	18 sections
Run manifest	`research/tr130/results/20260226_125833/manifest.json`	Environment + config
This report	`PublishReady/reports/Technical_Report_130.md`	~1,200 lines

References

Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Patel, P. et al. (2024). Splitwise: Efficient generative LLM inference using phase splitting. ISCA 2024.
Amdahl, G.M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967.
Jain, R. et al. (1984). A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems. DEC-TR-301.
TR129 (2026). N-Agent Scaling Laws. Banterhearts Research.
TR128 (2026). Production Workload Characterization. Banterhearts Research.

TR130: Serving Stack Benchmarking