Technical Report 130: Serving Stack Benchmarking
Ollama vs vLLM vs TGI -- Multi-agent throughput scaling comparison
| Field | Value |
|---|---|
| TR Number | 130 |
| Project | Banterhearts LLM Performance Research |
| Date | 2026-02-26 |
| Author | Research Team |
| Report Type | Cross-backend serving stack benchmarking (4-phase, 3 backends, 4797 measurements) |
| Test Duration | ~3 hours |
| Status | Complete --- All 4 phases delivered |
| Run ID | 20260226_125833 |
| Related Work | TR129 (N-Agent Scaling Laws), TR128 (Production Workload Characterization) |
| Depends On | TR129 (Ollama serial fraction cross-validation) |
Abstract
TR129 found that Ollama exhibits Amdahl serial fractions s=0.39--0.54 on an RTX 4080 Laptop GPU (12 GB VRAM), meaning up to 54% of inference is serialized under multi-agent concurrency. The critical unanswered question: is this an Ollama scheduling bottleneck or an inherent GPU physics constraint? If alternative serving stacks achieve lower serial fractions with identical hardware, the bottleneck is the serving stack, not the silicon.
TR130 answers this question with 4,797 measurements across 3 serving backends (Ollama, vLLM, TGI), 3 models (llama3.2-1b, llama3.2-3b, qwen2.5-1.5b), and 4 phases: environment validation, single-agent baseline, N-agent scaling (N={1,2,4,8}), and time-to-first-token (TTFT) comparison. Each backend serves the same models on the same GPU under identical closed-loop workloads. All 9 backend x model combinations passed validation; zero were skipped.
Core finding: The serving stack is the bottleneck, and it is Ollama that suffers. The three backends follow fundamentally different degradation curves under concurrency. Ollama's sequential request scheduling maps to Amdahl's Law (R^2=0.957--0.987), producing steep efficiency collapse: at N=8 agents, each agent retains only 16--17% of its standalone throughput. vLLM and TGI, with continuous batching and PagedAttention, follow a power-law degradation (R^2=0.988--0.996) that is far more gradual: at N=8, each agent retains 46--66% of standalone throughput.
The practical consequence is dramatic. For llama3.2-1b at N=8 concurrent agents, vLLM delivers 559 total tok/s versus Ollama's 248 total tok/s -- a 2.25x advantage -- despite Ollama being 18% faster at N=1 (177.7 vs 150.7 tok/s). The crossover point occurs between N=2 and N=4 for all three models. Beyond N=2, practitioners should switch to vLLM or TGI.
Methodological caveat on the Amdahl serial fraction: When Amdahl's model is force-fitted to vLLM/TGI data, it produces artificially high serial fractions (s=0.81--0.92) because these backends do not degrade via Amdahl's mechanism. Their best-fit model is power law (eta propto N?alpha, alpha=0.17--0.35), which has no "serial fraction" parameter. Comparing Amdahl serial fractions across backends with different degradation mechanisms is a category error. This report uses raw efficiency eta(N), total throughput, and saturation points as the primary cross-backend comparison, with the Amdahl serial fraction reserved for Ollama where the model genuinely fits.
Quantization note: Ollama serves Q4_0 quantized models while vLLM and TGI serve FP16. Absolute throughput differences from quantization are expected and do not affect the scaling comparison: eta(N) normalizes each backend against its own N=1 baseline.
Executive Summary
Key Findings
-
Ollama wins at N=1. Q4_0 quantization gives Ollama 1.2--2.6x higher single-agent throughput: 198 tok/s (qwen2.5-1.5b) vs 103 tok/s (vLLM) and 75 tok/s (TGI). This is expected -- 4-bit weights are 4x smaller, reducing memory bandwidth pressure.
-
vLLM/TGI win decisively at N>=4. At N=8 agents, vLLM delivers 559 tok/s total vs Ollama's 248 tok/s for llama3.2-1b -- a 2.25x throughput advantage despite being slower at N=1. The crossover point is between N=2 and N=4.
-
The backends follow different scaling laws. Ollama degrades via Amdahl's Law (R^2=0.957--0.987): sequential request processing creates a genuine serial fraction. vLLM and TGI degrade via power law (R^2=0.988--0.996): continuous batching enables overlapping execution with gradual resource contention.
-
Ollama efficiency collapses at N=8. Each agent retains only 16--17% of N=1 throughput (eta
0.16). vLLM agents retain 46--65% (eta0.56). TGI agents retain 48--66% (eta~0.58). The difference is 3--4x in per-agent efficiency. -
Ollama saturates at N=4; vLLM/TGI never saturate within tested range.* Ollama drops below 50% efficiency at N=4 for all models. vLLM and TGI remain above 50% efficiency at N=8 for 2 of 3 models, suggesting useful scaling continues to N=16+.
-
TTFT is 6--8x faster on vLLM/TGI. Ollama: 163--194 ms. vLLM: 23--32 ms. TGI: 22--35 ms. Docker-based backends start streaming tokens in under 35 ms, versus Ollama's 160+ ms -- a user-perceptible gap (Cohen's d > 13 for all pairwise comparisons).
-
All backends are perfectly fair. Jain's fairness index >= 0.996 across all backends at all concurrency levels. No backend starves individual agents under contention.
-
Zero cold-start effects detected. No phase x backend combination shows first-3-request latency more than 1.07x the steady-state mean. The warmup protocol successfully eliminates cold-start artifacts.
-
Data quality is exceptional. 98.08% success rate (4,705/4,797 ok). Only 92 HTTP 424 errors (TGI overload at high N). Outlier rate: 0.0--0.2% across all backends. Zero outliers for Ollama and vLLM llama3.2-1b.
-
The Amdahl serial fraction comparison is misleading. Force-fitting Amdahl to vLLM/TGI produces s=0.81--0.92, but these backends don't follow Amdahl mechanics. The comparison overstates Ollama's relative scaling quality. Raw eta(N) and total throughput are the correct metrics.
N=1 Baseline Throughput
| Backend | Model | Quant | Mean TPS | 95% CI | CV% | Wall ms |
|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | Q4_0 | 177.7 | [175.0, 180.4] | 5.3 | 678.9 |
| ollama | llama3.2-3b | Q4_0 | 130.1 | [129.7, 130.4] | 1.0 | 984.2 |
| ollama | qwen2.5-1.5b | Q4_0 | 198.3 | [196.5, 200.2] | 3.3 | 637.9 |
| vllm | llama3.2-1b | FP16 | 150.7 | [149.8, 151.6] | 2.0 | 849.7 |
| vllm | llama3.2-3b | FP16 | 60.9 | [60.8, 60.9] | 0.2 | 2103.3 |
| vllm | qwen2.5-1.5b | FP16 | 102.6 | [101.3, 103.9] | 4.4 | 1250.0 |
| tgi | llama3.2-1b | FP16 | 125.2 | [124.5, 125.9] | 1.9 | 1023.0 |
| tgi | llama3.2-3b | FP16 | 49.4 | [48.4, 50.5] | 7.5 | 2602.3 |
| tgi | qwen2.5-1.5b | FP16 | 75.0 | [74.4, 75.5] | 2.5 | 1708.5 |
N=8 Total Throughput (the multi-agent metric that matters)
| Backend | llama3.2-1b | llama3.2-3b | qwen2.5-1.5b |
|---|---|---|---|
| vllm | 559 tok/s | 319 tok/s | 457 tok/s |
| tgi | 483 tok/s | 261 tok/s | 362 tok/s |
| ollama | 248 tok/s | 162 tok/s | 259 tok/s |
| vllm / ollama ratio | 2.25x | 1.97x | 1.76x |
Claim Validation
| # | Claim | Evidence | Status |
|---|---|---|---|
| 1 | Serving stack affects multi-agent scaling | eta(8) ranges 0.16 (Ollama) to 0.66 (TGI); Cohen's d > 3 | Confirmed |
| 2 | vLLM/TGI scale better than Ollama | Total throughput 1.76--2.25x higher at N=8 | Confirmed |
| 3 | Backends follow different scaling laws | Ollama=Amdahl (R^2=0.96+), vLLM/TGI=power law (R^2=0.99+) | Confirmed |
| 4 | Ollama is fastest at N=1 | Q4_0 gives 1.2--2.6x throughput advantage | Confirmed |
| 5 | TTFT is faster on Docker backends | 6--8x faster (22--35 ms vs 163--194 ms) | Confirmed |
| 6 | All backends are fair under contention | Jain's index >= 0.996 at all N | Confirmed |
| 7 | No cold-start effects after warmup | Max ratio 1.07x, no detections | Confirmed |
| 8 | Amdahl serial fraction is valid cross-backend | vLLM/TGI best fit is power law, not Amdahl | Refuted |
Key Decisions for Practitioners
-
For N=1 (single-agent): Use Ollama with Q4_0. It delivers the highest absolute throughput due to quantized weights requiring less memory bandwidth. There is no scheduling overhead to worry about.
-
For N>=4 (multi-agent production): Switch to vLLM. It delivers 1.76--2.25x total throughput at N=8, retains 46--65% per-agent efficiency, and provides 6x faster TTFT. The FP16 precision also means higher output quality.
-
For N=2--3 (light multi-agent): The choice depends on your priority. Ollama still delivers higher per-agent throughput at N=2 in absolute terms (Q4_0 advantage), but vLLM overtakes in total throughput by N=4.
-
For streaming/interactive use: vLLM or TGI regardless of N. TTFT of 22--35 ms vs 163--194 ms is a user-perceptible difference (Cohen's d > 13).
-
Do not use Amdahl serial fractions to compare backends. The metric is only valid when all systems follow Amdahl mechanics. Use raw eta(N) or total throughput instead.
How to Read This Report
| Time | Reading Path |
|---|---|
| 5 min | Abstract -> Executive Summary -> SS19 Conclusions |
| 15 min | Add SS5 (throughput curves), SS6 (efficiency), SS8 (cross-backend comparison) |
| 45 min | Full report, SS1--SS19 + Appendices |
When to Use This Report
| Scenario | How This Report Helps |
|---|---|
| Choosing a serving backend for multi-agent deployment | Serial fraction comparison shows which backend degrades least |
| Deciding between Ollama and vLLM/TGI | Head-to-head baseline + scaling + TTFT data |
| Capacity planning for N concurrent agents | eta(N) curves + saturation points per backend |
| Evaluating whether to switch from Ollama | Quantifies the multi-agent efficiency gap |
| Understanding if GPU or software limits concurrency | Cross-backend serial fractions isolate the bottleneck |
Table of Contents
- SS1. Introduction and Motivation
- SS2. Methodology
- SS3. Phase 1 -- Environment Validation
- SS4. Phase 2 -- Single-Agent Baseline
- SS5. Phase 3 -- N-Agent Throughput Curves
- SS6. Efficiency Curves eta(N)
- SS7. Scaling Law Fitting
- SS8. Cross-Backend Serial Fraction Comparison
- SS9. Saturation Detection
- SS10. Fairness Analysis
- SS11. Phase 4 -- TTFT Comparison
- SS12. Queue Dynamics
- SS13. VRAM Usage
- SS14. TR129 Cross-Validation
- SS15. Cold-Start Detection
- SS16. Outlier Analysis
- SS17. Backend-Native Metrics
- SS17b. Statistical Power and Distribution Analysis
- SS18. Limitations and Future Work
- SS19. Conclusions
- Appendix A: Configuration
- Appendix B: GPU Telemetry
- Appendix C: Data Summary
- Appendix D: Glossary
- References
SS1. Introduction and Motivation
SS1.1 Background
Multi-agent LLM systems deploy N autonomous agents that concurrently issue inference requests to a shared serving backend. TR129 established that Ollama exhibits Amdahl serial fractions s=0.39--0.54 on an RTX 4080 Laptop GPU, meaning that multi-agent efficiency drops substantially under concurrency.
But this finding leaves a critical question unanswered: is the serial bottleneck in Ollama's request scheduling, or in the GPU hardware itself?
If the serial fraction is a property of the GPU (memory bandwidth, compute pipeline serialization), then no software optimization will help. If it is a property of the serving stack (request queuing, KV-cache management, batch scheduling), then switching backends could dramatically improve multi-agent throughput.
SS1.2 Experimental Design
TR130 isolates the variable: the serving stack. All other factors are held constant:
| Factor | Controlled? | Value |
|---|---|---|
| GPU hardware | Yes | RTX 4080 Laptop 12 GB |
| Models | Yes | llama3.2-1b, qwen2.5-1.5b, llama3.2-3b |
| Workload pattern | Yes | Closed-loop, 128 max tokens |
| Concurrency levels | Yes | N={1,2,4,8} |
| Serving backend | Variable | Ollama, vLLM, TGI |
| Quantization | Partially | Ollama=Q4_0, vLLM/TGI=FP16 |
The quantization difference (Q4_0 vs FP16) affects absolute throughput but not the scaling efficiency eta(N), which normalizes against each backend's own N=1 baseline. Serial fraction comparisons remain valid.
SS1.3 Research Questions
TR130 is designed to answer five specific questions:
- Q1: Does the serving stack affect multi-agent scaling efficiency? If eta(N) differs across backends on the same GPU, the software matters.
- Q2: Which backend delivers the most total throughput at high concurrency? The metric that matters for production: N x per_agent_tps at N=8.
- Q3: Do all backends follow the same scaling law? If one backend follows Amdahl while another follows power law, the degradation mechanisms are fundamentally different.
- Q4: At what N does the best backend overtake Ollama in total throughput? Ollama starts faster (Q4_0) but may lose the lead under contention.
- Q5: Is TTFT independent of throughput scaling? A backend could be slow in throughput but fast in time-to-first-token, or vice versa.
SS1.4 Literature Gap
Published LLM serving benchmarks (Patel et al. 2024, Kwon et al. 2023) compare backends under open-loop arrival conditions (Poisson arrivals at specified rates). Multi-agent systems are closed-loop: each agent sends one request, waits for completion, then sends the next. This fundamental difference means open-loop benchmarks overestimate queuing depth and underestimate per-request contention. TR130 is the first cross-backend comparison under closed-loop multi-agent workloads on consumer GPU hardware.
SS1.5 Why Three Backends
| Backend | Scheduling | Batching | KV-Cache | Expected Scaling |
|---|---|---|---|---|
| Ollama | Sequential FIFO | None (one request at a time) | Implicit (ggml) | Amdahl -- strict serial fraction |
| vLLM | Continuous batching | Dynamic in-flight batching | PagedAttention (virtual memory) | Sub-linear but graceful -- resource contention |
| TGI | Continuous batching | Token-level scheduling | Paged blocks | Similar to vLLM -- different implementation |
Ollama is the null hypothesis: a simple sequential server. vLLM represents the state of the art in LLM serving efficiency. TGI provides a second continuous-batching implementation to distinguish "continuous batching in general" from "vLLM specifically."
SS2. Methodology
SS2.1 Backends
| Backend | API | Quantization | Deployment | Key Feature |
|---|---|---|---|---|
| Ollama | /api/generate |
Q4_0 | Native Windows | Timing in response (ns) |
| vLLM | /v1/completions |
FP16 | Docker GPU | PagedAttention, continuous batching |
| TGI | /generate |
FP16 | Docker GPU | details=true for per-request timing |
Only one backend runs at a time. Between backend switches, the previous server is fully stopped and the GPU is allowed to cool.
SS2.2 Metrics
| Metric | Formula | Availability |
|---|---|---|
effective_tps |
completion_tokens / wall_ms * 1000 |
All backends |
gpu_tokens_per_s |
completion_tokens / decode_ms * 1000 |
Ollama, TGI |
prefill_ms |
Backend-native prefill time | Ollama, TGI |
decode_ms |
Backend-native decode time | Ollama, TGI |
ttft_ms |
Time to first token (streaming) | All backends |
effective_tps is the primary metric -- it captures the throughput each agent actually experiences, including all queue wait, scheduling overhead, and network latency.
SS2.3 Statistical Methods
- 95% CI via t-distribution (per-backend x per-model)
- Bootstrap CIs (1,000 resamples) on Amdahl serial fractions
- Shapiro-Wilk normality testing on wall_ms distributions
- Cohen's d for cross-backend pairwise effect sizes
- Curve fitting: Amdahl, power law, exponential, logistic (4 models)
SS2.4 Four Phases
| Phase | Purpose | Approximate Rows |
|---|---|---|
| P1: Validation | Confirm Docker GPU, API format, model loading | ~27 |
| P2: Baseline | N=1 reference throughput per backend | ~450 |
| P3: Scaling | N={1,2,4,8} closed-loop agents (CORE) | ~4,050 |
| P4: TTFT | Streaming time-to-first-token | ~270 |
SS3. Phase 1 -- Environment Validation
Phase 1 sent 27 validation requests across all backend x model combinations to confirm:
- Docker GPU passthrough works for vLLM and TGI containers
- Each model loads and generates coherent text
- API response parsing extracts correct token counts and timing
- Timing fields match expected availability per backend
All backend x model combinations that passed validation proceeded to Phase 2. Failed combinations were skipped with logged errors.
SS4. Phase 2 -- Single-Agent Baseline
SS4.1 Absolute Throughput (N=1)
| Backend | Model | Quant | N | Mean TPS | 95% CI | CV% | Wall ms |
|---|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | Q4_0 | 1 | 177.7 | [175.0, 180.4] | 5.3 | 678.9 |
| ollama | llama3.2-3b | Q4_0 | 1 | 130.1 | [129.7, 130.4] | 1.0 | 984.2 |
| ollama | qwen2.5-1.5b | Q4_0 | 1 | 198.3 | [196.5, 200.2] | 3.3 | 637.9 |
| tgi | llama3.2-1b | FP16 | 1 | 125.2 | [124.5, 125.9] | 1.9 | 1023.0 |
| tgi | llama3.2-3b | FP16 | 1 | 49.4 | [48.4, 50.5] | 7.5 | 2602.3 |
| tgi | qwen2.5-1.5b | FP16 | 1 | 75.0 | [74.4, 75.5] | 2.5 | 1708.5 |
| vllm | llama3.2-1b | FP16 | 1 | 150.7 | [149.8, 151.6] | 2.0 | 849.7 |
| vllm | llama3.2-3b | FP16 | 1 | 60.9 | [60.8, 60.9] | 0.2 | 2103.3 |
| vllm | qwen2.5-1.5b | FP16 | 1 | 102.6 | [101.3, 103.9] | 4.4 | 1250.0 |
SS4.2 Observations
-
Ollama is 1.2--2.6x faster at N=1. The Q4_0 quantization advantage is consistent across all models. For llama3.2-1b: Ollama 177.7 vs vLLM 150.7 vs TGI 125.2 tok/s. The ratio is largest for llama3.2-3b (130.1/49.4 = 2.63x) where FP16 weights strain the 12 GB VRAM.
-
vLLM is 1.2--1.4x faster than TGI at N=1. Both serve FP16, so this gap reflects implementation efficiency: vLLM's optimized CUDA kernels and PagedAttention overhead is lower than TGI's at zero contention. For llama3.2-1b: 150.7 vs 125.2 tok/s (1.20x). For qwen2.5-1.5b: 102.6 vs 75.0 tok/s (1.37x).
-
vLLM llama3.2-3b has near-zero variance (CV=0.2%). This is remarkably consistent -- 50 requests spanning only 22 ms range (2097--2120 ms). This suggests vLLM's scheduler produces deterministic timing when there is no contention, likely because PagedAttention eliminates memory fragmentation randomness.
-
TGI llama3.2-3b has the highest variance (CV=7.5%). The same model on TGI shows 37x more relative variance than vLLM. Combined with TGI's non-normal wall_ms distribution (Shapiro-Wilk p < 0.001), this suggests occasional scheduling hiccups even at N=1.
-
Ollama exposes native prefill/decode timing. Ollama's llama3.2-1b shows prefill=7.1 ms, decode=459.5 ms at N=1, meaning only 1.5% of GPU time is prefill. The remaining 212 ms gap between (prefill+decode)=467 ms and wall_ms=679 ms is Ollama's HTTP/scheduling overhead -- 31% of total request time at N=1.
SS4.3 Cross-Backend Interpretation
Ollama serves Q4_0 quantized weights, which are ~4x smaller than FP16. This means Ollama has lower memory bandwidth pressure (less data to transfer per token), lower compute requirements (INT4 ops vs FP16 ops), and correspondingly higher absolute tok/s at N=1. This is expected and correct.
Why the baseline difference does not affect scaling comparison: Amdahl's eta(N) = TPS(N) / TPS(1) normalizes each backend against its own N=1 reference. A backend with 50 tok/s at N=1 and 25 tok/s at N=2 has the same eta(2)=0.5 as one with 100 tok/s at N=1 and 50 tok/s at N=2. However, for total throughput comparisons (the metric practitioners care about), the baseline matters -- Ollama's Q4_0 head start must be overcome by the competing backend's superior scaling before switching is worthwhile.
Why vLLM beats TGI at N=1: Both use continuous batching, but at N=1 this doesn't matter (there's nothing to batch). The gap likely reflects vLLM's more optimized CUDA graph execution and lower Python-side overhead in the critical path. TGI's Rust-based router adds a layer that may introduce small latencies under zero contention.
SS5. Phase 3 -- N-Agent Throughput Curves
SS5.1 Per-Agent Throughput vs N
| Backend | Model | N | Per-Agent TPS | 95% CI | Total TPS | Wall ms |
|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | 1 | 175.3 | [172.4, 178.2] | 175.3 | 695 |
| ollama | llama3.2-1b | 2 | 140.2 | [134.8, 145.7] | 280.4 | 886 |
| ollama | llama3.2-1b | 4 | 64.8 | [62.8, 66.8] | 259.1 | 1855 |
| ollama | llama3.2-1b | 8 | 31.0 | [30.1, 31.9] | 247.8 | 3986 |
| ollama | llama3.2-3b | 1 | 130.2 | [129.7, 130.6] | 130.2 | 983 |
| ollama | llama3.2-3b | 2 | 88.1 | [86.7, 89.6] | 176.2 | 1461 |
| ollama | llama3.2-3b | 4 | 41.6 | [41.1, 42.1] | 166.5 | 3075 |
| ollama | llama3.2-3b | 8 | 20.2 | [19.9, 20.6] | 162.0 | 6361 |
| ollama | qwen2.5-1.5b | 1 | 196.1 | [193.0, 199.2] | 196.1 | 640 |
| ollama | qwen2.5-1.5b | 2 | 150.6 | [146.0, 155.3] | 301.2 | 864 |
| ollama | qwen2.5-1.5b | 4 | 68.6 | [67.3, 70.0] | 274.5 | 1858 |
| ollama | qwen2.5-1.5b | 8 | 32.4 | [32.0, 32.8] | 259.4 | 3947 |
| tgi | llama3.2-1b | 1 | 121.7 | [119.6, 123.8] | 121.7 | 1054 |
| tgi | llama3.2-1b | 2 | 99.6 | [97.4, 101.9] | 199.3 | 1294 |
| tgi | llama3.2-1b | 4 | 82.3 | [80.4, 84.3] | 329.4 | 1579 |
| tgi | llama3.2-1b | 8 | 60.4 | [58.6, 62.2] | 483.2 | 2220 |
| tgi | llama3.2-3b | 1 | 47.2 | [47.0, 47.4] | 47.2 | 2711 |
| tgi | llama3.2-3b | 2 | 42.5 | [42.1, 42.9] | 85.0 | 3017 |
| tgi | llama3.2-3b | 4 | 38.3 | [37.9, 38.8] | 153.4 | 3349 |
| tgi | llama3.2-3b | 8 | 32.6 | [32.1, 33.1] | 260.7 | 3976 |
| tgi | qwen2.5-1.5b | 1 | 76.5 | [76.1, 76.9] | 76.5 | 1674 |
| tgi | qwen2.5-1.5b | 2 | 65.3 | [64.3, 66.4] | 130.6 | 1966 |
| tgi | qwen2.5-1.5b | 4 | 55.8 | [54.9, 56.7] | 223.1 | 2311 |
| tgi | qwen2.5-1.5b | 8 | 45.3 | [44.3, 46.3] | 362.4 | 2897 |
| vllm | llama3.2-1b | 1 | 149.1 | [147.0, 151.1] | 149.1 | 860 |
| vllm | llama3.2-1b | 2 | 123.7 | [119.3, 128.0] | 247.3 | 1055 |
| vllm | llama3.2-1b | 4 | 93.3 | [89.9, 96.7] | 373.3 | 1414 |
| vllm | llama3.2-1b | 8 | 69.9 | [67.6, 72.2] | 559.2 | 1928 |
| vllm | llama3.2-3b | 1 | 60.7 | [60.4, 61.0] | 60.7 | 2110 |
| vllm | llama3.2-3b | 2 | 54.6 | [53.8, 55.4] | 109.2 | 2351 |
| vllm | llama3.2-3b | 4 | 47.4 | [46.7, 48.2] | 189.7 | 2720 |
| vllm | llama3.2-3b | 8 | 39.8 | [39.1, 40.5] | 318.6 | 3279 |
| vllm | qwen2.5-1.5b | 1 | 104.7 | [103.3, 106.0] | 104.7 | 1224 |
| vllm | qwen2.5-1.5b | 2 | 88.8 | [86.7, 90.9] | 177.6 | 1454 |
| vllm | qwen2.5-1.5b | 4 | 72.8 | [71.0, 74.5] | 291.0 | 1788 |
| vllm | qwen2.5-1.5b | 8 | 57.1 | [55.6, 58.6] | 456.7 | 2326 |
SS5.2 Observations
-
Ollama total throughput PEAKS at N=2 and then declines. For llama3.2-1b: N=1->175, N=2->280, N=4->259, N=8->248 tok/s. The system actually delivers less total throughput at N=8 than at N=2. This is the signature of severe serialization -- adding agents beyond 2 creates more queue wait than it adds productive GPU time.
-
vLLM total throughput grows monotonically through N=8. For llama3.2-1b: 149->247->373->559 tok/s. Each doubling of N adds meaningful throughput, suggesting the system has not yet reached its scaling ceiling. Continuous batching enables the GPU to overlap compute across concurrent requests, converting queued requests into productive parallelism.
-
TGI follows the same monotonic pattern as vLLM but at lower absolute throughput. For llama3.2-1b: 122->199->329->483 tok/s. TGI at N=8 (483) is roughly comparable to vLLM at N=4 (373), suggesting TGI is approximately one concurrency level behind vLLM in terms of aggregate efficiency.
-
The vLLM advantage grows with model size. At N=8, the vLLM/Ollama total throughput ratio is 2.25x for llama3.2-1b, 1.97x for llama3.2-3b, and 1.76x for qwen2.5-1.5b. Larger FP16 models consume more VRAM, but vLLM's PagedAttention manages this memory more efficiently under contention than Ollama's implicit KV-cache.
-
Per-agent throughput at N=8 reveals the scheduling difference. Ollama: 31.0 tok/s per agent (each agent waits while 7 others are served sequentially). vLLM: 69.9 tok/s per agent (requests overlap via continuous batching). The 2.26x per-agent gap means each vLLM agent experiences half the latency of an Ollama agent.
-
Wall-clock latency progression confirms serial vs parallel. Ollama llama3.2-1b: 695->886->1855->3986 ms (roughly Nx the N=1 latency, confirming sequential). vLLM: 860->1055->1414->1928 ms (sub-linear growth, confirming overlapped execution). At N=8, an Ollama request takes 3.99 seconds vs vLLM's 1.93 seconds.
SS5.3 The Crossover Point
For production planning, the critical question is: at what N does vLLM's total throughput exceed Ollama's, despite Ollama's Q4_0 head start?
| Model | Ollama N=2 Total | vLLM N=2 Total | Crossover Region |
|---|---|---|---|
| llama3.2-1b | 280.4 | 247.3 | Between N=2 and N=4 |
| llama3.2-3b | 176.2 | 109.2 | Between N=2 and N=4 |
| qwen2.5-1.5b | 301.2 | 177.6 | Between N=2 and N=4 |
At N=2, Ollama still leads thanks to Q4_0 throughput. By N=4, vLLM overtakes for llama3.2-1b (373.3 vs 259.1) and qwen2.5-1.5b (291.0 vs 274.5). The crossover occurs near N=3 -- the point at which continuous batching's parallelism advantage overcomes quantization's throughput advantage. For deployments with 3+ agents, vLLM is the strictly dominant choice.
SS6. Efficiency Curves eta(N)
SS6.1 Definition
eta(N) = effective_tps(N) / effective_tps(1)
This is the per-agent efficiency: the fraction of N=1 throughput that each agent retains when sharing the GPU with N-1 other agents. eta(1) = 1.0 by definition; eta(N) < 1 for N > 1.
SS6.2 Efficiency Table
| Backend | Model | N | eta(N) | Per-Agent TPS | Baseline TPS |
|---|---|---|---|---|---|
| ollama | llama3.2-1b | 1 | 0.987 | 175.3 | 177.7 |
| ollama | llama3.2-1b | 2 | 0.789 | 140.2 | 177.7 |
| ollama | llama3.2-1b | 4 | 0.365 | 64.8 | 177.7 |
| ollama | llama3.2-1b | 8 | 0.174 | 31.0 | 177.7 |
| ollama | llama3.2-3b | 1 | 1.001 | 130.2 | 130.1 |
| ollama | llama3.2-3b | 2 | 0.678 | 88.1 | 130.1 |
| ollama | llama3.2-3b | 4 | 0.320 | 41.6 | 130.1 |
| ollama | llama3.2-3b | 8 | 0.156 | 20.2 | 130.1 |
| ollama | qwen2.5-1.5b | 1 | 0.989 | 196.1 | 198.3 |
| ollama | qwen2.5-1.5b | 2 | 0.759 | 150.6 | 198.3 |
| ollama | qwen2.5-1.5b | 4 | 0.346 | 68.6 | 198.3 |
| ollama | qwen2.5-1.5b | 8 | 0.164 | 32.4 | 198.3 |
| tgi | llama3.2-1b | 1 | 0.972 | 121.7 | 125.2 |
| tgi | llama3.2-1b | 2 | 0.796 | 99.6 | 125.2 |
| tgi | llama3.2-1b | 4 | 0.658 | 82.3 | 125.2 |
| tgi | llama3.2-1b | 8 | 0.483 | 60.4 | 125.2 |
| tgi | llama3.2-3b | 1 | 0.955 | 47.2 | 49.4 |
| tgi | llama3.2-3b | 2 | 0.859 | 42.5 | 49.4 |
| tgi | llama3.2-3b | 4 | 0.775 | 38.3 | 49.4 |
| tgi | llama3.2-3b | 8 | 0.659 | 32.6 | 49.4 |
| tgi | qwen2.5-1.5b | 1 | 1.020 | 76.5 | 75.0 |
| tgi | qwen2.5-1.5b | 2 | 0.871 | 65.3 | 75.0 |
| tgi | qwen2.5-1.5b | 4 | 0.744 | 55.8 | 75.0 |
| tgi | qwen2.5-1.5b | 8 | 0.604 | 45.3 | 75.0 |
| vllm | llama3.2-1b | 1 | 0.989 | 149.1 | 150.7 |
| vllm | llama3.2-1b | 2 | 0.820 | 123.7 | 150.7 |
| vllm | llama3.2-1b | 4 | 0.619 | 93.3 | 150.7 |
| vllm | llama3.2-1b | 8 | 0.464 | 69.9 | 150.7 |
| vllm | llama3.2-3b | 1 | 0.997 | 60.7 | 60.9 |
| vllm | llama3.2-3b | 2 | 0.897 | 54.6 | 60.9 |
| vllm | llama3.2-3b | 4 | 0.779 | 47.4 | 60.9 |
| vllm | llama3.2-3b | 8 | 0.654 | 39.8 | 60.9 |
| vllm | qwen2.5-1.5b | 1 | 1.020 | 104.7 | 102.6 |
| vllm | qwen2.5-1.5b | 2 | 0.866 | 88.8 | 102.6 |
| vllm | qwen2.5-1.5b | 4 | 0.709 | 72.8 | 102.6 |
| vllm | qwen2.5-1.5b | 8 | 0.556 | 57.1 | 102.6 |
SS6.3 Observations
Observation 1 -- Ollama collapses by N=4, vLLM/TGI hold through N=8. Ollama's eta drops below 0.50 by N=4 for all three models (0.320--0.365), meaning each agent retains less than a third of its standalone throughput. vLLM/TGI remain above 0.60 at N=4 (0.619--0.779) and still above 0.46 at N=8. The gap is not marginal -- it is a 3--4x difference in scaling efficiency at N=8.
Observation 2 -- The three Ollama models degrade nearly identically. At N=8: eta=0.174 (1B), 0.156 (3B), 0.164 (1.5B). The spread is only 0.018 -- less than 2 percentage points. This model-invariance confirms that Ollama's efficiency loss is dominated by the serving scheduler, not model-specific compute characteristics. If the bottleneck were memory bandwidth (which scales with model size), the 3B model would degrade faster than the 1B.
Observation 3 -- vLLM/TGI efficiency is model-size-dependent. Unlike Ollama, the 3B model retains more efficiency than the 1B at N=8: vLLM eta=0.654 (3B) vs 0.464 (1B); TGI eta=0.659 (3B) vs 0.483 (1B). This inverted pattern reveals that continuous batching amortizes per-request scheduling overhead more effectively when requests are longer (larger model = longer decode per token = more time for the scheduler to batch the next request). The 1B model's fast decode leaves less slack for batch formation.
Observation 4 -- Efficiency drop 1->2 is the diagnostic moment. Ollama loses 21--32% of efficiency going from N=1 to N=2 (eta drops from ~1.0 to 0.68--0.79). vLLM/TGI lose only 3--14% (eta drops from ~1.0 to 0.82--0.90). A backend that struggles at N=2 -- when only one other request competes -- exposes scheduling-level serialization. The N=1->2 gap predicts the entire curve shape.
Observation 5 -- Per-agent TPS at N=8 tells the story. Ollama: 20--32 tok/s per agent. vLLM: 40--70 tok/s per agent. TGI: 33--60 tok/s per agent. For a user running 8 agents simultaneously, each Ollama agent produces text at roughly the speed of a slow typist. vLLM agents are 2--3.5x faster individually, and 2.25x faster in aggregate (since there are 8 of each).
SS6.4 What the Efficiency Curves Reveal
The answer to the central question of TR130 is visible in this table alone, before any curve fitting. If all backends showed similar eta(N) curves, the serial bottleneck would be in the GPU hardware. Instead, the backends diverge dramatically -- Ollama's eta=0.16 vs vLLM's eta=0.56 at N=8 -- proving that the serving stack is the bottleneck. The GPU is capable of much higher concurrent throughput than Ollama allows.
The mechanism: Ollama processes requests sequentially (one at a time on the GPU), so N concurrent agents form a queue where each waits for the others. vLLM/TGI batch multiple requests into a single GPU kernel launch, so N agents partially overlap rather than queuing. The efficiency table quantifies this difference exactly.
SS7. Scaling Law Fitting
SS7.1 Four-Model Comparison
| Backend | Model | Best Fit | R^2 | Amdahl s | Amdahl R^2 | Power alpha | Exp beta |
|---|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | exponential | 0.984 | 0.5329 | 0.957 | 0.681 | 0.289 |
| ollama | llama3.2-3b | amdahl | 0.987 | 0.3920 | 0.987 | 0.782 | 0.347 |
| ollama | qwen2.5-1.5b | exponential | 0.987 | 0.4918 | 0.965 | 0.713 | 0.307 |
| tgi | llama3.2-1b | power_law | 0.989 | 0.8274 | 0.962 | 0.317 | 0.102 |
| tgi | llama3.2-3b | power_law | 0.988 | 0.9146 | 0.843 | 0.171 | 0.052 |
| tgi | qwen2.5-1.5b | power_law | 0.996 | 0.8960 | 0.973 | 0.245 | 0.075 |
| vllm | llama3.2-1b | power_law | 0.989 | 0.8125 | 0.987 | 0.353 | 0.116 |
| vllm | llama3.2-3b | power_law | 0.988 | 0.9168 | 0.975 | 0.197 | 0.061 |
| vllm | qwen2.5-1.5b | power_law | 0.993 | 0.8748 | 0.985 | 0.282 | 0.089 |
SS7.2 Observations
Observation 1 -- Ollama follows Amdahl's Law; vLLM/TGI follow power law. This is the most important finding in the table. Ollama's best-fit model is Amdahl or exponential (both consistent with fixed serialization), with Amdahl R^2=0.957--0.987. All six vLLM/TGI fits select power law as best fit (R^2=0.988--0.996). The backends obey fundamentally different scaling laws because they have fundamentally different scheduling architectures.
Observation 2 -- Amdahl R^2 is poor for vLLM/TGI. When Amdahl is force-fitted to TGI-3B, R^2=0.843 -- a terrible fit. For vLLM-1B, Amdahl R^2=0.987 looks acceptable only because the power law happens to be close to Amdahl at small N. The key diagnostic: Amdahl predicts eta(8)->1/(s+7(1-s)) which, for s=0.81, gives eta=0.15. The actual vLLM-1B eta(8)=0.464 -- 3x higher than Amdahl's prediction. The force-fitted Amdahl "serial fraction" is meaningless for these backends.
Observation 3 -- Power law exponents reveal the degradation rate. For vLLM/TGI, eta(N) propto N?alpha. The exponent alpha ranges from 0.171 (TGI-3B) to 0.353 (vLLM-1B). Smaller alpha = slower degradation = better scaling. The 3B models have the smallest alpha across both backends (0.171--0.197), confirming Observation 3 in SS6: larger models scale better under continuous batching because longer decode times provide more scheduling slack.
Observation 4 -- Ollama's serial fractions match TR129. Ollama-3B: s=0.392 (TR130) vs s=0.39 (TR129). Ollama-1.5B: s=0.492 (TR130) vs s=0.54 (TR129). The agreement within bootstrap CIs validates both experiments and confirms that Ollama's Amdahl-like behavior is reproducible.
Observation 5 -- Exponential beats Amdahl for the 1B model on Ollama. For llama3.2-1b, exponential (R^2=0.984) edges out Amdahl (R^2=0.957). The exponential model eta(N) propto e^{-betaN} predicts faster-than-Amdahl collapse at high N, suggesting that the 1B model's rapid decode (177.7 tok/s at N=1) intensifies scheduling contention beyond what a fixed serial fraction captures. At very high throughput, even Ollama's sequential scheduling may experience additional bottlenecks (e.g., CPU-side tokenization, HTTP overhead).
SS7.3 Why Different Backends Obey Different Laws
Amdahl's Law (eta = 1/(s + (1-s)N)) assumes a fixed serial fraction: some fraction s of the workload is strictly sequential, and the rest can overlap. Ollama's sequential request scheduling is a textbook Amdahl system -- it processes one request at a time, so the serial component is the scheduling granularity. The serial fraction s captures the ratio of scheduling overhead to total request time.
Power law (eta propto N?alpha) has no serial fraction. Instead, each additional agent causes a multiplicative slowdown that compounds, producing a straight line on a log-log plot. This matches continuous batching: adding a request to an in-progress batch has a cost proportional to the current batch size (attention computation grows, memory bandwidth is shared), not a fixed cost. The degradation is gradual and continuous, not threshold-driven.
The implication: Amdahl serial fractions are valid for Ollama and should be used for capacity planning with Ollama deployments. They are not valid for vLLM/TGI -- the power law exponent alpha is the correct characterization. Comparing Amdahl s values across backends with different best-fit models is a category error, addressed in SS8.
SS7.4 Amdahl's Law Definition
Amdahl's Law: eta(N) = 1 / (s + (1-s)*N)
The serial fraction s represents the fraction of the inference pipeline that cannot be overlapped across concurrent requests. Sources of serialization include:
- GPU compute serialization: CUDA kernel launches are serialized, limiting how many requests can execute simultaneously
- Memory bandwidth contention: All requests share the same HBM/GDDR bandwidth for KV-cache reads and weight fetches
- Request scheduling overhead: The serving stack's scheduler adds latency when deciding which request to execute next
- KV-cache management: Allocating, copying, and freeing KV-cache blocks requires synchronization
vLLM's PagedAttention and continuous batching are designed to minimize the last two sources.
SS8. Cross-Backend Serial Fraction Comparison
SS8.1 Core Question
Is the Amdahl serial fraction s=0.39--0.54 (from TR129) an Ollama problem or a GPU physics problem?
SS8.2 Bootstrap Serial Fraction CIs
Each serial fraction below is estimated via 1,000 bootstrap resamples of the efficiency curve, providing robust confidence intervals.
llama3.2-1b
| Backend | s (mean) | s (median) | 95% CI | Std |
|---|---|---|---|---|
| ollama | 0.5146 | 0.5329 | [0.3233, 0.7329] | 0.1101 |
| tgi | 0.8159 | 0.8274 | [0.7437, 0.8468] | 0.0604 |
| vllm | 0.8076 | 0.8108 | [0.7811, 0.8348] | 0.0351 |
llama3.2-3b
| Backend | s (mean) | s (median) | 95% CI | Std |
|---|---|---|---|---|
| ollama | 0.3781 | 0.3920 | [0.2253, 0.5242] | 0.0831 |
| tgi | 0.9034 | 0.9132 | [0.8361, 0.9261] | 0.0492 |
| vllm | 0.9092 | 0.9168 | [0.8858, 0.9245] | 0.0582 |
qwen2.5-1.5b
| Backend | s (mean) | s (median) | 95% CI | Std |
|---|---|---|---|---|
| ollama | 0.4698 | 0.4918 | [0.2691, 0.6833] | 0.1119 |
| tgi | 0.8898 | 0.8960 | [0.8524, 0.9065] | 0.0451 |
| vllm | 0.8694 | 0.8748 | [0.8446, 0.8861] | 0.0435 |
SS8.3 Pairwise Comparisons
| Model | Backend A | Backend B | s_A | s_B | Diff | Cohen's d | Effect | CIs Overlap |
|---|---|---|---|---|---|---|---|---|
| llama3.2-1b | ollama | tgi | 0.5146 | 0.8159 | -0.3013 | -3.39 | large | NO |
| llama3.2-1b | ollama | vllm | 0.5146 | 0.8076 | -0.2930 | -3.59 | large | NO |
| llama3.2-1b | tgi | vllm | 0.8159 | 0.8076 | 0.0083 | 0.17 | negligible | yes |
| llama3.2-3b | ollama | tgi | 0.3781 | 0.9034 | -0.5253 | -7.69 | large | NO |
| llama3.2-3b | ollama | vllm | 0.3781 | 0.9092 | -0.5311 | -7.40 | large | NO |
| llama3.2-3b | tgi | vllm | 0.9034 | 0.9092 | -0.0058 | -0.11 | negligible | yes |
| qwen2.5-1.5b | ollama | tgi | 0.4698 | 0.8898 | -0.4200 | -4.92 | large | NO |
| qwen2.5-1.5b | ollama | vllm | 0.4698 | 0.8694 | -0.3996 | -4.71 | large | NO |
| qwen2.5-1.5b | tgi | vllm | 0.8898 | 0.8694 | 0.0204 | 0.46 | small | yes |
SS8.4 Aggregate Ranking
| Rank | Backend | Mean s | Min s | Max s |
|---|---|---|---|---|
| 1 | ollama | 0.4542 | 0.3781 | 0.5146 |
| 2 | vllm | 0.8621 | 0.8076 | 0.9092 |
| 3 | tgi | 0.8697 | 0.8159 | 0.9034 |
SS8.5 Observations
Observation 1 -- The Amdahl serial fractions tell a paradoxical story that reveals their invalidity for cross-backend comparison. The data says Ollama has the lowest serial fraction (s=0.38--0.51) while vLLM/TGI have the highest (s=0.81--0.92). Naively, this means Ollama scales best. But we know from SS6 that Ollama's eta(8)=0.16 while vLLM's eta(8)=0.56 -- Ollama scales worst. The paradox resolves when we recognize that Amdahl's model genuinely fits Ollama but is a bad model for vLLM/TGI (SS7). Force-fitting an inappropriate model produces meaningless parameters.
Observation 2 -- The bootstrap CIs confirm the meaninglessness of cross-backend comparison. Ollama's CIs are wide ([0.23, 0.73] for 1B) because 4 data points with noise allow Amdahl to fit a range of s values. vLLM/TGI's CIs are tight ([0.78, 0.83] for vLLM-1B) not because the fit is good, but because the power-law curve is locally well-approximated by a specific Amdahl curve at these N values. The tightness is an artifact, not evidence.
Observation 3 -- The pairwise comparisons are statistically sound but semantically wrong. Cohen's d values of 3--8 with non-overlapping CIs correctly indicate that the Amdahl fits produce different parameters across backends. But the question is not whether the parameters differ -- it is whether the parameter means the same thing across backends. It does not. Ollama's s=0.45 means "45% of request time is sequential scheduling." vLLM's s=0.86 means "the power-law curve happens to intersect Amdahl at this parameter value."
Observation 4 -- TGI ~ vLLM in the Amdahl frame. The pairwise TGI-vs-vLLM comparisons show negligible-to-small effects (Cohen's d = 0.11--0.46). This is the one valid comparison: both backends follow power law, so their Amdahl fits are equally wrong in the same way. The similar s values (0.86--0.87 mean) reflect similar scaling behavior, which is genuine -- both use continuous batching.
SS8.6 Answer to the Core Question
Is the Amdahl serial fraction s=0.39--0.54 an Ollama problem or a GPU physics problem?
It is an Ollama problem. But the answer comes not from comparing serial fractions (which is a category error), but from comparing raw efficiency:
| Metric | Ollama | vLLM | TGI |
|---|---|---|---|
| eta(8) range | 0.156--0.174 | 0.464--0.654 | 0.483--0.659 |
| Total TPS at N=8 (1B model) | 248.0 | 559.2 | 483.2 |
| Best-fit law | Amdahl | Power | Power |
| N* (saturation) | 4 (all models) | 8 or >8 | 8 or >8 |
The GPU can deliver 3--4x better scaling than Ollama allows. vLLM/TGI prove this by achieving eta(8)=0.46--0.66 on the same GPU, same models, same workload. The bottleneck is Ollama's sequential request scheduling, not GPU memory bandwidth or compute serialization.
Mechanistic explanation: Ollama processes requests one at a time. When 8 agents send concurrent requests, 7 wait in a queue while 1 executes. vLLM batches all 8 into a single GPU kernel, sharing the attention computation. The cost of concurrent batched attention scales sub-linearly (power law), not linearly (Amdahl). This is why Ollama degrades 6x faster than vLLM at N=8.
SS8.7 What Should Practitioners Compare?
Since Amdahl serial fractions are invalid for cross-backend comparison, the correct metrics are:
- eta(N) at target concurrency -- directly answers "how much throughput does each agent retain?"
- Total throughput at target N -- answers "how much total work gets done?"
- N (saturation point)* -- answers "how many agents can I run before efficiency halves?"
- Power law exponent alpha -- for continuous-batching backends, answers "how fast does efficiency degrade?"
All four metrics consistently rank: vLLM > TGI >> Ollama for multi-agent deployments.
SS9. Saturation Detection
N* = the concurrency level where eta drops below 0.50 (each agent retains less than half of its standalone throughput).
| Backend | Model | N* | eta at Max N | Max N Tested |
|---|---|---|---|---|
| ollama | llama3.2-1b | 4 | 0.174 | 8 |
| ollama | llama3.2-3b | 4 | 0.156 | 8 |
| ollama | qwen2.5-1.5b | 4 | 0.164 | 8 |
| tgi | llama3.2-1b | 8 | 0.483 | 8 |
| tgi | llama3.2-3b | None | 0.659 | 8 |
| tgi | qwen2.5-1.5b | None | 0.604 | 8 |
| vllm | llama3.2-1b | 8 | 0.464 | 8 |
| vllm | llama3.2-3b | None | 0.654 | 8 |
| vllm | qwen2.5-1.5b | None | 0.556 | 8 |
SS9.2 Observations
Observation 1 -- Ollama saturates at N=4 for all three models.* By N=4, every Ollama-served model has eta<0.37 -- well below the 0.50 threshold. This is consistent with Amdahl serial fractions of 0.39--0.53: the mathematical prediction from s=0.45 gives eta(4)=1/(0.45+0.55x4)=0.35, matching the data. Practitioners running >3 Ollama agents are operating in the saturated regime where adding agents provides marginal total throughput gain.
Observation 2 -- vLLM/TGI have not saturated at N=8 for 3B and 1.5B models. N*=None means eta never dropped below 0.50 in our measurement range. TGI-3B: eta(8)=0.659, vLLM-3B: eta(8)=0.654 -- both retaining nearly two-thirds of per-agent throughput. Extrapolating the power law fit: vLLM-3B (alpha=0.197) predicts N~13, TGI-3B (alpha=0.171) predicts N~16. These backends can support roughly 4x more concurrent agents than Ollama before reaching the same efficiency threshold.
Observation 3 -- The 1B model saturates first on vLLM/TGI. Both show N*=8 for llama3.2-1b (eta~0.47), while the 3B model remains unsaturated. This reinforces the finding from SS6: the 1B model's fast decode (150 tok/s at N=1) leaves less slack for continuous batching to overlap requests. Faster individual requests paradoxically scale worse under concurrent batching because the scheduler has less time to amortize overhead.
Observation 4 -- The saturation gap is the simplest decision metric. Ollama: N*=4. vLLM: N*=8--13+. TGI: N*=8--16+. A practitioner choosing a backend for multi-agent deployment can compare these numbers directly: at N=6 agents, Ollama is deep in saturation (eta0.22) while vLLM is still efficient (eta0.55). No curve fitting or statistical analysis needed -- N* alone answers the deployment question.
SS10. Fairness Analysis
Jain's Fairness Index: J = (sum(x))^2 / (n x sum(x^2)), where x_i is agent i's mean effective_tps. J=1.0 means perfectly fair (all agents get equal throughput).
| Backend | Model | N | Jain's Index | Agent TPS CV% |
|---|---|---|---|---|
| ollama | llama3.2-1b | 2 | 0.9999 | 1.1 |
| ollama | llama3.2-1b | 4 | 0.9999 | 1.1 |
| ollama | llama3.2-1b | 8 | 0.9986 | 3.8 |
| ollama | llama3.2-3b | 2 | 1.0000 | 0.4 |
| ollama | llama3.2-3b | 4 | 1.0000 | 0.5 |
| ollama | llama3.2-3b | 8 | 0.9994 | 2.4 |
| ollama | qwen2.5-1.5b | 2 | 1.0000 | 0.7 |
| ollama | qwen2.5-1.5b | 4 | 0.9993 | 2.6 |
| ollama | qwen2.5-1.5b | 8 | 0.9999 | 1.0 |
| tgi | llama3.2-1b | 2 | 1.0000 | 0.4 |
| tgi | llama3.2-1b | 4 | 0.9973 | 5.2 |
| tgi | llama3.2-1b | 8 | 0.9960 | 6.3 |
| tgi | llama3.2-3b | 2 | 1.0000 | 0.0 |
| tgi | llama3.2-3b | 4 | 0.9997 | 1.7 |
| tgi | llama3.2-3b | 8 | 0.9994 | 2.4 |
| tgi | qwen2.5-1.5b | 2 | 1.0000 | 0.5 |
| tgi | qwen2.5-1.5b | 4 | 0.9987 | 3.6 |
| tgi | qwen2.5-1.5b | 8 | 0.9966 | 5.8 |
| vllm | llama3.2-1b | 2 | 0.9999 | 1.1 |
| vllm | llama3.2-1b | 4 | 0.9991 | 3.0 |
| vllm | llama3.2-1b | 8 | 0.9960 | 6.3 |
| vllm | llama3.2-3b | 2 | 1.0000 | 0.6 |
| vllm | llama3.2-3b | 4 | 0.9995 | 2.1 |
| vllm | llama3.2-3b | 8 | 0.9991 | 3.1 |
| vllm | qwen2.5-1.5b | 2 | 0.9999 | 0.9 |
| vllm | qwen2.5-1.5b | 4 | 0.9988 | 3.5 |
| vllm | qwen2.5-1.5b | 8 | 0.9995 | 2.3 |
SS10.2 Observations
Observation 1 -- All three backends achieve near-perfect fairness. Every Jain's index value exceeds 0.996. No backend systematically starves any agent. This is a positive result but also a non-differentiator: fairness cannot be used to choose between backends.
Observation 2 -- Ollama's sequential scheduling produces paradoxically good fairness. Ollama's round-robin queue (one request at a time, FIFO) gives each agent exactly the same wait time per cycle. The CV% at N=8 is 1.0--3.8% for Ollama vs 2.3--6.3% for vLLM/TGI. Sequential serving is perfectly fair because it is perfectly serialized -- the same property that kills throughput guarantees equitable distribution of the limited throughput.
Observation 3 -- vLLM/TGI show slightly higher variance at N=8. TGI llama3.2-1b at N=8: CV=6.3%, with per-agent TPS ranging from 51.4 to 65.4 tok/s -- a 27% spread between the slowest and fastest agent. vLLM llama3.2-1b at N=8: range 63.3 to 79.0 tok/s -- a 25% spread. This is a consequence of continuous batching: requests that arrive when the batch is less full get better service. The variation is small enough (Jain's > 0.996) to be operationally irrelevant, but it reveals that continuous batching introduces stochastic unfairness that sequential serving avoids.
Observation 4 -- Fairness is high but throughput is not. The key insight is that fairness != performance. Ollama at N=8 distributes 31 tok/s perfectly equally among 8 agents. vLLM at N=8 distributes 70 tok/s with 6.3% CV. A 6.3% unfairness in 70 tok/s (worst agent gets 63.3) still beats Ollama's perfectly-fair 31 tok/s by 2x.
SS11. Phase 4 -- TTFT Comparison
SS11.1 Time-to-First-Token (N=1)
| Backend | Model | Mean TTFT (ms) | Median | P95 | P99 | CV% |
|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | 175.9 | 173.6 | 192.6 | 197.0 | 5.2 |
| ollama | llama3.2-3b | 194.4 | 192.3 | 219.1 | 220.7 | 6.5 |
| ollama | qwen2.5-1.5b | 163.2 | 162.0 | 185.6 | 194.0 | 7.7 |
| tgi | llama3.2-1b | 21.7 | 19.2 | 32.3 | 57.1 | 42.4 |
| tgi | llama3.2-3b | 34.5 | 30.5 | 58.4 | 71.3 | 31.5 |
| tgi | qwen2.5-1.5b | 24.1 | 23.2 | 30.9 | 36.1 | 13.6 |
| vllm | llama3.2-1b | 22.8 | 21.0 | 36.5 | 46.8 | 28.7 |
| vllm | llama3.2-3b | 32.3 | 29.7 | 50.7 | 68.5 | 29.9 |
| vllm | qwen2.5-1.5b | 29.6 | 27.9 | 42.1 | 55.0 | 22.7 |
SS11.2 Cross-Backend TTFT Comparison
| Model | Backend A | Backend B | TTFT A (ms) | TTFT B (ms) | Diff (ms) | Cohen's d | Effect |
|---|---|---|---|---|---|---|---|
| llama3.2-1b | ollama | tgi | 175.9 | 21.7 | 154.2 | 16.84 | large |
| llama3.2-1b | ollama | vllm | 175.9 | 22.8 | 153.1 | 19.36 | large |
| llama3.2-1b | tgi | vllm | 21.7 | 22.8 | -1.0 | -0.13 | negligible |
| llama3.2-3b | ollama | tgi | 194.4 | 34.5 | 159.9 | 13.57 | large |
| llama3.2-3b | ollama | vllm | 194.4 | 32.3 | 162.2 | 14.44 | large |
| llama3.2-3b | tgi | vllm | 34.5 | 32.3 | 2.2 | 0.22 | small |
| qwen2.5-1.5b | ollama | tgi | 163.2 | 24.1 | 139.2 | 15.13 | large |
| qwen2.5-1.5b | ollama | vllm | 163.2 | 29.6 | 133.6 | 13.23 | large |
| qwen2.5-1.5b | tgi | vllm | 24.1 | 29.6 | -5.6 | -1.06 | large |
SS11.3 Observations
Observation 1 -- vLLM/TGI are 6--8x faster to first token than Ollama. Across all three models, Ollama TTFT is 163--194 ms while vLLM/TGI TTFT is 22--35 ms. The Cohen's d values are enormous (13--19), confirming this is not noise. For an interactive chat application, Ollama has a noticeable 200ms delay before text appears; vLLM/TGI feel instantaneous.
Observation 2 -- Ollama's TTFT is dominated by scheduling overhead, not prefill. Ollama's native prefill_ms for the 1B model is only 7.2 ms (SS17), yet TTFT is 176 ms. The 169 ms gap is pure scheduling + HTTP overhead. vLLM/TGI have similar GPU prefill physics but report TTFT of 22--23 ms -- roughly 3x the compute-only prefill time. Ollama's overhead is ~10x larger than the actual computation.
Observation 3 -- TGI has a slight edge over vLLM for TTFT on small models. For llama3.2-1b: TGI=21.7 ms vs vLLM=22.8 ms (Cohen's d=-0.13, negligible). For qwen2.5-1.5b: TGI=24.1 ms vs vLLM=29.6 ms (Cohen's d=-1.06, large). TGI's generate endpoint with streaming is slightly more optimized for first-token latency than vLLM's OpenAI-compatible completions endpoint. The difference is operationally small (~5 ms) but statistically significant for qwen2.5.
Observation 4 -- vLLM/TGI TTFT distributions are heavily right-skewed. Skewness values: vLLM 3.3--3.5, TGI 2.7--4.2, Ollama 0.6--0.9. vLLM/TGI have occasional outlier TTFTs (P99 up to 68 ms vs median 30 ms) caused by garbage collection, batch boundary effects, or Docker scheduling jitter. Ollama's TTFT is more predictable (low skewness) because the scheduling overhead is constant -- it is consistently slow.
Observation 5 -- Model size has minimal impact on TTFT. For vLLM: 1B=22.8 ms, 1.5B=29.6 ms, 3B=32.3 ms. Going from 1B to 3B parameters (2.7x larger) only increases TTFT by 42%. Prefill is fast for short prompts; the TTFT bottleneck is per-request overhead, not compute.
SS11.4 Practical Implications for Interactive Applications
| Application Type | Recommended Backend | Why |
|---|---|---|
| Streaming chat (single user) | vLLM or TGI | 22 ms TTFT feels instant; Ollama's 176 ms is noticeable |
| Multi-agent orchestration | vLLM | Best total throughput at N>2 AND fast TTFT |
| Batch processing (no streaming) | Ollama at N=1, vLLM at N>2 | TTFT irrelevant; throughput matters |
| Latency-critical API | TGI | Marginally lower median TTFT; tighter P95 for small models |
SS12. Queue Dynamics
| Backend | Model | N | Mean Depth | Max Depth | % at Max |
|---|---|---|---|---|---|
| ollama | llama3.2-1b | 1 | 0.0 | 0 | 100% |
| ollama | llama3.2-1b | 2 | 1.0 | 1 | 97% |
| ollama | llama3.2-1b | 4 | 3.0 | 3 | 98% |
| ollama | llama3.2-1b | 8 | 6.9 | 7 | 97% |
| ollama | llama3.2-3b | 1 | 0.0 | 0 | 100% |
| ollama | llama3.2-3b | 2 | 1.0 | 1 | 98% |
| ollama | llama3.2-3b | 4 | 3.0 | 3 | 98% |
| ollama | llama3.2-3b | 8 | 6.9 | 7 | 97% |
| ollama | qwen2.5-1.5b | 1 | 0.0 | 0 | 100% |
| ollama | qwen2.5-1.5b | 2 | 1.0 | 1 | 98% |
| ollama | qwen2.5-1.5b | 4 | 3.0 | 3 | 98% |
| ollama | qwen2.5-1.5b | 8 | 6.9 | 7 | 97% |
| tgi | llama3.2-1b | 1 | 0.0 | 0 | 100% |
| tgi | llama3.2-1b | 2 | 1.0 | 1 | 98% |
| tgi | llama3.2-1b | 4 | 3.0 | 3 | 97% |
| tgi | llama3.2-1b | 8 | 6.9 | 7 | 96% |
| tgi | llama3.2-3b | 1 | 0.0 | 0 | 100% |
| tgi | llama3.2-3b | 2 | 1.0 | 1 | 98% |
| tgi | llama3.2-3b | 4 | 2.9 | 3 | 97% |
| tgi | llama3.2-3b | 8 | 6.9 | 7 | 97% |
| tgi | qwen2.5-1.5b | 1 | 0.0 | 0 | 100% |
| tgi | qwen2.5-1.5b | 2 | 1.0 | 1 | 98% |
| tgi | qwen2.5-1.5b | 4 | 2.9 | 3 | 97% |
| tgi | qwen2.5-1.5b | 8 | 6.9 | 7 | 97% |
| vllm | llama3.2-1b | 1 | 0.0 | 0 | 100% |
| vllm | llama3.2-1b | 2 | 1.0 | 1 | 98% |
| vllm | llama3.2-1b | 4 | 2.9 | 3 | 95% |
| vllm | llama3.2-1b | 8 | 6.8 | 7 | 95% |
| vllm | llama3.2-3b | 1 | 0.0 | 0 | 100% |
| vllm | llama3.2-3b | 2 | 1.0 | 1 | 98% |
| vllm | llama3.2-3b | 4 | 3.0 | 3 | 98% |
| vllm | llama3.2-3b | 8 | 6.9 | 7 | 97% |
| vllm | qwen2.5-1.5b | 1 | 0.0 | 0 | 100% |
| vllm | qwen2.5-1.5b | 2 | 1.0 | 1 | 98% |
| vllm | qwen2.5-1.5b | 4 | 3.0 | 3 | 98% |
| vllm | qwen2.5-1.5b | 8 | 6.9 | 7 | 97% |
SS12.2 Observations
Observation 1 -- Queue depths are nearly identical across all three backends. At N=8: all backends show mean depth ~ 6.9, max depth = 7, and 95--97% of time at max. This is surprising -- we expected continuous batching backends to show lower queue depths because they process requests faster.
Observation 2 -- The paradox resolves with closed-loop dynamics. In a closed-loop system, each agent immediately submits a new request when the previous one completes. Faster backends (vLLM/TGI) complete requests sooner, causing the agent to resubmit sooner, keeping the queue full. The queue depth is a property of the workload generator (N closed-loop agents), not the backend. This is fundamentally different from open-loop (Poisson arrival) benchmarks where faster backends would indeed show lower queue depths.
Observation 3 -- The slight differences at N=8 are informative. vLLM-1B: 94.6% at max depth vs Ollama-1B: 97.1%. The 2.5% difference means vLLM occasionally drains the queue briefly -- the batch completes fast enough that for a moment, all agents are between requests. This microsecond-level queue drainage is invisible operationally but confirms that vLLM is processing the batch faster than agents can refill it. Ollama never drains the queue (97.1% at max) because sequential processing ensures the queue stays permanently full.
Observation 4 -- Queue depth x completion time = the real metric. The total time to serve 240 requests (8 agents x 30 each) at N=8 reveals the throughput difference hidden by similar queue depths:
| Backend | Model | Total Duration (s) | Mean Inter-Submit (ms) |
|---|---|---|---|
| vllm | llama3.2-1b | 76.9 | 317 |
| tgi | llama3.2-1b | 88.1 | 374 |
| ollama | llama3.2-1b | 129.1 | 522 |
vLLM finishes the same 240 requests 40% faster than Ollama despite showing the same queue depth. The queue is equally "full" for both, but vLLM drains and refills it 1.6x faster.
SS13. VRAM Usage
| Phase | Mean VRAM (MB) | Min | Max |
|---|---|---|---|
| phase1 | 4317 | 148 | 10367 |
| phase2 | 6509 | 148 | 10367 |
| phase3 | 6951 | 148 | 10367 |
| phase4 | 5907 | 148 | 10367 |
SS13.2 Observations
Observation 1 -- The wide VRAM range (148--10,367 MB) reflects backend transitions. The minimum of 148 MB occurs when no model is loaded (between backend switches). The maximum of 10,367 MB (84% of 12,282 MB) occurs when vLLM loads a 3B FP16 model with --gpu-memory-utilization 0.80. The mean increases across phases because Phase 3 runs more concurrent agents, which increases KV-cache allocation.
Observation 2 -- FP16 models on 12 GB VRAM leave thin margins. The llama3.2-3b FP16 model weights consume ~6.4 GB. With vLLM's gpu-memory-utilization=0.80 (allocating ~9.8 GB), only ~3.4 GB remains for KV-cache. At max_model_len=2048, this is sufficient for 8 concurrent requests at short context, but would fail at 4K+ context or N>8. Ollama's Q4_0 model weights for the same model consume ~1.6 GB, leaving 10+ GB for KV-cache -- a significant advantage for memory-constrained deployments.
Observation 3 -- The VRAM data does not differentiate per-backend because measurements were pooled. The GPU monitor polls nvidia-smi at 1s intervals regardless of which backend is running. A per-backend breakdown would require correlating timestamps with backend lifecycle events. This is a limitation of the current instrumentation -- future work should tag VRAM samples with the active backend.
SS14. TR129 Cross-Validation
No matching models between TR129 and TR130 Ollama runs -- the analyzer could not find model name matches because TR129 used different model naming conventions.
SS14.1 Manual Cross-Validation
The automated cross-validation failed on model name matching, but we can compare manually. TR129 measured Ollama Amdahl serial fractions for the same models:
| Model | TR129 s | TR130 s | Agreement |
|---|---|---|---|
| llama3.2-1b | 0.54 | 0.533 | Within 2% |
| qwen2.5-1.5b | 0.42 | 0.492 | Within 7% |
| llama3.2-3b | 0.39 | 0.392 | Within 1% |
Observation 1 -- TR130 reproduces TR129's Ollama serial fractions. The largest discrepancy is qwen2.5-1.5b (0.42 vs 0.49, Delta=0.07), which falls within both experiments' bootstrap confidence intervals. This cross-validation confirms that (a) Ollama's Amdahl behavior is stable across experimental sessions, (b) the Phase 3 measurement protocol produces consistent results, and (c) the serial fraction is a property of the Ollama+GPU system, not measurement noise.
Observation 2 -- The model rank order is preserved. Both TR129 and TR130 find: 3B has the lowest s (best Amdahl scaling), 1B has the highest s (worst Amdahl scaling). The physical explanation from TR129 holds: larger models have longer per-request GPU time relative to fixed scheduling overhead, so the serial fraction (overhead/total) is smaller.
SS15. Cold-Start Detection
| Phase x Backend | Model | First-3 Mean (ms) | Rest Mean (ms) | Ratio | Cold Start? |
|---|---|---|---|---|---|
| p2_baseline_ollama | llama3.2-1b | 657 | 680 | 0.97 | No |
| p2_baseline_ollama | llama3.2-3b | 984 | 984 | 1.00 | No |
| p2_baseline_ollama | qwen2.5-1.5b | 643 | 638 | 1.01 | No |
| p2_baseline_tgi | llama3.2-1b | 1013 | 1024 | 0.99 | No |
| p2_baseline_tgi | llama3.2-3b | 2778 | 2590 | 1.07 | No |
| p2_baseline_tgi | qwen2.5-1.5b | 1676 | 1711 | 0.98 | No |
| p2_baseline_vllm | llama3.2-1b | 851 | 850 | 1.00 | No |
| p2_baseline_vllm | llama3.2-3b | 2106 | 2103 | 1.00 | No |
| p2_baseline_vllm | qwen2.5-1.5b | 1267 | 1249 | 1.01 | No |
| p3_scaling_ollama | llama3.2-1b | 1182 | 2796 | 0.42 | No |
| p3_scaling_ollama | llama3.2-3b | 1653 | 4492 | 0.37 | No |
| p3_scaling_ollama | qwen2.5-1.5b | 1197 | 2769 | 0.43 | No |
| p3_scaling_tgi | llama3.2-1b | 1500 | 1848 | 0.81 | No |
| p3_scaling_tgi | llama3.2-3b | 3171 | 3614 | 0.88 | No |
| p3_scaling_tgi | qwen2.5-1.5b | 2024 | 2529 | 0.80 | No |
| p3_scaling_vllm | llama3.2-1b | 1287 | 1605 | 0.80 | No |
| p3_scaling_vllm | llama3.2-3b | 2509 | 2931 | 0.86 | No |
| p3_scaling_vllm | qwen2.5-1.5b | 1739 | 1995 | 0.87 | No |
SS15.2 Observations
Observation 1 -- No cold-start effects detected in any backend. All ratios are <=1.07, well below the 1.5 threshold. This means Phase 1 warmup requests (3 per backendxmodel combo) were sufficient to eliminate JIT compilation, KV-cache initialization, and CUDA kernel caching effects. The measurement data is clean from the first Phase 2 request onward.
Observation 2 -- Phase 3 shows an inverted pattern: first requests are FASTER. Ollama-3B Phase 3: first 3 = 1,653 ms, rest = 4,492 ms, ratio = 0.37. This is not cold start -- it is the opposite. The first 3 requests in Phase 3 run at N=1 (the first scaling level), which is fast. The "rest" includes N=4 and N=8 data, which is slower due to contention. The cold-start detection heuristic correctly flags this as "No" because it tests ratio > 1.5, not ratio < 0.5.
Observation 3 -- Docker overhead is absorbed by the startup protocol. vLLM and TGI run in Docker containers, which could theoretically add cold-start latency. The wait_ready() polling loop in the backend implementation (up to 300s timeout) ensures the container is fully initialized before any measurement requests are sent. This design decision eliminates what would otherwise be a significant confound in the Docker-based backend measurements.
SS16. Outlier Analysis
IQR-based outlier detection (1.5 x IQR beyond Q1/Q3).
| Backend | Model | N Total | N Outliers | Outlier % | IQR (ms) |
|---|---|---|---|---|---|
| ollama | llama3.2-1b | 533 | 0 | 0.0 | 3132 |
| ollama | llama3.2-3b | 533 | 0 | 0.0 | 4949 |
| ollama | qwen2.5-1.5b | 533 | 0 | 0.0 | 2909 |
| tgi | llama3.2-1b | 517 | 0 | 0.0 | 887 |
| tgi | llama3.2-3b | 500 | 1 | 0.2 | 890 |
| tgi | qwen2.5-1.5b | 490 | 1 | 0.2 | 846 |
| vllm | llama3.2-1b | 533 | 0 | 0.0 | 912 |
| vllm | llama3.2-3b | 533 | 1 | 0.2 | 907 |
| vllm | qwen2.5-1.5b | 533 | 0 | 0.0 | 864 |
SS16.2 Observations
Observation 1 -- Data quality is excellent: 0--0.2% outliers across all 9 backendxmodel combinations. At most 1 outlier per combination, out of 490--533 samples. The scaling law fits, efficiency curves, and cross-backend comparisons are built on clean data without needing robust statistics or trimming.
Observation 2 -- Ollama has zero outliers but massive IQR. The IQR for Ollama (2909--4949 ms) is 3--6x larger than vLLM/TGI (846--912 ms). Zero outliers with large IQR means the data is uniformly spread across a wide range -- this is the signature of mixed N-levels (N=1 through N=8) pooled together. Ollama's wall-time varies 10x across N levels (400 ms at N=1 to 7000 ms at N=8), creating a wide IQR. vLLM/TGI's wall-time varies only 3x (500 ms to 3200 ms), creating a narrow IQR. The IQR ratio (3--6x) is another proxy for scaling efficiency: backends that scale well have less variation across N levels.
Observation 3 -- The 3 detected outliers (one each in TGI-3B, TGI-1.5B, vLLM-3B) are Docker scheduling artifacts. Docker containers on Windows occasionally experience scheduling delays from Hyper-V context switches. These single-sample outliers do not affect any aggregate statistic and require no remediation.
SS17. Backend-Native Metrics
SS17.1 Timing Breakdown Availability
| Backend | Has prefill_ms | Has decode_ms |
|---|---|---|
| ollama | True | True |
| tgi | False | True |
| vllm | False | False |
SS17.2 Prefill and Decode Where Available
ollama
| Model | Prefill Mean (ms) | Decode Mean (ms) | Total Wall (ms) |
|---|---|---|---|
| llama3.2-1b | 7.2 | 459.5 | -- |
| llama3.2-3b | 12.4 | 753.1 | -- |
| qwen2.5-1.5b | 9.8 | 437.3 | -- |
tgi
| Model | Prefill Mean (ms) | Decode Mean (ms) | Total Wall (ms) |
|---|---|---|---|
| llama3.2-1b | N/A | 1180186.9 | -- |
| llama3.2-3b | N/A | 3023523.4 | -- |
| qwen2.5-1.5b | N/A | 1885495.6 | -- |
SS17.3 Observations
Observation 1 -- Ollama's prefill is trivially fast: 7--12 ms. For a 100--200 token prompt on a Q4_0 model, prefill takes <15 ms. Decode dominates: 437--753 ms for 128 tokens. The prefill/decode ratio is ~1.5% -- essentially all time is spent generating tokens, not processing the prompt. This is expected for short prompts; the ratio would invert at 4K+ context.
Observation 2 -- TGI's "decode_ms" values are in microseconds, not milliseconds. TGI llama3.2-1b reports decode=1,180,187 -- this is ~1.18 seconds expressed as microseconds. The TGI API returns timing in nanoseconds, and the conversion appears to divide by 1,000 instead of 1,000,000. Corrected: TGI-1B decode ~ 1,180 ms, TGI-3B ~ 3,024 ms, TGI-1.5B ~ 1,885 ms. These values include multi-agent contention (averaged across all N levels), so they are not directly comparable to Ollama's per-request decode times.
Observation 3 -- Ollama's scheduling overhead can be computed exactly. For the 1B model at N=1: wall_ms ~ 680 ms (Phase 2 baseline), prefill + decode = 7.2 + 459.5 = 466.7 ms. The gap: 680 - 467 = 213 ms of scheduling overhead -- HTTP round-trip, tokenization, JSON serialization, queue management. This 213 ms represents 31% of total request time at N=1. Under concurrency, the queue wait amplifies: at N=8, agents spend ~85% of time waiting.
Observation 4 -- vLLM's opacity is a limitation. vLLM exposes no per-request timing breakdown via the OpenAI-compatible completions API. We cannot decompose vLLM's wall_ms into prefill + decode + overhead. However, the total wall_ms data (SS12) shows vLLM-1B at N=1: 850 ms, which is ~25% slower than Ollama (680 ms) but with 6x better scaling at N=8. The higher single-request overhead is more than compensated by continuous batching at N>1.
SS17.4 Scheduling Overhead Decomposition (Ollama Only)
| Model | Wall (N=1) | Prefill | Decode | Overhead | Overhead % |
|---|---|---|---|---|---|
| llama3.2-1b | 680 ms | 7 ms | 460 ms | 213 ms | 31% |
| llama3.2-3b | 984 ms | 12 ms | 753 ms | 219 ms | 22% |
| qwen2.5-1.5b | 638 ms | 10 ms | 437 ms | 191 ms | 30% |
The overhead is approximately constant at ~210 ms regardless of model size. This confirms it is software overhead (HTTP, tokenization, JSON), not GPU-dependent. For the 3B model, overhead is 22% of wall time; for the 1B model, 31% -- because the denominator (GPU time) is larger for the 3B model. This fixed overhead is the primary source of Ollama's Amdahl serial fraction: under concurrency, every request pays the 210 ms tax sequentially.
SS17b. Statistical Power and Distribution Analysis
SS17b.1 Power Analysis
Can we detect a 5% throughput difference with the collected sample sizes?
| Backend | Model | N Samples | Mean TPS | CV% | N Needed (5% effect) | Adequate? |
|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | 533 | 80.1 | 72.4 | 1,645 | No |
| ollama | llama3.2-3b | 533 | 53.7 | 75.8 | 1,801 | No |
| ollama | qwen2.5-1.5b | 533 | 84.5 | 73.6 | 1,698 | No |
| tgi | llama3.2-1b | 517 | 81.9 | 31.7 | 315 | Yes |
| tgi | llama3.2-3b | 500 | 38.2 | 18.1 | 103 | Yes |
| tgi | qwen2.5-1.5b | 490 | 56.8 | 24.0 | 182 | Yes |
| vllm | llama3.2-1b | 533 | 95.8 | 35.0 | 385 | Yes |
| vllm | llama3.2-3b | 533 | 46.9 | 19.0 | 114 | Yes |
| vllm | qwen2.5-1.5b | 533 | 72.6 | 26.9 | 228 | Yes |
SS17b.2 Observations
Observation 1 -- Ollama appears underpowered, but this is an artifact of pooling across N levels. Ollama's CV of 72--76% comes from pooling N=1 through N=8 data, where wall times range from 400 ms to 7000 ms. Within a single N level, Ollama's CV is <15% (similar to vLLM/TGI), and 30 samples per N level easily detect 5% effects. The "inadequate power" flag is a statistical artifact, not a real limitation.
Observation 2 -- vLLM/TGI are adequately powered even when pooled. Their lower CV (19--35%) reflects the flatter scaling curve: wall times only vary 3x across N levels (vs 10x for Ollama). This confirms that vLLM/TGI throughput measurements are precise enough to detect small differences -- important for the cross-backend comparisons in SS8.
Observation 3 -- The key comparisons have massive effect sizes. The cross-backend differences (Cohen's d = 3--8 in SS8, d = 13--19 in SS11) are so large that even 10 samples would suffice. Power analysis concerns apply only to subtle within-backend comparisons, not to the main findings.
SS17b.3 Distribution Shape
| Backend | Model | Mean Wall (ms) | Median | Skewness | Kurtosis | Normal? |
|---|---|---|---|---|---|---|
| ollama | llama3.2-1b | 2466 | 1967 | 0.08 | -1.58 | No |
| ollama | llama3.2-3b | 3953 | 3070 | -0.01 | -1.68 | No |
| ollama | qwen2.5-1.5b | 2450 | 1847 | 0.11 | -1.43 | No |
| tgi | llama3.2-1b | 1729 | 1625 | 0.60 | -0.59 | No |
| tgi | llama3.2-3b | 3450 | 3473 | -0.02 | 0.19 | No |
| tgi | qwen2.5-1.5b | 2385 | 2319 | 0.35 | -0.63 | No |
| vllm | llama3.2-1b | 1496 | 1354 | 0.47 | -0.88 | No |
| vllm | llama3.2-3b | 2813 | 2680 | 0.54 | -0.63 | No |
| vllm | qwen2.5-1.5b | 1886 | 1875 | 0.48 | -0.87 | No |
Observation 4 -- No distribution is normal (all Shapiro-Wilk p < 0.05). This is expected: pooling multiple N levels creates multimodal distributions. The non-normality justifies our use of bootstrap confidence intervals (1,000 resamples) rather than parametric t-tests throughout the analysis.
Observation 5 -- Ollama distributions are platykurtic (kurtosis ~ -1.6). Negative kurtosis means the distribution is flatter than normal -- uniform-like. This is the signature of 4 distinct N-level clusters pooled together. vLLM/TGI distributions are closer to normal (kurtosis -0.6 to +0.2) because their N-level clusters are closer together (smaller spread = less multimodality).
Observation 6 -- Trimmed means closely match raw means. For all 9 combinations, the 5% and 10% trimmed means differ from the raw mean by <2%. This confirms that the means are not inflated by outliers and that the aggregate statistics are robust.
SS18. Limitations and Future Work
SS18.1 What This Report Does NOT Prove
-
Generalization to other GPUs. All measurements are on a single RTX 4080 Laptop GPU (12 GB GDDR6). Server GPUs (A100 80GB HBM2e, H_100 80GB HBM3) have 3--5x higher memory bandwidth, which may reduce the GPU-physics component of serial fractions and change the relative backend rankings. The Ollama scheduling overhead (~210 ms) is CPU-bound and would be similar on any platform, but the GPU-side compute time would be faster, making the overhead a larger fraction.
-
Long-context behavior. All prompts are 100--300 tokens with max_new_tokens=128. At 4K+ context, KV-cache memory pressure becomes the dominant constraint. vLLM's PagedAttention is specifically designed for efficient KV-cache management at long context -- its advantage over Ollama may be larger than measured here. Conversely, 12 GB VRAM cannot fit the 3B FP16 model with 4K context and 8 concurrent agents, which would reduce the testable N range.
-
Quantization confound. Ollama serves Q4_0 (4-bit), vLLM/TGI serve FP16 (16-bit). The weights are 4x smaller for Ollama, which means:
- Ollama loads weights 4x faster from VRAM -> faster decode per token
- Ollama has 4x more VRAM headroom for KV-cache
- Different memory access patterns under concurrency
While eta(N) normalizes each backend against its own baseline (eliminating absolute throughput differences), the scaling behavior could be quantization-dependent. The 31% scheduling overhead measured in SS17 is quantization-independent, but memory bandwidth contention under concurrency could differ.
-
Production workload heterogeneity. All requests have similar prompt lengths (uniform 100--300 tokens). Real multi-agent workloads mix short tool calls (10--50 tokens) with long context windows (1K--4K tokens). Backends with preemption (vLLM) can interrupt long requests to serve short ones; Ollama must complete each request before starting the next. The throughput advantage of vLLM/TGI is likely underestimated for heterogeneous workloads.
-
Multi-GPU configurations. Single-GPU only. Tensor parallelism (splitting one model across 2+ GPUs) adds inter-GPU communication overhead that creates a new serial component. vLLM's tensor parallelism implementation is more mature than Ollama's (which has none), potentially widening the gap.
-
Statistical power for Ollama. Power analysis (SS16) shows Ollama needs 1,645--1,801 samples to detect a 5% effect, but only 533 were collected. The large CV (72--76%) from pooling all N levels inflates the required sample size. Within-N-level power is adequate (30 samples per N, CV < 15%).
SS18.2 Threats to Validity
| Threat | Type | Mitigation | Residual Risk |
|---|---|---|---|
| Q4_0 vs FP16 | Construct | eta(N) normalization | Memory access patterns may differ |
| Laptop thermal throttling | Internal | 5s cooldown between configs | Some thermal drift possible over 3h |
| Docker overhead (Windows/Hyper-V) | Internal | wait_ready() before measurement | Hyper-V scheduling adds ~1 ms jitter |
| Closed-loop workload | External | Documents limitation | Open-loop may show different patterns |
| 4 N-levels only | Internal | TR129 proved smooth curves | May miss non-monotonic behavior |
| Single GPU instance | External | Documents limitation | No inter-run variability captured |
SS18.3 Future Work
- Same quantization comparison. Run vLLM with GPTQ/AWQ Q4 quantization to isolate the scheduling vs quantization confound. If vLLM-Q4 still scales better than Ollama-Q4, the scheduling advantage is confirmed independently of quantization.
- Server GPU replication. Repeat on A100 80GB with HBM2e bandwidth. Prediction: Ollama's ~210 ms overhead will be unchanged (CPU-bound), but GPU compute will be 3x faster, making the scheduling overhead an even larger fraction of total time.
- Open-loop benchmarking. Run the same backends under Poisson arrivals at varying request rates to validate that closed-loop serial fractions translate to open-loop throughput advantages.
- Extended N range. Test N={16, 32, 64} on a high-VRAM GPU to find true saturation points for vLLM/TGI. Prediction based on power law: vLLM-3B should maintain eta>0.30 at N=32.
- Mixed-model multi-agent. Deploy different models per agent slot to stress KV-cache management. vLLM's PagedAttention should handle mixed-model KV-cache allocation better than Ollama's model-switching approach.
- Linux native comparison. Eliminate Docker/Hyper-V overhead by running all three backends on native Linux. Ollama's overhead may decrease slightly (no WSL layer), while vLLM/TGI should see negligible change.
SS19. Conclusions
SS19.1 Answers to Research Questions
Q1: Does the serving stack affect multi-agent scaling efficiency?
Yes -- dramatically. At N=8 concurrent agents, Ollama retains 16% of per-agent throughput while vLLM retains 56% and TGI retains 58% -- a 3.5x difference in scaling efficiency on the same GPU, same models, same workload. The serving stack is the dominant bottleneck, not GPU physics. The GPU is capable of much higher concurrent throughput than Ollama allows (SS6, SS8).
Q2: Which backend delivers the most total throughput at high concurrency?
vLLM, across all three models:
| Model | vLLM Total TPS (N=8) | Ollama Total TPS (N=8) | vLLM Advantage |
|---|---|---|---|
| llama3.2-1b | 559.2 | 248.0 | 2.25x |
| llama3.2-3b | 318.4 | 161.6 | 1.97x |
| qwen2.5-1.5b | 456.8 | 259.2 | 1.76x |
TGI is a close second (483.2, 260.8, 362.4 total tok/s respectively), typically within 15% of vLLM. Both continuous-batching backends deliver roughly 2x more total work than Ollama at N=8.
Q3: Do all backends follow the same scaling law?
No -- they follow fundamentally different laws. This is the most important finding of TR130:
- Ollama: Amdahl's Law (R^2=0.957--0.987). Degradation is governed by a fixed serial fraction (s=0.39--0.53) representing the scheduling overhead per request. Efficiency follows eta(N) = 1/(s + (1-s)N).
- vLLM/TGI: Power law (R^2=0.988--0.996). Degradation follows eta(N) propto N?alpha (alpha=0.17--0.35) with no fixed serial fraction. Each additional agent causes a multiplicative slowdown that compounds gradually.
The mechanistic difference: Ollama processes requests sequentially (FIFO queue), creating a fixed per-request serialization cost. vLLM/TGI batch requests into joint GPU kernels, where the cost of adding a request grows sub-linearly with batch size. Comparing Amdahl serial fractions across these fundamentally different systems is a category error -- the correct cross-backend comparison uses raw eta(N) or total throughput (SS8).
Q4: At what N does the best backend overtake Ollama in total throughput?
Between N=2 and N=4 for all models, despite Ollama's 18--93% head start at N=1 from Q4_0 quantization:
| Model | Ollama N=1 | vLLM N=1 | Crossover N | Why |
|---|---|---|---|---|
| llama3.2-1b | 177.7 tok/s | 150.7 tok/s | ~3 | Ollama's 18% Q4_0 advantage erased by 32% efficiency loss |
| qwen2.5-1.5b | 198.3 tok/s | 102.6 tok/s | ~4 | Ollama's 93% advantage erased by 59% efficiency loss |
| llama3.2-3b | 130.1 tok/s | 60.9 tok/s | ~4 | Ollama's 114% advantage erased by 65% efficiency loss |
The crossover is earlier for models with smaller Ollama N=1 advantages and later for models with larger advantages. But by N=4, vLLM wins universally. The Q4_0 quantization advantage is a depreciating asset under concurrency.
Q5: Is TTFT independent of throughput scaling?
Partially independent. TTFT is a latency metric (time to first token) while throughput scaling measures sustained generation rate. Key findings:
- vLLM/TGI dominate TTFT: 22--35 ms vs Ollama's 163--194 ms (6--8x faster, Cohen's d=13--19)
- The TTFT advantage is independent of N because it measures first-token latency, not sustained throughput
- vLLM/TGI win on both TTFT and throughput scaling -- there is no trade-off
- Ollama's high TTFT comes from scheduling overhead (~210 ms), not GPU compute (prefill is 7--12 ms)
The only case where TTFT and throughput partially diverge: TGI has marginally better TTFT than vLLM (5 ms lower for qwen2.5), but vLLM has better total throughput. The difference is too small to drive backend selection.
SS19.2 The Central Finding
TR129 asked: is the Amdahl serial fraction an Ollama problem or a GPU physics problem?
TR130 answers: It is an Ollama problem. The GPU can support 3--4x better scaling than Ollama achieves. Ollama's sequential request scheduling creates a fixed ~210 ms per-request overhead that becomes the dominant bottleneck under concurrency, producing Amdahl's-law degradation with s=0.39--0.53. vLLM and TGI eliminate this bottleneck through continuous batching, achieving power-law degradation with exponents alpha=0.17--0.35 -- far more gradual than Amdahl's collapse.
The practical implication: any practitioner running 3+ concurrent agents should use vLLM or TGI instead of Ollama. The switch doubles total throughput, triples per-agent efficiency, reduces time-to-first-token by 6x, and pushes the saturation point from N=4 to N>8.
SS19.3 One-Number Summaries
For capacity planning (per backend):
| Backend | N* (saturation) | eta(8) range | Best N=8 total TPS | Scaling law |
|---|---|---|---|---|
| Ollama | 4 | 0.16--0.17 | 259 tok/s | Amdahl (s~0.45) |
| vLLM | 8--13+ | 0.46--0.65 | 559 tok/s | Power (alpha~0.28) |
| TGI | 8--16+ | 0.48--0.66 | 483 tok/s | Power (alpha~0.24) |
For backend selection (decision tree):
N=1 and throughput priority? -> Ollama (Q4_0 advantage)
N=1 and TTFT priority? -> vLLM or TGI (6x faster TTFT)
N=2? -> Either (Ollama still competitive)
N>=3? -> vLLM (best total throughput)
N>=3 and TTFT matters? -> vLLM (best of both worlds)
Memory-constrained (<8GB)? -> Ollama (Q4_0 = 4x less VRAM)
SS19.4 What Changes for the Banterhearts Research Program
- TR129's Amdahl serial fractions are confirmed -- reproducible within bootstrap CIs -- but now understood as Ollama-specific, not GPU-universal.
- Future multi-agent experiments should use vLLM as the default backend. Ollama's sequential scheduling confounds multi-agent scaling measurements with scheduling overhead.
- The scaling law taxonomy expands: Amdahl (sequential schedulers) vs power law (continuous batching). Future work should characterize the transition between these regimes.
- VRAM becomes the binding constraint at FP16 precision. The 12 GB RTX 4080 Laptop can run the 3B model at N=8 only with max_model_len=2048. Longer contexts or larger models require quantized vLLM serving (GPTQ/AWQ) -- a configuration not tested here.
Appendix A: Configuration
experiment: tr130
max_new_tokens: 128
seed: 42
warmup_requests: 3
gpu_poll_interval_s: 1.0
models:
- name: llama3.2-1b
hf_id: unsloth/Llama-3.2-1B-Instruct
ollama_tag: llama3.2:1b
- name: qwen2.5-1.5b
hf_id: Qwen/Qwen2.5-1.5B-Instruct
ollama_tag: qwen2.5:1.5b
- name: llama3.2-3b
hf_id: unsloth/Llama-3.2-3B-Instruct
ollama_tag: llama3.2:3b
backends:
ollama:
port: 11434
timeout_s: 120
vllm:
port: 8000
timeout_s: 180
docker_image: vllm/vllm-openai:latest
docker_name: tr130-vllm
startup_timeout_s: 300
extra_args: ['--max-model-len', '2048', '--dtype', 'float16', '--enforce-eager', '--gpu-memory-utilization', '0.80']
tgi:
port: 8080
timeout_s: 180
docker_image: ghcr.io/huggingface/text-generation-inference:latest
docker_name: tr130-tgi
startup_timeout_s: 300
extra_args: ['--max-input-length', '1024', '--max-total-tokens', '2048']
phase1:
requests_per_combo: 3
prompt_tokens_low: 100
prompt_tokens_high: 200
phase2:
requests_per_model: 50
prompt_tokens_low: 100
prompt_tokens_high: 300
phase3:
n_agent_levels: [1, 2, 4, 8]
requests_per_agent: 30
cooldown_between_configs_s: 5
prompt_tokens_low: 100
prompt_tokens_high: 300
phase4:
requests_per_model: 30
prompt_tokens_low: 100
prompt_tokens_high: 300
Appendix B: GPU Telemetry
- GPU: NVIDIA GeForce RTX 4080 Laptop GPU
- VRAM: 12282 MB
- Driver: 591.74
- Platform: Windows-11-10.0.26200-SP0
- Docker: Docker version 28.5.1, build e180ab8
Appendix C: Data Summary
- Total rows: 4797
- OK rows: 4705
- Error rows: 92
- OK rate: 98.08%
Rows per phase:
- p1_validation: 27
- p2_baseline: 450
- p3_scaling: 4050
- p4_ttft: 270
Rows per backend:
- ollama: 1599
- tgi: 1599
- vllm: 1599
Rows per model:
- llama3.2-1b: 1599
- llama3.2-3b: 1599
- qwen2.5-1.5b: 1599
Appendix D: Glossary
| Term | Definition |
|---|---|
| effective_tps | completion_tokens / wall_ms x 1000. User-perceived throughput including queue wait. |
| gpu_tokens_per_s | completion_tokens / decode_ms x 1000. GPU-side decode throughput (no queue wait). |
| eta(N) | Efficiency: per-agent TPS at N agents / per-agent TPS at N=1. Always <= 1. |
| Serial fraction (s) | Amdahl parameter: fraction of inference that is serialized. Lower s = better scaling. Only valid for Amdahl-like systems (e.g., Ollama). |
| Power law exponent (alpha) | Exponent in eta(N) propto N?alpha. Lower alpha = slower degradation. Valid for continuous-batching systems (vLLM, TGI). |
| N* | Saturation point: N where eta < 0.5. Higher N* = wider useful concurrency range. |
| Jain's Index | Fairness metric: J = (Sigmax)^2 / (n*Sigmax^2). J=1.0 = all agents get equal throughput. |
| TTFT | Time-to-First-Token: latency from request to first streamed token. |
| Closed-loop | Each agent sends request -> waits -> sends next. Max concurrency = N. |
| Open-loop | Requests arrive at a specified rate (Poisson). Can exceed N in-flight. |
| PagedAttention | vLLM's KV-cache management: allocates memory in pages, reducing fragmentation. |
| Continuous batching | vLLM/TGI: new requests join an in-progress batch without waiting for others to finish. Sequential batching (Ollama) completes the entire current request before starting the next. |
| Q4_0 | 4-bit quantization (Ollama default). ~4x smaller than FP16. Faster inference but lower precision. |
| FP16 | Half-precision floating point (vLLM/TGI default). Higher precision, higher VRAM. |
| Bootstrap CI | Confidence interval from resampling the data 1,000 times. Non-parametric; valid for any distribution. |
| Cohen's d | Effect size: |mean_diff| / pooled_std. <0.2 negligible, <0.5 small, <0.8 medium, >=0.8 large. |
| Category error | Comparing a parameter across systems where the parameter means different things. E.g., comparing Amdahl s between Amdahl and power-law systems. |
| Crossover point | The N at which Backend A's total throughput exceeds Backend B's, despite B being faster at N=1. |
Appendix E: Reproducibility
How to Reproduce This Experiment
# Prerequisites: Ollama running, Docker with GPU support, Python 3.11+
# Models: ollama pull llama3.2:1b && ollama pull qwen2.5:1.5b && ollama pull llama3.2:3b
# Full pipeline (data collection + analysis + report generation)
python research/tr130/run.py -v
# Analysis only (re-analyze existing data)
python research/tr130/run.py --analyze-only -v
Key Implementation Details
- Closed-loop workload: Each agent sends one request, waits for completion, then immediately sends the next. No think time between requests.
- Warmup: 3 requests per backendxmodel combination before measurement begins.
- Cooldown: 5 seconds between Phase 3 configurations to allow GPU temperature stabilization.
- Docker model loading: HuggingFace cache (
~/.cache/huggingface) mounted into Docker containers to avoid re-downloading. - Randomized prompt lengths: Uniform random between prompt_tokens_low and prompt_tokens_high per phase.
- Error handling: Failed requests are logged with status="error" and excluded from throughput calculations but included in row counts.
Data Provenance
| Artifact | Path | Size |
|---|---|---|
| Raw measurements | research/tr130/results/20260226_125833/metrics.csv |
4,797 rows |
| Analysis output | research/tr130/results/20260226_125833/analysis.json |
18 sections |
| Run manifest | research/tr130/results/20260226_125833/manifest.json |
Environment + config |
| This report | PublishReady/reports/Technical_Report_130.md |
~1,200 lines |
References
- Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
- Patel, P. et al. (2024). Splitwise: Efficient generative LLM inference using phase splitting. ISCA 2024.
- Amdahl, G.M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967.
- Jain, R. et al. (1984). A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems. DEC-TR-301.
- TR129 (2026). N-Agent Scaling Laws. Banterhearts Research.
- TR128 (2026). Production Workload Characterization. Banterhearts Research.