Technical Report 131: GPU Kernel Profiling
Root-Cause Analysis of Multi-Agent Throughput Degradation via NVIDIA Nsight Systems
| Field | Value |
|---|---|
| TR Number | 131 |
| Project | Banterhearts LLM Performance Research |
| Date | 2026-02-26 |
| Author | Research Team |
| Report Type | Hardware-level root-cause analysis (6-phase, 2 backends, 26 profiled runs) |
| Test Duration | ~71 minutes |
| Status | Complete -- GPU memory physics identified as primary degradation mechanism |
| Run ID | 20260226_174224 |
| Related Work | TR129 (N-Agent Scaling Laws), TR130 (Serving Stack Benchmarking) |
| Depends On | TR129 (degradation measurement), TR130 (serving stack comparison) |
Abstract
TR129 established that Ollama per-agent throughput drops 63% under 8-agent concurrency (Amdahl serial fraction s=0.39--0.54) on an RTX 4080 Laptop GPU. TR130 compared three serving stacks and found vLLM/TGI scale 3--4x better than Ollama, concluding: "The serving stack is the bottleneck." But neither study opened the GPU black box. Without kernel-level traces, the attribution was correlational, not causal.
TR131 provides the causal test. Using NVIDIA Nsight Systems (nsys) and Nsight Compute (ncu), we capture GPU kernel timelines, memory operations, and execution traces for 2 LLaMA 3.2 models (1B, 3B) across 26 profiled runs in 4 experimental conditions: Ollama N=1, Ollama N=8, PyTorch Direct N=1, PyTorch Direct N=8. PyTorch Direct eliminates the entire serving stack -- no HTTP server, no Go runtime, no request queuing, no token streaming -- and calls model.generate() directly via HuggingFace Transformers. If the degradation persists without the serving stack, the cause is GPU physics, not software.
The central finding overturns the TR130 hypothesis. PyTorch Direct degrades 86.4% from N=1 to N=8 -- worse than Ollama's 82.1% (both p < 0.003, Cohen's d > 4, confirmed by Mann-Whitney U). The -4.3 percentage-point attributable-to-Ollama figure means the degradation is intrinsic to running 8 concurrent inference streams on a single GPU with shared memory bandwidth. Ollama's Q4_0 quantization actually helps under concurrency by reducing per-request bandwidth pressure, growing the Ollama-vs-PyTorch advantage from 3.0x at N=1 to 3.9x at N=8.
The strongest mechanistic evidence comes from memory bandwidth analysis: Ollama memory operation time increases 74.4% from N=1 to N=8 (p=6.4x10^-5, Cohen's d=3.81) -- the only hypothesis test surviving Holm step-down correction across 6 tests at family-wise alpha=0.05. Back-of-envelope bandwidth calculations show N=8 demand exceeds the RTX 4080's peak 432 GB/s by 78--130% depending on precision, forcing the memory controller to serialize weight reads.
Five hypotheses were tested. H_1 (bandwidth saturation): partially confirmed. H2 (Ollama serialization): serialization exists but is GPU-level, not Ollama-level (max_concurrent_kernels = 1 in all conditions including PyTorch). H3 (context switching): rejected. H4 (CPU scheduling): insufficient data. H5 (KV-cache pressure): rejected. Welch's t-tests, Cohen's d effect sizes, 95% CIs, Mann-Whitney U robustness checks, and Holm correction provide statistical rigor matching TR126 standards.
Quantization caveat: Ollama serves Q4_0 (0.5 bytes/parameter); PyTorch runs FP16 (2 bytes/parameter). Absolute TPS is non-comparable. However, the N=1->N=8 degradation ratio is the relevant attribution metric, and FP16's strictly higher memory pressure per parameter makes PyTorch's worse degradation a conservative bound on GPU physics effects.
Executive Summary
Key Findings
-
GPU physics dominates. PyTorch Direct degrades 86.4% (N=1->N=8), exceeding Ollama's 82.1% -- the serving stack is not the primary bottleneck. The degradation is intrinsic to single-GPU memory bandwidth under concurrent weight reads.
-
Both backends show massive, highly significant degradation. Ollama: 128.5 -> 23.0 TPS (p=0.0006, d=4.19); PyTorch: 42.9 -> 5.8 TPS (p=0.002, d=4.17). All effects are far above the minimum detectable d=1.29 at our sample sizes.
-
Ollama is 3--4x faster than PyTorch at both concurrency levels. At N=1: 128.5 vs 42.9 TPS (3.0x, p=0.001, d=3.12). At N=8: 23.0 vs 5.8 TPS (3.9x, p=0.0008, d=3.44). The advantage grows under concurrency because Q4_0's lower bandwidth demand compounds when bandwidth is scarce.
-
Memory bandwidth stress is the only statistically significant mechanism. Ollama memory operation time grows +74.4% at N=8 (p=6.4x10^-5, d=3.81) -- the only test surviving Holm correction (rank 1 of 6, threshold=0.0083).
-
Kernel serialization is GPU-level, not Ollama-level. Max concurrent kernels = 1 in all 26 runs across both backends. Cohen's d = 0 for every concurrency comparison. The GPU hardware enforces serial kernel execution regardless of software.
-
Context switching (H3) is rejected. Inter-kernel gap metrics show zero variance between N=1 and N=8 across both backends. No evidence of CUDA context switching overhead.
-
KV-cache pressure (H5) is rejected. Memory allocation counts are unchanged between N=1 and N=8 (p=1.0, d=0). No evidence of memory pressure affecting GPU utilization.
-
Nsight Compute data was limited by WDDM. ncu captured kernel names but returned null metrics for SM occupancy and DRAM throughput on Windows WDDM driver. Direct bandwidth measurement was not possible.
-
Degradation is model-size independent. LLaMA-1B: -82.1% (Ollama), -86.2% (PyTorch). LLaMA-3B: -82.2% (Ollama), -87.1% (PyTorch). Near-identical patterns regardless of parameter count.
-
This revises TR130's conclusion. The "serving stack bottleneck" is actually a "GPU memory physics bottleneck." vLLM/TGI's better scaling comes from continuous batching and PagedAttention reducing bandwidth waste per token, not merely better HTTP request scheduling.
Summary Tables
Per-Agent Throughput (TPS) with Full Statistics
| Backend | Model | N=1 Mean | N=1 95% CI | CV% | N=8 Mean | N=8 95% CI | CV% | Degradation | p-value | Cohen's d |
|---|---|---|---|---|---|---|---|---|---|---|
| Ollama | LLaMA-1B | 160.44 | [159.29, 161.60] | 0.29 | 28.80 | [24.93, 32.66] | 5.40 | -82.1% | 1.1x10^-5 | 114.64 |
| Ollama | LLaMA-3B | 96.48 | [96.06, 96.89] | 0.17 | 17.19 | [14.00, 20.39] | 7.47 | -82.2% | 6.8x10^-5 | 86.54 |
| PyTorch | LLaMA-1B | 52.02 | [50.11, 53.92] | 1.48 | 7.18 | [6.84, 7.53] | 1.93 | -86.2% | 6.1x10^-5 | 81.26 |
| PyTorch | LLaMA-3B | 29.33 | [28.25, 30.41] | 0.41 | 3.79 | [3.66, 3.92] | 0.37 | -87.1% | 1.8x10^-3 | 298.35 |
Hypothesis Verdicts
| H | Hypothesis | Verdict | Key Evidence | Holm-Corrected | Confidence |
|---|---|---|---|---|---|
| H_1 | GPU bandwidth saturation | PARTIALLY CONFIRMED | Mem time +74%, p=6.4x10^-5, d=3.81 | Significant | HIGH |
| H2 | Ollama request serialization | GPU-LEVEL (not Ollama) | Max concurrent=1 everywhere, d=0 | NaN (no variance) | HIGH |
| H3 | CUDA context switching | REJECTED | Gap metrics identical N=1 vs N=8 | Not significant | MEDIUM |
| H4 | CPU thread scheduling | INSUFFICIENT DATA | OS runtime data unavailable | N/A | LOW |
| H5 | KV-cache memory pressure | REJECTED | Alloc counts unchanged, p=1.0, d=0 | Not significant | LOW |
Claim Validation
| # | Claim | Evidence | Status |
|---|---|---|---|
| 1 | Serving stack causes 82% degradation (TR130) | PyTorch Direct degrades 86.4% without any serving stack | Overturned |
| 2 | GPU bandwidth is stressed at N=8 | Mem time +74.4%, p=6.4x10^-5, survives Holm correction | Confirmed |
| 3 | Ollama serializes GPU access | Max concurrent=1 in both backends; GPU hardware, not Ollama | Reattributed |
| 4 | Q4_0 quantization helps concurrency | Ollama advantage grows 3.0x -> 3.9x from N=1 to N=8 | Confirmed |
| 5 | Context switches cause overhead at N=8 | Zero variance in gap metrics across all conditions | Rejected |
| 6 | KV-cache pressure reduces occupancy | Alloc counts unchanged; ncu SM data null | Rejected |
| 7 | Profiling overhead distorts timing | Ollama N=1 TPS matches TR129 unprofiled (160.4 vs ~160) | Negligible |
Key Decisions for Practitioners
-
Do not blame Ollama for N=8 degradation. GPU memory bandwidth is the fundamental limit. Switching serving stacks alone will not solve the 82% per-agent throughput collapse -- the GPU physics enforces it.
-
Continuous batching is the real differentiator. vLLM/TGI's better scaling (TR130) comes from batching multiple sequences into single kernel launches, amortizing the weight-read bandwidth cost. This is fundamentally different from "better scheduling" -- it is a bandwidth optimization.
-
Quantization is critical for concurrency. Ollama's Q4_0 models maintain 3--4x higher absolute TPS than FP16 PyTorch because Q4_0 weights are 4x smaller, proportionally reducing memory bandwidth demand. The advantage compounds under contention: 3.0x at N=1 grows to 3.9x at N=8.
-
For multi-agent workloads, use multiple GPUs or reduce agent count. 8 agents on a single RTX 4080 Laptop is fundamentally bandwidth-bound regardless of software stack. Plan for 2--3 agents per GPU for acceptable latency.
-
Profile before optimizing. This study demonstrates that intuitive attributions (blaming the serving stack) can be wrong even when supported by strong correlational evidence (TR130). Hardware profiling is the only way to distinguish correlation from causation.
How to Read This Report
| Time | Reading Path |
|---|---|
| 2 min | Abstract -> Executive Summary -> SS15 Hypothesis Verdicts table |
| 10 min | Add SS3 (Methodology) + SS14 (PyTorch vs Ollama) + SS15 (Verdicts) |
| 30 min | Full report, SS1--SS19 + Appendices |
When to Use This Report
| Scenario | How This Report Helps |
|---|---|
| Diagnosing multi-agent throughput collapse | Attribution table shows GPU physics is primary cause |
| Deciding whether to switch from Ollama | Switching to another sequential server won't help; need continuous batching |
| Capacity planning for concurrent agents | Bandwidth demand table shows when GPU saturates |
| Evaluating if quantization helps concurrency | Q4_0 advantage grows under bandwidth pressure (3.0x -> 3.9x) |
| Understanding why vLLM scales better than Ollama | Not scheduling -- continuous batching amortizes bandwidth |
| Planning GPU profiling for your workload | Methodology section provides nsys/ncu recipe |
Table of Contents
- SS1. Introduction and Motivation
- SS2. Methodology
- SS3. Phase 1 -- Environment Validation
- SS4. Phase 2 -- Ollama N=1 Baseline
- SS5. Phase 3 -- Ollama N=8 Concurrent
- SS6. Phases 4--5 -- PyTorch Direct N=1 and N=8
- SS7. The Core Comparison -- Ollama vs PyTorch Degradation
- SS8. Kernel Profile Comparison
- SS9. GPU Utilization Analysis
- SS10. Memory Bandwidth Analysis (H_1)
- SS11. Serialization Analysis (H2)
- SS12. Context Switch Analysis (H3)
- SS13. Memory Allocation Analysis (H5)
- SS14. Phase 6 -- Nsight Compute Targeted Profiling
- SS15. Hypothesis Verdicts and Degradation Attribution
- SS16. Statistical Power and Data Quality
- SS17. Profiling Overhead Assessment
- SS18. Limitations and Future Work
- SS19. Conclusions
- Appendix A: Configuration
- Appendix B: Environment
- Appendix C: Statistical Methods
- Appendix D: Glossary
- Appendix E: Reproducibility
- References
SS1. Introduction and Motivation
SS1.1 Background
Multi-agent LLM systems deploy N autonomous agents that concurrently issue inference requests to a shared GPU. TR129 established that Ollama per-agent throughput degrades following Amdahl's Law with serial fraction s=0.39--0.54: at N=8 agents on an RTX 4080 Laptop GPU, each agent retains only 16--17% of its standalone throughput. The practical consequence is severe -- adding agents beyond N=2 yields diminishing total throughput.
TR130 isolated the serving stack variable by comparing Ollama, vLLM, and TGI on identical hardware. vLLM retained 46--65% of per-agent throughput at N=8 (vs Ollama's 16--17%), and total throughput was 2x higher. The conclusion: "The serving stack is the bottleneck, and it is Ollama that suffers."
But this conclusion rests on a correlational argument. TR130 showed that different serving stacks produce different degradation curves. It did not show that Ollama's scheduling causes the degradation. An alternative explanation: the GPU's memory bandwidth is the fundamental constraint, and vLLM/TGI's continuous batching reduces bandwidth demand (by batching weight reads across sequences), while Ollama's sequential execution does not. Under this alternative, the root cause is GPU physics, and the serving stack's role is bandwidth efficiency, not scheduling quality.
Distinguishing these explanations requires opening the GPU black box. If we remove the serving stack entirely -- calling model.generate() directly via PyTorch -- and the degradation persists, then the cause is GPU physics, not software. If it disappears, the cause is indeed Ollama's scheduling.
SS1.2 Experimental Design
TR131 introduces a PyTorch Direct control condition that eliminates the entire serving stack. There is no HTTP server, no Go runtime, no request queuing, no token streaming, no Ollama process. A Python script loads the model via HuggingFace Transformers and calls model.generate() directly. For N=8, a ThreadPoolExecutor with 8 workers runs concurrent inference -- GPU operations release the GIL, enabling true CUDA-level concurrency.
| Factor | Controlled? | Value |
|---|---|---|
| GPU hardware | Yes | RTX 4080 Laptop 12 GB, GDDR6, 432 GB/s peak |
| Models | Yes | LLaMA-3.2-1B, LLaMA-3.2-3B |
| Concurrency levels | Yes | N=1 (baseline), N=8 (concurrent) |
| Max new tokens | Yes | 128 |
| Profiler | Yes | NVIDIA Nsight Systems 2025.5.1 wrapping target |
| Repetitions | Yes | 3 per condition |
| Inference backend | Variable | Ollama (Q4_0) vs PyTorch Direct (FP16) |
| Quantization | Partially | Ollama=Q4_0, PyTorch=FP16 |
The quantization difference (Q4_0 vs FP16) affects absolute throughput but not the N=1->N=8 degradation ratio, which is the metric used for causal attribution. FP16 models place strictly more memory pressure per parameter (4x), making PyTorch's degradation a conservative bound on GPU physics effects.
SS1.3 Research Questions
TR131 is designed to answer five specific questions:
- Q1: Does GPU memory bandwidth saturate under N=8 concurrency? If memory operation time increases significantly at N=8, bandwidth contention is a mechanism.
- Q2: Does Ollama serialize GPU kernel execution compared to direct PyTorch? If Ollama shows lower max concurrent kernels than PyTorch, it is adding serialization.
- Q3: Do CUDA context switches increase measurably at N=8? If inter-kernel gaps widen at N=8, context switching is a mechanism.
- Q4: Is GPU-level degradation intrinsic (hardware) or extrinsic (software)? If PyTorch Direct degrades comparably to Ollama, the cause is hardware.
- Q5: What fraction of the 82% degradation is attributable to Ollama's serving stack vs GPU physics? The difference between PyTorch and Ollama degradation ratios is the serving stack contribution.
SS1.4 Literature Gap
Published LLM serving benchmarks (Kwon et al. 2023, Patel et al. 2024) compare backends under open-loop arrival conditions. Multi-agent systems are closed-loop. TR130 provided the first closed-loop cross-backend comparison. TR131 goes one step further: by removing the serving stack entirely, it isolates the GPU physics component that no prior study has measured. This is the first kernel-level profiling of multi-agent inference degradation in the Banterhearts research series.
SS1.5 Five Hypotheses
| H | Hypothesis | Key nsys/ncu Data | Confirm If... |
|---|---|---|---|
| H_1 | GPU memory bandwidth saturates | Memory op time, ncu DRAM throughput | Bandwidth >80% of peak at N=8 |
| H2 | Ollama serializes GPU requests | Kernel exec trace timeline | Ollama N=8 has zero kernel overlap; PyTorch N=8 has overlap |
| H3 | CUDA context switching overhead | --gpuctxsw=true, inter-kernel gaps |
Context switches at N=8 >> N=1 |
| H4 | CPU thread scheduling bottleneck | OS runtime summary | CPU thread wait >20% of wall time at N=8 |
| H5 | KV-cache memory pressure | SM occupancy, mem alloc counts | SM occupancy drops significantly at N=8 |
Expected primary cause based on TR130 evidence: H2 (Ollama serialization). This expectation will be tested.
SS1.6 Why Nsight Systems + Nsight Compute
| Tool | Purpose | Data Captured | When Used |
|---|---|---|---|
| Nsight Systems (nsys) | System-wide timeline | CUDA API calls, kernel launches, memory ops, inter-kernel gaps, OS runtime | Phases 1--5 (all runs) |
| Nsight Compute (ncu) | Per-kernel deep dive | SM occupancy, DRAM throughput, compute utilization | Phase 6 (targeted) |
nsys wraps the entire target process, capturing all CUDA activity with minimal overhead (validated in SS17). ncu profiles individual kernel launches with detailed hardware counter data but can only profile a few launches at a time due to replay overhead.
SS2. Methodology
SS2.1 Profiling Architecture
Ollama (Phases 2--3): nsys wraps ollama serve, capturing all CUDA activity from Ollama's ggml inference engine. A separate Python thread (running outside the nsys process tree) sends HTTP requests to localhost:11434 after the server is ready. This ensures profiling overhead does not affect request timing. The nsys --kill=true flag terminates Ollama after the profile duration expires.
PyTorch Direct (Phases 4--5): nsys wraps a Python script that loads the model via HuggingFace Transformers with torch.float16 and calls model.generate() directly. For N=8, a ThreadPoolExecutor with 8 workers dispatches concurrent inference calls. GPU operations release Python's GIL, enabling true CUDA-level concurrency from multiple threads. The inference script runs inside the nsys process tree (unavoidable), but profiling overhead is symmetric across N=1 and N=8 conditions, preserving the degradation ratio comparison.
Nsight Compute (Phase 6): ncu wraps a Python script that loads the model and performs 5 generate() calls, allowing ncu to capture detailed per-kernel metrics for the top kernels. Only 5 launches are profiled per model (ncu replays each kernel multiple times for counter collection).
SS2.2 Metrics
| Metric | Source | Formula | Primary? |
|---|---|---|---|
| Per-agent TPS | Request driver | tokens_generated / wall_time_s |
Yes |
| Kernel count | nsys cuda_gpu_kern_sum |
Total CUDA kernel launches | Yes |
| GPU time (ms) | nsys cuda_gpu_kern_sum |
Sum of kernel durations | Yes |
| Memory op time (ms) | nsys cuda_gpu_mem_time_sum |
Sum of memcpy/memset durations | Yes |
| Max concurrent kernels | nsys cuda_kern_exec_trace |
Peak overlapping kernel count (sweep-line) | Yes |
| GPU utilization % | nsys cuda_kern_exec_trace |
active_time / total_trace_time x 100 |
Secondary |
| Inter-kernel gap (mus) | nsys cuda_kern_exec_trace |
Mean gap between consecutive launches | Secondary |
| SM occupancy % | ncu | Achieved / theoretical warp occupancy | Phase 6 only |
| DRAM throughput % | ncu | Achieved / peak memory bandwidth | Phase 6 only |
Per-agent TPS is the primary metric -- it captures the throughput each agent actually experiences, including all scheduling, queue wait, and memory contention overhead. Kernel count, GPU time, and memory op time are the primary trace metrics used for hypothesis testing.
SS2.3 Statistical Methods
- Welch's t-test (unequal variance) for all pairwise comparisons -- does not assume equal group sizes or variance
- Cohen's d (pooled standard deviation) for standardized effect size -- classified as negligible (<0.2), small (0.2--0.5), medium (0.5--0.8), or large (>0.8)
- Mann-Whitney U (non-parametric) as robustness check on every significant Welch's result
- 95% confidence intervals via t-distribution with n-1 degrees of freedom
- Holm step-down correction for multiple hypothesis tests (k=6 tests, family-wise alpha=0.05)
- Power analysis: minimum detectable Cohen's d given sample sizes: d_min = t_crit(alpha/2, df=2n-2) x sqrt(2/n)
- IQR outlier detection via Tukey fences (1.5x IQR below Q1 or above Q3)
- Descriptive statistics: mean, median, std, p90, p95, p99, min, max per group
SS2.4 Six Phases
| Phase | Description | Runs | Profile Duration | Expected Traces |
|---|---|---|---|---|
| P1 | Validation: nsys captures Ollama CUDA kernels | 1 | 30s | ~1 MB |
| P2 | Ollama N=1: 2 models x 3 reps | 6 | 30s | ~1 MB each |
| P3 | Ollama N=8: 2 models x 3 reps | 6 | 60s | ~4 MB each |
| P4 | PyTorch Direct N=1: 2 models x 3 reps | 6 | 60s | ~40--80 MB each |
| P5 | PyTorch Direct N=8: 2 models x 3 reps | 6 | 120s | ~160--270 MB each |
| P6 | Nsight Compute: 2 models x 5 kernel launches | 2 | ~2--4 min | CSV output |
| Total | 27 | ~71 min | ~1.6 GB |
Each phase requires a fresh start of the target process (Ollama or PyTorch script) under nsys. Between runs, the previous process is fully stopped and a 3-second cooldown allows GPU temperature stabilization.
SS3. Phase 1 -- Environment Validation
Phase 1 is the critical gate: if nsys cannot capture CUDA kernels from Ollama's process tree, the entire experiment fails. Ollama spawns child processes for GPU inference (the ggml CUDA backend runs in a separate subprocess), so nsys must follow the process tree.
SS3.1 Validation Results
| Check | Result | Status |
|---|---|---|
| nsys reachable | NVIDIA Nsight Systems 2025.5.1.121-255136380782v0 | Pass |
| CUDA API calls captured | 26,185 | Pass |
| GPU kernels captured | 1,871 | Pass |
| GPU time | 44.7 ms | Pass |
| Ollama HTTP requests | 3/3 OK (162.2 TPS) | Pass |
| Reports extracted | cuda_api_sum, cuda_gpu_kern_sum, cuda_gpu_mem_time_sum, cuda_kern_exec_trace, osrt_sum | Pass |
| Trace file size | 0.9 MB | Expected |
SS3.2 Observations
Observation 1 -- nsys captures ggml CUDA kernels through the Ollama process tree. The 1,871 kernel launches and 26,185 CUDA API calls confirm full visibility into Ollama's GPU activity. This is not guaranteed -- some profilers cannot follow forked child processes on Windows. nsys's --trace=cuda flag with --kill=true successfully wraps ollama serve and its child processes.
Observation 2 -- The dominant kernel is mul_mat_q<ggml_type=8>. This is ggml's quantized matrix multiplication kernel for Q4_0 (type 8 in the ggml enum). It performs dequantization and matrix multiplication in a single fused kernel, avoiding the bandwidth cost of a separate dequantize pass. The top 3 kernels are all mul_mat_q variants with different tile sizes (64, 80) and stream-k fixup kernels.
Observation 3 -- Validation TPS (162.2) matches TR129 unprofiled data (~160 TPS for LLaMA-1B). This provides early evidence that nsys profiling overhead is negligible for Ollama runs (further validated in SS17).
Observation 4 -- GPU utilization reads 0% but this is a denominator artifact. The 44.7 ms of GPU time within a 30-second profile window means the GPU is active for only 0.15% of the total profile duration. Between requests, the GPU is idle. This metric is misleading for bursty workloads (discussed further in SS9).
Observation 5 -- Max concurrent kernels = 1 even during validation. This is the first hint that kernel serialization is a GPU-level phenomenon, not specific to multi-agent concurrency. A single sequential request already executes kernels one at a time because each ggml kernel occupies the full GPU.
Gate: PASSED. All 5 nsys report types extracted successfully. Proceeding to profiled phases.
SS4. Phase 2 -- Ollama N=1 Baseline
SS4.1 Per-Request Results
| Model | Rep | TPS | Wall (ms) | Trace (MB) | Kernels | GPU Time (ms) |
|---|---|---|---|---|---|---|
| LLaMA-1B | 0 | 159.98 | 800.4 | 0.8 | 2,257 | 45.9 |
| LLaMA-1B | 1 | 160.44 | 797.8 | 0.8 | 2,257 | 45.9 |
| LLaMA-1B | 2 | 160.91 | 795.6 | 0.8 | 2,257 | 45.9 |
| LLaMA-3B | 0 | 96.30 | 1,329.5 | 1.0 | 3,949 | 117.4 |
| LLaMA-3B | 1 | 96.50 | 1,326.3 | 1.0 | 3,949 | 118.3 |
| LLaMA-3B | 2 | 96.63 | 1,324.7 | 1.0 | 3,949 | 121.0 |
SS4.2 Descriptive Statistics
| Model | Mean TPS | 95% CI | Std | CV% | Median | p90 |
|---|---|---|---|---|---|---|
| LLaMA-1B | 160.44 | [159.29, 161.60] | 0.47 | 0.29% | 160.44 | 160.82 |
| LLaMA-3B | 96.48 | [96.06, 96.89] | 0.17 | 0.17% | 96.50 | 96.60 |
SS4.3 Observations
Observation 1 -- Ollama N=1 throughput is extremely deterministic. LLaMA-1B: CV=0.29%, 95% CI width = 2.31 TPS. LLaMA-3B: CV=0.17%, 95% CI width = 0.83 TPS. This sub-1% variance means that even 3 repetitions produce tight confidence intervals. The determinism comes from sequential request processing: with no contention, every request follows the same code path through ggml -> CUDA -> memory controller.
Observation 2 -- LLaMA-1B is 1.66x faster than LLaMA-3B (160.4 vs 96.5 TPS). The parameter ratio is 2.67x (3.2B / 1.2B), but the throughput ratio is only 1.66x -- sublinear scaling. The gap is smaller than expected because per-token overhead (CUDA launch, memory allocation, HTTP round-trip) is approximately constant regardless of model size. For the 1B model, this fixed overhead is a larger fraction of total time, compressing the ratio.
Observation 3 -- Kernel counts are identical across repetitions for the same model. LLaMA-1B: exactly 2,257 kernels in all 3 reps. LLaMA-3B: exactly 3,949 in all 3 reps. This perfect reproducibility means the ggml execution graph is fully deterministic for a given prompt length and model architecture -- no dynamic kernel dispatch.
Observation 4 -- GPU time is a tiny fraction of wall time. LLaMA-1B: 45.9 ms GPU time vs 798 ms wall time -- the GPU is actively computing for only 5.7% of request time. The remaining 94.3% is Ollama overhead: HTTP handling, tokenization, JSON serialization, queue management, and inter-kernel gaps. This matches TR130's finding of ~210 ms scheduling overhead per request.
Observation 5 -- Trace sizes are tiny (~0.8--1.0 MB per 30s profile). This confirms that CUDA activity for 5 sequential Q4_0 inference requests is minimal. The small trace sizes also mean nsys stats extraction is fast (<5 seconds per run), avoiding the timeout issues that will affect large PyTorch traces (SS6).
SS4.4 Kernel Architecture (N=1)
| Metric | LLaMA-1B | LLaMA-3B | Interpretation |
|---|---|---|---|
| Kernel instances | 2,257 | 3,949 | 1.75x -- proportional to layer count (16 vs 28 layers) |
| GPU time (ms) | 45.9 | 118.9 | 2.59x -- closer to parameter ratio (2.67x) |
| Attention kernel % | 8.4% | 8.5% | Identical -- attention fraction is architecture-independent |
| GEMM kernel % | 0% | 0% | Expected -- ggml uses fused mul_mat_q, not cuBLAS GEMM |
| Memory op time (ms) | 109.3 | 245.4 | 2.25x -- tracks weight size ratio |
The 0% GEMM fraction deserves explanation. Standard PyTorch inference dispatches separate cuBLAS GEMM calls for matrix multiplications. Ollama's ggml backend uses custom mul_mat_q kernels that fuse dequantization and matrix multiplication into a single kernel. This fusion eliminates the intermediate dequantized tensor, halving memory bandwidth for each matmul operation. The ggml kernel names (void mul_mat_q<(ggml_type)8, (int)64, (bool)0>) confirm Q4_0 (type=8) with tile sizes 64 and 80.
SS5. Phase 3 -- Ollama N=8 Concurrent
SS5.1 Per-Request Results
| Model | Rep | Per-Agent TPS | Wall (ms) | Trace (MB) | Kernels | GPU Time (ms) |
|---|---|---|---|---|---|---|
| LLaMA-1B | 0 | 28.86 | 4,833 | 3.3 | 10,975 | 216.1 |
| LLaMA-1B | 1 | 27.21 | 4,971 | 3.3 | 10,975 | 216.1 |
| LLaMA-1B | 2 | 30.32 | 4,856 | 3.3 | 10,975 | 216.1 |
| LLaMA-3B | 0 | 16.58 | 8,729 | 3.7 | 18,492 | 512.9 |
| LLaMA-3B | 1 | 16.33 | 8,716 | 3.8 | 19,745 | 514.9 |
| LLaMA-3B | 2 | 18.67 | 8,689 | 4.1 | 19,745 | 521.0 |
SS5.2 Descriptive Statistics
| Model | Mean TPS | 95% CI | Std | CV% | Median | Degradation from N=1 |
|---|---|---|---|---|---|---|
| LLaMA-1B | 28.80 | [24.93, 32.66] | 1.56 | 5.4% | 28.86 | -82.1% |
| LLaMA-3B | 17.19 | [14.00, 20.39] | 1.28 | 7.5% | 16.58 | -82.2% |
SS5.3 Observations
Observation 1 -- 82.1% per-agent degradation, highly significant. Overall: 128.46 -> 22.99 TPS, Welch's t=7.25, p=0.0006, Cohen's d=4.19. Mann-Whitney U=36.0, p=0.002, confirming non-parametric robustness. The effect size of 4.19 is far above the minimum detectable d=1.29 at our sample sizes (SS16).
Observation 2 -- Degradation is near-identical across models. LLaMA-1B: -82.1% (131.6 TPS lost). LLaMA-3B: -82.2% (79.3 TPS lost). The degradation percentage is model-independent even though the absolute TPS loss differs by 1.66x. This means the degradation mechanism is proportional to baseline throughput -- consistent with bandwidth contention, which scales with weight size.
Observation 3 -- Variance increases at N=8 (CV: 0.29% -> 5.4% for 1B, 0.17% -> 7.5% for 3B). Under contention, the nondeterminism of GPU memory controller scheduling introduces variability. The 95% CI width grows from 2.3 to 7.7 TPS for 1B and from 0.8 to 6.4 TPS for 3B. This increased variance is expected when multiple processes contend for the same memory bus.
Observation 4 -- Kernel counts increase ~5x for 8x workload. LLaMA-1B: 2,257 -> 10,975 (4.86x). LLaMA-3B: 3,949 -> 19,327 (4.89x). The sub-8x scaling means that not all 8 agents complete all 3 requests within the 60-second profile window -- some requests are cut short by nsys's --kill termination. The kernel count still reflects the actual GPU work done.
Observation 5 -- GPU time increases ~4.7x while kernel count increases ~4.9x. LLaMA-1B: 45.9 -> 216.1 ms (4.71x GPU time), 2,257 -> 10,975 kernels (4.86x count). The GPU time per kernel is approximately constant (20.3 mus at N=1, 19.7 mus at N=8), suggesting that individual kernel execution time is not affected by concurrency -- the slowdown comes from increased total work competing for memory bandwidth, not from individual kernels running slower.
Observation 6 -- Trace sizes increase ~4x, not 8x. LLaMA-1B: 0.8 -> 3.3 MB (4.1x). The sub-8x scaling mirrors the kernel count -- the profile duration captures 4.9x more kernels than N=1, proportionally increasing trace size. The modest trace sizes (3.3--4.1 MB) mean nsys stats extraction remains fast for Ollama runs.
SS5.4 Interpretation -- Ollama Degradation Mechanism
The data suggests the following mechanism: at N=8, Ollama receives 8 concurrent HTTP requests and queues them. Because ggml processes one request at a time (max_concurrent_kernels = 1 at the GPU level), each agent waits while the other 7 are served. The wall-clock time per request grows from ~800 ms to ~4,900 ms -- approximately 6x rather than 8x, because some overlap in HTTP processing and tokenization occurs while the GPU handles the previous request.
The key question is whether this serialization is Ollama's fault (it could batch requests) or the GPU's constraint (the hardware can only run one full-width kernel at a time). Phases 4--5 answer this by testing PyTorch Direct, which has no request queue and uses threads for concurrency.
SS6. Phases 4--5 -- PyTorch Direct N=1 and N=8
SS6.1 PyTorch N=1 Results
| Model | Rep | TPS | Wall (ms) | Trace (MB) | Kernels | GPU Time (ms) |
|---|---|---|---|---|---|---|
| LLaMA-1B | 0 | 52.46 | 2,440 | 42.4 | 903,323 | 10,344 |
| LLaMA-1B | 1 | 52.46 | 2,440 | 42.3 | 903,323 | 10,358 |
| LLaMA-1B | 2 | 51.13 | 2,506 | 42.3 | 903,323 | 10,359 |
| LLaMA-3B | 0 | -- | -- | 21.3 | 452,567 | 6,950 |
| LLaMA-3B | 1 | 29.24 | 4,378 | 76.6 | 1,628,896 | 25,352 |
| LLaMA-3B | 2 | 29.41 | 4,354 | 76.6 | 1,628,896 | 25,376 |
Note: LLaMA-3B rep0 returned 0 complete requests -- model loading consumed the entire 60-second profile duration. The 452,567 kernels represent partial model loading. Excluded from all TPS analysis; n=2 for LLaMA-3B N=1.
SS6.2 PyTorch N=8 Results
| Model | Rep | Per-Agent TPS | Wall (ms) | Trace (MB) | Kernels | GPU Time (ms) |
|---|---|---|---|---|---|---|
| LLaMA-1B | 0 | 7.03 | 18,289 | 161.9 | 3,165,433 | 36,484 |
| LLaMA-1B | 1 | 7.30 | 17,641 | 161.2 | 3,165,433 | 36,526 |
| LLaMA-1B | 2 | 7.22 | 17,863 | 161.8 | 3,165,433 | 36,604 |
| LLaMA-3B | 0 | -- | -- | 172.7 | 3,384,704 | 55,177 |
| LLaMA-3B | 1 | 3.80 | 33,791 | 271.8 | 5,262,724 | 82,327 |
| LLaMA-3B | 2 | 3.78 | 33,945 | 271.6 | 5,274,779 | 82,606 |
Note: LLaMA-3B rep0 again returned 0 complete requests. Excluded from TPS analysis.
SS6.3 Descriptive Statistics
| Backend | Model | N | Mean TPS | 95% CI | Std | CV% |
|---|---|---|---|---|---|---|
| PyTorch | LLaMA-1B | 1 | 52.02 | [50.11, 53.92] | 0.77 | 1.5% |
| PyTorch | LLaMA-3B | 1 | 29.33 | [28.25, 30.41] | 0.12 | 0.4% |
| PyTorch | LLaMA-1B | 8 | 7.18 | [6.84, 7.53] | 0.14 | 1.9% |
| PyTorch | LLaMA-3B | 8 | 3.79 | [3.66, 3.92] | 0.01 | 0.4% |
SS6.4 Observations
Observation 1 -- Massive trace sizes reveal PyTorch's eager execution model. PyTorch N=1 generates 42--77 MB traces vs Ollama's 0.8--1.0 MB. PyTorch N=8 reaches 162--272 MB. The reason: PyTorch's eager-mode execution dispatches individual CUDA kernels for every operation -- each nn.Linear, softmax, layer norm, and attention computation launches separate kernels. Ollama's ggml fuses these into large mul_mat_q blocks.
Observation 2 -- PyTorch launches 100--400x more kernels than Ollama. At N=1: PyTorch LLaMA-1B launches 903,323 kernels vs Ollama's 2,257 (400x). LLaMA-3B: 1,628,896 vs 3,949 (412x). This massive gap reflects the difference between eager execution (individual ops -> individual kernels) and fused execution (ggml combines dequant + matmul + element-wise into single kernels). Despite launching 400x more kernels, PyTorch is only 3x slower -- each PyTorch kernel is much smaller and faster to launch, but the launch overhead adds up.
Observation 3 -- PyTorch N=1 is 3.0x slower than Ollama N=1. LLaMA-1B: 52.0 vs 160.4 TPS (3.08x). LLaMA-3B: 29.3 vs 96.5 TPS (3.29x). The primary driver is FP16 vs Q4_0: FP16 weights are 4x larger, requiring 4x more memory bandwidth per token. The sub-4x ratio reflects that not all time is memory-bound -- some is compute-bound (attention) and some is fixed overhead (kernel launch).
Observation 4 -- PyTorch GPU time is vastly higher. LLaMA-1B N=1: 10,350 ms GPU time vs Ollama's 45.9 ms (225x). This is partly the 400x kernel count difference and partly the FP16 weight reads. But note: PyTorch's GPU time exceeds its wall time (10,350 ms GPU vs 2,440 ms wall). This means kernels overlap on the GPU timeline -- the CUDA runtime pipelines kernel execution even within a single stream, but the wall-clock time reflects that the GPU is near-fully utilized during inference.
Observation 5 -- LLaMA-3B rep0 failed in both N=1 and N=8. The 3B FP16 model requires ~6.4 GB VRAM for weights alone. Combined with PyTorch's memory overhead (activation caching, CUDA context), model loading takes longer than the profile duration on the first run. Subsequent reps benefit from cached CUDA context. This is a startup artifact, not a measurement issue -- excluded from analysis.
Observation 6 -- PyTorch N=8 variance is remarkably low. LLaMA-1B N=8: CV=1.9% (7.18 +/- 0.14 TPS). LLaMA-3B N=8: CV=0.4% (3.79 +/- 0.01 TPS). The low variance under 8-thread concurrency suggests that GPU memory controller scheduling is deterministic at steady state -- all 8 threads get equal, predictable bandwidth slices.
Observation 7 -- nsys stats extraction timed out for large PyTorch N=8 traces. Three LLaMA-3B N=8 traces (~272 MB each, ~5M kernels) caused the cuda_kern_exec_trace stats extraction to exceed the 120-second timeout. This is why some per-model GPU utilization data is missing for PyTorch N=8 LLaMA-3B. The missing data does not affect the primary TPS comparison.
SS6.5 Interpretation -- The Bombshell Finding
PyTorch Direct N=8 degrades 86.4% -- worse than Ollama's 82.1%. This is the opposite of what TR130 would predict. If Ollama's serving stack were the bottleneck, removing it (PyTorch Direct has no HTTP, no Go, no Ollama) should reduce degradation. Instead, it increases.
This means the degradation is not caused by Ollama's request scheduling. It is caused by the GPU memory bandwidth constraint. With 8 concurrent threads all calling model.generate() on the same GPU, the CUDA runtime must serialize kernel execution (max_concurrent=1) and share memory bandwidth across 8 concurrent weight-read streams. The result is the same throughput collapse -- actually worse, because FP16 weights require 4x more bandwidth per parameter than Q4_0.
SS7. The Core Comparison -- Ollama vs PyTorch Degradation
This section presents the central analysis: comparing degradation ratios between Ollama and PyTorch Direct to attribute the throughput collapse.
SS7.1 Aggregate Comparison
| Metric | Ollama (n=6) | PyTorch (n=5) | Interpretation |
|---|---|---|---|
| N=1 Mean TPS | 128.46 [91.69, 165.23] | 42.94 [27.49, 58.39] | Ollama 3.0x faster (Q4_0) |
| N=8 Mean TPS | 22.99 [16.19, 29.80] | 5.83 [3.52, 8.14] | Ollama 3.9x faster |
| N=1->N=8 Degradation | -82.1% | -86.4% | PyTorch degrades MORE |
| p-value (degradation) | 0.0006 | 0.002 | Both highly significant |
| Cohen's d | 4.19 | 4.17 | Both massive effects |
| Mann-Whitney p | 0.002 | 0.012 | Non-parametric confirms |
SS7.2 Per-Model Breakdown
| Model | Backend | N=1 TPS | N=8 TPS | Degradation | p-value | Cohen's d |
|---|---|---|---|---|---|---|
| LLaMA-1B | Ollama | 160.44 | 28.80 | -82.1% | 1.1x10^-5 | 114.64 |
| LLaMA-1B | PyTorch | 52.02 | 7.18 | -86.2% | 6.1x10^-5 | 81.26 |
| LLaMA-3B | Ollama | 96.48 | 17.19 | -82.2% | 6.8x10^-5 | 86.54 |
| LLaMA-3B | PyTorch | 29.33 | 3.79 | -87.1% | 1.8x10^-3 | 298.35 |
SS7.3 Degradation Ratio Comparison
| Model | Ollama Deg. | PyTorch Deg. | Difference | Interpretation |
|---|---|---|---|---|
| LLaMA-1B | -82.1% | -86.2% | -4.1 pp | PyTorch 4.1 pp worse |
| LLaMA-3B | -82.2% | -87.1% | -4.9 pp | PyTorch 4.9 pp worse |
| Overall | -82.1% | -86.4% | -4.3 pp | PyTorch worse overall |
SS7.4 Ollama Advantage Growth Under Contention
| Model | N=1 Ollama/PyTorch Ratio | N=8 Ollama/PyTorch Ratio | Growth |
|---|---|---|---|
| LLaMA-1B | 3.08x | 4.01x | +0.93x |
| LLaMA-3B | 3.29x | 4.54x | +1.25x |
SS7.5 Degradation Attribution
| Source | Attribution | Derivation |
|---|---|---|
| GPU memory physics | 86.4% | PyTorch Direct baseline (no serving stack) |
| Ollama serving stack | -4.3% | Ollama degrades less than PyTorch |
| Net observed (Ollama) | 82.1% | 86.4% + (-4.3%) |
SS7.6 Observations
Observation 1 -- PyTorch degrades 4.3 percentage points more than Ollama, overturning TR130. TR130 concluded: "The serving stack is the bottleneck." If true, PyTorch Direct (no serving stack) should degrade less. It degrades more. The attribution table shows -4.3% for Ollama's serving stack -- a negative contribution, meaning Ollama's stack slightly reduces degradation, probably because Q4_0 quantization reduces per-request bandwidth pressure.
Observation 2 -- The Ollama advantage grows from 3.0x to 3.9x under contention. At N=1, Q4_0's 4x smaller weights give Ollama a 3.0x throughput advantage. At N=8, this grows to 3.9x because bandwidth becomes the binding constraint. Q4_0's advantage compounds: at N=1, bandwidth is 22--38% of peak (SS10); at N=8, it exceeds peak. The backend that uses less bandwidth per token suffers less.
Observation 3 -- Degradation is model-size independent. LLaMA-1B and 3B degrade within 0.1 percentage points of each other (82.1% vs 82.2% for Ollama; 86.2% vs 87.1% for PyTorch). This means the degradation mechanism scales proportionally with model size -- consistent with bandwidth saturation, which is proportional to weight reads per token.
Observation 4 -- The LLaMA-3B Ollama/PyTorch advantage grows MORE than 1B (4.54x vs 4.01x at N=8). Larger models have higher bandwidth demand, so Q4_0's bandwidth savings compound more. The 3B FP16 model requires ~6.4 GB of weight reads per full forward pass vs ~1.6 GB for Q4_0. The 4x bandwidth gap amplifies under contention.
Observation 5 -- All 4 per-model degradation tests are significant at p < 0.002. Even with only n=2 for PyTorch LLaMA-3B, the effect size is d=298 (the enormous d reflects near-zero within-group variance for both N=1 and N=8). The statistical conclusion is unambiguous: the degradation is real, large, and reproducible.
Observation 6 -- This reframes TR130's finding. TR130 is correct that vLLM/TGI scale better than Ollama. But the reason is not "better scheduling" -- it is continuous batching's bandwidth efficiency. vLLM batches multiple sequences into single kernel launches, reading the model weights once for multiple tokens. Ollama reads the weights once per token per request. The difference is bandwidth amortization, not request scheduling.
SS8. Kernel Profile Comparison
SS8.1 Aggregate Statistics
| Phase | Mean Kernels | 95% CI | Mean GPU Time (ms) | 95% CI |
|---|---|---|---|---|
| Ollama N=1 | 3,103 | [2,130, 4,076] | 82.4 | [40.4, 124.4] |
| Ollama N=8 | 15,151 | [10,322, 19,979] | 366.2 | [193.6, 538.8] |
| PyTorch N=1 | 1,070,055 | [580,226, 1,559,883] | 14,790 | [6,083, 23,496] |
| PyTorch N=8 | 3,903,084 | [2,789,369, 5,016,799] | 54,954 | [31,341, 78,567] |
SS8.2 Statistical Tests
| Comparison | Mean A | Mean B | Delta % | Cohen's d | t-stat | p-value | M-W p |
|---|---|---|---|---|---|---|---|
| Ollama kernels N=1->N=8 | 3,103 | 15,151 | +388% | 3.63 | -6.29 | 0.001 | 0.004 |
| PyTorch kernels N=1->N=8 | 1,070,055 | 3,903,084 | +265% | 3.46 | -5.99 | 0.0006 | 0.004 |
SS8.3 Per-Model Kernel Breakdown
| Model | Phase | Kernels | GPU Time (ms) | GPU Time/Kernel (mus) | GEMM % | Attention % |
|---|---|---|---|---|---|---|
| LLaMA-1B | Ollama N=1 | 2,257 | 45.9 | 20.3 | 0% | 8.4% |
| LLaMA-1B | Ollama N=8 | 10,975 | 216.1 | 19.7 | 0% | 8.4% |
| LLaMA-1B | PyTorch N=1 | 903,323 | 10,353 | 11.5 | 71.6% | 0.6% |
| LLaMA-1B | PyTorch N=8 | 3,165,433 | 36,538 | 11.5 | 71.9% | 0.6% |
| LLaMA-3B | Ollama N=1 | 3,949 | 118.9 | 30.1 | 0% | 8.5% |
| LLaMA-3B | Ollama N=8 | 19,327 | 516.3 | 26.7 | 0% | 8.5% |
| LLaMA-3B | PyTorch N=1 | 1,236,786 | 19,226 | 15.5 | 71.6% | 0.6% |
| LLaMA-3B | PyTorch N=8 | 4,640,736 | 73,370 | 15.8 | 71.9% | 0.6% |
SS8.4 Observations
Observation 1 -- Per-kernel execution time is constant across N=1 and N=8. LLaMA-1B Ollama: 20.3 mus/kernel at N=1, 19.7 mus at N=8. PyTorch: 11.5 mus at both. Individual kernels do not run slower under contention -- the slowdown comes from more total work competing for the same memory bandwidth, not from individual kernel degradation. This is consistent with bandwidth contention rather than compute saturation.
Observation 2 -- Ollama and PyTorch have inverted GEMM/Attention profiles. Ollama: 0% GEMM, 8.5% attention. PyTorch: 71.6% GEMM, 0.6% attention. The inversion reflects different kernel implementations. Ollama's ggml fuses matmul into mul_mat_q (not reported as cuBLAS GEMM). PyTorch dispatches separate cuBLAS GEMM calls for each linear layer, making GEMM the dominant kernel class. The attention difference reflects ggml's custom attention kernel vs PyTorch's decomposed scaled_dot_product_attention.
Observation 3 -- Kernel count gap is 345x at N=1 but narrows to 258x at N=8. Ollama: 3,103 -> 15,151 (4.88x). PyTorch: 1,070,055 -> 3,903,084 (3.65x). PyTorch's kernel count scales less than Ollama's because some kernels are shared across threads (memory allocation, context management), while Ollama's kernel count scales nearly linearly with the number of concurrent requests processed.
Observation 4 -- The attention/GEMM fractions are invariant with N. Both backends maintain identical attention and GEMM percentages at N=1 and N=8 (within 0.3 pp). Concurrency does not change the kernel mix -- it scales all kernel types proportionally. This rules out a hypothesis where attention kernels become disproportionately expensive under contention.
SS9. GPU Utilization Analysis
SS9.1 Results
GPU utilization from kernel exec trace analysis reads 0.0% for all 24 runs across all 4 conditions. This counterintuitive result requires careful interpretation.
SS9.2 Why Utilization Reads Zero
The utilization metric is computed as: active_kernel_time / total_profile_duration x 100. For Ollama N=1: 45.9 ms of kernel activity within a 30,000 ms profile window = 0.15%. The metric rounds to 0% because inference requests occupy a tiny fraction of the total profiling window -- between requests, the GPU is idle.
This metric is misleading for bursty workloads. During active inference, the GPU is near-fully utilized -- evidenced by max_concurrent_kernels = 1, meaning the GPU has no idle SMs during kernel execution. The correct interpretation is:
- Instantaneous utilization during inference: near 100% (GPU is the bottleneck)
- Time-averaged utilization over profile window: <1% (most time is between requests)
For the multi-agent comparison, the relevant metric is how individual kernel execution and bandwidth are affected by concurrency -- captured by GPU time, memory op time, and kernel count in SS8 and SS10.
SS9.3 Inter-Kernel Gap Analysis
All inter-kernel gap comparisons (N=1 vs N=8, Ollama vs PyTorch) showed Cohen's d = 0 and NaN p-values. The zero variance means the nsys aggregated gap metric does not differentiate between conditions. Fine-grained gap distributions would require timeline-level analysis of the raw .nsys-rep files (outside scope of automated analysis).
SS10. Memory Bandwidth Analysis (H_1)
SS10.1 Memory Operation Time
| Phase | Mean Mem Time (ms) | 95% CI | Std |
|---|---|---|---|
| Ollama N=1 | 177.3 | -- | -- |
| Ollama N=8 | 309.3 | -- | -- |
| PyTorch N=1 | 398.0 | -- | -- |
| PyTorch N=8 | 488.7 | -- | -- |
SS10.2 Statistical Tests
| Comparison | Delta (ms) | Delta % | Cohen's d | t-stat | p-value | Significant | M-W p |
|---|---|---|---|---|---|---|---|
| Ollama mem N=1->N=8 | +131.9 | +74.4% | 3.81 | -6.61 | 6.4x10^-5 | Yes | 0.002 |
| PyTorch mem N=1->N=8 | +90.8 | +22.8% | 0.37 | -0.64 | 0.54 | No | 0.18 |
SS10.3 Observations
Observation 1 -- Ollama memory time increases 74.4% at N=8 (p=6.4x10^-5, d=3.81). This is the strongest statistical signal in the entire analysis. The large effect size (d=3.81) indicates a massive shift in memory operation duration under concurrency. Both the Welch's t-test (p=6.4x10^-5) and Mann-Whitney U (p=0.002) confirm the result. This is the only test that survives Holm correction (rank 1 of 6, threshold=0.0083).
Observation 2 -- PyTorch's memory time increase is non-significant (p=0.54, d=0.37). At first glance, this seems to contradict the bandwidth hypothesis. But the explanation is clear: PyTorch's baseline memory time is already 2.24x higher than Ollama's (398 vs 177 ms) because FP16 weights require 4x more memory operations. The absolute increase (+91 ms) is comparable to Ollama's (+132 ms), but the relative increase (22.8% vs 74.4%) is smaller because PyTorch starts from a higher base. The high variance in PyTorch memory time (from the large trace sizes and timeout issues) also inflates the standard error, reducing significance.
Observation 3 -- The asymmetry supports the bandwidth hypothesis. Ollama's sharp increase suggests its memory subsystem transitions from comfortable (22% of peak at N=1) to stressed (>100% at N=8). PyTorch's memory subsystem is already under pressure at N=1 (29% of peak), so the additional N=8 stress causes a proportionally smaller relative change. This is exactly what bandwidth saturation looks like: diminishing marginal stress increase as you approach the ceiling.
SS10.4 Bandwidth Demand Calculation
The RTX 4080 Laptop GPU has a peak memory bandwidth of ~432 GB/s (GDDR6, 256-bit bus).
Q4_0 LLaMA-1B (Ollama):
- Model weight size: 1.2B params x 0.5 bytes/param = 0.6 GB
- Per-token bandwidth: 0.6 GB x 1 read per token = 0.6 GB/token
- N=1 at 160 TPS: 0.6 x 160 = 96 GB/s (22% of peak)
- N=8 at 29 TPS per agent: 0.6 x 29 x 8 = 139 GB/s (32% of peak total)
- N=8 if no degradation (160x8): 0.6 x 160 x 8 = 768 GB/s (178% of peak -- impossible)
FP16 LLaMA-1B (PyTorch):
- Model weight size: 1.2B params x 2 bytes/param = 2.4 GB
- N=1 at 52 TPS: 2.4 x 52 = 125 GB/s (29% of peak)
- N=8 if no degradation (52x8): 2.4 x 52 x 8 = 998 GB/s (231% of peak -- impossible)
Q4_0 LLaMA-3B (Ollama):
- Model weight size: 3.2B x 0.5 = 1.6 GB
- N=1 at 96 TPS: 1.6 x 96 = 154 GB/s (36% of peak)
- N=8 if no degradation: 1.6 x 96 x 8 = 1,229 GB/s (285% of peak)
FP16 LLaMA-3B (PyTorch):
- Model weight size: 3.2B x 2 = 6.4 GB
- N=1 at 29 TPS: 6.4 x 29 = 186 GB/s (43% of peak)
- N=8 if no degradation: 6.4 x 29 x 8 = 1,485 GB/s (344% of peak)
SS10.5 Interpretation
The bandwidth demand calculations reveal why degradation is so severe. At N=8, the theoretical bandwidth demand (without degradation) exceeds peak bandwidth by 78--244% across all 4 configurations. The GPU memory controller must serialize weight reads, creating the observed throughput collapse.
The actual N=8 bandwidth demand is lower because per-agent TPS drops. For Ollama 1B at N=8: 0.6 GB x 29 x 8 = 139 GB/s (32% of peak). This is feasible -- the memory controller achieves it by time-slicing weight reads across the 8 concurrent streams. But the time-slicing means each agent waits for the others, producing the ~82% per-agent degradation.
Why Ollama degrades less than PyTorch: Q4_0 weights are 4x smaller. At N=8, Ollama demands 139 GB/s total vs PyTorch's ~240+ GB/s. The lower demand means less contention per agent, explaining Ollama's 4.3 percentage-point advantage in degradation ratio.
SS11. Serialization Analysis (H2)
SS11.1 Max Concurrent Kernels
| Phase | Max Concurrent | Std | n | All Identical? |
|---|---|---|---|---|
| Ollama N=1 | 1 | 0 | 6 | Yes |
| Ollama N=8 | 1 | 0 | 6 | Yes |
| PyTorch N=1 | 1 | 0 | 6 | Yes |
| PyTorch N=8 | 1 | 0 | 6 | Yes |
SS11.2 Statistical Tests
All pairwise comparisons (Ollama N=1 vs N=8, PyTorch N=1 vs N=8, Ollama N=8 vs PyTorch N=8) return Cohen's d = 0 and NaN p-values. Zero variance in both groups makes parametric testing impossible -- which is itself the strongest possible finding.
SS11.3 Observations
Observation 1 -- Kernel serialization is universal and GPU-level. Every single run across all 26 profiled conditions shows max_concurrent_kernels = 1. This is not Ollama imposing serialization -- PyTorch Direct with 8 concurrent threads shows the same result. The CUDA runtime on a single consumer GPU (without NVIDIA MPS) processes kernels from different threads sequentially on the same SM array.
Observation 2 -- H2's original framing was wrong. H2 hypothesized: "Ollama serializes GPU requests even under concurrency." The evidence shows: all backends serialize, because the GPU hardware enforces it. The correct reframing: serialization exists (confirmed), but it is not Ollama-specific (not confirmed). The serialization is a property of the CUDA scheduling model on consumer GPUs, not a software deficiency.
Observation 3 -- This explains why vLLM/TGI scale better. If all backends face the same kernel serialization on a single GPU, why do vLLM/TGI achieve higher throughput at N=8 (TR130)? Because continuous batching reduces the number of kernel launches per total token. vLLM batches N sequences into a single kernel launch, reading model weights once for N tokens. Ollama reads weights once per token per request. The serialization constraint (max_concurrent=1) is the same, but vLLM does more useful work per kernel.
SS11.4 The Batching Insight
Consider LLaMA-1B generating 128 tokens for 8 agents:
- Ollama: 8 x 128 = 1,024 separate inference passes, each reading 0.6 GB of weights -> 614 GB total bandwidth
- vLLM (batched): ~128 batched inference passes, each reading 0.6 GB but producing 8 tokens -> 77 GB total bandwidth
- Bandwidth ratio: 614 / 77 = 8x less bandwidth with continuous batching
This 8x bandwidth reduction explains vLLM's 2.25x throughput advantage at N=8 (TR130). The remaining gap (8x / 2.25x ~ 3.6x) reflects overhead from variable sequence lengths, attention mask computation, and KV-cache management in batched execution.
SS12. Context Switch Analysis (H3)
SS12.1 Results
The nsys --gpuctxsw=true flag, which captures CUDA GPU context switches, requires administrator privileges on Windows WDDM drivers. This flag was disabled in our configuration to avoid profiling failures. Inter-kernel gap counts were used as a proxy metric.
All inter-kernel gap comparisons showed:
- Cohen's d = 0 for both Ollama and PyTorch N=1 vs N=8
- NaN p-values (zero variance in both groups)
- No measurable difference in any gap metric
SS12.2 Observations
Observation 1 -- No evidence of context switching overhead. The proxy metrics show no difference between N=1 and N=8. With max_concurrent_kernels = 1 and sequential execution, context switches between threads happen at the CUDA driver level with overhead below the nsys resolution (~100 ns). The driver multiplexes GPU access transparently.
Observation 2 -- This is expected for WDDM consumer GPUs. Unlike TCC (Linux server) drivers that can run multiple CUDA contexts simultaneously, WDDM serializes all GPU access through the Windows display driver. Context switches are handled by the WDDM scheduler, which preempts at kernel boundaries with minimal overhead. The single-GPU, single-context execution model on WDDM means there are no measurable context switch costs -- the driver simply queues kernels from all threads and dispatches them sequentially.
SS12.3 Verdict
H3 REJECTED. No evidence of CUDA context switching overhead at N=8. The GPU processes kernels sequentially from a single queue regardless of the number of requesting threads, and the WDDM scheduler handles multiplexing with negligible overhead.
SS13. Memory Allocation Analysis (H5)
SS13.1 Results
Memory allocation count comparisons between N=1 and N=8 showed:
- Cohen's d = 0 for both backends
- p-value = 1.0 (no difference detectable)
- Zero variance in allocation counts within each condition
SS13.2 Observations
Observation 1 -- No evidence of KV-cache memory pressure. Memory allocation patterns are identical at N=1 and N=8. For Ollama, this is expected: ggml pre-allocates KV-cache memory at model load time, not per-request. For PyTorch, the HuggingFace generate() function manages KV-cache internally with a fixed allocation pattern per sequence.
Observation 2 -- ncu SM occupancy was null, preventing direct H5 confirmation. Nsight Compute returned null values for SM occupancy and compute throughput metrics on Windows WDDM (see SS14). Without SM occupancy data, we cannot measure whether KV-cache expansion at N=8 reduces the GPU's ability to schedule warps. The H5 rejection is based on the absence of memory allocation changes, not on direct occupancy measurement.
SS13.3 Verdict
H5 REJECTED with low confidence. The evidence is absence-of-change rather than measured-no-effect. Future work on Linux TCC should re-test with ncu SM occupancy data.
SS14. Phase 6 -- Nsight Compute Targeted Profiling
SS14.1 Results
| Model | Wall Time (s) | Kernel Launches | Kernels Captured | SM Occupancy | DRAM Throughput | Compute Throughput |
|---|---|---|---|---|---|---|
| LLaMA-1B | 113.2 | 5 | 2 | null | null | null |
| LLaMA-3B | 251.3 | 5 | 2 | null | null | null |
SS14.2 Observations
Observation 1 -- ncu captured kernel launches but returned null metrics. Both models show 2 captured kernels (out of 5 launches), but SM occupancy, DRAM throughput, and compute throughput are all null. The kernel names were recorded as "unknown" -- the ncu CSV parser could not match kernel names from the output format.
Observation 2 -- WDDM is the likely cause. Nsight Compute on Windows WDDM has known limitations for hardware counter collection. The WDDM driver intercepts GPU access for display compositing, preventing ncu from getting exclusive hardware counter access. On Linux TCC (Tesla Compute Cluster) drivers, ncu has direct access to performance counters and can measure SM occupancy, DRAM throughput, and compute utilization accurately.
Observation 3 -- The 2-kernel capture suggests incomplete profiling. ncu should capture all 5 kernel launches (configured via --kernel-launch-count=5), but only 2 were recorded. This may be due to kernel replay failures on WDDM -- ncu replays each kernel multiple times to collect different counter sets, and the WDDM scheduler may interfere with replay.
Observation 4 -- This is the primary limitation of TR131. Direct DRAM throughput measurement would conclusively confirm or refute H_1 (bandwidth saturation). Without it, H_1 relies on memory operation time (SS10) and bandwidth demand calculations (SS10.4). Future work on Linux should re-run Phase 6 with TCC driver.
SS15. Hypothesis Verdicts and Degradation Attribution
SS15.1 Evidence Matrix
| Hypothesis | Test | p-value | Cohen's d | Effect | Supports H? |
|---|---|---|---|---|---|
| H_1 | Ollama mem time N=1->N=8 | 6.4x10^-5 | 3.81 | large | Yes |
| H_1 | ncu DRAM throughput | null | -- | -- | No data |
| H2 | Ollama max concurrent N=1->N=8 | NaN | 0 | negligible | No (zero variance) |
| H2 | Ollama vs PyTorch N=8 concurrency | NaN | 0 | negligible | No (zero variance) |
| H3 | Gap count N=1->N=8 | NaN | 0 | negligible | No |
| H3 | Mean gap N=1->N=8 | NaN | 0 | negligible | No |
| H4 | OS runtime | -- | -- | -- | No data |
| H5 | Alloc count N=1->N=8 | 1.0 | 0 | negligible | No |
SS15.2 Verdicts
H_1: GPU Memory Bandwidth Saturation -- PARTIALLY CONFIRMED (Confidence: HIGH)
Memory operation time increases 74.4% from N=1 to N=8 (p=6.4x10^-5, d=3.81), surviving Holm correction. Bandwidth demand calculations show N=8 exceeds peak RTX 4080 bandwidth by 78--244%. However, direct DRAM throughput measurement was not possible (ncu null on WDDM). "Partially confirmed" because the statistical evidence is strong but indirect -- we measure the consequence (increased memory time) rather than the mechanism (DRAM utilization percentage).
H2: Ollama Request Serialization -- REATTRIBUTED TO GPU HARDWARE (Confidence: HIGH)
Serialization exists (max_concurrent = 1) but occurs equally in Ollama and PyTorch Direct. Cohen's d = 0 for all cross-backend comparisons. The serialization is a fundamental property of single-GPU CUDA execution on consumer hardware, not an Ollama scheduling deficiency. vLLM/TGI avoid the throughput consequences of serialization through continuous batching (more work per kernel), not by achieving kernel concurrency.
H3: CUDA Context Switching -- REJECTED (Confidence: MEDIUM)
Zero variance in gap metrics across all conditions. No evidence of context switching overhead. Limited by inability to use --gpuctxsw=true on WDDM, hence "medium" confidence rather than "high."
H4: CPU Thread Scheduling -- INSUFFICIENT DATA (Confidence: LOW)
OS runtime summary data was not extracted for PyTorch N=8 runs due to stats extraction timeouts on large traces. Cannot evaluate CPU-side bottlenecks.
H5: KV-Cache Memory Pressure -- REJECTED (Confidence: LOW)
Memory allocation counts unchanged (p=1.0, d=0). No evidence of increased memory pressure. Low confidence because ncu SM occupancy data was null, preventing direct measurement of warp scheduling effects.
SS15.3 Holm Step-Down Correction
| Rank | Test | p-value | Holm Threshold (alpha/(k-i+1)) | Significant After Correction |
|---|---|---|---|---|
| 1 | H_1: Ollama mem time | 6.4x10^-5 | 0.05/6 = 0.0083 | Yes (6.4x10^-5 < 0.0083) |
| 2 | H2: Concurrency comparison | NaN | 0.05/5 = 0.0100 | No |
| 3 | H2: Ollama max concurrent | NaN | 0.05/4 = 0.0125 | No |
| 4 | H3: Gap count | NaN | 0.05/3 = 0.0167 | No |
| 5 | H3: Mean gap | NaN | 0.05/2 = 0.0250 | No |
| 6 | H5: Alloc count | 1.0 | 0.05/1 = 0.0500 | No |
After Holm correction for 6 simultaneous tests, only H_1 (memory bandwidth) remains significant. The NaN p-values for H2/H3 tests reflect zero variance in the underlying metrics -- the tests are mathematically undefined because there is no within-group variation to estimate standard error. This is itself informative: the lack of variation means the GPU imposes uniform behavior regardless of concurrency level or backend.
SS15.4 Degradation Attribution Table
| Source | Attribution | Evidence |
|---|---|---|
| GPU memory bandwidth physics | 86.4% | PyTorch Direct baseline degradation |
| Ollama serving stack overhead | -4.3% | Ollama degrades less (Q4_0 bandwidth advantage) |
| Net observed (Ollama N=1->N=8) | 82.1% | Aggregate across 2 models, 3 reps each |
The negative attribution to Ollama's serving stack means that Ollama's Q4_0 quantization provides a net benefit under bandwidth-limited concurrency. The serving stack is not a bottleneck -- it is a slight advantage, because Q4_0 requires less bandwidth per parameter, leaving more headroom for concurrent weight reads.
SS16. Statistical Power and Data Quality
SS16.1 Power Analysis
| Phase | n per group | Min Detectable d | Interpretation | Adequate for Observed d? |
|---|---|---|---|---|
| Ollama (2 models x 3 reps) | 6 | 1.286 | Can detect large effects | Yes (observed d=4.19) |
| PyTorch (2 models x 3 reps) | 5 | 1.458 | Can detect large effects | Yes (observed d=4.17) |
| Per-model Ollama | 3 | 2.484 | Can detect very large effects | Yes (observed d>80) |
| Per-model PyTorch 1B | 3 | 2.484 | Can detect very large effects | Yes (observed d=81) |
| Per-model PyTorch 3B | 2 | 6.314 | Can detect massive effects | Yes (observed d=298) |
SS16.2 Observations
Observation 1 -- All observed effects far exceed minimum detectable sizes. The weakest power (n=2, d_min=6.31 for PyTorch LLaMA-3B) still detects the massive d=298 effect. The primary comparison (overall degradation, n=5--6, d_min=1.29--1.46) easily detects the observed d=4.17--4.19. Power is not a concern for this experiment.
Observation 2 -- Small sample sizes produce wide CIs but don't affect significance. The wide 95% CIs for aggregate stats (e.g., Ollama N=1: 128.5 [91.7, 165.2]) reflect pooling 2 models with very different TPS values (160 vs 96). Within-model CIs are much tighter (LLaMA-1B N=1: 160.44 [159.29, 161.60], width = 2.3 TPS). The wide aggregate CIs do not undermine the degradation tests, which compare matched conditions.
Observation 3 -- Zero outliers detected across all conditions. IQR outlier detection (Tukey fence, 1.5x IQR) found zero outliers in any group. The data is remarkably clean, consistent with the deterministic nature of GPU inference (no network jitter, no disk I/O, no thermal throttling during short profiles).
SS16.3 Data Quality Summary
| Metric | Value |
|---|---|
| Total profiled runs | 26 (+ 1 validation) |
| Runs with 0 requests (excluded) | 2 (both PyTorch 3B rep0) |
| Analyzable runs | 24 |
| Outlier rate | 0.0% (zero outliers detected) |
| Missing trace data | 3 runs (PyTorch N=8 3B -- stats timeout) |
| Significant comparisons | 9/18 major tests (50%) |
| Tests surviving Holm correction | 1/6 hypothesis tests (H_1 only) |
SS17. Profiling Overhead Assessment
SS17.1 Cross-Validation with TR129
| Model | TR129 Unprofiled TPS | TR131 Profiled TPS (N=1) | Overhead |
|---|---|---|---|
| LLaMA-1B (Ollama) | ~160 | 160.44 | <1% |
| LLaMA-3B (Ollama) | ~97 | 96.48 | <1% |
SS17.2 Observations
Observation 1 -- nsys profiling overhead is negligible for Ollama. The profiled TPS matches TR129's unprofiled data within measurement noise. This is expected because nsys uses hardware-based instrumentation (GPU performance counters) rather than software instrumentation, and the HTTP request driver runs outside the nsys process tree.
Observation 2 -- PyTorch profiling overhead cannot be independently validated. PyTorch Direct was not benchmarked unprofiled in prior TRs. However, nsys overhead is symmetric across N=1 and N=8 conditions, so the degradation ratio is unaffected even if absolute TPS is slightly depressed. The attribution analysis uses ratios, not absolute values.
Observation 3 -- Trace file sizes suggest higher overhead for PyTorch. PyTorch traces (42--272 MB) are 50--340x larger than Ollama traces (0.8--4.1 MB), reflecting the 345x more kernel launches. While nsys's per-kernel instrumentation overhead is small (~10 ns per kernel), at 3--5 million kernels this accumulates to 30--50 ms total -- still <0.5% of the 10--80 second GPU time. The overhead is negligible even for PyTorch.
SS18. Limitations and Future Work
SS18.1 What This Report Does NOT Prove
-
We did not measure peak DRAM throughput directly. ncu returned null metrics on Windows WDDM, preventing direct confirmation that bandwidth exceeds 80% of peak at N=8. The bandwidth argument relies on calculation (SS10.4), not measurement. The calculation assumes 1 full weight read per token, which is correct for autoregressive decode but may overestimate prefill bandwidth (where computation is more memory-efficient due to batched token processing).
-
FP16 vs Q4_0 confound persists. Ollama serves Q4_0 (0.5 bytes/parameter); PyTorch serves FP16 (2 bytes/parameter). This affects absolute TPS but not the degradation ratio used for attribution. However, Q4_0 and FP16 may have different memory access patterns (quantized reads require dequantization logic, which could affect cache line utilization), potentially introducing a subtle confound in the bandwidth comparison.
-
N=8 threads may not achieve true GPU concurrency. Python's GIL is released for CUDA operations, and our ThreadPoolExecutor dispatches 8 workers. But the CUDA context serializes kernel execution regardless (max_concurrent=1). Whether the threads achieve true memory-level concurrency (8 concurrent weight reads) or whether the memory controller serializes reads too is unclear from nsys data alone.
-
Only consumer GPU tested. The RTX 4080 Laptop with GDDR6 and WDDM driver is not representative of server deployments (A100/H_100 with HBM2e/HBM3 and TCC driver). Server GPUs have 3--5x higher memory bandwidth, which may reduce the bandwidth saturation severity. However, the same mechanism (N concurrent weight reads exceeding bandwidth) would still apply at larger N.
-
OS runtime data was incomplete. Large PyTorch N=8 traces (>270 MB, ~5M kernels) caused nsys stats extraction to exceed the 120-second timeout for
cuda_kern_exec_traceandosrt_sumreports. This prevented evaluation of H4 (CPU scheduling) and reduced per-model GPU utilization data for PyTorch 3B N=8. -
Only 2 models tested. Both are decoder-only LLaMA 3.2 variants. MoE models (Mixtral), encoder-decoder models (T5), and models with different attention mechanisms (GQA vs MHA) may show different degradation patterns. The 1B and 3B sizes are also relatively small -- larger models (8B, 70B) would change the compute/memory ratio.
SS18.2 Threats to Validity
| Threat | Type | Severity | Mitigation | Residual Risk |
|---|---|---|---|---|
| Profiling overhead distorts TPS | Internal | Low | Ollama TPS matches TR129 unprofiled (<1% delta) | PyTorch overhead unknown |
| Small sample size (n=2--3 per model) | Internal | Medium | Power analysis: d_min=1.29. All observed d>4 | Tight per-model CIs but adequate |
| WDDM vs TCC driver differences | External | Medium | Results are conservative (WDDM adds overhead) | Linux replication needed |
| Q4_0 vs FP16 confound | Construct | Medium | Focus on degradation ratios, not absolute TPS | Memory access pattern differences |
| Missing data (2 runs, 0 requests) | Internal | Low | Excluded; remaining n>=2 still significant | Slightly reduced power for 3B |
| PyTorch 3B stats timeout | Internal | Low | Primary TPS data unaffected | Missing trace-level metrics |
| Thermal drift over 71 min | Internal | Low | 3s cooldown between runs; GPU at <70degC | Negligible on short profiles |
SS18.3 Future Work
-
Linux TCC driver profiling. Run identical experiment on Linux with TCC driver to enable ncu hardware counter collection. Prediction: ncu will show DRAM throughput >80% of peak at N=8, directly confirming H_1. SM occupancy should show warp scheduling saturation.
-
vLLM kernel profiling. Profile vLLM under nsys to understand how continuous batching reduces bandwidth demand at N=8. Prediction: vLLM launches fewer, larger kernels with multiple sequences batched per launch, reading model weights once for N tokens instead of once per token.
-
Multi-GPU N=8 test. Split 8 agents across 2 GPUs (4 each) and measure per-agent degradation. Prediction: per-agent degradation drops from ~82% to ~50% because each GPU handles half the bandwidth demand. This would directly confirm that bandwidth is the binding constraint.
-
Same-quantization comparison. Run PyTorch with Q4_0 (via GPTQ/bitsandbytes) to eliminate the Q4_0/FP16 confound. Prediction: PyTorch-Q4_0 will degrade ~82% (matching Ollama), confirming that the 4.3 pp gap is quantization-driven, not serving-stack-driven.
-
CUDA MPS testing. Enable NVIDIA Multi-Process Service to allow concurrent kernel execution from 8 processes. Prediction: marginal improvement (<5%) because bandwidth, not compute, is the bottleneck. MPS allows kernel overlap but does not increase memory bandwidth.
-
Continuous batching prototype. Implement batched inference in the PyTorch Direct code path (batch 8 sequences per forward pass) and re-measure N=8 degradation. Prediction: degradation drops from 86.4% to ~40--50%, matching vLLM's behavior from TR130, because batching amortizes the weight-read bandwidth across sequences.
SS19. Conclusions
SS19.1 Answers to Research Questions
Q1: Does GPU memory bandwidth saturate under N=8 concurrency?
Partially yes -- strong evidence for bandwidth stress, but direct measurement was unavailable. Memory operation time increases 74.4% at N=8 (p=6.4x10^-5, d=3.81), the only test surviving Holm correction across 6 hypothesis tests. Bandwidth demand calculations show N=8 requires 178--344% of peak RTX 4080 bandwidth depending on model and precision. The GPU memory controller must serialize weight reads, explaining the per-agent throughput collapse. Direct DRAM utilization percentage was not measurable (ncu null on WDDM), so "saturation" is inferred rather than directly observed. The evidence is strong but indirect.
Q2: Does Ollama serialize GPU kernel execution compared to direct PyTorch?
No -- serialization is universal. Max concurrent kernels = 1 in all 26 runs across both backends. Cohen's d = 0 for every cross-backend comparison. Both Ollama and PyTorch Direct face the same GPU-level kernel serialization. The CUDA runtime on a single consumer GPU dispatches kernels sequentially from a single hardware queue regardless of how many software threads submit them. Ollama does not add serialization beyond what the GPU hardware imposes.
Q3: Do CUDA context switches increase measurably at N=8?
No. Inter-kernel gap metrics show zero variance between N=1 and N=8 across both backends. The WDDM driver multiplexes GPU access at kernel boundaries with overhead below nsys resolution. Context switching is not a contributor to multi-agent degradation.
Q4: Is GPU-level degradation intrinsic (hardware) or extrinsic (software)?
Intrinsic. PyTorch Direct eliminates the entire serving stack -- no HTTP server, no Go runtime, no request queuing, no Ollama process -- and degrades 86.4% vs Ollama's 82.1%. The serving stack is not the cause. The degradation is a fundamental property of running 8 concurrent inference streams on a single GPU with finite memory bandwidth.
Q5: What fraction of the 82% degradation is attributable to Ollama's serving stack vs GPU physics?
GPU physics: 86.4%. Ollama serving stack: -4.3%. The serving stack attribution is negative -- Ollama's Q4_0 quantization provides a net benefit under bandwidth-limited concurrency. The 82.1% net degradation is entirely explained by GPU memory bandwidth contention, with Q4_0 quantization providing a 4.3 percentage-point reduction by lowering per-request bandwidth demand.
SS19.2 The Central Finding
TR129 asked: what causes the 63% per-agent degradation? (Later measured as 82% in profiled conditions with 128 tokens.) TR130 answered: the serving stack. TR131 overturns this: GPU memory physics.
The 82% per-agent throughput degradation under N=8 concurrency is a GPU memory bandwidth phenomenon, not a serving stack scheduling deficiency. This is proven by the elimination test: PyTorch Direct, with zero serving stack overhead, degrades more than Ollama (86.4% vs 82.1%). The only hypothesis test surviving multiple comparison correction is H_1 (memory bandwidth), and bandwidth demand calculations show N=8 exceeds peak RTX 4080 bandwidth by 78--244%.
TR130's conclusion was correct in its recommendation but wrong in its mechanism. vLLM > TGI > Ollama for multi-agent scaling -- this ranking holds. But the reason is not "Ollama's scheduling is bad." The reason is continuous batching amortizes memory bandwidth by reading model weights once for multiple sequences per kernel launch. Ollama reads weights once per token per request. The 8x bandwidth reduction from batching (SS11.4) explains vLLM's 2.25x total throughput advantage at N=8.
SS19.3 One-Number Summaries
For capacity planning -- Bandwidth Demand per Configuration:
| Configuration | Per-Agent TPS (N=1) | Per-Agent TPS (N=8) | Degradation | N=8 Total BW Demand |
|---|---|---|---|---|
| Q4_0 LLaMA-1B | 160 | 29 | -82% | 139 GB/s (32% peak) |
| Q4_0 LLaMA-3B | 96 | 17 | -82% | 218 GB/s (50% peak) |
| FP16 LLaMA-1B | 52 | 7.2 | -86% | 138 GB/s (32% peak) |
| FP16 LLaMA-3B | 29 | 3.8 | -87% | 195 GB/s (45% peak) |
For backend selection -- Decision Tree:
| If you need... | Then... | Why |
|---|---|---|
| >30 TPS per agent with 8 agents | Multiple GPUs | Bandwidth-bound on single GPU at N>=4 |
| Best single-GPU multi-agent throughput | vLLM with Q4 quantization | Continuous batching + low bandwidth demand |
| Lowest per-agent latency at N=1 | Ollama with Q4_0 | Highest single-agent TPS (160 tok/s) |
| Better than 82% per-agent degradation | Continuous batching server | vLLM/TGI amortize bandwidth across sequences |
| Root-cause diagnosis for your workload | Profile with nsys first | Intuitive attribution can be wrong (this report demonstrates it) |
SS19.4 What Changes for the Banterhearts Research Program
-
TR130's recommendation stands, but the reasoning changes. vLLM > Ollama for multi-agent deployment. The advantage is not "better scheduling" but "continuous batching reduces bandwidth demand per token." This distinction matters because it suggests that Ollama could achieve similar scaling by implementing request batching -- the GPU hardware is not the limit; the software's bandwidth efficiency is.
-
TR129's Amdahl serial fraction is reinterpreted. The s=0.39--0.54 serial fraction measured for Ollama reflects GPU memory bandwidth serialization, not Ollama scheduling serialization. The same serial fraction would apply to any sequential-serving backend on the same GPU. vLLM/TGI avoid this by batching, not by scheduling.
-
Quantization is a concurrency optimization, not just a compression technique. Q4_0's 4x smaller weights provide 4x less bandwidth demand per token. At N=1 this translates to ~3x faster inference; at N=8 it translates to 3.9x faster. Quantization should be viewed as a bandwidth efficiency technique, especially for multi-agent workloads.
-
Future multi-agent experiments should profile GPU bandwidth. The intuition that "better software = better scaling" led TR130 to a correct recommendation with an incorrect mechanism. Hardware profiling (nsys/ncu) should accompany any multi-agent benchmark to distinguish software bottlenecks from hardware physics.
Appendix A: Configuration
experiment: tr131
models:
- name: llama3.2-1b
ollama_tag: "llama3.2:1b"
hf_id: "unsloth/Llama-3.2-1B-Instruct"
params_m: 1200
- name: llama3.2-3b
ollama_tag: "llama3.2:3b"
hf_id: "unsloth/Llama-3.2-3B-Instruct"
params_m: 3200
nsys:
trace: cuda
gpuctxsw: false # Requires admin on Windows WDDM
gpu_metrics_set: "" # Requires admin on Windows WDDM
gpu_metrics_frequency: 0
sample: none # No CPU sampling (reduces trace size)
cpuctxsw: none # No CPU context switches (reduces trace size)
max_new_tokens: 128
seed: 42
warmup_requests: 3
prompt_tokens_low: 100
prompt_tokens_high: 200
phase1: # Validation
requests: 3
profile_duration_s: 30
phase2: # Ollama N=1
n_agents: 1
requests_per_agent: 5
repetitions: 3
profile_duration_s: 30
phase3: # Ollama N=8
n_agents: 8
requests_per_agent: 3
repetitions: 3
profile_duration_s: 60
phase4: # PyTorch N=1
n_threads: 1
requests_per_thread: 5
repetitions: 3
profile_duration_s: 60
phase5: # PyTorch N=8
n_threads: 8
requests_per_thread: 3
repetitions: 3
profile_duration_s: 120
phase6: # Nsight Compute
kernel_launch_count: 5
Appendix B: Environment
| Component | Version / Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 4080 Laptop GPU |
| VRAM | 12,282 MB GDDR6 |
| Peak Memory Bandwidth | ~432 GB/s |
| Bus Width | 256-bit |
| Driver | 591.74 |
| OS | Windows 11 10.0.26200 (WDDM) |
| Python | 3.13.1 |
| Nsight Systems | 2025.5.1.121-255136380782v0 |
| Nsight Compute | 2025.3.1.0 (build 36398880) |
| CUDA | Via driver (no standalone CUDA toolkit required) |
| Ollama | Latest (Q4_0 quantization for llama3.2 models) |
| PyTorch | Via HuggingFace Transformers (torch.float16, CUDA) |
| Architecture | AMD64 |
Appendix C: Statistical Methods
Welch's t-test
Used for all pairwise comparisons. Does not assume equal variance or equal sample sizes between groups. Degrees of freedom estimated via Welch-Satterthwaite approximation: df ~ (s1^2/n1 + s2^2/n2)^2 / [(s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1)]. When both groups have zero variance (e.g., max_concurrent = 1 everywhere), the test returns NaN -- which is itself informative.
Cohen's d (pooled)
Effect size computed as: d = (mean_a - mean_b) / pooled_std, where pooled_std = sqrt(((n_a-1)xvar_a + (n_b-1)xvar_b) / (n_a + n_b - 2)). Interpretation thresholds: |d| < 0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >0.8 large. Values exceeding d=10 (observed frequently in this study) indicate effect sizes so large that they are visible in individual data points without statistical testing.
Mann-Whitney U
Non-parametric rank-based test. Used as robustness check on every significant Welch's result. Two-sided alternative hypothesis. Does not assume normal distributions. Particularly important for n=2--3 groups where normality cannot be verified.
Holm Step-Down Correction
For k hypothesis tests at family-wise alpha: sort p-values ascending, test rank i against threshold alpha/(k-i+1). Reject H_i if p_i < threshold AND all lower-ranked tests were also rejected. More powerful than Bonferroni (which uses alpha/k for all tests) while still controlling family-wise error rate.
Power Analysis
Minimum detectable effect size for a two-sample t-test with equal n: d_min = t_crit(alpha/2, df=2n-2) x sqrt(2/n). At n=6: d_min=1.286 (requires "large" effect). At n=3: d_min=2.484 (requires "very large" effect). All observed effects (d > 4) are well above detection thresholds, confirming adequate statistical power despite small sample sizes.
IQR Outlier Detection
Tukey fences: outlier if x < Q1 - 1.5xIQR or x > Q3 + 1.5xIQR, where IQR = Q3 - Q1. Applied to all metric groups. Zero outliers detected in this study.
Appendix D: Glossary
| Term | Definition |
|---|---|
| TPS | Tokens per second -- tokens_generated / wall_time. User-perceived throughput. |
| N=K | K concurrent agents/threads sending inference requests to the same GPU. |
| Q4_0 | 4-bit quantization format (0.5 bytes/parameter). Used by Ollama via ggml. |
| FP16 | Half-precision floating point (2 bytes/parameter). Used by PyTorch/HuggingFace. |
| nsys | NVIDIA Nsight Systems -- system-wide GPU profiler using hardware counters. |
| ncu | NVIDIA Nsight Compute -- per-kernel profiler with detailed hardware metrics. |
| SM | Streaming Multiprocessor -- the GPU's compute unit. RTX 4080 Laptop has 58 SMs. |
| WDDM | Windows Display Driver Model -- Windows GPU driver framework. Serializes GPU access for display compositing. |
| TCC | Tesla Compute Cluster -- Linux/server GPU driver mode. Allows exclusive compute access and ncu hardware counter collection. |
| MPS | Multi-Process Service -- CUDA feature enabling concurrent kernel execution from multiple processes. Not available on consumer WDDM GPUs. |
| GIL | Global Interpreter Lock -- Python's thread serialization mechanism. Released for CUDA operations, enabling true GPU concurrency from Python threads. |
| ggml | C library for ML inference used by Ollama/llama.cpp. Features fused quantized kernels. |
mul_mat_q |
ggml's quantized matrix multiply CUDA kernel. Fuses dequantization + matmul. |
| Continuous batching | Technique where multiple sequences are processed in a single kernel launch, reading model weights once for N tokens. Used by vLLM and TGI. |
| PagedAttention | vLLM's memory management: allocates KV-cache in pages to reduce fragmentation. |
| Cohen's d | Standardized mean difference: |mean_diff| / pooled_std. <0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >=0.8 large. |
| Holm correction | Step-down multiple comparison correction. Controls family-wise error rate more powerfully than Bonferroni. |
| Welch's t-test | t-test for unequal variance. Standard for comparing two independent groups. |
| Category error | Comparing a metric across systems where the metric means different things. E.g., attributing GPU-level serialization to software scheduling. |
| Bandwidth demand | Memory bandwidth required per second: model_weight_size x tokens_per_second x concurrent_agents. |
Appendix E: Reproducibility
How to Reproduce This Experiment
# Prerequisites:
# - NVIDIA Nsight Systems 2025.5.1 at default install path
# - NVIDIA Nsight Compute 2025.3.1 at default install path
# - Ollama installed: ollama pull llama3.2:1b && ollama pull llama3.2:3b
# - Python 3.11+ with: torch, transformers, numpy, scipy, pyyaml, requests
# - Close Ollama tray app before running (avoids process conflicts)
# Full pipeline (all 6 phases + analysis)
python research/tr131/run.py -v
# Expected runtime: ~71 minutes
# Expected disk usage: ~1.6 GB traces
Key Implementation Details
- Ollama profiling: nsys wraps
ollama serve; HTTP driver runs in separate thread outside nsys process tree - PyTorch profiling: nsys wraps Python script; ThreadPoolExecutor for N=8 concurrency (GIL released for CUDA)
- Warmup: 3 requests per model before measurement begins
- Cooldown: 3 seconds between runs for GPU temperature stabilization
- Error handling: Runs with 0 complete requests are logged and excluded from analysis
- Stats extraction timeout: 120 seconds per nsys stats report; large traces may timeout
Data Provenance
| Artifact | Path | Size |
|---|---|---|
| Raw traces | research/tr131/results/20260226_174224/traces/ |
~1.6 GB (27 .nsys-rep files) |
| Exported CSVs | research/tr131/results/20260226_174224/exports/ |
~50 MB |
| Phase results | research/tr131/results/20260226_174224/p{2,3,4,5}_*_results.json |
~10 KB each |
| Validation | research/tr131/results/20260226_174224/validation.json |
~2 KB |
| ncu results | research/tr131/results/20260226_174224/p6_ncu_results.json |
~1 KB |
| Analysis | research/tr131/results/20260226_174224/analysis.json |
~80 KB (12 sections) |
| Manifest | research/tr131/results/20260226_174224/manifest.json |
~3 KB |
| This report | PublishReady/reports/Technical_Report_131.md |
~1,200 lines |
Implementation Files
| File | Purpose | Lines |
|---|---|---|
research/tr131/run.py |
Orchestrator -- runs all 6 phases sequentially | ~130 |
research/tr131/run_validation.py |
Phase 1 -- validates nsys captures Ollama | ~220 |
research/tr131/run_ollama_profiled.py |
Phases 2--3 -- profiles Ollama at N=1 and N=8 | ~210 |
research/tr131/run_pytorch_direct.py |
Phases 4--5 -- profiles PyTorch Direct at N=1 and N=8 | ~250 |
research/tr131/run_ncu_targeted.py |
Phase 6 -- nsight compute per-kernel profiling | ~150 |
research/tr131/analyze.py |
12-section statistical analysis -> analysis.json | ~680 |
research/tr131/shared/statistics.py |
Statistical utilities (Welch's t, Cohen's d, Holm, etc.) | ~300 |
research/tr131/shared/nsys_driver.py |
NsysDriver class -- wraps nsys profile/stats/export | ~200 |
research/tr131/shared/trace_parser.py |
Parse nsys CSV exports into analysis-ready dicts | ~350 |
research/tr131/shared/request_driver.py |
HTTP request sender for Ollama (outside nsys) | ~200 |
research/tr131/shared/pytorch_inference.py |
Direct HuggingFace model loading + generate | ~250 |
research/tr131/shared/utils.py |
Paths, constants, prompt generation | ~100 |
References
- Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
- Patel, P. et al. (2024). Splitwise: Efficient generative LLM inference using phase splitting. ISCA 2024.
- Amdahl, G.M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967.
- NVIDIA (2025). Nsight Systems User Guide. NVIDIA Developer Documentation.
- NVIDIA (2025). Nsight Compute Documentation. NVIDIA Developer Documentation.
- TR129 (2026). N-Agent Scaling Laws. Banterhearts Research.
- TR130 (2026). Serving Stack Benchmarking -- Ollama vs vLLM vs TGI. Banterhearts Research.
- TR126 (2026). Docker/Linux + Triton Validation. Banterhearts Research (statistical methodology reference).