Technical Report 131: GPU Kernel Profiling

Root-Cause Analysis of Multi-Agent Throughput Degradation via NVIDIA Nsight Systems

Field	Value
TR Number	131
Project	Banterhearts LLM Performance Research
Date	2026-02-26
Author	Research Team
Report Type	Hardware-level root-cause analysis (6-phase, 2 backends, 26 profiled runs)
Test Duration	~71 minutes
Status	Complete -- GPU memory physics identified as primary degradation mechanism
Run ID	`20260226_174224`
Related Work	TR129 (N-Agent Scaling Laws), TR130 (Serving Stack Benchmarking)
Depends On	TR129 (degradation measurement), TR130 (serving stack comparison)

Abstract

TR129 established that Ollama per-agent throughput drops 63% under 8-agent concurrency (Amdahl serial fraction s=0.39--0.54) on an RTX 4080 Laptop GPU. TR130 compared three serving stacks and found vLLM/TGI scale 3--4x better than Ollama, concluding: "The serving stack is the bottleneck." But neither study opened the GPU black box. Without kernel-level traces, the attribution was correlational, not causal.

TR131 provides the causal test. Using NVIDIA Nsight Systems (nsys) and Nsight Compute (ncu), we capture GPU kernel timelines, memory operations, and execution traces for 2 LLaMA 3.2 models (1B, 3B) across 26 profiled runs in 4 experimental conditions: Ollama N=1, Ollama N=8, PyTorch Direct N=1, PyTorch Direct N=8. PyTorch Direct eliminates the entire serving stack -- no HTTP server, no Go runtime, no request queuing, no token streaming -- and calls model.generate() directly via HuggingFace Transformers. If the degradation persists without the serving stack, the cause is GPU physics, not software.

The central finding overturns the TR130 hypothesis. PyTorch Direct degrades 86.4% from N=1 to N=8 -- worse than Ollama's 82.1% (both p < 0.003, Cohen's d > 4, confirmed by Mann-Whitney U). The -4.3 percentage-point attributable-to-Ollama figure means the degradation is intrinsic to running 8 concurrent inference streams on a single GPU with shared memory bandwidth. Ollama's Q4_0 quantization actually helps under concurrency by reducing per-request bandwidth pressure, growing the Ollama-vs-PyTorch advantage from 3.0x at N=1 to 3.9x at N=8.

The strongest mechanistic evidence comes from memory bandwidth analysis: Ollama memory operation time increases 74.4% from N=1 to N=8 (p=6.4x10^-5, Cohen's d=3.81) -- the only hypothesis test surviving Holm step-down correction across 6 tests at family-wise alpha=0.05. Back-of-envelope bandwidth calculations show N=8 demand exceeds the RTX 4080's peak 432 GB/s by 78--130% depending on precision, forcing the memory controller to serialize weight reads.

Five hypotheses were tested. H_1 (bandwidth saturation): partially confirmed. H2 (Ollama serialization): serialization exists but is GPU-level, not Ollama-level (max_concurrent_kernels = 1 in all conditions including PyTorch). H3 (context switching): rejected. H4 (CPU scheduling): insufficient data. H5 (KV-cache pressure): rejected. Welch's t-tests, Cohen's d effect sizes, 95% CIs, Mann-Whitney U robustness checks, and Holm correction provide statistical rigor matching TR126 standards.

Quantization caveat: Ollama serves Q4_0 (0.5 bytes/parameter); PyTorch runs FP16 (2 bytes/parameter). Absolute TPS is non-comparable. However, the N=1->N=8 degradation ratio is the relevant attribution metric, and FP16's strictly higher memory pressure per parameter makes PyTorch's worse degradation a conservative bound on GPU physics effects.

Executive Summary

Key Findings

GPU physics dominates. PyTorch Direct degrades 86.4% (N=1->N=8), exceeding Ollama's 82.1% -- the serving stack is not the primary bottleneck. The degradation is intrinsic to single-GPU memory bandwidth under concurrent weight reads.
Both backends show massive, highly significant degradation. Ollama: 128.5 -> 23.0 TPS (p=0.0006, d=4.19); PyTorch: 42.9 -> 5.8 TPS (p=0.002, d=4.17). All effects are far above the minimum detectable d=1.29 at our sample sizes.
Ollama is 3--4x faster than PyTorch at both concurrency levels. At N=1: 128.5 vs 42.9 TPS (3.0x, p=0.001, d=3.12). At N=8: 23.0 vs 5.8 TPS (3.9x, p=0.0008, d=3.44). The advantage grows under concurrency because Q4_0's lower bandwidth demand compounds when bandwidth is scarce.
Memory bandwidth stress is the only statistically significant mechanism. Ollama memory operation time grows +74.4% at N=8 (p=6.4x10^-5, d=3.81) -- the only test surviving Holm correction (rank 1 of 6, threshold=0.0083).
Kernel serialization is GPU-level, not Ollama-level. Max concurrent kernels = 1 in all 26 runs across both backends. Cohen's d = 0 for every concurrency comparison. The GPU hardware enforces serial kernel execution regardless of software.
Context switching (H3) is rejected. Inter-kernel gap metrics show zero variance between N=1 and N=8 across both backends. No evidence of CUDA context switching overhead.
KV-cache pressure (H5) is rejected. Memory allocation counts are unchanged between N=1 and N=8 (p=1.0, d=0). No evidence of memory pressure affecting GPU utilization.
Nsight Compute data was limited by WDDM. ncu captured kernel names but returned null metrics for SM occupancy and DRAM throughput on Windows WDDM driver. Direct bandwidth measurement was not possible.
Degradation is model-size independent. LLaMA-1B: -82.1% (Ollama), -86.2% (PyTorch). LLaMA-3B: -82.2% (Ollama), -87.1% (PyTorch). Near-identical patterns regardless of parameter count.
This revises TR130's conclusion. The "serving stack bottleneck" is actually a "GPU memory physics bottleneck." vLLM/TGI's better scaling comes from continuous batching and PagedAttention reducing bandwidth waste per token, not merely better HTTP request scheduling.

Summary Tables

Per-Agent Throughput (TPS) with Full Statistics

Backend	Model	N=1 Mean	N=1 95% CI	CV%	N=8 Mean	N=8 95% CI	CV%	Degradation	p-value	Cohen's d
Ollama	LLaMA-1B	160.44	[159.29, 161.60]	0.29	28.80	[24.93, 32.66]	5.40	-82.1%	1.1x10^-5	114.64
Ollama	LLaMA-3B	96.48	[96.06, 96.89]	0.17	17.19	[14.00, 20.39]	7.47	-82.2%	6.8x10^-5	86.54
PyTorch	LLaMA-1B	52.02	[50.11, 53.92]	1.48	7.18	[6.84, 7.53]	1.93	-86.2%	6.1x10^-5	81.26
PyTorch	LLaMA-3B	29.33	[28.25, 30.41]	0.41	3.79	[3.66, 3.92]	0.37	-87.1%	1.8x10^-3	298.35

Hypothesis Verdicts

H	Hypothesis	Verdict	Key Evidence	Holm-Corrected	Confidence
H_1	GPU bandwidth saturation	PARTIALLY CONFIRMED	Mem time +74%, p=6.4x10^-5, d=3.81	Significant	HIGH
H2	Ollama request serialization	GPU-LEVEL (not Ollama)	Max concurrent=1 everywhere, d=0	NaN (no variance)	HIGH
H3	CUDA context switching	REJECTED	Gap metrics identical N=1 vs N=8	Not significant	MEDIUM
H4	CPU thread scheduling	INSUFFICIENT DATA	OS runtime data unavailable	N/A	LOW
H5	KV-cache memory pressure	REJECTED	Alloc counts unchanged, p=1.0, d=0	Not significant	LOW

Claim Validation

#	Claim	Evidence	Status
1	Serving stack causes 82% degradation (TR130)	PyTorch Direct degrades 86.4% without any serving stack	Overturned
2	GPU bandwidth is stressed at N=8	Mem time +74.4%, p=6.4x10^-5, survives Holm correction	Confirmed
3	Ollama serializes GPU access	Max concurrent=1 in both backends; GPU hardware, not Ollama	Reattributed
4	Q4_0 quantization helps concurrency	Ollama advantage grows 3.0x -> 3.9x from N=1 to N=8	Confirmed
5	Context switches cause overhead at N=8	Zero variance in gap metrics across all conditions	Rejected
6	KV-cache pressure reduces occupancy	Alloc counts unchanged; ncu SM data null	Rejected
7	Profiling overhead distorts timing	Ollama N=1 TPS matches TR129 unprofiled (160.4 vs ~160)	Negligible

Key Decisions for Practitioners

Do not blame Ollama for N=8 degradation. GPU memory bandwidth is the fundamental limit. Switching serving stacks alone will not solve the 82% per-agent throughput collapse -- the GPU physics enforces it.
Continuous batching is the real differentiator. vLLM/TGI's better scaling (TR130) comes from batching multiple sequences into single kernel launches, amortizing the weight-read bandwidth cost. This is fundamentally different from "better scheduling" -- it is a bandwidth optimization.
Quantization is critical for concurrency. Ollama's Q4_0 models maintain 3--4x higher absolute TPS than FP16 PyTorch because Q4_0 weights are 4x smaller, proportionally reducing memory bandwidth demand. The advantage compounds under contention: 3.0x at N=1 grows to 3.9x at N=8.
For multi-agent workloads, use multiple GPUs or reduce agent count. 8 agents on a single RTX 4080 Laptop is fundamentally bandwidth-bound regardless of software stack. Plan for 2--3 agents per GPU for acceptable latency.
Profile before optimizing. This study demonstrates that intuitive attributions (blaming the serving stack) can be wrong even when supported by strong correlational evidence (TR130). Hardware profiling is the only way to distinguish correlation from causation.

How to Read This Report

Time	Reading Path
2 min	Abstract -> Executive Summary -> SS15 Hypothesis Verdicts table
10 min	Add SS3 (Methodology) + SS14 (PyTorch vs Ollama) + SS15 (Verdicts)
30 min	Full report, SS1--SS19 + Appendices

When to Use This Report

Scenario	How This Report Helps
Diagnosing multi-agent throughput collapse	Attribution table shows GPU physics is primary cause
Deciding whether to switch from Ollama	Switching to another sequential server won't help; need continuous batching
Capacity planning for concurrent agents	Bandwidth demand table shows when GPU saturates
Evaluating if quantization helps concurrency	Q4_0 advantage grows under bandwidth pressure (3.0x -> 3.9x)
Understanding why vLLM scales better than Ollama	Not scheduling -- continuous batching amortizes bandwidth
Planning GPU profiling for your workload	Methodology section provides nsys/ncu recipe

SS1. Introduction and Motivation
SS2. Methodology
SS3. Phase 1 -- Environment Validation
SS4. Phase 2 -- Ollama N=1 Baseline
SS5. Phase 3 -- Ollama N=8 Concurrent
SS6. Phases 4--5 -- PyTorch Direct N=1 and N=8
SS7. The Core Comparison -- Ollama vs PyTorch Degradation
SS8. Kernel Profile Comparison
SS9. GPU Utilization Analysis
SS10. Memory Bandwidth Analysis (H_1)
SS11. Serialization Analysis (H2)
SS12. Context Switch Analysis (H3)
SS13. Memory Allocation Analysis (H5)
SS14. Phase 6 -- Nsight Compute Targeted Profiling
SS15. Hypothesis Verdicts and Degradation Attribution
SS16. Statistical Power and Data Quality
SS17. Profiling Overhead Assessment
SS18. Limitations and Future Work
SS19. Conclusions
Appendix A: Configuration
Appendix B: Environment
Appendix C: Statistical Methods
Appendix D: Glossary
Appendix E: Reproducibility
References

SS1. Introduction and Motivation

SS1.1 Background

Multi-agent LLM systems deploy N autonomous agents that concurrently issue inference requests to a shared GPU. TR129 established that Ollama per-agent throughput degrades following Amdahl's Law with serial fraction s=0.39--0.54: at N=8 agents on an RTX 4080 Laptop GPU, each agent retains only 16--17% of its standalone throughput. The practical consequence is severe -- adding agents beyond N=2 yields diminishing total throughput.

TR130 isolated the serving stack variable by comparing Ollama, vLLM, and TGI on identical hardware. vLLM retained 46--65% of per-agent throughput at N=8 (vs Ollama's 16--17%), and total throughput was 2x higher. The conclusion: "The serving stack is the bottleneck, and it is Ollama that suffers."

But this conclusion rests on a correlational argument. TR130 showed that different serving stacks produce different degradation curves. It did not show that Ollama's scheduling causes the degradation. An alternative explanation: the GPU's memory bandwidth is the fundamental constraint, and vLLM/TGI's continuous batching reduces bandwidth demand (by batching weight reads across sequences), while Ollama's sequential execution does not. Under this alternative, the root cause is GPU physics, and the serving stack's role is bandwidth efficiency, not scheduling quality.

Distinguishing these explanations requires opening the GPU black box. If we remove the serving stack entirely -- calling model.generate() directly via PyTorch -- and the degradation persists, then the cause is GPU physics, not software. If it disappears, the cause is indeed Ollama's scheduling.

SS1.2 Experimental Design

TR131 introduces a PyTorch Direct control condition that eliminates the entire serving stack. There is no HTTP server, no Go runtime, no request queuing, no token streaming, no Ollama process. A Python script loads the model via HuggingFace Transformers and calls model.generate() directly. For N=8, a ThreadPoolExecutor with 8 workers runs concurrent inference -- GPU operations release the GIL, enabling true CUDA-level concurrency.

Factor	Controlled?	Value
GPU hardware	Yes	RTX 4080 Laptop 12 GB, GDDR6, 432 GB/s peak
Models	Yes	LLaMA-3.2-1B, LLaMA-3.2-3B
Concurrency levels	Yes	N=1 (baseline), N=8 (concurrent)
Max new tokens	Yes	128
Profiler	Yes	NVIDIA Nsight Systems 2025.5.1 wrapping target
Repetitions	Yes	3 per condition
Inference backend	Variable	Ollama (Q4_0) vs PyTorch Direct (FP16)
Quantization	Partially	Ollama=Q4_0, PyTorch=FP16

The quantization difference (Q4_0 vs FP16) affects absolute throughput but not the N=1->N=8 degradation ratio, which is the metric used for causal attribution. FP16 models place strictly more memory pressure per parameter (4x), making PyTorch's degradation a conservative bound on GPU physics effects.

SS1.3 Research Questions

TR131 is designed to answer five specific questions:

Q1: Does GPU memory bandwidth saturate under N=8 concurrency? If memory operation time increases significantly at N=8, bandwidth contention is a mechanism.
Q2: Does Ollama serialize GPU kernel execution compared to direct PyTorch? If Ollama shows lower max concurrent kernels than PyTorch, it is adding serialization.
Q3: Do CUDA context switches increase measurably at N=8? If inter-kernel gaps widen at N=8, context switching is a mechanism.
Q4: Is GPU-level degradation intrinsic (hardware) or extrinsic (software)? If PyTorch Direct degrades comparably to Ollama, the cause is hardware.
Q5: What fraction of the 82% degradation is attributable to Ollama's serving stack vs GPU physics? The difference between PyTorch and Ollama degradation ratios is the serving stack contribution.

SS1.4 Literature Gap

Published LLM serving benchmarks (Kwon et al. 2023, Patel et al. 2024) compare backends under open-loop arrival conditions. Multi-agent systems are closed-loop. TR130 provided the first closed-loop cross-backend comparison. TR131 goes one step further: by removing the serving stack entirely, it isolates the GPU physics component that no prior study has measured. This is the first kernel-level profiling of multi-agent inference degradation in the Banterhearts research series.

SS1.5 Five Hypotheses

H	Hypothesis	Key nsys/ncu Data	Confirm If...
H_1	GPU memory bandwidth saturates	Memory op time, ncu DRAM throughput	Bandwidth >80% of peak at N=8
H2	Ollama serializes GPU requests	Kernel exec trace timeline	Ollama N=8 has zero kernel overlap; PyTorch N=8 has overlap
H3	CUDA context switching overhead	`--gpuctxsw=true`, inter-kernel gaps	Context switches at N=8 >> N=1
H4	CPU thread scheduling bottleneck	OS runtime summary	CPU thread wait >20% of wall time at N=8
H5	KV-cache memory pressure	SM occupancy, mem alloc counts	SM occupancy drops significantly at N=8

Expected primary cause based on TR130 evidence: H2 (Ollama serialization). This expectation will be tested.

SS1.6 Why Nsight Systems + Nsight Compute

Tool	Purpose	Data Captured	When Used
Nsight Systems (nsys)	System-wide timeline	CUDA API calls, kernel launches, memory ops, inter-kernel gaps, OS runtime	Phases 1--5 (all runs)
Nsight Compute (ncu)	Per-kernel deep dive	SM occupancy, DRAM throughput, compute utilization	Phase 6 (targeted)

nsys wraps the entire target process, capturing all CUDA activity with minimal overhead (validated in SS17). ncu profiles individual kernel launches with detailed hardware counter data but can only profile a few launches at a time due to replay overhead.

SS2. Methodology

SS2.1 Profiling Architecture

Ollama (Phases 2--3): nsys wraps ollama serve, capturing all CUDA activity from Ollama's ggml inference engine. A separate Python thread (running outside the nsys process tree) sends HTTP requests to localhost:11434 after the server is ready. This ensures profiling overhead does not affect request timing. The nsys --kill=true flag terminates Ollama after the profile duration expires.

PyTorch Direct (Phases 4--5): nsys wraps a Python script that loads the model via HuggingFace Transformers with torch.float16 and calls model.generate() directly. For N=8, a ThreadPoolExecutor with 8 workers dispatches concurrent inference calls. GPU operations release Python's GIL, enabling true CUDA-level concurrency from multiple threads. The inference script runs inside the nsys process tree (unavoidable), but profiling overhead is symmetric across N=1 and N=8 conditions, preserving the degradation ratio comparison.

Nsight Compute (Phase 6): ncu wraps a Python script that loads the model and performs 5 generate() calls, allowing ncu to capture detailed per-kernel metrics for the top kernels. Only 5 launches are profiled per model (ncu replays each kernel multiple times for counter collection).

SS2.2 Metrics

Metric	Source	Formula	Primary?
Per-agent TPS	Request driver	`tokens_generated / wall_time_s`	Yes
Kernel count	nsys `cuda_gpu_kern_sum`	Total CUDA kernel launches	Yes
GPU time (ms)	nsys `cuda_gpu_kern_sum`	Sum of kernel durations	Yes
Memory op time (ms)	nsys `cuda_gpu_mem_time_sum`	Sum of memcpy/memset durations	Yes
Max concurrent kernels	nsys `cuda_kern_exec_trace`	Peak overlapping kernel count (sweep-line)	Yes
GPU utilization %	nsys `cuda_kern_exec_trace`	`active_time / total_trace_time x 100`	Secondary
Inter-kernel gap (mus)	nsys `cuda_kern_exec_trace`	Mean gap between consecutive launches	Secondary
SM occupancy %	ncu	Achieved / theoretical warp occupancy	Phase 6 only
DRAM throughput %	ncu	Achieved / peak memory bandwidth	Phase 6 only

Per-agent TPS is the primary metric -- it captures the throughput each agent actually experiences, including all scheduling, queue wait, and memory contention overhead. Kernel count, GPU time, and memory op time are the primary trace metrics used for hypothesis testing.

SS2.3 Statistical Methods

Welch's t-test (unequal variance) for all pairwise comparisons -- does not assume equal group sizes or variance
Cohen's d (pooled standard deviation) for standardized effect size -- classified as negligible (<0.2), small (0.2--0.5), medium (0.5--0.8), or large (>0.8)
Mann-Whitney U (non-parametric) as robustness check on every significant Welch's result
95% confidence intervals via t-distribution with n-1 degrees of freedom
Holm step-down correction for multiple hypothesis tests (k=6 tests, family-wise alpha=0.05)
Power analysis: minimum detectable Cohen's d given sample sizes: d_min = t_crit(alpha/2, df=2n-2) x sqrt(2/n)
IQR outlier detection via Tukey fences (1.5x IQR below Q1 or above Q3)
Descriptive statistics: mean, median, std, p90, p95, p99, min, max per group

SS2.4 Six Phases

Phase	Description	Runs	Profile Duration	Expected Traces
P1	Validation: nsys captures Ollama CUDA kernels	1	30s	~1 MB
P2	Ollama N=1: 2 models x 3 reps	6	30s	~1 MB each
P3	Ollama N=8: 2 models x 3 reps	6	60s	~4 MB each
P4	PyTorch Direct N=1: 2 models x 3 reps	6	60s	~40--80 MB each
P5	PyTorch Direct N=8: 2 models x 3 reps	6	120s	~160--270 MB each
P6	Nsight Compute: 2 models x 5 kernel launches	2	~2--4 min	CSV output
Total		27	~71 min	~1.6 GB

Each phase requires a fresh start of the target process (Ollama or PyTorch script) under nsys. Between runs, the previous process is fully stopped and a 3-second cooldown allows GPU temperature stabilization.

SS3. Phase 1 -- Environment Validation

Phase 1 is the critical gate: if nsys cannot capture CUDA kernels from Ollama's process tree, the entire experiment fails. Ollama spawns child processes for GPU inference (the ggml CUDA backend runs in a separate subprocess), so nsys must follow the process tree.

SS3.1 Validation Results

Check	Result	Status
nsys reachable	NVIDIA Nsight Systems 2025.5.1.121-255136380782v0	Pass
CUDA API calls captured	26,185	Pass
GPU kernels captured	1,871	Pass
GPU time	44.7 ms	Pass
Ollama HTTP requests	3/3 OK (162.2 TPS)	Pass
Reports extracted	cuda_api_sum, cuda_gpu_kern_sum, cuda_gpu_mem_time_sum, cuda_kern_exec_trace, osrt_sum	Pass
Trace file size	0.9 MB	Expected

SS3.2 Observations

Observation 1 -- nsys captures ggml CUDA kernels through the Ollama process tree. The 1,871 kernel launches and 26,185 CUDA API calls confirm full visibility into Ollama's GPU activity. This is not guaranteed -- some profilers cannot follow forked child processes on Windows. nsys's --trace=cuda flag with --kill=true successfully wraps ollama serve and its child processes.

Observation 2 -- The dominant kernel is mul_mat_q<ggml_type=8>. This is ggml's quantized matrix multiplication kernel for Q4_0 (type 8 in the ggml enum). It performs dequantization and matrix multiplication in a single fused kernel, avoiding the bandwidth cost of a separate dequantize pass. The top 3 kernels are all mul_mat_q variants with different tile sizes (64, 80) and stream-k fixup kernels.

Observation 3 -- Validation TPS (162.2) matches TR129 unprofiled data (~160 TPS for LLaMA-1B). This provides early evidence that nsys profiling overhead is negligible for Ollama runs (further validated in SS17).

Observation 4 -- GPU utilization reads 0% but this is a denominator artifact. The 44.7 ms of GPU time within a 30-second profile window means the GPU is active for only 0.15% of the total profile duration. Between requests, the GPU is idle. This metric is misleading for bursty workloads (discussed further in SS9).

Observation 5 -- Max concurrent kernels = 1 even during validation. This is the first hint that kernel serialization is a GPU-level phenomenon, not specific to multi-agent concurrency. A single sequential request already executes kernels one at a time because each ggml kernel occupies the full GPU.

Gate: PASSED. All 5 nsys report types extracted successfully. Proceeding to profiled phases.

SS4. Phase 2 -- Ollama N=1 Baseline

SS4.1 Per-Request Results

Model	Rep	TPS	Wall (ms)	Trace (MB)	Kernels	GPU Time (ms)
LLaMA-1B	0	159.98	800.4	0.8	2,257	45.9
LLaMA-1B	1	160.44	797.8	0.8	2,257	45.9
LLaMA-1B	2	160.91	795.6	0.8	2,257	45.9
LLaMA-3B	0	96.30	1,329.5	1.0	3,949	117.4
LLaMA-3B	1	96.50	1,326.3	1.0	3,949	118.3
LLaMA-3B	2	96.63	1,324.7	1.0	3,949	121.0

SS4.2 Descriptive Statistics

Model	Mean TPS	95% CI	Std	CV%	Median	p90
LLaMA-1B	160.44	[159.29, 161.60]	0.47	0.29%	160.44	160.82
LLaMA-3B	96.48	[96.06, 96.89]	0.17	0.17%	96.50	96.60

SS4.3 Observations

Observation 1 -- Ollama N=1 throughput is extremely deterministic. LLaMA-1B: CV=0.29%, 95% CI width = 2.31 TPS. LLaMA-3B: CV=0.17%, 95% CI width = 0.83 TPS. This sub-1% variance means that even 3 repetitions produce tight confidence intervals. The determinism comes from sequential request processing: with no contention, every request follows the same code path through ggml -> CUDA -> memory controller.

Observation 2 -- LLaMA-1B is 1.66x faster than LLaMA-3B (160.4 vs 96.5 TPS). The parameter ratio is 2.67x (3.2B / 1.2B), but the throughput ratio is only 1.66x -- sublinear scaling. The gap is smaller than expected because per-token overhead (CUDA launch, memory allocation, HTTP round-trip) is approximately constant regardless of model size. For the 1B model, this fixed overhead is a larger fraction of total time, compressing the ratio.

Observation 3 -- Kernel counts are identical across repetitions for the same model. LLaMA-1B: exactly 2,257 kernels in all 3 reps. LLaMA-3B: exactly 3,949 in all 3 reps. This perfect reproducibility means the ggml execution graph is fully deterministic for a given prompt length and model architecture -- no dynamic kernel dispatch.

Observation 4 -- GPU time is a tiny fraction of wall time. LLaMA-1B: 45.9 ms GPU time vs 798 ms wall time -- the GPU is actively computing for only 5.7% of request time. The remaining 94.3% is Ollama overhead: HTTP handling, tokenization, JSON serialization, queue management, and inter-kernel gaps. This matches TR130's finding of ~210 ms scheduling overhead per request.

Observation 5 -- Trace sizes are tiny (~0.8--1.0 MB per 30s profile). This confirms that CUDA activity for 5 sequential Q4_0 inference requests is minimal. The small trace sizes also mean nsys stats extraction is fast (<5 seconds per run), avoiding the timeout issues that will affect large PyTorch traces (SS6).

SS4.4 Kernel Architecture (N=1)

Metric	LLaMA-1B	LLaMA-3B	Interpretation
Kernel instances	2,257	3,949	1.75x -- proportional to layer count (16 vs 28 layers)
GPU time (ms)	45.9	118.9	2.59x -- closer to parameter ratio (2.67x)
Attention kernel %	8.4%	8.5%	Identical -- attention fraction is architecture-independent
GEMM kernel %	0%	0%	Expected -- ggml uses fused `mul_mat_q`, not cuBLAS GEMM
Memory op time (ms)	109.3	245.4	2.25x -- tracks weight size ratio

The 0% GEMM fraction deserves explanation. Standard PyTorch inference dispatches separate cuBLAS GEMM calls for matrix multiplications. Ollama's ggml backend uses custom mul_mat_q kernels that fuse dequantization and matrix multiplication into a single kernel. This fusion eliminates the intermediate dequantized tensor, halving memory bandwidth for each matmul operation. The ggml kernel names (void mul_mat_q<(ggml_type)8, (int)64, (bool)0>) confirm Q4_0 (type=8) with tile sizes 64 and 80.

SS5. Phase 3 -- Ollama N=8 Concurrent

SS5.1 Per-Request Results

Model	Rep	Per-Agent TPS	Wall (ms)	Trace (MB)	Kernels	GPU Time (ms)
LLaMA-1B	0	28.86	4,833	3.3	10,975	216.1
LLaMA-1B	1	27.21	4,971	3.3	10,975	216.1
LLaMA-1B	2	30.32	4,856	3.3	10,975	216.1
LLaMA-3B	0	16.58	8,729	3.7	18,492	512.9
LLaMA-3B	1	16.33	8,716	3.8	19,745	514.9
LLaMA-3B	2	18.67	8,689	4.1	19,745	521.0

SS5.2 Descriptive Statistics

Model	Mean TPS	95% CI	Std	CV%	Median	Degradation from N=1
LLaMA-1B	28.80	[24.93, 32.66]	1.56	5.4%	28.86	-82.1%
LLaMA-3B	17.19	[14.00, 20.39]	1.28	7.5%	16.58	-82.2%

SS5.3 Observations

Observation 1 -- 82.1% per-agent degradation, highly significant. Overall: 128.46 -> 22.99 TPS, Welch's t=7.25, p=0.0006, Cohen's d=4.19. Mann-Whitney U=36.0, p=0.002, confirming non-parametric robustness. The effect size of 4.19 is far above the minimum detectable d=1.29 at our sample sizes (SS16).

Observation 2 -- Degradation is near-identical across models. LLaMA-1B: -82.1% (131.6 TPS lost). LLaMA-3B: -82.2% (79.3 TPS lost). The degradation percentage is model-independent even though the absolute TPS loss differs by 1.66x. This means the degradation mechanism is proportional to baseline throughput -- consistent with bandwidth contention, which scales with weight size.

Observation 3 -- Variance increases at N=8 (CV: 0.29% -> 5.4% for 1B, 0.17% -> 7.5% for 3B). Under contention, the nondeterminism of GPU memory controller scheduling introduces variability. The 95% CI width grows from 2.3 to 7.7 TPS for 1B and from 0.8 to 6.4 TPS for 3B. This increased variance is expected when multiple processes contend for the same memory bus.

Observation 4 -- Kernel counts increase ~5x for 8x workload. LLaMA-1B: 2,257 -> 10,975 (4.86x). LLaMA-3B: 3,949 -> 19,327 (4.89x). The sub-8x scaling means that not all 8 agents complete all 3 requests within the 60-second profile window -- some requests are cut short by nsys's --kill termination. The kernel count still reflects the actual GPU work done.

Observation 5 -- GPU time increases ~4.7x while kernel count increases ~4.9x. LLaMA-1B: 45.9 -> 216.1 ms (4.71x GPU time), 2,257 -> 10,975 kernels (4.86x count). The GPU time per kernel is approximately constant (20.3 mus at N=1, 19.7 mus at N=8), suggesting that individual kernel execution time is not affected by concurrency -- the slowdown comes from increased total work competing for memory bandwidth, not from individual kernels running slower.

Observation 6 -- Trace sizes increase ~4x, not 8x. LLaMA-1B: 0.8 -> 3.3 MB (4.1x). The sub-8x scaling mirrors the kernel count -- the profile duration captures 4.9x more kernels than N=1, proportionally increasing trace size. The modest trace sizes (3.3--4.1 MB) mean nsys stats extraction remains fast for Ollama runs.

SS5.4 Interpretation -- Ollama Degradation Mechanism

The data suggests the following mechanism: at N=8, Ollama receives 8 concurrent HTTP requests and queues them. Because ggml processes one request at a time (max_concurrent_kernels = 1 at the GPU level), each agent waits while the other 7 are served. The wall-clock time per request grows from ~800 ms to ~4,900 ms -- approximately 6x rather than 8x, because some overlap in HTTP processing and tokenization occurs while the GPU handles the previous request.

The key question is whether this serialization is Ollama's fault (it could batch requests) or the GPU's constraint (the hardware can only run one full-width kernel at a time). Phases 4--5 answer this by testing PyTorch Direct, which has no request queue and uses threads for concurrency.

SS6. Phases 4--5 -- PyTorch Direct N=1 and N=8

SS6.1 PyTorch N=1 Results

Model	Rep	TPS	Wall (ms)	Trace (MB)	Kernels	GPU Time (ms)
LLaMA-1B	0	52.46	2,440	42.4	903,323	10,344
LLaMA-1B	1	52.46	2,440	42.3	903,323	10,358
LLaMA-1B	2	51.13	2,506	42.3	903,323	10,359
LLaMA-3B	0	--	--	21.3	452,567	6,950
LLaMA-3B	1	29.24	4,378	76.6	1,628,896	25,352
LLaMA-3B	2	29.41	4,354	76.6	1,628,896	25,376

Note: LLaMA-3B rep0 returned 0 complete requests -- model loading consumed the entire 60-second profile duration. The 452,567 kernels represent partial model loading. Excluded from all TPS analysis; n=2 for LLaMA-3B N=1.

SS6.2 PyTorch N=8 Results

Model	Rep	Per-Agent TPS	Wall (ms)	Trace (MB)	Kernels	GPU Time (ms)
LLaMA-1B	0	7.03	18,289	161.9	3,165,433	36,484
LLaMA-1B	1	7.30	17,641	161.2	3,165,433	36,526
LLaMA-1B	2	7.22	17,863	161.8	3,165,433	36,604
LLaMA-3B	0	--	--	172.7	3,384,704	55,177
LLaMA-3B	1	3.80	33,791	271.8	5,262,724	82,327
LLaMA-3B	2	3.78	33,945	271.6	5,274,779	82,606

Note: LLaMA-3B rep0 again returned 0 complete requests. Excluded from TPS analysis.

SS6.3 Descriptive Statistics

Backend	Model	N	Mean TPS	95% CI	Std	CV%
PyTorch	LLaMA-1B	1	52.02	[50.11, 53.92]	0.77	1.5%
PyTorch	LLaMA-3B	1	29.33	[28.25, 30.41]	0.12	0.4%
PyTorch	LLaMA-1B	8	7.18	[6.84, 7.53]	0.14	1.9%
PyTorch	LLaMA-3B	8	3.79	[3.66, 3.92]	0.01	0.4%

SS6.4 Observations

Observation 1 -- Massive trace sizes reveal PyTorch's eager execution model. PyTorch N=1 generates 42--77 MB traces vs Ollama's 0.8--1.0 MB. PyTorch N=8 reaches 162--272 MB. The reason: PyTorch's eager-mode execution dispatches individual CUDA kernels for every operation -- each nn.Linear, softmax, layer norm, and attention computation launches separate kernels. Ollama's ggml fuses these into large mul_mat_q blocks.

Observation 2 -- PyTorch launches 100--400x more kernels than Ollama. At N=1: PyTorch LLaMA-1B launches 903,323 kernels vs Ollama's 2,257 (400x). LLaMA-3B: 1,628,896 vs 3,949 (412x). This massive gap reflects the difference between eager execution (individual ops -> individual kernels) and fused execution (ggml combines dequant + matmul + element-wise into single kernels). Despite launching 400x more kernels, PyTorch is only 3x slower -- each PyTorch kernel is much smaller and faster to launch, but the launch overhead adds up.

Observation 3 -- PyTorch N=1 is 3.0x slower than Ollama N=1. LLaMA-1B: 52.0 vs 160.4 TPS (3.08x). LLaMA-3B: 29.3 vs 96.5 TPS (3.29x). The primary driver is FP16 vs Q4_0: FP16 weights are 4x larger, requiring 4x more memory bandwidth per token. The sub-4x ratio reflects that not all time is memory-bound -- some is compute-bound (attention) and some is fixed overhead (kernel launch).

Observation 4 -- PyTorch GPU time is vastly higher. LLaMA-1B N=1: 10,350 ms GPU time vs Ollama's 45.9 ms (225x). This is partly the 400x kernel count difference and partly the FP16 weight reads. But note: PyTorch's GPU time exceeds its wall time (10,350 ms GPU vs 2,440 ms wall). This means kernels overlap on the GPU timeline -- the CUDA runtime pipelines kernel execution even within a single stream, but the wall-clock time reflects that the GPU is near-fully utilized during inference.

Observation 5 -- LLaMA-3B rep0 failed in both N=1 and N=8. The 3B FP16 model requires ~6.4 GB VRAM for weights alone. Combined with PyTorch's memory overhead (activation caching, CUDA context), model loading takes longer than the profile duration on the first run. Subsequent reps benefit from cached CUDA context. This is a startup artifact, not a measurement issue -- excluded from analysis.

Observation 6 -- PyTorch N=8 variance is remarkably low. LLaMA-1B N=8: CV=1.9% (7.18 +/- 0.14 TPS). LLaMA-3B N=8: CV=0.4% (3.79 +/- 0.01 TPS). The low variance under 8-thread concurrency suggests that GPU memory controller scheduling is deterministic at steady state -- all 8 threads get equal, predictable bandwidth slices.

Observation 7 -- nsys stats extraction timed out for large PyTorch N=8 traces. Three LLaMA-3B N=8 traces (~272 MB each, ~5M kernels) caused the cuda_kern_exec_trace stats extraction to exceed the 120-second timeout. This is why some per-model GPU utilization data is missing for PyTorch N=8 LLaMA-3B. The missing data does not affect the primary TPS comparison.

SS6.5 Interpretation -- The Bombshell Finding

PyTorch Direct N=8 degrades 86.4% -- worse than Ollama's 82.1%. This is the opposite of what TR130 would predict. If Ollama's serving stack were the bottleneck, removing it (PyTorch Direct has no HTTP, no Go, no Ollama) should reduce degradation. Instead, it increases.

This means the degradation is not caused by Ollama's request scheduling. It is caused by the GPU memory bandwidth constraint. With 8 concurrent threads all calling model.generate() on the same GPU, the CUDA runtime must serialize kernel execution (max_concurrent=1) and share memory bandwidth across 8 concurrent weight-read streams. The result is the same throughput collapse -- actually worse, because FP16 weights require 4x more bandwidth per parameter than Q4_0.

SS7. The Core Comparison -- Ollama vs PyTorch Degradation

This section presents the central analysis: comparing degradation ratios between Ollama and PyTorch Direct to attribute the throughput collapse.

SS7.1 Aggregate Comparison

Metric	Ollama (n=6)	PyTorch (n=5)	Interpretation
N=1 Mean TPS	128.46 [91.69, 165.23]	42.94 [27.49, 58.39]	Ollama 3.0x faster (Q4_0)
N=8 Mean TPS	22.99 [16.19, 29.80]	5.83 [3.52, 8.14]	Ollama 3.9x faster
N=1->N=8 Degradation	-82.1%	-86.4%	PyTorch degrades MORE
p-value (degradation)	0.0006	0.002	Both highly significant
Cohen's d	4.19	4.17	Both massive effects
Mann-Whitney p	0.002	0.012	Non-parametric confirms

SS7.2 Per-Model Breakdown

Model	Backend	N=1 TPS	N=8 TPS	Degradation	p-value	Cohen's d
LLaMA-1B	Ollama	160.44	28.80	-82.1%	1.1x10^-5	114.64
LLaMA-1B	PyTorch	52.02	7.18	-86.2%	6.1x10^-5	81.26
LLaMA-3B	Ollama	96.48	17.19	-82.2%	6.8x10^-5	86.54
LLaMA-3B	PyTorch	29.33	3.79	-87.1%	1.8x10^-3	298.35

SS7.3 Degradation Ratio Comparison

Model	Ollama Deg.	PyTorch Deg.	Difference	Interpretation
LLaMA-1B	-82.1%	-86.2%	-4.1 pp	PyTorch 4.1 pp worse
LLaMA-3B	-82.2%	-87.1%	-4.9 pp	PyTorch 4.9 pp worse
Overall	-82.1%	-86.4%	-4.3 pp	PyTorch worse overall

SS7.4 Ollama Advantage Growth Under Contention

Model	N=1 Ollama/PyTorch Ratio	N=8 Ollama/PyTorch Ratio	Growth
LLaMA-1B	3.08x	4.01x	+0.93x
LLaMA-3B	3.29x	4.54x	+1.25x

SS7.5 Degradation Attribution

Source	Attribution	Derivation
GPU memory physics	86.4%	PyTorch Direct baseline (no serving stack)
Ollama serving stack	-4.3%	Ollama degrades less than PyTorch
Net observed (Ollama)	82.1%	86.4% + (-4.3%)

SS7.6 Observations

Observation 1 -- PyTorch degrades 4.3 percentage points more than Ollama, overturning TR130. TR130 concluded: "The serving stack is the bottleneck." If true, PyTorch Direct (no serving stack) should degrade less. It degrades more. The attribution table shows -4.3% for Ollama's serving stack -- a negative contribution, meaning Ollama's stack slightly reduces degradation, probably because Q4_0 quantization reduces per-request bandwidth pressure.

Observation 2 -- The Ollama advantage grows from 3.0x to 3.9x under contention. At N=1, Q4_0's 4x smaller weights give Ollama a 3.0x throughput advantage. At N=8, this grows to 3.9x because bandwidth becomes the binding constraint. Q4_0's advantage compounds: at N=1, bandwidth is 22--38% of peak (SS10); at N=8, it exceeds peak. The backend that uses less bandwidth per token suffers less.

Observation 3 -- Degradation is model-size independent. LLaMA-1B and 3B degrade within 0.1 percentage points of each other (82.1% vs 82.2% for Ollama; 86.2% vs 87.1% for PyTorch). This means the degradation mechanism scales proportionally with model size -- consistent with bandwidth saturation, which is proportional to weight reads per token.

Observation 4 -- The LLaMA-3B Ollama/PyTorch advantage grows MORE than 1B (4.54x vs 4.01x at N=8). Larger models have higher bandwidth demand, so Q4_0's bandwidth savings compound more. The 3B FP16 model requires ~6.4 GB of weight reads per full forward pass vs ~1.6 GB for Q4_0. The 4x bandwidth gap amplifies under contention.

Observation 5 -- All 4 per-model degradation tests are significant at p < 0.002. Even with only n=2 for PyTorch LLaMA-3B, the effect size is d=298 (the enormous d reflects near-zero within-group variance for both N=1 and N=8). The statistical conclusion is unambiguous: the degradation is real, large, and reproducible.

Observation 6 -- This reframes TR130's finding. TR130 is correct that vLLM/TGI scale better than Ollama. But the reason is not "better scheduling" -- it is continuous batching's bandwidth efficiency. vLLM batches multiple sequences into single kernel launches, reading the model weights once for multiple tokens. Ollama reads the weights once per token per request. The difference is bandwidth amortization, not request scheduling.

SS8. Kernel Profile Comparison

SS8.1 Aggregate Statistics

Phase	Mean Kernels	95% CI	Mean GPU Time (ms)	95% CI
Ollama N=1	3,103	[2,130, 4,076]	82.4	[40.4, 124.4]
Ollama N=8	15,151	[10,322, 19,979]	366.2	[193.6, 538.8]
PyTorch N=1	1,070,055	[580,226, 1,559,883]	14,790	[6,083, 23,496]
PyTorch N=8	3,903,084	[2,789,369, 5,016,799]	54,954	[31,341, 78,567]

SS8.2 Statistical Tests

Comparison	Mean A	Mean B	Delta %	Cohen's d	t-stat	p-value	M-W p
Ollama kernels N=1->N=8	3,103	15,151	+388%	3.63	-6.29	0.001	0.004
PyTorch kernels N=1->N=8	1,070,055	3,903,084	+265%	3.46	-5.99	0.0006	0.004

SS8.3 Per-Model Kernel Breakdown

Model	Phase	Kernels	GPU Time (ms)	GPU Time/Kernel (mus)	GEMM %	Attention %
LLaMA-1B	Ollama N=1	2,257	45.9	20.3	0%	8.4%
LLaMA-1B	Ollama N=8	10,975	216.1	19.7	0%	8.4%
LLaMA-1B	PyTorch N=1	903,323	10,353	11.5	71.6%	0.6%
LLaMA-1B	PyTorch N=8	3,165,433	36,538	11.5	71.9%	0.6%
LLaMA-3B	Ollama N=1	3,949	118.9	30.1	0%	8.5%
LLaMA-3B	Ollama N=8	19,327	516.3	26.7	0%	8.5%
LLaMA-3B	PyTorch N=1	1,236,786	19,226	15.5	71.6%	0.6%
LLaMA-3B	PyTorch N=8	4,640,736	73,370	15.8	71.9%	0.6%

SS8.4 Observations

Observation 1 -- Per-kernel execution time is constant across N=1 and N=8. LLaMA-1B Ollama: 20.3 mus/kernel at N=1, 19.7 mus at N=8. PyTorch: 11.5 mus at both. Individual kernels do not run slower under contention -- the slowdown comes from more total work competing for the same memory bandwidth, not from individual kernel degradation. This is consistent with bandwidth contention rather than compute saturation.

Observation 2 -- Ollama and PyTorch have inverted GEMM/Attention profiles. Ollama: 0% GEMM, 8.5% attention. PyTorch: 71.6% GEMM, 0.6% attention. The inversion reflects different kernel implementations. Ollama's ggml fuses matmul into mul_mat_q (not reported as cuBLAS GEMM). PyTorch dispatches separate cuBLAS GEMM calls for each linear layer, making GEMM the dominant kernel class. The attention difference reflects ggml's custom attention kernel vs PyTorch's decomposed scaled_dot_product_attention.

Observation 3 -- Kernel count gap is 345x at N=1 but narrows to 258x at N=8. Ollama: 3,103 -> 15,151 (4.88x). PyTorch: 1,070,055 -> 3,903,084 (3.65x). PyTorch's kernel count scales less than Ollama's because some kernels are shared across threads (memory allocation, context management), while Ollama's kernel count scales nearly linearly with the number of concurrent requests processed.

Observation 4 -- The attention/GEMM fractions are invariant with N. Both backends maintain identical attention and GEMM percentages at N=1 and N=8 (within 0.3 pp). Concurrency does not change the kernel mix -- it scales all kernel types proportionally. This rules out a hypothesis where attention kernels become disproportionately expensive under contention.

SS9. GPU Utilization Analysis

SS9.1 Results

GPU utilization from kernel exec trace analysis reads 0.0% for all 24 runs across all 4 conditions. This counterintuitive result requires careful interpretation.

SS9.2 Why Utilization Reads Zero

The utilization metric is computed as: active_kernel_time / total_profile_duration x 100. For Ollama N=1: 45.9 ms of kernel activity within a 30,000 ms profile window = 0.15%. The metric rounds to 0% because inference requests occupy a tiny fraction of the total profiling window -- between requests, the GPU is idle.

This metric is misleading for bursty workloads. During active inference, the GPU is near-fully utilized -- evidenced by max_concurrent_kernels = 1, meaning the GPU has no idle SMs during kernel execution. The correct interpretation is:

Instantaneous utilization during inference: near 100% (GPU is the bottleneck)
Time-averaged utilization over profile window: <1% (most time is between requests)

For the multi-agent comparison, the relevant metric is how individual kernel execution and bandwidth are affected by concurrency -- captured by GPU time, memory op time, and kernel count in SS8 and SS10.

SS9.3 Inter-Kernel Gap Analysis

All inter-kernel gap comparisons (N=1 vs N=8, Ollama vs PyTorch) showed Cohen's d = 0 and NaN p-values. The zero variance means the nsys aggregated gap metric does not differentiate between conditions. Fine-grained gap distributions would require timeline-level analysis of the raw .nsys-rep files (outside scope of automated analysis).

SS10. Memory Bandwidth Analysis (H_1)

SS10.1 Memory Operation Time

Phase	Mean Mem Time (ms)	95% CI	Std
Ollama N=1	177.3	--	--
Ollama N=8	309.3	--	--
PyTorch N=1	398.0	--	--
PyTorch N=8	488.7	--	--

SS10.2 Statistical Tests

Comparison	Delta (ms)	Delta %	Cohen's d	t-stat	p-value	Significant	M-W p
Ollama mem N=1->N=8	+131.9	+74.4%	3.81	-6.61	6.4x10^-5	Yes	0.002
PyTorch mem N=1->N=8	+90.8	+22.8%	0.37	-0.64	0.54	No	0.18

SS10.3 Observations

Observation 1 -- Ollama memory time increases 74.4% at N=8 (p=6.4x10^-5, d=3.81). This is the strongest statistical signal in the entire analysis. The large effect size (d=3.81) indicates a massive shift in memory operation duration under concurrency. Both the Welch's t-test (p=6.4x10^-5) and Mann-Whitney U (p=0.002) confirm the result. This is the only test that survives Holm correction (rank 1 of 6, threshold=0.0083).

Observation 2 -- PyTorch's memory time increase is non-significant (p=0.54, d=0.37). At first glance, this seems to contradict the bandwidth hypothesis. But the explanation is clear: PyTorch's baseline memory time is already 2.24x higher than Ollama's (398 vs 177 ms) because FP16 weights require 4x more memory operations. The absolute increase (+91 ms) is comparable to Ollama's (+132 ms), but the relative increase (22.8% vs 74.4%) is smaller because PyTorch starts from a higher base. The high variance in PyTorch memory time (from the large trace sizes and timeout issues) also inflates the standard error, reducing significance.

Observation 3 -- The asymmetry supports the bandwidth hypothesis. Ollama's sharp increase suggests its memory subsystem transitions from comfortable (22% of peak at N=1) to stressed (>100% at N=8). PyTorch's memory subsystem is already under pressure at N=1 (29% of peak), so the additional N=8 stress causes a proportionally smaller relative change. This is exactly what bandwidth saturation looks like: diminishing marginal stress increase as you approach the ceiling.

SS10.4 Bandwidth Demand Calculation

The RTX 4080 Laptop GPU has a peak memory bandwidth of ~432 GB/s (GDDR6, 256-bit bus).

Q4_0 LLaMA-1B (Ollama):

Model weight size: 1.2B params x 0.5 bytes/param = 0.6 GB
Per-token bandwidth: 0.6 GB x 1 read per token = 0.6 GB/token
N=1 at 160 TPS: 0.6 x 160 = 96 GB/s (22% of peak)
N=8 at 29 TPS per agent: 0.6 x 29 x 8 = 139 GB/s (32% of peak total)
N=8 if no degradation (160x8): 0.6 x 160 x 8 = 768 GB/s (178% of peak -- impossible)

FP16 LLaMA-1B (PyTorch):

Model weight size: 1.2B params x 2 bytes/param = 2.4 GB
N=1 at 52 TPS: 2.4 x 52 = 125 GB/s (29% of peak)
N=8 if no degradation (52x8): 2.4 x 52 x 8 = 998 GB/s (231% of peak -- impossible)

Q4_0 LLaMA-3B (Ollama):

Model weight size: 3.2B x 0.5 = 1.6 GB
N=1 at 96 TPS: 1.6 x 96 = 154 GB/s (36% of peak)
N=8 if no degradation: 1.6 x 96 x 8 = 1,229 GB/s (285% of peak)

FP16 LLaMA-3B (PyTorch):

Model weight size: 3.2B x 2 = 6.4 GB
N=1 at 29 TPS: 6.4 x 29 = 186 GB/s (43% of peak)
N=8 if no degradation: 6.4 x 29 x 8 = 1,485 GB/s (344% of peak)

SS10.5 Interpretation

The bandwidth demand calculations reveal why degradation is so severe. At N=8, the theoretical bandwidth demand (without degradation) exceeds peak bandwidth by 78--244% across all 4 configurations. The GPU memory controller must serialize weight reads, creating the observed throughput collapse.

The actual N=8 bandwidth demand is lower because per-agent TPS drops. For Ollama 1B at N=8: 0.6 GB x 29 x 8 = 139 GB/s (32% of peak). This is feasible -- the memory controller achieves it by time-slicing weight reads across the 8 concurrent streams. But the time-slicing means each agent waits for the others, producing the ~82% per-agent degradation.

Why Ollama degrades less than PyTorch: Q4_0 weights are 4x smaller. At N=8, Ollama demands 139 GB/s total vs PyTorch's ~240+ GB/s. The lower demand means less contention per agent, explaining Ollama's 4.3 percentage-point advantage in degradation ratio.

SS11. Serialization Analysis (H2)

SS11.1 Max Concurrent Kernels

Phase	Max Concurrent	n	All Identical?
Ollama N=1	1	6	Yes
Ollama N=8	1	6	Yes
PyTorch N=1	1	6	Yes
PyTorch N=8	1	6	Yes

SS11.2 Statistical Tests

All pairwise comparisons (Ollama N=1 vs N=8, PyTorch N=1 vs N=8, Ollama N=8 vs PyTorch N=8) return Cohen's d = 0 and NaN p-values. Zero variance in both groups makes parametric testing impossible -- which is itself the strongest possible finding.

SS11.3 Observations

Observation 1 -- Kernel serialization is universal and GPU-level. Every single run across all 26 profiled conditions shows max_concurrent_kernels = 1. This is not Ollama imposing serialization -- PyTorch Direct with 8 concurrent threads shows the same result. The CUDA runtime on a single consumer GPU (without NVIDIA MPS) processes kernels from different threads sequentially on the same SM array.

Observation 2 -- H2's original framing was wrong. H2 hypothesized: "Ollama serializes GPU requests even under concurrency." The evidence shows: all backends serialize, because the GPU hardware enforces it. The correct reframing: serialization exists (confirmed), but it is not Ollama-specific (not confirmed). The serialization is a property of the CUDA scheduling model on consumer GPUs, not a software deficiency.

Observation 3 -- This explains why vLLM/TGI scale better. If all backends face the same kernel serialization on a single GPU, why do vLLM/TGI achieve higher throughput at N=8 (TR130)? Because continuous batching reduces the number of kernel launches per total token. vLLM batches N sequences into a single kernel launch, reading model weights once for N tokens. Ollama reads weights once per token per request. The serialization constraint (max_concurrent=1) is the same, but vLLM does more useful work per kernel.

SS11.4 The Batching Insight

Consider LLaMA-1B generating 128 tokens for 8 agents:

Ollama: 8 x 128 = 1,024 separate inference passes, each reading 0.6 GB of weights -> 614 GB total bandwidth
vLLM (batched): ~128 batched inference passes, each reading 0.6 GB but producing 8 tokens -> 77 GB total bandwidth
Bandwidth ratio: 614 / 77 = 8x less bandwidth with continuous batching

This 8x bandwidth reduction explains vLLM's 2.25x throughput advantage at N=8 (TR130). The remaining gap (8x / 2.25x ~ 3.6x) reflects overhead from variable sequence lengths, attention mask computation, and KV-cache management in batched execution.

SS12. Context Switch Analysis (H3)

SS12.1 Results

The nsys --gpuctxsw=true flag, which captures CUDA GPU context switches, requires administrator privileges on Windows WDDM drivers. This flag was disabled in our configuration to avoid profiling failures. Inter-kernel gap counts were used as a proxy metric.

All inter-kernel gap comparisons showed:

Cohen's d = 0 for both Ollama and PyTorch N=1 vs N=8
NaN p-values (zero variance in both groups)
No measurable difference in any gap metric

SS12.2 Observations

Observation 1 -- No evidence of context switching overhead. The proxy metrics show no difference between N=1 and N=8. With max_concurrent_kernels = 1 and sequential execution, context switches between threads happen at the CUDA driver level with overhead below the nsys resolution (~100 ns). The driver multiplexes GPU access transparently.

Observation 2 -- This is expected for WDDM consumer GPUs. Unlike TCC (Linux server) drivers that can run multiple CUDA contexts simultaneously, WDDM serializes all GPU access through the Windows display driver. Context switches are handled by the WDDM scheduler, which preempts at kernel boundaries with minimal overhead. The single-GPU, single-context execution model on WDDM means there are no measurable context switch costs -- the driver simply queues kernels from all threads and dispatches them sequentially.

SS12.3 Verdict

H3 REJECTED. No evidence of CUDA context switching overhead at N=8. The GPU processes kernels sequentially from a single queue regardless of the number of requesting threads, and the WDDM scheduler handles multiplexing with negligible overhead.

SS13. Memory Allocation Analysis (H5)

SS13.1 Results

Memory allocation count comparisons between N=1 and N=8 showed:

Cohen's d = 0 for both backends
p-value = 1.0 (no difference detectable)
Zero variance in allocation counts within each condition

SS13.2 Observations

Observation 1 -- No evidence of KV-cache memory pressure. Memory allocation patterns are identical at N=1 and N=8. For Ollama, this is expected: ggml pre-allocates KV-cache memory at model load time, not per-request. For PyTorch, the HuggingFace generate() function manages KV-cache internally with a fixed allocation pattern per sequence.

Observation 2 -- ncu SM occupancy was null, preventing direct H5 confirmation. Nsight Compute returned null values for SM occupancy and compute throughput metrics on Windows WDDM (see SS14). Without SM occupancy data, we cannot measure whether KV-cache expansion at N=8 reduces the GPU's ability to schedule warps. The H5 rejection is based on the absence of memory allocation changes, not on direct occupancy measurement.

SS13.3 Verdict

H5 REJECTED with low confidence. The evidence is absence-of-change rather than measured-no-effect. Future work on Linux TCC should re-test with ncu SM occupancy data.

SS14. Phase 6 -- Nsight Compute Targeted Profiling

SS14.1 Results

Model	Wall Time (s)	Kernel Launches	Kernels Captured	SM Occupancy	DRAM Throughput	Compute Throughput
LLaMA-1B	113.2	5	2	null	null	null
LLaMA-3B	251.3	5	2	null	null	null

SS14.2 Observations

Observation 1 -- ncu captured kernel launches but returned null metrics. Both models show 2 captured kernels (out of 5 launches), but SM occupancy, DRAM throughput, and compute throughput are all null. The kernel names were recorded as "unknown" -- the ncu CSV parser could not match kernel names from the output format.

Observation 2 -- WDDM is the likely cause. Nsight Compute on Windows WDDM has known limitations for hardware counter collection. The WDDM driver intercepts GPU access for display compositing, preventing ncu from getting exclusive hardware counter access. On Linux TCC (Tesla Compute Cluster) drivers, ncu has direct access to performance counters and can measure SM occupancy, DRAM throughput, and compute utilization accurately.

Observation 3 -- The 2-kernel capture suggests incomplete profiling. ncu should capture all 5 kernel launches (configured via --kernel-launch-count=5), but only 2 were recorded. This may be due to kernel replay failures on WDDM -- ncu replays each kernel multiple times to collect different counter sets, and the WDDM scheduler may interfere with replay.

Observation 4 -- This is the primary limitation of TR131. Direct DRAM throughput measurement would conclusively confirm or refute H_1 (bandwidth saturation). Without it, H_1 relies on memory operation time (SS10) and bandwidth demand calculations (SS10.4). Future work on Linux should re-run Phase 6 with TCC driver.

SS15. Hypothesis Verdicts and Degradation Attribution

SS15.1 Evidence Matrix

Hypothesis	Test	p-value	Cohen's d	Effect	Supports H?
H_1	Ollama mem time N=1->N=8	6.4x10^-5	3.81	large	Yes
H_1	ncu DRAM throughput	null	--	--	No data
H2	Ollama max concurrent N=1->N=8	NaN	0	negligible	No (zero variance)
H2	Ollama vs PyTorch N=8 concurrency	NaN	0	negligible	No (zero variance)
H3	Gap count N=1->N=8	NaN	0	negligible	No
H3	Mean gap N=1->N=8	NaN	0	negligible	No
H4	OS runtime	--	--	--	No data
H5	Alloc count N=1->N=8	1.0	0	negligible	No

SS15.2 Verdicts

H_1: GPU Memory Bandwidth Saturation -- PARTIALLY CONFIRMED (Confidence: HIGH)

Memory operation time increases 74.4% from N=1 to N=8 (p=6.4x10^-5, d=3.81), surviving Holm correction. Bandwidth demand calculations show N=8 exceeds peak RTX 4080 bandwidth by 78--244%. However, direct DRAM throughput measurement was not possible (ncu null on WDDM). "Partially confirmed" because the statistical evidence is strong but indirect -- we measure the consequence (increased memory time) rather than the mechanism (DRAM utilization percentage).

H2: Ollama Request Serialization -- REATTRIBUTED TO GPU HARDWARE (Confidence: HIGH)

Serialization exists (max_concurrent = 1) but occurs equally in Ollama and PyTorch Direct. Cohen's d = 0 for all cross-backend comparisons. The serialization is a fundamental property of single-GPU CUDA execution on consumer hardware, not an Ollama scheduling deficiency. vLLM/TGI avoid the throughput consequences of serialization through continuous batching (more work per kernel), not by achieving kernel concurrency.

H3: CUDA Context Switching -- REJECTED (Confidence: MEDIUM)

Zero variance in gap metrics across all conditions. No evidence of context switching overhead. Limited by inability to use --gpuctxsw=true on WDDM, hence "medium" confidence rather than "high."

H4: CPU Thread Scheduling -- INSUFFICIENT DATA (Confidence: LOW)

OS runtime summary data was not extracted for PyTorch N=8 runs due to stats extraction timeouts on large traces. Cannot evaluate CPU-side bottlenecks.

H5: KV-Cache Memory Pressure -- REJECTED (Confidence: LOW)

Memory allocation counts unchanged (p=1.0, d=0). No evidence of increased memory pressure. Low confidence because ncu SM occupancy data was null, preventing direct measurement of warp scheduling effects.

SS15.3 Holm Step-Down Correction

Rank	Test	p-value	Holm Threshold (alpha/(k-i+1))	Significant After Correction
1	H_1: Ollama mem time	6.4x10^-5	0.05/6 = 0.0083	Yes (6.4x10^-5 < 0.0083)
2	H2: Concurrency comparison	NaN	0.05/5 = 0.0100	No
3	H2: Ollama max concurrent	NaN	0.05/4 = 0.0125	No
4	H3: Gap count	NaN	0.05/3 = 0.0167	No
5	H3: Mean gap	NaN	0.05/2 = 0.0250	No
6	H5: Alloc count	1.0	0.05/1 = 0.0500	No

After Holm correction for 6 simultaneous tests, only H_1 (memory bandwidth) remains significant. The NaN p-values for H2/H3 tests reflect zero variance in the underlying metrics -- the tests are mathematically undefined because there is no within-group variation to estimate standard error. This is itself informative: the lack of variation means the GPU imposes uniform behavior regardless of concurrency level or backend.

SS15.4 Degradation Attribution Table

Source	Attribution	Evidence
GPU memory bandwidth physics	86.4%	PyTorch Direct baseline degradation
Ollama serving stack overhead	-4.3%	Ollama degrades less (Q4_0 bandwidth advantage)
Net observed (Ollama N=1->N=8)	82.1%	Aggregate across 2 models, 3 reps each

The negative attribution to Ollama's serving stack means that Ollama's Q4_0 quantization provides a net benefit under bandwidth-limited concurrency. The serving stack is not a bottleneck -- it is a slight advantage, because Q4_0 requires less bandwidth per parameter, leaving more headroom for concurrent weight reads.

SS16. Statistical Power and Data Quality

SS16.1 Power Analysis

Phase	n per group	Min Detectable d	Interpretation	Adequate for Observed d?
Ollama (2 models x 3 reps)	6	1.286	Can detect large effects	Yes (observed d=4.19)
PyTorch (2 models x 3 reps)	5	1.458	Can detect large effects	Yes (observed d=4.17)
Per-model Ollama	3	2.484	Can detect very large effects	Yes (observed d>80)
Per-model PyTorch 1B	3	2.484	Can detect very large effects	Yes (observed d=81)
Per-model PyTorch 3B	2	6.314	Can detect massive effects	Yes (observed d=298)

SS16.2 Observations

Observation 1 -- All observed effects far exceed minimum detectable sizes. The weakest power (n=2, d_min=6.31 for PyTorch LLaMA-3B) still detects the massive d=298 effect. The primary comparison (overall degradation, n=5--6, d_min=1.29--1.46) easily detects the observed d=4.17--4.19. Power is not a concern for this experiment.

Observation 2 -- Small sample sizes produce wide CIs but don't affect significance. The wide 95% CIs for aggregate stats (e.g., Ollama N=1: 128.5 [91.7, 165.2]) reflect pooling 2 models with very different TPS values (160 vs 96). Within-model CIs are much tighter (LLaMA-1B N=1: 160.44 [159.29, 161.60], width = 2.3 TPS). The wide aggregate CIs do not undermine the degradation tests, which compare matched conditions.

Observation 3 -- Zero outliers detected across all conditions. IQR outlier detection (Tukey fence, 1.5x IQR) found zero outliers in any group. The data is remarkably clean, consistent with the deterministic nature of GPU inference (no network jitter, no disk I/O, no thermal throttling during short profiles).

SS16.3 Data Quality Summary

Metric	Value
Total profiled runs	26 (+ 1 validation)
Runs with 0 requests (excluded)	2 (both PyTorch 3B rep0)
Analyzable runs	24
Outlier rate	0.0% (zero outliers detected)
Missing trace data	3 runs (PyTorch N=8 3B -- stats timeout)
Significant comparisons	9/18 major tests (50%)
Tests surviving Holm correction	1/6 hypothesis tests (H_1 only)

SS17. Profiling Overhead Assessment

SS17.1 Cross-Validation with TR129

Model	TR129 Unprofiled TPS	TR131 Profiled TPS (N=1)	Overhead
LLaMA-1B (Ollama)	~160	160.44	<1%
LLaMA-3B (Ollama)	~97	96.48	<1%

SS17.2 Observations

Observation 1 -- nsys profiling overhead is negligible for Ollama. The profiled TPS matches TR129's unprofiled data within measurement noise. This is expected because nsys uses hardware-based instrumentation (GPU performance counters) rather than software instrumentation, and the HTTP request driver runs outside the nsys process tree.

Observation 2 -- PyTorch profiling overhead cannot be independently validated. PyTorch Direct was not benchmarked unprofiled in prior TRs. However, nsys overhead is symmetric across N=1 and N=8 conditions, so the degradation ratio is unaffected even if absolute TPS is slightly depressed. The attribution analysis uses ratios, not absolute values.

Observation 3 -- Trace file sizes suggest higher overhead for PyTorch. PyTorch traces (42--272 MB) are 50--340x larger than Ollama traces (0.8--4.1 MB), reflecting the 345x more kernel launches. While nsys's per-kernel instrumentation overhead is small (~10 ns per kernel), at 3--5 million kernels this accumulates to 30--50 ms total -- still <0.5% of the 10--80 second GPU time. The overhead is negligible even for PyTorch.

SS18. Limitations and Future Work

SS18.1 What This Report Does NOT Prove

We did not measure peak DRAM throughput directly. ncu returned null metrics on Windows WDDM, preventing direct confirmation that bandwidth exceeds 80% of peak at N=8. The bandwidth argument relies on calculation (SS10.4), not measurement. The calculation assumes 1 full weight read per token, which is correct for autoregressive decode but may overestimate prefill bandwidth (where computation is more memory-efficient due to batched token processing).
FP16 vs Q4_0 confound persists. Ollama serves Q4_0 (0.5 bytes/parameter); PyTorch serves FP16 (2 bytes/parameter). This affects absolute TPS but not the degradation ratio used for attribution. However, Q4_0 and FP16 may have different memory access patterns (quantized reads require dequantization logic, which could affect cache line utilization), potentially introducing a subtle confound in the bandwidth comparison.
N=8 threads may not achieve true GPU concurrency. Python's GIL is released for CUDA operations, and our ThreadPoolExecutor dispatches 8 workers. But the CUDA context serializes kernel execution regardless (max_concurrent=1). Whether the threads achieve true memory-level concurrency (8 concurrent weight reads) or whether the memory controller serializes reads too is unclear from nsys data alone.
Only consumer GPU tested. The RTX 4080 Laptop with GDDR6 and WDDM driver is not representative of server deployments (A100/H_100 with HBM2e/HBM3 and TCC driver). Server GPUs have 3--5x higher memory bandwidth, which may reduce the bandwidth saturation severity. However, the same mechanism (N concurrent weight reads exceeding bandwidth) would still apply at larger N.
OS runtime data was incomplete. Large PyTorch N=8 traces (>270 MB, ~5M kernels) caused nsys stats extraction to exceed the 120-second timeout for cuda_kern_exec_trace and osrt_sum reports. This prevented evaluation of H4 (CPU scheduling) and reduced per-model GPU utilization data for PyTorch 3B N=8.
Only 2 models tested. Both are decoder-only LLaMA 3.2 variants. MoE models (Mixtral), encoder-decoder models (T5), and models with different attention mechanisms (GQA vs MHA) may show different degradation patterns. The 1B and 3B sizes are also relatively small -- larger models (8B, 70B) would change the compute/memory ratio.

SS18.2 Threats to Validity

Threat	Type	Severity	Mitigation	Residual Risk
Profiling overhead distorts TPS	Internal	Low	Ollama TPS matches TR129 unprofiled (<1% delta)	PyTorch overhead unknown
Small sample size (n=2--3 per model)	Internal	Medium	Power analysis: d_min=1.29. All observed d>4	Tight per-model CIs but adequate
WDDM vs TCC driver differences	External	Medium	Results are conservative (WDDM adds overhead)	Linux replication needed
Q4_0 vs FP16 confound	Construct	Medium	Focus on degradation ratios, not absolute TPS	Memory access pattern differences
Missing data (2 runs, 0 requests)	Internal	Low	Excluded; remaining n>=2 still significant	Slightly reduced power for 3B
PyTorch 3B stats timeout	Internal	Low	Primary TPS data unaffected	Missing trace-level metrics
Thermal drift over 71 min	Internal	Low	3s cooldown between runs; GPU at <70degC	Negligible on short profiles

SS18.3 Future Work

Linux TCC driver profiling. Run identical experiment on Linux with TCC driver to enable ncu hardware counter collection. Prediction: ncu will show DRAM throughput >80% of peak at N=8, directly confirming H_1. SM occupancy should show warp scheduling saturation.
vLLM kernel profiling. Profile vLLM under nsys to understand how continuous batching reduces bandwidth demand at N=8. Prediction: vLLM launches fewer, larger kernels with multiple sequences batched per launch, reading model weights once for N tokens instead of once per token.
Multi-GPU N=8 test. Split 8 agents across 2 GPUs (4 each) and measure per-agent degradation. Prediction: per-agent degradation drops from ~82% to ~50% because each GPU handles half the bandwidth demand. This would directly confirm that bandwidth is the binding constraint.
Same-quantization comparison. Run PyTorch with Q4_0 (via GPTQ/bitsandbytes) to eliminate the Q4_0/FP16 confound. Prediction: PyTorch-Q4_0 will degrade ~82% (matching Ollama), confirming that the 4.3 pp gap is quantization-driven, not serving-stack-driven.
CUDA MPS testing. Enable NVIDIA Multi-Process Service to allow concurrent kernel execution from 8 processes. Prediction: marginal improvement (<5%) because bandwidth, not compute, is the bottleneck. MPS allows kernel overlap but does not increase memory bandwidth.
Continuous batching prototype. Implement batched inference in the PyTorch Direct code path (batch 8 sequences per forward pass) and re-measure N=8 degradation. Prediction: degradation drops from 86.4% to ~40--50%, matching vLLM's behavior from TR130, because batching amortizes the weight-read bandwidth across sequences.

SS19. Conclusions

SS19.1 Answers to Research Questions

Q1: Does GPU memory bandwidth saturate under N=8 concurrency?

Partially yes -- strong evidence for bandwidth stress, but direct measurement was unavailable. Memory operation time increases 74.4% at N=8 (p=6.4x10^-5, d=3.81), the only test surviving Holm correction across 6 hypothesis tests. Bandwidth demand calculations show N=8 requires 178--344% of peak RTX 4080 bandwidth depending on model and precision. The GPU memory controller must serialize weight reads, explaining the per-agent throughput collapse. Direct DRAM utilization percentage was not measurable (ncu null on WDDM), so "saturation" is inferred rather than directly observed. The evidence is strong but indirect.

Q2: Does Ollama serialize GPU kernel execution compared to direct PyTorch?

No -- serialization is universal. Max concurrent kernels = 1 in all 26 runs across both backends. Cohen's d = 0 for every cross-backend comparison. Both Ollama and PyTorch Direct face the same GPU-level kernel serialization. The CUDA runtime on a single consumer GPU dispatches kernels sequentially from a single hardware queue regardless of how many software threads submit them. Ollama does not add serialization beyond what the GPU hardware imposes.

Q3: Do CUDA context switches increase measurably at N=8?

No. Inter-kernel gap metrics show zero variance between N=1 and N=8 across both backends. The WDDM driver multiplexes GPU access at kernel boundaries with overhead below nsys resolution. Context switching is not a contributor to multi-agent degradation.

Q4: Is GPU-level degradation intrinsic (hardware) or extrinsic (software)?

Intrinsic. PyTorch Direct eliminates the entire serving stack -- no HTTP server, no Go runtime, no request queuing, no Ollama process -- and degrades 86.4% vs Ollama's 82.1%. The serving stack is not the cause. The degradation is a fundamental property of running 8 concurrent inference streams on a single GPU with finite memory bandwidth.

Q5: What fraction of the 82% degradation is attributable to Ollama's serving stack vs GPU physics?

GPU physics: 86.4%. Ollama serving stack: -4.3%. The serving stack attribution is negative -- Ollama's Q4_0 quantization provides a net benefit under bandwidth-limited concurrency. The 82.1% net degradation is entirely explained by GPU memory bandwidth contention, with Q4_0 quantization providing a 4.3 percentage-point reduction by lowering per-request bandwidth demand.

SS19.2 The Central Finding

TR129 asked: what causes the 63% per-agent degradation? (Later measured as 82% in profiled conditions with 128 tokens.) TR130 answered: the serving stack. TR131 overturns this: GPU memory physics.

The 82% per-agent throughput degradation under N=8 concurrency is a GPU memory bandwidth phenomenon, not a serving stack scheduling deficiency. This is proven by the elimination test: PyTorch Direct, with zero serving stack overhead, degrades more than Ollama (86.4% vs 82.1%). The only hypothesis test surviving multiple comparison correction is H_1 (memory bandwidth), and bandwidth demand calculations show N=8 exceeds peak RTX 4080 bandwidth by 78--244%.

TR130's conclusion was correct in its recommendation but wrong in its mechanism. vLLM > TGI > Ollama for multi-agent scaling -- this ranking holds. But the reason is not "Ollama's scheduling is bad." The reason is continuous batching amortizes memory bandwidth by reading model weights once for multiple sequences per kernel launch. Ollama reads weights once per token per request. The 8x bandwidth reduction from batching (SS11.4) explains vLLM's 2.25x total throughput advantage at N=8.

SS19.3 One-Number Summaries

For capacity planning -- Bandwidth Demand per Configuration:

Configuration	Per-Agent TPS (N=1)	Per-Agent TPS (N=8)	Degradation	N=8 Total BW Demand
Q4_0 LLaMA-1B	160	29	-82%	139 GB/s (32% peak)
Q4_0 LLaMA-3B	96	17	-82%	218 GB/s (50% peak)
FP16 LLaMA-1B	52	7.2	-86%	138 GB/s (32% peak)
FP16 LLaMA-3B	29	3.8	-87%	195 GB/s (45% peak)

For backend selection -- Decision Tree:

If you need...	Then...	Why
>30 TPS per agent with 8 agents	Multiple GPUs	Bandwidth-bound on single GPU at N>=4
Best single-GPU multi-agent throughput	vLLM with Q4 quantization	Continuous batching + low bandwidth demand
Lowest per-agent latency at N=1	Ollama with Q4_0	Highest single-agent TPS (160 tok/s)
Better than 82% per-agent degradation	Continuous batching server	vLLM/TGI amortize bandwidth across sequences
Root-cause diagnosis for your workload	Profile with nsys first	Intuitive attribution can be wrong (this report demonstrates it)

SS19.4 What Changes for the Banterhearts Research Program

TR130's recommendation stands, but the reasoning changes. vLLM > Ollama for multi-agent deployment. The advantage is not "better scheduling" but "continuous batching reduces bandwidth demand per token." This distinction matters because it suggests that Ollama could achieve similar scaling by implementing request batching -- the GPU hardware is not the limit; the software's bandwidth efficiency is.
TR129's Amdahl serial fraction is reinterpreted. The s=0.39--0.54 serial fraction measured for Ollama reflects GPU memory bandwidth serialization, not Ollama scheduling serialization. The same serial fraction would apply to any sequential-serving backend on the same GPU. vLLM/TGI avoid this by batching, not by scheduling.
Quantization is a concurrency optimization, not just a compression technique. Q4_0's 4x smaller weights provide 4x less bandwidth demand per token. At N=1 this translates to ~3x faster inference; at N=8 it translates to 3.9x faster. Quantization should be viewed as a bandwidth efficiency technique, especially for multi-agent workloads.
Future multi-agent experiments should profile GPU bandwidth. The intuition that "better software = better scaling" led TR130 to a correct recommendation with an incorrect mechanism. Hardware profiling (nsys/ncu) should accompany any multi-agent benchmark to distinguish software bottlenecks from hardware physics.

Appendix A: Configuration

experiment: tr131
models:
  - name: llama3.2-1b
    ollama_tag: "llama3.2:1b"
    hf_id: "unsloth/Llama-3.2-1B-Instruct"
    params_m: 1200
  - name: llama3.2-3b
    ollama_tag: "llama3.2:3b"
    hf_id: "unsloth/Llama-3.2-3B-Instruct"
    params_m: 3200

nsys:
  trace: cuda
  gpuctxsw: false        # Requires admin on Windows WDDM
  gpu_metrics_set: ""     # Requires admin on Windows WDDM
  gpu_metrics_frequency: 0
  sample: none            # No CPU sampling (reduces trace size)
  cpuctxsw: none          # No CPU context switches (reduces trace size)

max_new_tokens: 128
seed: 42
warmup_requests: 3
prompt_tokens_low: 100
prompt_tokens_high: 200

phase1:  # Validation
  requests: 3
  profile_duration_s: 30

phase2:  # Ollama N=1
  n_agents: 1
  requests_per_agent: 5
  repetitions: 3
  profile_duration_s: 30

phase3:  # Ollama N=8
  n_agents: 8
  requests_per_agent: 3
  repetitions: 3
  profile_duration_s: 60

phase4:  # PyTorch N=1
  n_threads: 1
  requests_per_thread: 5
  repetitions: 3
  profile_duration_s: 60

phase5:  # PyTorch N=8
  n_threads: 8
  requests_per_thread: 3
  repetitions: 3
  profile_duration_s: 120

phase6:  # Nsight Compute
  kernel_launch_count: 5

Appendix B: Environment

Component	Version / Specification
GPU	NVIDIA GeForce RTX 4080 Laptop GPU
VRAM	12,282 MB GDDR6
Peak Memory Bandwidth	~432 GB/s
Bus Width	256-bit
Driver	591.74
OS	Windows 11 10.0.26200 (WDDM)
Python	3.13.1
Nsight Systems	2025.5.1.121-255136380782v0
Nsight Compute	2025.3.1.0 (build 36398880)
CUDA	Via driver (no standalone CUDA toolkit required)
Ollama	Latest (Q4_0 quantization for llama3.2 models)
PyTorch	Via HuggingFace Transformers (torch.float16, CUDA)
Architecture	AMD64

Appendix C: Statistical Methods

Welch's t-test

Used for all pairwise comparisons. Does not assume equal variance or equal sample sizes between groups. Degrees of freedom estimated via Welch-Satterthwaite approximation: df ~ (s1^2/n1 + s2^2/n2)^2 / [(s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1)]. When both groups have zero variance (e.g., max_concurrent = 1 everywhere), the test returns NaN -- which is itself informative.

Cohen's d (pooled)

Effect size computed as: d = (mean_a - mean_b) / pooled_std, where pooled_std = sqrt(((n_a-1)xvar_a + (n_b-1)xvar_b) / (n_a + n_b - 2)). Interpretation thresholds: |d| < 0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >0.8 large. Values exceeding d=10 (observed frequently in this study) indicate effect sizes so large that they are visible in individual data points without statistical testing.

Mann-Whitney U

Non-parametric rank-based test. Used as robustness check on every significant Welch's result. Two-sided alternative hypothesis. Does not assume normal distributions. Particularly important for n=2--3 groups where normality cannot be verified.

Holm Step-Down Correction

For k hypothesis tests at family-wise alpha: sort p-values ascending, test rank i against threshold alpha/(k-i+1). Reject H_i if p_i < threshold AND all lower-ranked tests were also rejected. More powerful than Bonferroni (which uses alpha/k for all tests) while still controlling family-wise error rate.

Power Analysis

Minimum detectable effect size for a two-sample t-test with equal n: d_min = t_crit(alpha/2, df=2n-2) x sqrt(2/n). At n=6: d_min=1.286 (requires "large" effect). At n=3: d_min=2.484 (requires "very large" effect). All observed effects (d > 4) are well above detection thresholds, confirming adequate statistical power despite small sample sizes.

IQR Outlier Detection

Tukey fences: outlier if x < Q1 - 1.5xIQR or x > Q3 + 1.5xIQR, where IQR = Q3 - Q1. Applied to all metric groups. Zero outliers detected in this study.

Appendix D: Glossary

Term	Definition
TPS	Tokens per second -- `tokens_generated / wall_time`. User-perceived throughput.
N=K	K concurrent agents/threads sending inference requests to the same GPU.
Q4_0	4-bit quantization format (0.5 bytes/parameter). Used by Ollama via ggml.
FP16	Half-precision floating point (2 bytes/parameter). Used by PyTorch/HuggingFace.
nsys	NVIDIA Nsight Systems -- system-wide GPU profiler using hardware counters.
ncu	NVIDIA Nsight Compute -- per-kernel profiler with detailed hardware metrics.
SM	Streaming Multiprocessor -- the GPU's compute unit. RTX 4080 Laptop has 58 SMs.
WDDM	Windows Display Driver Model -- Windows GPU driver framework. Serializes GPU access for display compositing.
TCC	Tesla Compute Cluster -- Linux/server GPU driver mode. Allows exclusive compute access and ncu hardware counter collection.
MPS	Multi-Process Service -- CUDA feature enabling concurrent kernel execution from multiple processes. Not available on consumer WDDM GPUs.
GIL	Global Interpreter Lock -- Python's thread serialization mechanism. Released for CUDA operations, enabling true GPU concurrency from Python threads.
ggml	C library for ML inference used by Ollama/llama.cpp. Features fused quantized kernels.
`mul_mat_q`	ggml's quantized matrix multiply CUDA kernel. Fuses dequantization + matmul.
Continuous batching	Technique where multiple sequences are processed in a single kernel launch, reading model weights once for N tokens. Used by vLLM and TGI.
PagedAttention	vLLM's memory management: allocates KV-cache in pages to reduce fragmentation.
Cohen's d	Standardized mean difference: \|mean_diff\| / pooled_std. <0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >=0.8 large.
Holm correction	Step-down multiple comparison correction. Controls family-wise error rate more powerfully than Bonferroni.
Welch's t-test	t-test for unequal variance. Standard for comparing two independent groups.
Category error	Comparing a metric across systems where the metric means different things. E.g., attributing GPU-level serialization to software scheduling.
Bandwidth demand	Memory bandwidth required per second: `model_weight_size x tokens_per_second x concurrent_agents`.

Appendix E: Reproducibility

How to Reproduce This Experiment

# Prerequisites:
# - NVIDIA Nsight Systems 2025.5.1 at default install path
# - NVIDIA Nsight Compute 2025.3.1 at default install path
# - Ollama installed: ollama pull llama3.2:1b && ollama pull llama3.2:3b
# - Python 3.11+ with: torch, transformers, numpy, scipy, pyyaml, requests
# - Close Ollama tray app before running (avoids process conflicts)

# Full pipeline (all 6 phases + analysis)
python research/tr131/run.py -v

# Expected runtime: ~71 minutes
# Expected disk usage: ~1.6 GB traces

Key Implementation Details

Ollama profiling: nsys wraps ollama serve; HTTP driver runs in separate thread outside nsys process tree
PyTorch profiling: nsys wraps Python script; ThreadPoolExecutor for N=8 concurrency (GIL released for CUDA)
Warmup: 3 requests per model before measurement begins
Cooldown: 3 seconds between runs for GPU temperature stabilization
Error handling: Runs with 0 complete requests are logged and excluded from analysis
Stats extraction timeout: 120 seconds per nsys stats report; large traces may timeout

Data Provenance

Artifact	Path	Size
Raw traces	`research/tr131/results/20260226_174224/traces/`	~1.6 GB (27 .nsys-rep files)
Exported CSVs	`research/tr131/results/20260226_174224/exports/`	~50 MB
Phase results	`research/tr131/results/20260226_174224/p{2,3,4,5}_*_results.json`	~10 KB each
Validation	`research/tr131/results/20260226_174224/validation.json`	~2 KB
ncu results	`research/tr131/results/20260226_174224/p6_ncu_results.json`	~1 KB
Analysis	`research/tr131/results/20260226_174224/analysis.json`	~80 KB (12 sections)
Manifest	`research/tr131/results/20260226_174224/manifest.json`	~3 KB
This report	`PublishReady/reports/Technical_Report_131.md`	~1,200 lines

Implementation Files

File	Purpose	Lines
`research/tr131/run.py`	Orchestrator -- runs all 6 phases sequentially	~130
`research/tr131/run_validation.py`	Phase 1 -- validates nsys captures Ollama	~220
`research/tr131/run_ollama_profiled.py`	Phases 2--3 -- profiles Ollama at N=1 and N=8	~210
`research/tr131/run_pytorch_direct.py`	Phases 4--5 -- profiles PyTorch Direct at N=1 and N=8	~250
`research/tr131/run_ncu_targeted.py`	Phase 6 -- nsight compute per-kernel profiling	~150
`research/tr131/analyze.py`	12-section statistical analysis -> analysis.json	~680
`research/tr131/shared/statistics.py`	Statistical utilities (Welch's t, Cohen's d, Holm, etc.)	~300
`research/tr131/shared/nsys_driver.py`	NsysDriver class -- wraps nsys profile/stats/export	~200
`research/tr131/shared/trace_parser.py`	Parse nsys CSV exports into analysis-ready dicts	~350
`research/tr131/shared/request_driver.py`	HTTP request sender for Ollama (outside nsys)	~200
`research/tr131/shared/pytorch_inference.py`	Direct HuggingFace model loading + generate	~250
`research/tr131/shared/utils.py`	Paths, constants, prompt generation	~100

References

Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Patel, P. et al. (2024). Splitwise: Efficient generative LLM inference using phase splitting. ISCA 2024.
Amdahl, G.M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967.
NVIDIA (2025). Nsight Systems User Guide. NVIDIA Developer Documentation.
NVIDIA (2025). Nsight Compute Documentation. NVIDIA Developer Documentation.
TR129 (2026). N-Agent Scaling Laws. Banterhearts Research.
TR130 (2026). Serving Stack Benchmarking -- Ollama vs vLLM vs TGI. Banterhearts Research.
TR126 (2026). Docker/Linux + Triton Validation. Banterhearts Research (statistical methodology reference).

TR131: GPU Kernel Profiling