Technical Report 127: Long-Context Performance Characterization

Consumer GPU context scaling from 512 to 32K tokens with two-regime VRAM analysis

Field	Value
TR Number	127
Project	Banterhearts LLM Performance Research
Date	2026-02-24
Author	Research Team
Report Type	Context-length scaling analysis (single-phase, 2-backend sweep)
Test Duration	~5 hours
Status	Complete -- Experiment + two-regime reanalysis delivered
Run ID	`20260224_101128`
Related Work	TR123 (KV-Cache Production Economics), TR125 (Quantization Decision Matrix), TR126 (Linux/Triton Validation)
Depends On	TR123 (KV cache cost model, VRAM formulas), TR126 (HF vs Ollama backend methodology)

Abstract

TR108-TR126 tested prompts up to ~2K tokens. Production workloads -- RAG pipelines, document summarization, multi-turn conversations -- operate at 4K-128K token contexts. The performance scaling behavior from 512 to 32K tokens on consumer hardware (RTX 4080 Laptop, 12 GB VRAM) was unknown. TR127 fills this gap with a systematic context-length sweep: 1,144 measurements (1,140 successful + 4 OOM) across 5 models (0.5B-3.2B parameters), 2 backends (HuggingFace transformers FP16, Ollama quantized), 7 context lengths (512 to 32,768 tokens), and 3 measurement modes (prefill, decode, end-to-end) with 10 repetitions each.

The central finding is a two-regime scaling phenomenon on HuggingFace transformers:

Pre-spillover regime (context fits in 12 GB VRAM): Prefill latency scales with exponent b = 1.58-1.78 (between linear and quadratic), with quadratic R^2 = 0.999+. This represents true computational scaling from self-attention's O(n^2) cost, partially mitigated by hardware optimizations.
Post-spillover regime (context exceeds VRAM): CUDA Unified Memory silently pages tensors to system RAM via PCIe, causing 25-105x latency cliffs. Full-range power-law fits show apparent exponents of b = 4.6-6.7 -- these are artifacts of memory thrashing, not O(n^2) attention. This is the dominant effect observed in the data.

Ollama (llama.cpp quantized) shows sub-linear prefill scaling across the entire 512-32K range (b < 0.2), confirming that Flash Attention and paged KV caches effectively eliminate the quadratic penalty at these context lengths. Ollama is 86-100% faster than HF at every context length tested, with the gap widening from 86% at 512 tokens to 99.96% at 16K tokens (where HF enters the thrashing regime).

Decode throughput degrades with context length across both backends: Ollama's llama3.2-1b drops from 163 tok/s (512 tokens) to 96 tok/s (32K tokens) -- a 41% decline over 64x context growth. HF models show steeper decode degradation, with qwen2.5-3b dropping from 27 tok/s to 0.9 tok/s due to VRAM spillover during KV-cache attention.

VRAM scaling analysis reveals per-token KV cache costs of 0.75-1.16 MB/token for FP16 models, with spillover thresholds at 8K tokens (3B model), 16K tokens (0.5B, 1.5B models), and hard OOM cliffs one step higher. The qwen2.5-3b model's 6 GB base footprint leaves only 6 GB for KV cache -- enough for ~4,600 tokens before spillover begins.

Total: 1,144 measurements, 5 models, 2 backends, 7 context lengths, 3 modes, ~5 hours runtime.

Key findings:

Quadratic attention is empirically visible but NOT the dominant bottleneck. Pre-spillover exponents b = 1.58-1.78 confirm superlinear scaling. But the 25-105x thrashing cliffs dwarf the computational cost -- VRAM management, not attention complexity, is the practical bottleneck on consumer hardware.
VRAM spillover is silent and catastrophic. PyTorch's CUDA Unified Memory allows allocation beyond physical VRAM without raising exceptions, but performance degrades 25-105x as tensors page through PCIe bandwidth (~16 GB/s) instead of GDDR6 bandwidth (~256 GB/s).
Ollama eliminates quadratic scaling entirely. Sub-linear exponents (b < 0.2) across all 3 Ollama models confirm that llama.cpp's optimized attention (Flash Attention + paged KV cache) makes prefill effectively O(n) at these context lengths.
Decode throughput degrades linearly with context. As KV cache grows, each decode step must attend over more cached tokens. Ollama shows 41-53% throughput degradation from 512 to 32K tokens. HF shows identical pre-spillover degradation plus catastrophic post-spillover collapse.
Consumer GPU context budget: 4K-8K tokens for FP16 HF, unlimited for Ollama. On 12 GB VRAM, FP16 models fit 4K-8K tokens before spillover depending on model size. Ollama's quantized models handle 32K+ without degradation.

Executive Summary

TR127 answers: how does inference performance scale with context length on consumer hardware, and where are the practical limits?

Key Findings

Two-regime scaling on HF transformers: Pre-spillover (b = 1.58-1.78, R^2 = 0.999+) vs post-spillover (25-105x cliffs). The quadratic attention cost IS empirically visible in the clean regime but is completely dominated by VRAM thrashing at higher context lengths.
Ollama prefill is sub-linear: b = 0.083 (llama3.2-1b), b = 0.109 (qwen2.5-1.5b), b = 0.158 (llama3.2-3b). Flash Attention eliminates quadratic scaling at 512-32K context lengths.
VRAM spillover thresholds are model-dependent: qwen2.5-0.5b and qwen2.5-1.5b spill at 16K tokens; qwen2.5-3b spills at 8K tokens. Each model's base weight footprint determines the remaining VRAM budget for KV cache.
Decode throughput degrades 41-53% over 64x context growth (Ollama): llama3.2-1b drops from 163->96 tok/s; qwen2.5-1.5b drops from 147->80 tok/s; llama3.2-3b drops from 99->47 tok/s. Linear KV-cache lookup cost.
HF decode enters catastrophic regime: qwen2.5-1.5b decode goes from 42 tok/s (512) to 2.1 tok/s (16K) -- a 95% collapse driven by VRAM spillover during KV-cache attention.
Ollama is 86-100% faster than HF across all context lengths. At short contexts (512), Ollama prefill is 86% faster. At long contexts (16K), it is 99.96% faster. The gap widens monotonically because HF enters the thrashing regime while Ollama does not.
TTFT exceeds 1 second at 4K tokens on HF. All three HF models cross the 1-second TTFT threshold at 4,096 tokens. At 16K tokens, TTFT is 7.9 minutes (qwen2.5-0.5b) and 8.9 minutes (qwen2.5-1.5b). Ollama TTFT never exceeds 1 second at any context length tested.
Total context-dependent VRAM grows at 0.75-1.16 MB/token (FP16). Derived from VRAM growth slopes on pre-spillover data. Cross-validation with TR123's theoretical KV costs (12-37 KB/token) reveals 20-95x overhead from attention workspace, activations, and allocator fragmentation (SS6.4).
OOM cliff follows spillover by one context-length step. qwen2.5-0.5b and qwen2.5-1.5b spill at 16K, OOM at 32K. qwen2.5-3b spills at 8K, OOM at 16K. The spillover regime is a warning zone before hard failure.
HF measurement precision is extremely high pre-spillover (CV 0.2-3.1%) but Ollama variance is dominated by cold-start outliers (CV 97-307% with rep-0; 3.6-6.1% after filtering). Median or 10%-trimmed mean is the correct central tendency for Ollama (SS4.5). All 18 backend comparisons survive Bonferroni correction (SS8.4).

Key Decisions

For long-context workloads on consumer GPU: Use Ollama (quantized). HF transformers in FP16 cannot serve contexts beyond 4-8K tokens without catastrophic performance degradation on 12 GB VRAM.
For short-context workloads (<=2K tokens): Either backend is viable. HF offers exact FP16 precision; Ollama offers 3-7x faster throughput with quantization.
Context budget planning: Allocate VRAM as model_weight_size + KV_cache_cost x context_length. For FP16 on 12 GB: qwen2.5-0.5b supports ~8K tokens, qwen2.5-1.5b supports ~5K tokens, qwen2.5-3b supports ~4K tokens before spillover.
Decode-heavy applications (chat, code gen): Ollama's decode throughput (47-163 tok/s) vastly exceeds HF's (0.9-49 tok/s). Use Ollama.
TTFT-sensitive applications (interactive use): Ollama maintains sub-second TTFT through 32K tokens. HF crosses 1 second at 4K tokens. Use Ollama for long-context interactive workloads.

Claim Validation

#	Claim	Evidence Base	Status
1	Quadratic attention cost is empirically visible on RTX 4080	Pre-spillover exponents b = 1.58-1.78, quadratic R^2 = 0.999+ (SS4)	Demonstrated (pre-spillover only)
2	VRAM becomes the bottleneck before compute does	Spillover at 8-16K tokens causes 25-105x cliffs (SS6)	Demonstrated
3	TTFT scales superlinearly with context length	HF: 9,004x TTFT increase over 32x context growth (SS7)	Demonstrated (HF only; Ollama is sub-linear)
4	There is a context-length "cliff" where performance drops dramatically	25-105x latency jumps at spillover thresholds (SS4, SS12)	Demonstrated (HF only)
5	Ollama eliminates quadratic scaling	Sub-linear exponents b < 0.2, all 3 models (SS4)	Demonstrated
6	Decode throughput degrades with context length	41-53% Ollama decode degradation, 95% HF decode collapse (SS5)	Demonstrated
7	Empirical VRAM growth cross-validates with TR123	Slopes 0.75-1.16 MB/token vs theoretical KV 12-37 KB/token; 20-95x overhead quantified as attention workspace + allocator (SS6.4)	Demonstrated (with reinterpretation: slope is total context-dependent memory, not pure KV cost)
8	Model size determines context budget	3B model spills at 8K, 0.5B/1.5B at 16K (SS6)	Demonstrated

When to Use This Report

TR127 is the context-length scaling reference for the Banterhearts research program. Use it when planning context budgets, evaluating VRAM requirements, or understanding long-context performance on consumer hardware.

Scenario 1: Planning Context Window for RAG Pipeline

Question: "I want to stuff 8K tokens of retrieved context into qwen2.5-1.5b on my RTX 4080. Will it work?"

Answer: Consult SS6. qwen2.5-1.5b uses 3.1 GB base VRAM + 1.03 MB/token KV cache. At 8K tokens: 3.1 + 8.2 = 11.3 GB -- borderline. TTFT will be ~5 seconds (SS7). Decode throughput drops to 19 tok/s (SS5). For acceptable interactive latency, limit to 4K tokens on HF, or use Ollama (handles 32K+ at 10 ms TTFT, 130 tok/s decode).

Scenario 2: Evaluating VRAM Requirements for a New Model

Question: "I have a 7B model at FP16. How much context can I fit on 12 GB VRAM?"

Answer: A 7B FP16 model uses ~14 GB for weights alone -- it won't fit on 12 GB VRAM even at context=0. Use Ollama with Q4_K_M quantization (TR125 shows negligible quality loss). At Q4_K_M, llama3.1-8b uses ~4.6 GB, leaving ~7.4 GB for KV cache.

Scenario 3: Understanding Why Inference Suddenly Became Very Slow

Question: "My HF model was running fine at 4K context but became 100x slower at 16K. What happened?"

Answer: Consult SS4 and SS6. Your model likely hit the VRAM spillover threshold -- CUDA Unified Memory is paging tensors to system RAM, causing 25-105x latency increases. Check torch.cuda.max_memory_allocated(): if it exceeds your physical VRAM (12 GB), you're in the thrashing regime. Reduce context length or switch to Ollama.

Scenario 4: Comparing with TR123 KV Cache Theory

Question: "TR123 predicted KV cache costs based on model architecture. Do the empirical measurements match?"

Answer: Consult SS6.4. Measured KV cache costs (0.75-1.16 MB/token) are derived from VRAM growth slopes on pre-spillover data. These should be compared with TR123's architectural predictions (layer_count x kv_heads x head_dim x precision_bytes x 2). Discrepancies may arise from PyTorch's memory allocator overhead.

Preliminaries

Metric Definitions & Statistical Methods

Experiment Design (SS1-SS3)

Introduction & Research Motivation
Methodology & Experimental Design
Environment & Artifacts

Scaling Analysis (SS4-SS7)

Prefill Scaling Analysis -- Two-regime discovery, cold-start analysis (4.5), trimmed-mean robustness (4.8)
Decode Scaling Analysis -- Two-regime decode (5.3), decode trimmed-mean robustness (5.4)
Memory Scaling (CUDA Allocation) -- KV cross-validation with TR123 theory (6.4)
Time to First Token (TTFT) Analysis

Comparisons & Quality (SS8-SS10)

Backend Comparison (HF vs Ollama) -- Multiple comparison correction (8.4), ANOVA interaction (8.5)
Outlier Analysis -- Distribution shape analysis (9.5)
Power Analysis

Synthesis (SS11-SS14)

Cross-Model Comparison
Key Findings
Conclusions
Production Guidance & Decision Trees

Closing

Limitations & Future Work
Reproducibility

Appendices

Appendix A: Environment Specifications
Appendix B: Config (Source of Truth)
Appendix C: Glossary
References

Metric Definitions & Statistical Methods

Latency Metrics

Metric	Definition	Computation
Mean (ms)	Arithmetic mean of wall-clock latency across all repetitions	`sum(x) / N`
Median (ms)	50th percentile latency	`sorted(x)[N//2]`
Std (ms)	Sample standard deviation	`sqrt(sum((x - mean)^2) / (N-1))`
p95/p99 (ms)	Percentile latencies	`numpy.percentile(x, [95, 99])`
95% CI	95% confidence interval for the mean	`mean +/- 1.96 * std / sqrt(N)`
CV%	Coefficient of variation	`(std / mean) * 100`

Throughput Metrics

Metric	Definition	Computation
Prefill tok/s	Prompt processing speed	`prompt_tokens / prefill_ms * 1000`
Decode tok/s	Token generation speed	`generated_tokens / decode_ms * 1000`
TTFT (ms)	Time to first token	Equal to prefill latency (prompt_eval_duration for Ollama)

Effect Size & Significance Metrics

Metric	Definition	Interpretation
Cohen's d	Standardized mean difference: `(mean_A - mean_B) / pooled_std`	Negligible: \|d\| < 0.2, Small: 0.2-0.5, Medium: 0.5-0.8, Large: > 0.8
p-value	Probability of observing the data under H_0 (no difference), via Welch's t-test	Significant if p < 0.05
Delta (%)	Relative difference: `(mean_B - mean_A) / mean_A x 100`	Negative = B is faster
Outlier (IQR)	Tukey fence per context length: `x < Q1 - 1.5IQR` or `x > Q3 + 1.5IQR`	Per-context detection avoids false positives from pooling heterogeneous regimes

Multiple Comparison Correction

Method	Formula	Use Case
Bonferroni	Reject if p < alpha / n_tests	Conservative FWER control. With 18 tests at alpha=0.05: threshold = 0.0028
Holm-Bonferroni	Step-down: sort p-values, reject p_i < alpha/(n-i+1) while all smaller p rejected	Less conservative than Bonferroni, controls FWER without assuming independence

ANOVA / Interaction Testing

Test	Purpose	In TR127
One-way ANOVA (backend)	Does backend affect latency (collapsing contexts)?	F-test on HF vs Ollama for qwen2.5-1.5b
Per-context t-test	Does backend effect change at each context?	Series of t-tests showing interaction pattern
Interaction evidence	Does the magnitude/direction of backend effect depend on context?	Classified as none/weak/moderate/strong

Trimmed Mean

A robust estimator of central tendency: scipy.stats.trim_mean(values, proportiontocut) removes floor(N x proportion) values from each tail before computing the mean. At N=10: 5% trim removes 0 values (useless); 10% trim removes 1 from each tail (effective for cold-start filtering).

Distribution Shape

Metric	Formula	Interpretation
Mean/median ratio	`mean / median`	>1.0 = right skew; >2.0 = severe skew (mean unreliable)
Skewness	`scipy.stats.skew(values)`	0 = symmetric; >2 = strong right skew
Shapiro-Wilk	`scipy.stats.shapiro(values)`	p >= 0.05 = consistent with normality

Scaling Fit Methods

Three models are fit to (context_length, latency) data:

Power law: latency = a x context_length^b -- exponent b indicates scaling behavior (b=1 is linear, b=2 is quadratic)
Linear: latency = a x context_length + c -- R^2 compared against power law to determine better fit
Quadratic: latency = a x context_length^2 + b x context_length + c -- direct test of O(n^2) hypothesis

For HF models with VRAM spillover, two-regime analysis separates pre-spillover data (true computational scaling) from post-spillover data (VRAM thrashing artifacts). The thrashing threshold is identified as the first context length where torch.cuda.max_memory_allocated() exceeds physical GPU VRAM (12,288 MB).

Timing Methodology

HF (transformers-gpu): All latency measurements use time.perf_counter() with torch.cuda.synchronize() barriers before and after the timed region. VRAM is measured via torch.cuda.max_memory_allocated() (reset before each context-length sweep).

Ollama: Native timing fields from the /api/generate HTTP response:

Prefill: prompt_eval_duration (nanoseconds, GPU-only)
Decode: eval_duration (nanoseconds, GPU-only)
E2E: Wall clock via time.perf_counter() (includes HTTP overhead)

VRAM Measurement Caveat

torch.cuda.max_memory_allocated() reports the peak amount of memory allocated by the CUDA allocator, which includes CUDA Unified Memory allocations that spill to system RAM. On our 12 GB GPU, values exceeding 12,288 MB indicate that PyTorch has allocated memory beyond physical VRAM, with excess served from system RAM at PCIe bandwidth (~16 GB/s vs ~256 GB/s GDDR6). This is the root cause of the 25-105x thrashing cliffs. Importantly, PyTorch does NOT raise an OOM exception at this point -- the OOM occurs at a higher allocation threshold determined by the CUDA Unified Memory limit.

1. Introduction & Research Motivation

1.1 Research Questions

TR127 addresses four decision-grade questions from the research roadmap:

Does attention's quadratic cost show up empirically on the RTX 4080? Self-attention is O(n^2) in context length. At what point does this become measurable on consumer hardware?
At what context length does VRAM become the bottleneck per model? Each model has a different base weight footprint. Where does the remaining VRAM fill up with KV cache?
How does TTFT (prefill latency) scale with context length? TTFT determines interactive responsiveness. At what context length does TTFT exceed acceptable thresholds (1s, 5s, 10s)?
Is there a context-length "cliff" where performance drops dramatically? Do models degrade gracefully or catastrophically as context grows?

1.2 Why This Matters

Every prior report in this research program (TR108-TR126) tested at most ~2K token contexts. But production workloads are increasingly long-context:

RAG pipelines: 4K-16K tokens of retrieved context prepended to queries
Document summarization: 8K-32K tokens of source text
Multi-turn chat: 4K-128K tokens of conversation history
Code generation: 2K-16K tokens of repository context

Without empirical context-length scaling data, production teams cannot answer basic capacity questions: "How many tokens of context can I stuff before latency becomes unacceptable?" or "Should I use FP16 HF or quantized Ollama for my 8K-context RAG pipeline?"

TR127 fills this gap with a systematic context-length sweep on consumer hardware within this research program. It also reveals a phenomenon -- VRAM spillover via CUDA Unified Memory -- that was invisible in short-context experiments and would silently degrade production deployments.

1.3 Scope

Hardware: Single consumer machine (RTX 4080 Laptop, 12 GB VRAM) -- same GPU as TR117-TR126.
Platform: Windows 11, Python 3.13, PyTorch 2.8.0+cu128 (for HF); Ollama localhost (for quantized inference).
Models: 5 models spanning 0.5B-3.2B parameters: 3 HF-only (qwen2.5-0.5b/1.5b/3b), 3 Ollama-only (llama3.2-1b/3b, qwen2.5-1.5b), with qwen2.5-1.5b on both backends for direct comparison.
Context lengths: 7 levels -- 512, 1024, 2048, 4096, 8192, 16384, 32768 tokens (geometric progression, 2x steps).
Modes: prefill, decode (128 new tokens), end-to-end (prefill + decode).
Timing: torch.cuda.synchronize() + perf_counter for HF; native prompt_eval_duration/eval_duration for Ollama.
Temperature: 0.0 (greedy decoding). Deterministic -- validated by TR124 Phase 3.
Repetitions: 10 measured + 3 warmup (discarded) per (model x context_length).

1.4 Literature Grounding

Reference	Contribution	How TR127 Uses It
TR123 (Banterhearts)	KV cache cost model, VRAM formulas	Validate empirical KV costs against theoretical predictions
TR125 (Banterhearts)	Quantization quality data, Ollama timing	Ollama model selection, native timing methodology
TR126 (Banterhearts)	HF vs Ollama comparison at short context	Cross-reference backend comparison findings
FlashAttention (Dao et al., 2022)	Tiling-based exact attention at sub-quadratic memory	Explains Ollama's sub-linear scaling
CUDA Unified Memory (NVIDIA)	Transparent CPU-GPU memory migration	Explains VRAM spillover mechanism

Gap filled: Prior reports tested performance at fixed, short context lengths. TR127 provides the first context-length scaling curves on consumer hardware, revealing two-regime behavior invisible in short-context experiments.

1.5 How to Read This Report

Use TR127 in three passes:

SS2-SS3 (Methodology): Understand the experimental design and what was measured. If you trust the setup, skip to results.
SS4-SS7 (Scaling Analysis): The core contribution -- prefill scaling (the two-regime discovery), decode scaling, VRAM scaling, and TTFT analysis. These four sections answer the four research questions.
SS8-SS14 (Comparisons & Synthesis): Backend comparison, outlier analysis, power analysis, cross-model comparison, and production guidance. Read these for deployment decisions.

2. Methodology & Experimental Design

2.1 Independent Variable

Context length is the single independent variable, swept across 7 levels:

Level	Context Length (tokens)	Doubling Step
1	512	--
2	1,024	2x
3	2,048	2x
4	4,096	2x
5	8,192	2x
6	16,384	2x
7	32,768	2x

The geometric (2x) progression provides even coverage on a log scale, which is natural for analyzing power-law scaling relationships.

2.2 Model Lineup

Model	Params	Backend	Dtype	Max Context	KV Heads	Base VRAM (MB)
qwen2.5-0.5b	500M	transformers-gpu	FP16	32K	2	1,122
qwen2.5-1.5b	1,543M	transformers-gpu + ollama	FP16 / Q4	131K	2	3,132
qwen2.5-3b	3,000M	transformers-gpu	FP16	32K	2	6,190
llama3.2-1b	1,236M	ollama	Q8_0	131K	--	--
llama3.2-3b	3,213M	ollama	Q4_K_M	131K	--	--

Design rationale:

Three HF models at 0.5B/1.5B/3B (3x parameter intervals) provide controlled VRAM scaling: each model has a different base footprint, so the "VRAM budget for KV cache" differs, and the spillover threshold should occur at different context lengths. This is confirmed: qwen2.5-3b spills at 8K while the smaller models spill at 16K.
Three Ollama models at 1B/1.5B/3B provide a quantized reference that does not hit VRAM limits, enabling measurement of true computational scaling without VRAM artifacts.
qwen2.5-1.5b on both backends enables a direct HF-vs-Ollama comparison at each context length, isolating the backend effect from the model effect.

2.3 Backend Selection

Backend	Measurement	VRAM Tracking	Context Control
transformers-gpu	`torch.cuda.synchronize()` + `perf_counter` (GPU-accurate wall clock)	`torch.cuda.max_memory_allocated()`	Tokenizer truncation to exact token count
ollama	Native `prompt_eval_duration` / `eval_duration` (GPU-only, from llama.cpp)	Not available (no VRAM API)	`num_ctx` option in API call

Why no compiled HF? TR126 established that torch.compile on Windows falls back to aot_eager (no Triton). Testing compiled HF in a long-context sweep on Windows would measure aot_eager scaling, which is uninformative. TR126's Linux Docker results show compilation benefits prefill but crashes on decode -- neither applicable to this Windows-only context sweep.

2.4 Measurement Protocol

For each (model x backend x context_length):

Prompt generation: Synthetic text tokenized to exactly N tokens using the model's tokenizer (HF) or repeated seed text (Ollama)
Warmup: 3 repetitions (discarded) -- critical for JIT, memory allocation, CUDA context setup
Measurement: 10 repetitions, each measuring:
- Prefill: Forward pass over the entire prompt (single pass, use_cache=True)
- Decode: Autoregressive generation of 128 new tokens using KV cache
- E2E: Prefill + decode timed together
VRAM recording: torch.cuda.max_memory_allocated() recorded once per context length (resets between context lengths)
OOM handling: torch.cuda.OutOfMemoryError caught and recorded as status: "oom" -- the OOM context length itself is a data point

2.5 Prompt Generation

Synthetic prompts are constructed to hit exact token counts:

HF: A seed paragraph is repeated and then tokenized. The token IDs are truncated to exactly N tokens. This ensures the tokenizer reports exactly the target context length.
Ollama: Text is repeated to approximately the target length. The num_ctx API parameter controls the context window. Actual prompt_eval_count varies slightly from the target (e.g., 29,294 prompt tokens for a 32K context length on llama3.2-3b) due to Ollama's tokenizer differences.

2.6 Controlled Variables

Variable	Value	Rationale
`max_new_tokens`	128	Fixed decode length for fair comparison across context lengths
`temperature`	0.0	Greedy decoding -- deterministic, no sampling variance
`seed`	42	Reproducible random state
`warmup_repetitions`	3	Exclude cold-start artifacts
`repetitions`	10	Sufficient for N=10 power at large effect sizes

2.7 Sample Counts

Backend	Models	Context Lengths	Reps	Modes	Total Planned	Actual (ok)	OOM
transformers-gpu	3	7	10	3	630	540*	4
ollama	3	7	10	3	630	600	0
Total	--	--	--	--	1,260	1,140	4

*HF models: qwen2.5-0.5b ran 6/7 lengths (OOM at 32K), qwen2.5-1.5b ran 6/7 (OOM at 32K), qwen2.5-3b ran 5/7 (OOM at 16K and 32K). Plus 4 OOM sentinel rows = 1,144 total rows in CSV.

3. Environment & Artifacts

3.1 Environment Fingerprint

Property	Value
Platform	Windows-11-10.0.26200-SP0
Python	3.13.1
PyTorch	2.8.0+cu128
CUDA	12.8
cuDNN	91002
GPU	NVIDIA GeForce RTX 4080 Laptop GPU
VRAM	12.88 GB (12,288 MB usable)
Compute Capability	8.9 (Ada Lovelace)
Triton	Not available (Windows)
Ollama	localhost:11434 (llama.cpp backend)

3.2 Preflight Validation

Check	Result	Detail
CUDA available	Pass	RTX 4080 Laptop GPU detected
GPU free memory	10.78 GB	~1.21 GB used by system at start
CUDA sync test	Pass	`torch.cuda.synchronize()` completes without error

3.3 Run Timeline

Event	Time	Duration
Start	10:11:30	--
qwen2.5-0.5b complete (6 lengths)	~11:00	~49 min
qwen2.5-1.5b complete (6 lengths)	~13:30	~2h 30m
qwen2.5-3b complete (5 lengths)	~14:00	~30 min
Ollama models complete (3 x 7 lengths)	15:13	~1h 13m
Pipeline end (analysis + report)	15:13:12	~5h 2m total

The HF models took ~3h 49m (dominated by 16K-token context runs that entered the VRAM thrashing regime). Ollama models took ~1h 13m with no thrashing.

3.4 Key Artifacts

Artifact	Path	Size
Raw measurements	`research/tr127/results/20260224_101128/metrics.csv`	127 KB, 1,144 rows
Analysis	`research/tr127/results/20260224_101128/analysis.json`	75 KB
Auto-generated report	`research/tr127/results/20260224_101128/report.md`	29 KB
Manifest	`research/tr127/results/20260224_101128/manifest.json`	3 KB

4. Prefill Scaling Analysis

This section answers Research Question 1: Does attention's quadratic cost show up empirically on the RTX 4080?

4.1 The Two-Regime Discovery

The central finding of TR127 is that HF prefill scaling exhibits two distinct regimes:

Regime 1 -- Pre-spillover (computational scaling): When the model + KV cache fits in physical VRAM (12 GB), prefill latency follows a power law with exponent b = 1.58-1.78. This is between linear (b=1) and quadratic (b=2), consistent with self-attention's O(n^2) cost partially mitigated by hardware optimizations (SDPA kernels, Ada Lovelace architecture optimizations). Quadratic fits achieve R^2 = 0.999+, confirming that the O(n^2) model is an excellent description of the computational scaling.

Regime 2 -- Post-spillover (VRAM thrashing): When KV cache allocation exceeds physical VRAM, CUDA Unified Memory silently pages tensors to system RAM. Each attention operation must now fetch data over PCIe (~16 GB/s) instead of GDDR6 (~256 GB/s), causing 25-105x latency increases. Full-range power-law fits show apparent exponents of b = 4.6-6.7 -- but these are artifacts of the memory hierarchy transition, not O(n^6) attention.

Why two regimes? The transition is not gradual -- it's a cliff. At 8K tokens, qwen2.5-3b allocates 16.2 GB (1.32x physical VRAM). The excess 4.2 GB must be fetched from system RAM on every access. Because attention involves all-pairs token comparisons, the amount of data fetched from system RAM scales quadratically, creating a multiplicative interaction between quadratic compute and PCIe bandwidth limits.

4.2 Full-Range Scaling Fits

For completeness, the full-range fits (including thrashing data) are reported below. These show what a naive analysis would conclude if VRAM spillover were not identified:

Model	Backend	Exponent (b)	R^2 (power)	R^2 (quad)	R^2 (linear)	Better Fit
llama3.2-1b	ollama	0.083	0.488	0.864	0.840	linear
llama3.2-3b	ollama	0.158	0.718	0.969	0.960	linear
qwen2.5-0.5b	transformers-gpu	6.645	1.000	0.990	0.796	power_law
qwen2.5-1.5b	ollama	0.109	0.588	0.860	0.855	linear
qwen2.5-1.5b	transformers-gpu	6.703	1.000	0.990	0.795	power_law
qwen2.5-3b	transformers-gpu	4.625	1.000	0.994	0.832	power_law

Warning: The b = 4.6-6.7 exponents for HF models are VRAM thrashing artifacts, not evidence of O(n^6) attention. The full-range power-law fit achieves R^2 ~ 1.000 because the 16K/8K data points (at 100x latency) dominate the fit. These numbers should NOT be used for scaling predictions.

4.3 Pre-Thrashing Scaling Fits (True Computational Scaling)

By restricting the fit to context lengths where VRAM allocation stays within 12 GB, we isolate the true computational scaling:

Model	Backend	Clean Points	Exponent (b)	R^2 (power)	R^2 (quad)	R^2 (linear)	Thrashing At	Thrashing Mult
qwen2.5-0.5b	transformers-gpu	5 (512-8K)	1.701	0.9999	0.9999	0.969	16,384	100.7x
qwen2.5-1.5b	transformers-gpu	5 (512-8K)	1.780	0.9982	0.9998	0.954	16,384	104.7x
qwen2.5-3b	transformers-gpu	4 (512-4K)	1.583	0.9950	0.9992	0.967	8,192	25.2x

Interpretation:

Exponents b = 1.58-1.78 are between linear (b=1) and quadratic (b=2). Pure O(n^2) self-attention would give b=2. The sub-quadratic exponents suggest hardware optimizations (SDPA with memory-efficient attention, Ada Lovelace tensor cores) reduce the effective scaling exponent by 10-20%.
Quadratic R^2 = 0.999+ means the quadratic model latency = a*n^2 + b*n + c fits the pre-spillover data almost perfectly. This is consistent with O(n^2) attention as the dominant cost, even though the power-law exponent is slightly below 2.
The thrashing multiplier (25-105x) quantifies how much worse performance gets when VRAM spills. qwen2.5-0.5b jumps from 4.7 seconds (8K, pre-spillover) to 474 seconds (16K, post-spillover) -- a 100.7x increase for a 2x context growth.

4.4 Per-Model Prefill Analysis (HF)

qwen2.5-0.5b (500M params, FP16)

Context Length	Mean (ms)	Median (ms)	Std (ms)	CV%	95% CI	N	Regime
512	52.6	52.7	0.7	1.4%	[52, 53]	10	Clean
1,024	144.5	144.5	1.0	0.7%	[144, 145]	10	Clean
2,048	441.2	441.4	2.9	0.6%	[439, 443]	10	Clean
4,096	1,452.3	1,452.1	4.8	0.3%	[1,449, 1,456]	10	Clean
8,192	4,723.6	4,721.0	7.4	0.2%	[4,718, 4,729]	10	Clean
16,384	473,789.2	475,338.8	5,244.4	1.1%	[470,038, 477,541]	10	Thrashing
32,768	--	--	--	--	--	0	OOM

Observations:

Extraordinary measurement precision pre-spillover: CV ranges from 0.2% to 1.4% across 5 context lengths. The 95% CIs are +/-1-4 ms wide. This validates the measurement methodology -- torch.cuda.synchronize() + perf_counter produces highly repeatable GPU timing.
The 100x cliff: At 16K tokens, latency jumps from 4,724 ms to 473,789 ms (100.3x increase for 2x context growth). The clean-regime prediction at 16K would be ~15,000 ms (based on the b=1.70 power law). The actual value is 31x worse than the computational prediction.
Base VRAM is tiny (1,122 MB): This model uses only 1.1 GB for weights, leaving ~11 GB for KV cache. Yet it still spills at 16K tokens because KV cache grows at 2.11 MB/token.

qwen2.5-1.5b (1.5B params, FP16)

Context Length	Mean (ms)	Median (ms)	Std (ms)	CV%	95% CI	N	Regime
512	108.8	110.2	2.6	2.4%	[107, 111]	10	Clean
1,024	243.3	242.7	4.1	1.7%	[240, 246]	10	Clean
2,048	507.1	504.4	15.6	3.1%	[496, 518]	10	Clean
4,096	1,436.4	1,395.3	105.9	7.4%	[1,361, 1,512]	10	Clean
8,192	5,086.0	5,083.8	10.2	0.2%	[5,079, 5,093]	10	Clean
16,384	532,890.8	532,221.5	13,308.7	2.5%	[523,370, 542,411]	10	Thrashing
32,768	--	--	--	--	--	0	OOM

Observations:

Higher base latency: At 512 tokens, qwen2.5-1.5b (109 ms) is 2.1x slower than qwen2.5-0.5b (53 ms). The 3x parameter increase translates to ~2x latency increase -- sub-linear in model size, consistent with GPU parallelism absorbing some of the additional compute.
4K anomaly: The 4,096-token measurement shows elevated std (106 ms, CV 7.4%) compared to neighboring points. The mean (1,436 ms) is higher than the median (1,395 ms), suggesting one or two warm runs. This may indicate the model is approaching VRAM pressure (5,013 MB allocated at 4K -- 41% of VRAM) causing occasional allocator overhead.
105x cliff at 16K: 5,086 ms -> 532,891 ms. Even more dramatic than the 0.5B model, likely because the larger model has more weights to page in addition to KV cache.

qwen2.5-3b (3B params, FP16)

Context Length	Mean (ms)	Median (ms)	Std (ms)	CV%	95% CI	N	Regime
512	164.0	156.5	22.9	13.9%	[148, 180]	10	Clean
1,024	341.9	346.8	43.5	12.7%	[311, 373]	10	Clean
2,048	756.8	732.2	63.4	8.4%	[711, 802]	10	Clean
4,096	2,432.3	2,428.4	10.5	0.4%	[2,425, 2,440]	10	Clean
8,192	61,981.8	61,211.1	2,108.4	3.4%	[60,474, 63,490]	10	Thrashing
16,384	--	--	--	--	--	0	OOM
32,768	--	--	--	--	--	0	OOM

Observations:

Earlier spillover: The 3B model's 6.2 GB base footprint leaves only 6.1 GB for KV cache, so it spills at 8K tokens (16.2 GB allocated, 1.32x physical VRAM) instead of 16K.
Higher baseline variance: CV of 8-14% at 512-2K tokens is much higher than the smaller models' 0.2-3.1%. This may be caused by the model's larger memory footprint interacting with the CUDA allocator more frequently, or by thermal throttling (the 150W laptop TDP may cause clock throttling during sustained 3B-parameter forward passes).
"Only" 25x cliff: The 4,096->8,192 jump is 25.5x (compared to 100x for the smaller models). The 3B model enters spillover earlier (8K vs 16K), and the spillover ratio (1.32x) is smaller than for the smaller models (2.66-2.83x), so the thrashing is less severe. However, the absolute latency (62 seconds for a single 8K prefill) is still catastrophically slow.

4.5 Ollama Cold-Start Analysis

Despite 3 warmup repetitions, Ollama's first measured repetition (rep-0) consistently shows 2-10x higher latency than subsequent repetitions. This pattern was present in 100% (21/21) of Ollama measurement groups and must be understood before interpreting Ollama scaling data.

Cold-Start Magnitude

qwen2.5-1.5b (Ollama)

Context	Rep-0 (ms)	Rest Median (ms)	Cold Ratio	Mean Inflation	Rest CV%
512	70.2	9.0	7.8x	+44.7%	5.7%
1,024	118.2	8.8	13.4x	+58.1%	6.1%
2,048	240.1	7.7	31.2x	+67.8%	5.4%
4,096	492.3	8.0	61.5x	+73.5%	3.8%
8,192	959.3	10.3	93.1x	+76.1%	3.9%
16,384	2,028.2	10.4	195.0x	+79.5%	3.6%
32,768	4,104.5	13.4	306.3x	+80.0%	3.6%

Key observations:

Cold ratio grows with context length -- from 7.8x at 512 tokens to 306x at 32K tokens. This is consistent with Ollama performing lazy KV cache allocation on the first inference: longer contexts require more KV cache memory, so the first-run overhead is proportionally larger.
Rest CV% is low (3.6-6.1%) -- after filtering rep-0, Ollama measurements are quite stable. The 97-307% CV reported in SS4.6 is entirely caused by the cold-start outlier.
Mean inflation is severe (45-80%) -- including rep-0 inflates the mean by up to 80% at long contexts. The median is robust (unchanged by rep-0), and 10%-trimmed mean (removes 1 value from each tail of N=10) also eliminates the cold-start. However, 5%-trimmed mean with N=10 is insufficient -- floor(10 x 0.05) = 0 values removed, so it equals the untrimmed mean.

Recommendation: Use median or 10%-trimmed mean for Ollama central tendency. Do not report arithmetic mean without disclosing cold-start contamination.

4.6 Ollama Prefill Scaling

Ollama shows fundamentally different scaling behavior -- sub-linear across the entire 512-32K range:

llama3.2-1b (Ollama, Q8_0)

Context Length	Mean (ms)	Median (ms)	Std (ms)	CV%	N
512	12.0	8.4	11.7	97%	10
1,024	16.2	8.3	25.1	155%	10
2,048	23.3	7.1	51.5	221%	10
4,096	41.8	7.3	109.5	262%	10
8,192	81.4	8.4	230.5	283%	10
16,384	168.4	9.3	504.0	299%	10
32,768	371.9	11.4	1,140.0	307%	10

Critical observation: Mean vs Median divergence. The medians are nearly flat (7-11 ms across 64x context growth) while means grow 31x (12->372 ms). This pattern -- high CV (97-307%), large mean-median gap -- is characteristic of cold-start outliers: 1-2 of the 10 repetitions take 10-100x longer than the rest, pulling the mean up dramatically. Despite 3 warmup reps, the first measured repetition often shows this pattern with Ollama, likely due to llama.cpp's internal lazy initialization or KV cache pre-allocation on first real inference.

The median is the reliable measure for Ollama. Median prefill time for llama3.2-1b grows from 8.4 ms (512) to 11.4 ms (32K) -- a 1.4x increase over 64x context growth. This is profoundly sub-linear (b ~ 0.08) and consistent with Flash Attention's O(n) memory access pattern on optimized hardware.

llama3.2-3b (Ollama, Q4_K_M)

Context Length	Mean (ms)	Median (ms)	Std (ms)	N
512	23.3	12.5	34.2	10
1,024	34.6	12.9	69.3	10
2,048	56.0	11.7	140.6	10
4,096	102.2	11.9	285.6	10
8,192	191.2	14.7	557.9	10
16,384	370.0	16.2	1,119.1	10
32,768	828.5	22.6	2,549.3	10

Same pattern: medians grow from 12.5 to 22.6 ms (1.8x over 64x context growth). The 3B model shows slightly more context sensitivity than the 1B model (median growth factor 1.8x vs 1.4x), consistent with more parameters requiring more memory bandwidth for attention computation.

qwen2.5-1.5b (Ollama, Q4_K_M)

Context Length	Mean (ms)	Median (ms)	Std (ms)	N
512	15.1	9.2	19.4	10
1,024	21.0	8.8	39.1	10
2,048	33.2	7.7	81.1	10
4,096	60.8	8.0	166.8	10
8,192	117.0	10.3	338.1	10
16,384	226.6	10.4	683.2	10
32,768	457.5	13.4	1,404.7	10

Median grows from 9.2 to 13.4 ms (1.5x over 64x context growth). This is the same model (qwen2.5-1.5b) that takes 532 seconds on HF at 16K tokens -- Ollama processes it in 10.4 ms (median). The difference is 51,000x at 16K tokens.

4.6 Scaling Exponent Summary

Backend	Model	Regime	Exponent (b)	Interpretation
HF	qwen2.5-0.5b	Pre-spillover	1.70	Superlinear (approaching quadratic)
HF	qwen2.5-1.5b	Pre-spillover	1.78	Superlinear (approaching quadratic)
HF	qwen2.5-3b	Pre-spillover	1.58	Superlinear (partially optimized)
HF	All	Post-spillover	4.6-6.7	VRAM thrashing artifact
Ollama	llama3.2-1b	Full range	0.08	Sub-linear (Flash Attention)
Ollama	llama3.2-3b	Full range	0.16	Sub-linear (Flash Attention)
Ollama	qwen2.5-1.5b	Full range	0.11	Sub-linear (Flash Attention)

Answer to Research Question 1: Yes, quadratic attention cost is empirically visible in the pre-spillover regime (b = 1.58-1.78). However, it is NOT the dominant bottleneck -- VRAM spillover causes 25-105x cliffs that completely dominate the scaling picture. Ollama's optimized attention eliminates quadratic scaling entirely.

4.8 Trimmed-Mean Robustness Analysis

To verify that scaling exponents are not artifacts of outliers, we re-fit the power law using trimmed means instead of medians. With N=10 repetitions per context length:

5% trim = floor(10 x 0.05) = 0 values removed per tail -> identical to untrimmed mean
10% trim = floor(10 x 0.10) = 1 value removed per tail -> removes extreme rep

Model	Backend	Median b	Trim 5% b	Trim 10% b	Stable?
qwen2.5-0.5b	HF	6.645	6.640	6.642	Yes
qwen2.5-1.5b	HF	6.703	6.704	6.706	Yes
qwen2.5-3b	HF	4.625	4.640	4.630	Yes
llama3.2-1b	ollama	0.083	1.083	0.081	No (13.4x)
llama3.2-3b	ollama	0.158	1.063	0.159	No (6.7x)
qwen2.5-1.5b	ollama	0.109	0.976	0.111	No (8.8x)

Interpretation:

HF exponents are rock-solid: Median, 5%-trimmed, and 10%-trimmed fits all agree within 0.3%. The thrashing-dominated exponents are not outlier artifacts -- they reflect genuine VRAM spillover scaling.
Ollama exponents are median-robust but mean-fragile: The 5% trim (which removes 0 values from N=10) gives wildly inflated exponents (b ~ 1.0 instead of 0.1) because the cold-start rep-0 contaminates the mean at every context length. The 10% trim removes 1 value per tail, recovering the median-based exponent exactly.
This confirms the cold-start finding (SS4.5): Ollama scaling analysis MUST use median or >=10% trimmed mean. Arithmetic mean produces exponents 7-13x too high due to a single cold-start outlier per group.

5. Decode Scaling Analysis

5.1 Decode Throughput vs Context Length

Decode throughput measures token generation speed. Each decode step must attend over the full KV cache from all prior tokens, so longer contexts increase per-step attention cost.

Ollama Decode Throughput (Clean Scaling)

Context	llama3.2-1b (tok/s)	llama3.2-3b (tok/s)	qwen2.5-1.5b (tok/s)
512	162.6	98.7	146.7
1,024	161.5	97.6	144.6
2,048	157.8	95.0	141.2
4,096	153.0	89.3	134.5
8,192	140.3	79.4	129.9
16,384	121.1	64.8	102.7
32,768	96.0	46.9	80.0

Degradation rates (512->32K):

Model	Start (tok/s)	End (tok/s)	Degradation	Factor
llama3.2-1b	162.6	96.0	-41.0%	1.69x slower
qwen2.5-1.5b	146.7	80.0	-45.5%	1.83x slower
llama3.2-3b	98.7	46.9	-52.5%	2.10x slower

Pattern: Decode throughput degradation is approximately linear in context length -- consistent with each decode step's attention cost growing linearly with KV cache size. The larger model (3B) shows more degradation (53%) than the smaller model (1B, 41%), likely because the 3B model's larger KV cache per token amplifies the per-step attention cost.

Production implication: At 32K context, decode speed drops to 47-96 tok/s (Ollama). For interactive applications requiring >=50 tok/s, the practical decode context limit is:

llama3.2-1b: ~32K tokens (96 tok/s)
qwen2.5-1.5b: ~26K tokens (estimated by interpolation)
llama3.2-3b: ~16K tokens (65 tok/s drops below 50 at ~20K)

HF Decode Throughput (Pre-Spillover + Thrashing)

Context	qwen2.5-0.5b (tok/s)	qwen2.5-1.5b (tok/s)	qwen2.5-3b (tok/s)
512	48.9	42.1	27.2
1,024	47.9	41.0	28.2
2,048	48.1	45.2	26.6
4,096	49.3	30.5	17.9
8,192	40.9	19.3	0.9
16,384	23.3	2.1	OOM

Observations:

HF decode is 3-4x slower than Ollama even pre-spillover. At 512 tokens: HF qwen2.5-1.5b = 42 tok/s vs Ollama qwen2.5-1.5b = 147 tok/s. This is consistent with TR126's finding that Ollama dominates decode due to quantized weights and optimized C++ KV-cache implementation.
Catastrophic decode collapse at spillover: qwen2.5-3b decode goes from 17.9 tok/s (4K) to 0.9 tok/s (8K) -- a 20x drop. At 0.9 tok/s, generating 128 tokens takes 145 seconds. qwen2.5-1.5b drops from 19.3 tok/s (8K) to 2.1 tok/s (16K).
Decode thrashing is even worse than prefill thrashing because decode involves per-step attention over the entire KV cache, and each step requires reading the full cache from system RAM.

5.2 Decode Scaling Fits

Model	Backend	Exponent (b)	R^2	Better Fit	Interpretation
llama3.2-1b	ollama	0.160	0.822	linear	Gentle linear degradation
llama3.2-3b	ollama	0.063	0.157	linear	Nearly flat (model-internal overhead dominates)
qwen2.5-0.5b	HF	0.246	0.641	linear	Mild degradation, last point enters spillover
qwen2.5-1.5b	ollama	0.149	0.839	linear	Gentle linear degradation
qwen2.5-1.5b	HF	3.018	0.986	power_law	VRAM thrashing at 16K
qwen2.5-3b	HF	4.247	0.996	power_law	VRAM thrashing at 8K

The HF decode exponents (b = 3.0, 4.2) are thrashing artifacts, same as prefill.

5.3 Pre-Thrashing Decode Fits (Two-Regime Analysis)

Like prefill (SS4), decode latency also exhibits two regimes. Pre-spillover decode exponents quantify the true KV-cache lookup cost scaling:

Model	Backend	Clean Points	Exponent (b)	R^2 (power)	Thrashing At	Thrashing Mult
qwen2.5-0.5b	HF	5 (512-8K)	0.053	0.463	16,384	8.8x
qwen2.5-1.5b	HF	5 (512-8K)	0.356	0.788	16,384	7.3x
qwen2.5-3b	HF	4 (512-4K)	0.230	0.692	8,192	6.9x

Interpretation:

Pre-spillover decode is nearly flat (b = 0.05-0.36). This means each decode step's KV-cache lookup cost grows very slowly with context length -- consistent with hardware-optimized attention that reads the KV cache efficiently. The 0.5B model is essentially constant (b = 0.05) because its tiny KV cache fits entirely in GPU L2 cache at all pre-spillover contexts.
Decode thrashing multipliers (7-9x) are lower than prefill (25-105x). Decode generates one token at a time (reading KV cache + writing one K/V pair), while prefill processes the entire sequence in one pass. The smaller working set per decode step means less data needs to page from system RAM.
The R^2 values are modest (0.46-0.79) because pre-spillover decode latency has very little variation -- the signal (scaling) is tiny relative to measurement noise. This is actually a positive finding: decode latency is remarkably stable across 512-8K context lengths.

5.4 Decode Trimmed-Mean Robustness

Model	Backend	Median b	Trim 5% b	Trim 10% b	Stable?
qwen2.5-0.5b	HF	0.246	0.253	0.249	Yes
qwen2.5-1.5b	HF	3.018	3.018	3.023	Yes
qwen2.5-3b	HF	4.247	4.247	4.244	Yes
llama3.2-1b	ollama	0.160	0.142	0.162	Yes
llama3.2-3b	ollama	0.063	0.103	0.062	Yes
qwen2.5-1.5b	ollama	0.149	0.130	0.150	Yes

Decode scaling exponents are stable across all trimming levels -- the cold-start issue primarily affects prefill (where the initial KV cache allocation occurs), not decode (which reuses the already-allocated cache).

6. Memory Scaling (CUDA Allocation)

This section answers Research Question 2: At what context length does VRAM become the bottleneck per model?

Important: Peak VRAM values are torch.cuda.max_memory_allocated(), which includes CUDA Unified Memory spillover to system RAM. Values exceeding 12,288 MB (12 GB physical VRAM) indicate system-memory paging -- the root cause of performance cliffs. Ollama does not expose VRAM metrics, so this analysis covers HF models only.

6.1 VRAM Growth Summary

Model	Slope (MB/token)	KV Cost (B/token)	R^2	Spillover At	OOM Cliff
qwen2.5-0.5b	2.11	1,164,226	0.940	16,384	32,768
qwen2.5-1.5b	1.85	1,034,253	0.942	16,384	32,768
qwen2.5-3b	1.32	752,846	0.951	8,192	16,384

KV cache cost per token is derived from the VRAM growth slope: slope_MB_per_token x 1024 x 1024 / 2 (the /2 converts from bytes to the key+value pair). These represent the all-in cost including PyTorch allocator overhead, activation memory, and KV cache tensors.

6.2 Per-Model VRAM Curves

qwen2.5-0.5b -- CUDA Allocation

Context Length	Peak Alloc (MB)	In GPU?	Spillover (GB)	Alloc/VRAM Ratio
512	1,122	Yes	--	0.09x
1,024	1,282	Yes	--	0.10x
2,048	1,603	Yes	--	0.13x
4,096	3,175	Yes	--	0.26x
8,192	9,553	Yes	--	0.78x
16,384	34,787	NO	22.0 GB	2.83x

Observations: VRAM grows 8.5x from 8K->16K (9.6 GB -> 34.8 GB) while context only doubles. This superlinear VRAM growth is consistent with quadratic attention requiring temporary activation memory proportional to n^2. The 22 GB spillover means the GPU is borrowing 22 GB from system RAM -- every attention operation must fetch most of its data over PCIe.

qwen2.5-1.5b -- CUDA Allocation

Context Length	Peak Alloc (MB)	In GPU?	Spillover (GB)	Alloc/VRAM Ratio
512	3,132	Yes	--	0.25x
1,024	3,305	Yes	--	0.27x
2,048	3,689	Yes	--	0.30x
4,096	5,013	Yes	--	0.41x
8,192	10,654	Yes	--	0.87x
16,384	32,690	NO	19.9 GB	2.66x

Observations: The 1.5B model starts with a 3.1 GB base footprint (vs 1.1 GB for 0.5B). Despite the higher base, it spills at the same context length (16K) because the per-token KV cost is slightly lower (1.85 vs 2.11 MB/token). At 8K tokens, it uses 10.7 GB -- just barely fitting in VRAM, which explains the high measurement precision at 8K (CV 0.2%).

qwen2.5-3b -- CUDA Allocation

Context Length	Peak Alloc (MB)	In GPU?	Spillover (GB)	Alloc/VRAM Ratio
512	6,190	Yes	--	0.50x
1,024	6,370	Yes	--	0.52x
2,048	6,768	Yes	--	0.55x
4,096	8,717	Yes	--	0.71x
8,192	16,167	NO	3.8 GB	1.32x

Observations: The 3B model's 6.2 GB base footprint consumes half the VRAM before any context is processed. At 4K tokens, 8.7 GB is allocated (71% of VRAM). By 8K, the allocation (16.2 GB) exceeds physical VRAM, triggering the 25x thrashing cliff. The spillover is "only" 3.8 GB (vs 20-22 GB for the smaller models) because the model hits OOM before context grows further.

6.3 VRAM Budget Model

The practical context budget can be computed as:

max_context_tokens ~ (GPU_VRAM_MB - model_base_MB) / slope_MB_per_token

Model	Base (MB)	Available (MB)	Slope (MB/tok)	Max Context (theoretical)	Max Context (observed)
qwen2.5-0.5b	1,122	11,166	2.11	~5,290	8,192 (pre-spillover)
qwen2.5-1.5b	3,132	9,156	1.85	~4,950	8,192 (pre-spillover)
qwen2.5-3b	6,190	6,098	1.32	~4,620	4,096 (pre-spillover)

The theoretical maximum is where base + slope x context = 12,288 MB. Observed pre-spillover maximums are higher because our context levels jump by 2x, so the model may fit at a given level even if the next level exceeds VRAM.

6.4 KV Cache Cross-Validation with TR123 Theory

TR123 computed theoretical KV cache costs from model architecture:

kv_cost_per_token = num_layers x num_kv_heads x head_dim x precision_bytes x 2 (keys + values)

We now have empirical VRAM growth slopes to cross-validate against TR123's theoretical predictions:

Model	Architecture	Theoretical KV (B/tok)	Empirical Slope (B/tok)	Overhead Ratio
qwen2.5-0.5b	24L x 2KV x 64d	12,288	1,164,226	94.7x
qwen2.5-1.5b	28L x 2KV x 128d	28,672	1,034,253	36.1x
qwen2.5-3b	36L x 2KV x 128d	36,864	752,846	20.4x

Why is the overhead so large (20-95x)?

The empirical VRAM slope captures all memory that grows with context length, not just the KV cache:

KV cache tensors -- the theoretical minimum (12-37 KB/token)
Attention workspace -- temporary attention score matrices, softmax buffers, output projections. For self-attention with n tokens, this includes n x n attention matrices per head per layer
Activation memory -- intermediate tensors during the forward pass that scale with sequence length
CUDA allocator fragmentation -- PyTorch's caching allocator rounds allocations to block sizes and maintains free lists, causing 10-30% overhead
Gradient/state buffers -- even in inference mode, PyTorch may allocate temporary buffers

The trend is informative: The overhead ratio decreases with model size (95x -> 36x -> 20x). This is expected because GQA's aggressive head sharing (only 2 KV heads for all three Qwen models) makes the theoretical KV cost tiny relative to other context-dependent memory. The 0.5B model's theoretical KV cost is only 12 KB/token -- essentially negligible compared to the ~1.1 MB/token of attention workspace and activations. For the 3B model, the theoretical KV cost is 3x larger (37 KB/token), so it represents a slightly larger fraction of the total.

Conclusion: The empirical VRAM slope is NOT a KV cache cost estimator -- it's a total context-dependent memory estimator. To isolate true KV cache cost, one would need to measure VRAM growth while holding attention workspace constant (e.g., by varying only the KV cache size with a fixed-length prompt). This reinterpretation corrects the v1 claim that "KV cache costs 0.75-1.16 MB/token" -- the correct statement is "total context-dependent VRAM grows at 0.75-1.16 MB/token, of which KV cache is 1-5% and the remainder is attention workspace, activations, and allocator overhead."

7. Time to First Token (TTFT) Analysis

This section answers Research Question 3: How does TTFT scale with context length?

TTFT equals prefill latency -- the time from receiving a prompt to producing the first output token. For interactive applications, acceptable TTFT thresholds are typically:

Excellent: < 200 ms (imperceptible)
Good: < 1,000 ms (noticeable but acceptable)
Poor: < 5,000 ms (frustrating)
Unacceptable: > 10,000 ms (broken experience)

7.1 Threshold Crossings

Model	Backend	>1s (Good->Poor)	>5s (Poor->Unacceptable)	>10s
llama3.2-1b	ollama	Never	Never	Never
llama3.2-3b	ollama	Never	Never	Never
qwen2.5-0.5b	HF	4,096	16,384*	16,384*
qwen2.5-1.5b	ollama	Never	Never	Never
qwen2.5-1.5b	HF	4,096	8,192	16,384*
qwen2.5-3b	HF	4,096	8,192	8,192

*At 16K, TTFT is 7.9 minutes (qwen2.5-0.5b) and 8.9 minutes (qwen2.5-1.5b) due to VRAM thrashing.

7.2 TTFT Growth Factors

Model	Backend	512->32K TTFT	Growth Factor	Context Growth	Scaling
llama3.2-1b	ollama	8.4->11.4 ms (median)	1.4x	64x	Sub-linear
llama3.2-3b	ollama	12.5->22.6 ms (median)	1.8x	64x	Sub-linear
qwen2.5-1.5b	ollama	9.2->13.4 ms (median)	1.5x	64x	Sub-linear
qwen2.5-0.5b	HF	52.6->473,789 ms	9,004x	32x*	Thrashing
qwen2.5-1.5b	HF	108.8->532,891 ms	4,898x	32x*	Thrashing
qwen2.5-3b	HF	164.0->61,982 ms	378x	16x*	Thrashing

*HF models OOM before reaching 32K, so growth factors use the maximum measured context length.

7.3 Production Implications

Ollama maintains sub-second TTFT at all context lengths tested (up to 32K). Even the largest model (llama3.2-3b) produces first tokens in 23 ms (median) at 32K context -- well within the "excellent" threshold.

HF crosses the 1-second TTFT threshold at 4K tokens for ALL models. This means that for any interactive application processing more than ~4K tokens of context on a 12 GB consumer GPU, HF transformers in FP16 is not viable. At 8K+ tokens, TTFT ranges from 5 seconds (qwen2.5-0.5b) to 62 seconds (qwen2.5-3b, in thrashing regime).

For TTFT-sensitive applications, Ollama is the most viable backend tested on consumer hardware at long contexts. The 50,000x TTFT difference at 16K tokens (10 ms Ollama vs 533 seconds HF) is not a performance difference -- it's a qualitative capability gap.

8. Backend Comparison (HF vs Ollama)

Direct comparison is possible for qwen2.5-1.5b, which ran on both backends at 6 shared context lengths (512-16,384). All comparisons use Welch's t-test.

8.1 Prefill: Ollama Dominates at All Contexts

Context	HF Mean (ms)	Ollama Mean (ms)	Diff (ms)	% Change	p-value	Cohen's d
512	108.8	15.1	-93.7	-86.1%	1.08e-11	-6.78
1,024	243.3	21.0	-222.3	-91.4%	6.76e-13	-7.99
2,048	507.1	33.2	-473.8	-93.4%	5.12e-13	-8.12
4,096	1,436.4	60.8	-1,375.7	-95.8%	1.82e-14	-9.85
8,192	5,086.0	117.0	-4,969.0	-97.7%	3.39e-20	-20.77
16,384	532,890.8	226.6	-532,664.2	-100.0%	5.37e-28	-56.53

All 6 comparisons are significant (p < 10^-11). Effect sizes are massive (d = -6.78 to -56.53) and grow monotonically with context length. At 512 tokens, Ollama is 7.2x faster. At 16K tokens, Ollama is 2,352x faster.

Why does the gap widen? HF prefill scales superlinearly (b ~ 1.78) while Ollama scales sub-linearly (b ~ 0.11). These opposing curves diverge exponentially. At the spillover threshold (16K), HF enters the 105x thrashing regime while Ollama continues at 10 ms -- the gap becomes astronomical.

8.2 Decode: Ollama 3-30x Faster

Context	HF Mean (ms)	Ollama Mean (ms)	% Change	Cohen's d
512	3,038.2	872.9	-71.3%	-68.83
1,024	3,125.1	851.9	-72.7%	-29.76
2,048	2,834.3	906.3	-68.0%	-43.90
4,096	4,194.9	951.5	-77.3%	-31.41
8,192	6,628.3	985.9	-85.1%	-323.43
16,384	61,169.1	1,252.7	-98.0%	-55.24

Ollama decode is 3.1-3.5x faster pre-spillover (512-4K) and 6.7-48.8x faster at higher contexts. The pre-spillover ratio is consistent with TR126's finding that Ollama's quantized KV-cache decode is ~3-7x faster than eager HF.

8.3 E2E: Ollama Dominates

Context	HF Mean (ms)	Ollama Mean (ms)	% Change	Cohen's d
512	3,147.0	888.0	-71.8%	-63.66
1,024	3,368.4	872.9	-74.1%	-50.76
2,048	3,341.4	939.6	-71.9%	-33.69
4,096	5,631.3	1,012.3	-82.0%	-23.33
8,192	11,714.3	1,103.0	-90.6%	-47.10
16,384	594,059.9	1,479.3	-99.8%	-56.96

18 out of 18 pairwise comparisons (3 modes x 6 context lengths) show Ollama significantly faster (p < 0.05). There is no context length or mode where HF outperforms Ollama on this hardware.

8.4 Multiple Comparison Correction

With 18 pairwise tests, family-wise error rate must be controlled. We apply both Bonferroni (conservative) and Holm-Bonferroni (step-down, less conservative) corrections:

Correction	Threshold	Significant Tests	Survival Rate
Uncorrected	p < 0.050	18/18	100%
Bonferroni	p < 0.0028	18/18	100%
Holm-Bonferroni	Stepwise	18/18	100%

All 18 comparisons survive both corrections. The maximum p-value across all tests is 5.37 x 10^-28 (prefill at 16K) -- orders of magnitude below even the strictest Bonferroni threshold (0.0028). These are not marginal effects inflated by multiple testing -- they are genuine, massive differences.

This is consistent with the effect sizes (Cohen's d = 6.78 to 323.43) being in the "very large" category. The backend performance gap is so dramatic that statistical correction is confirmatory rather than revelatory.

8.5 ANOVA Interaction: Context Length x Backend

Does the effect of backend depend on context length? A significant interaction would mean Ollama's advantage isn't constant -- it changes with context.

qwen2.5-1.5b -- Prefill (the only model on both backends):

Effect	F-statistic	p-value	Significant
Backend (main)	14.20	0.00025	Yes
Context length (main)	(see per-context)	< 0.001	Yes
Interaction evidence	--	--	Strong (magnitude change)

Per-context backend test (prefill):

Context	HF Mean (ms)	Ollama Mean (ms)	t-stat	p-value	Sig?
512	108.8	15.1	-20.92	<0.001	Yes
1,024	243.3	21.0	-32.32	<0.001	Yes
2,048	507.1	33.2	-18.95	<0.001	Yes
4,096	1,436.4	60.8	-9.13	<0.001	Yes
8,192	5,086.0	117.0	-39.22	<0.001	Yes
16,384	532,890.8	226.6	-56.53	<0.001	Yes

Interaction interpretation: The interaction is classified as "strong (magnitude change)" -- the backend effect is significant at every context length, but the absolute gap changes dramatically:

At 512 tokens: |HF - Ollama| = 93.7 ms (HF is 7.2x slower)
At 16,384 tokens: |HF - Ollama| = 532,664 ms (HF is 2,352x slower)

The gap widens 5,687x over 32x context growth. This is because HF enters the VRAM thrashing regime while Ollama does not -- creating a multiplicative interaction between backend and context length. A formal two-way ANOVA would show a massive interaction F-statistic, but the visual evidence in SS8.1 is already definitive.

Decode ANOVA (qwen2.5-1.5b):

The decode interaction is also strong (F_backend = 23.42, p < 0.001), with the gap widening from 3.5x at 512 tokens to 48.8x at 16K tokens as HF decode enters the thrashing regime.

9. Outlier Analysis

9.1 Detection Method

Outliers are detected using Tukey's IQR fence per context length within each (model, backend, mode) group. This prevents false positives from pooling measurements across heterogeneous regimes (e.g., 8 ms Ollama at 512 tokens vs 533,000 ms HF at 16K tokens would make everything an "outlier" under global IQR).

9.2 Summary

Metric	Value
Total measurements	1,140 (ok status)
Total outliers (per-context IQR)	116
Overall outlier rate	10.2%

9.3 Per-Model Outlier Rates

Model	Backend	Mode	N	Outliers	Rate
llama3.2-1b	ollama	prefill	70	8	11.4%
llama3.2-1b	ollama	decode	70	7	10.0%
llama3.2-1b	ollama	e2e	70	9	12.9%
llama3.2-3b	ollama	prefill	70	13	18.6%
llama3.2-3b	ollama	decode	70	14	20.0%
llama3.2-3b	ollama	e2e	70	11	15.7%
qwen2.5-0.5b	HF	prefill	60	2	3.3%
qwen2.5-0.5b	HF	decode	60	4	6.7%
qwen2.5-0.5b	HF	e2e	60	5	8.3%
qwen2.5-1.5b	ollama	prefill	70	8	11.4%
qwen2.5-1.5b	ollama	decode	70	5	7.1%
qwen2.5-1.5b	ollama	e2e	70	8	11.4%
qwen2.5-1.5b	HF	prefill	60	5	8.3%
qwen2.5-1.5b	HF	decode	60	3	5.0%
qwen2.5-1.5b	HF	e2e	60	3	5.0%
qwen2.5-3b	HF	prefill	50	5	10.0%
qwen2.5-3b	HF	decode	50	3	6.0%
qwen2.5-3b	HF	e2e	50	3	6.0%

9.4 Pattern Analysis

Ollama models have higher outlier rates (7-20%) than HF models (3-10%). This is consistent with the cold-start outlier pattern identified in SS4.5: despite 3 warmup reps, 1-2 measured reps per context length show elevated latency (likely from llama.cpp internal lazy initialization). The llama3.2-3b model has the highest outlier rate (15-20% across modes), suggesting the 3B model's larger internal state takes more cold-start overhead.

HF models show very low outlier rates (3-6%) pre-spillover. The outliers that do exist are concentrated at context lengths near the spillover boundary (e.g., 4K tokens for qwen2.5-1.5b where CV jumped to 7.4%).

Impact on conclusions: The high Ollama outlier rate is why we report medians alongside means for Ollama measurements. Medians are robust to outliers and provide the reliable central tendency. HF means and medians are nearly identical pre-spillover (CV < 3%), so mean is reliable for HF.

9.5 Distribution Shape Analysis

To formally justify when mean vs median should be used, we analyze the distribution shape of each measurement group:

Model	Backend	Mode	Pooled Skewness	Mean/Median Ratio	Interpretation
qwen2.5-0.5b	HF	prefill	1.75	1.00	Symmetric (thrashing pulls right)
qwen2.5-1.5b	HF	prefill	2.28	1.00	Mild right skew
qwen2.5-3b	HF	prefill	1.37	1.04	Near-symmetric
llama3.2-1b	ollama	prefill	6.36	12.15	Extreme right skew
llama3.2-3b	ollama	prefill	6.02	11.09	Extreme right skew
qwen2.5-1.5b	ollama	prefill	6.08	10.83	Extreme right skew
All Ollama	ollama	decode	0.5-1.0	1.01-1.11	Mild right skew
All HF	HF	decode	0.6-1.8	1.00-1.03	Near-symmetric

Key finding: Mean/median ratio >2.0 indicates severe skew where the arithmetic mean is unreliable. All 3 Ollama prefill groups show ratios of 10-12x -- the mean is 10-12x larger than the median. This is entirely caused by cold-start rep-0 outliers (SS4.5): one value 100-300x larger than the rest inflates the mean dramatically.

Shapiro-Wilk normality test: Per-context-length tests show most HF pre-spillover groups pass normality (p > 0.05), confirming that parametric tests (t-test, ANOVA) are valid for HF data. Ollama groups universally fail normality due to the cold-start rep-0.

Implication for all statistical tests: The t-test p-values in SS8 use means, but since the effects are so massive (d > 6), the non-normality of Ollama data does not affect conclusions. A non-parametric test (Mann-Whitney U) would yield identical significance levels for effects of this magnitude.

10. Power Analysis

10.1 Aggregate Power

Parameter	Value
Repetitions	10
Alpha	0.05
Power	0.80
Min detectable Cohen's d (z-based)	1.253
Min detectable Cohen's d (t-based)	0.94
Sensitivity	Large effects only

With N=10 repetitions, this experiment can reliably detect only large effects (d > 0.94). Small or medium effects between context lengths or backends may be missed. However, the observed effects in TR127 are massive (d > 6 for backend comparisons, thrashing multipliers of 25-105x), so the limited power does not affect the primary findings.

10.2 Stratified Power Per Model x Backend

Model	Backend	Pooled Std (ms)	Min Detectable (ms)	N
llama3.2-1b	ollama	475.9	596	70
llama3.2-3b	ollama	1,067.3	1,337	70
qwen2.5-0.5b	HF	177,568	222,476	60
qwen2.5-1.5b	ollama	600.4	752	70
qwen2.5-1.5b	HF	199,793	250,322	60
qwen2.5-3b	HF	24,701	30,948	50

Caveat: The pooled std for HF models (25K-200K ms) is dominated by the post-spillover data points. These massive values make the global MDE meaningless for HF -- a 200,000 ms "minimum detectable effect" is not useful for interpreting pre-spillover differences.

10.3 Measurement Precision (CV% by Context Length)

The meaningful power metric is per-context-length:

HF (pre-spillover) -- Extremely precise:

Model	Context	Std (ms)	CV%	MDE (ms)
qwen2.5-0.5b	512	0.7	1.4%	0.9
qwen2.5-0.5b	4,096	4.8	0.3%	6.0
qwen2.5-0.5b	8,192	7.4	0.2%	9.3
qwen2.5-1.5b	512	2.6	2.4%	3.3
qwen2.5-1.5b	8,192	10.2	0.2%	12.8
qwen2.5-3b	4,096	10.5	0.4%	13.2

Interpretation: At N=10, HF can detect effects as small as 1-13 ms (MDE) within individual context lengths. This is outstanding precision -- meaningful for sub-percent comparisons within a context length.

Ollama -- High variance from cold-start:

Model	Context	Std (ms)	CV%	MDE (ms)
llama3.2-1b	512	11.7	97%	14.6
llama3.2-1b	8,192	230.5	283%	289
llama3.2-1b	32,768	1,140.0	307%	1,428
qwen2.5-1.5b	512	19.4	128%	24.3
qwen2.5-1.5b	32,768	1,404.7	307%	1,760

Interpretation: Ollama CV% > 100% at every context length. The MDE at 32K is 1,428-1,760 ms -- this experiment cannot detect sub-second Ollama differences at long contexts. However, the actual effects being measured (HF vs Ollama: thousands of ms) are far larger than the MDE.

11. Cross-Model Comparison

11.1 Scale Effect on Prefill (Pre-Spillover)

How does model size affect prefill latency at each context length?

Context	qwen2.5-0.5b (ms)	qwen2.5-1.5b (ms)	qwen2.5-3b (ms)	1.5B/0.5B Ratio	3B/0.5B Ratio
512	52.6	108.8	164.0	2.1x	3.1x
1,024	144.5	243.3	341.9	1.7x	2.4x
2,048	441.2	507.1	756.8	1.1x	1.7x
4,096	1,452.3	1,436.4	2,432.3	1.0x	1.7x
8,192	4,723.6	5,086.0	61,981.8*	1.1x	13.1x*

*qwen2.5-3b is in the thrashing regime at 8K.

Pattern: At short contexts (512), latency scales roughly linearly with model size (3.1x for 6x parameters). At longer contexts (4K), the ratio narrows to 1.7x as attention computation (which scales with context length, not model size) begins to dominate. At 4K, the 0.5B and 1.5B models have nearly identical latency (1,452 vs 1,436 ms) -- the extra parameters add negligible overhead when attention over 4K tokens is the bottleneck.

11.2 Scale Effect on Decode (Ollama)

Context	llama3.2-1b (tok/s)	qwen2.5-1.5b (tok/s)	llama3.2-3b (tok/s)
512	162.6	146.7	98.7
4,096	153.0	134.5	89.3
32,768	96.0	80.0	46.9

Decode throughput scales inversely with model size across all context lengths: the 3B model is consistently ~2x slower than the 1B model. This is expected -- each decode step must read all model parameters plus the KV cache, and the 3B model has ~3x more parameters.

11.3 Context Budget by Model Size

Model	Params	Base VRAM	VRAM for KV	Spillover Context	OOM Context	Practical Budget
qwen2.5-0.5b	500M	1.1 GB	10.9 GB	16K	32K	8K (safe)
qwen2.5-1.5b	1.5B	3.1 GB	8.9 GB	16K	32K	8K (safe)
qwen2.5-3b	3.0B	6.2 GB	5.8 GB	8K	16K	4K (safe)

The "practical budget" is one step below the spillover context, ensuring the model stays in the clean computational regime.

12. Key Findings

Pre-thrashing prefill scaling (qwen2.5-0.5b, HF): exponent b = 1.701, quadratic R^2 = 0.9999 on 512-8,192 tokens. At 16,384 tokens, VRAM spills to system RAM causing a 100.7x latency cliff.
Pre-thrashing prefill scaling (qwen2.5-1.5b, HF): exponent b = 1.780, quadratic R^2 = 0.9998 on 512-8,192 tokens. At 16,384 tokens, VRAM spills to system RAM causing a 104.7x latency cliff.
Pre-thrashing prefill scaling (qwen2.5-3b, HF): exponent b = 1.583, quadratic R^2 = 0.9992 on 512-4,096 tokens. At 8,192 tokens, VRAM spills to system RAM causing a 25.2x latency cliff.
VRAM thrashing dominates HF scaling: Full-range exponents of b = 6.6, 6.7, 4.6 are caused by system-memory paging, not O(n^2) attention. The true computational scaling (pre-spillover) shows b = 1.58-1.78.
Ollama prefill scales sub-linearly: All 3 Ollama models show b < 0.2, confirming Flash Attention eliminates quadratic overhead at 512-32K contexts.
Decode also shows two regimes: Pre-spillover HF decode exponents (b = 0.05-0.36) confirm near-constant KV-cache lookup cost. Post-spillover decode thrashing multipliers (7-9x) are lower than prefill (25-105x) because decode's smaller per-step working set reduces paging overhead.
Decode throughput degrades linearly with context: Ollama models lose 41-53% throughput from 512->32K. HF models show similar pre-spillover degradation plus catastrophic post-spillover collapse (95% at qwen2.5-1.5b 16K).
OOM cliffs follow spillover by one step: 0.5B/1.5B spill at 16K -> OOM at 32K. 3B spills at 8K -> OOM at 16K.
TTFT exceeds 1 second at 4K on all HF models. Ollama TTFT never exceeds 1 second at any context length.
All 18 backend comparisons survive Bonferroni correction: 18/18 pairwise tests (3 modes x 6 shared context lengths) remain significant after both Bonferroni (p < 0.0028) and Holm-Bonferroni correction. These are genuine effects, not multiple-testing artifacts.
Strong context x backend interaction (ANOVA): The Ollama advantage widens 5,687x from 512 to 16K tokens (F_backend = 14.20, p = 0.00025 for qwen2.5-1.5b prefill), confirming the interaction between backend choice and context length.
Ollama cold-start contaminates 100% of measurement groups: Rep-0 shows 2-306x higher latency than subsequent reps. Mean inflation is 45-80% at long contexts. Median is robust; 5%-trimmed mean with N=10 is insufficient (removes 0 values); 10%-trimmed mean recovers correctly.
Scaling exponents are robust to trimming (HF) but fragile (Ollama mean): HF exponents are stable within 0.3% across all trimming levels. Ollama mean-based exponents are 7-13x too high due to cold-start; median-based exponents are correct.
Distribution shape confirms mean unreliability for Ollama: Mean/median ratio = 10-12x for Ollama prefill (extreme right skew from cold-start). HF pre-spillover groups pass Shapiro-Wilk normality.
KV cross-validation reveals 20-95x overhead over theory: Empirical VRAM slopes (0.75-1.16 MB/token) are 20-95x larger than theoretical KV cache costs (12-37 KB/token). The difference is attention workspace, activations, and allocator fragmentation -- not KV cache inflation.
HF measurement precision is excellent pre-spillover (CV 0.2-3.1%). Ollama rest-of-run CV is also excellent (3.6-6.1%) after filtering rep-0.
Measurement stability: 116 outliers across 1,140 measurements (10.2% outlier rate, per-context IQR method). Ollama accounts for the majority due to cold-start.

13. Conclusions

Q1: Does attention quadratic cost show up empirically on RTX 4080?

Two-regime answer -- yes, but it's not the bottleneck:

Pre-spillover (true computational scaling):

qwen2.5-0.5b: b = 1.70, quadratic R^2 = 0.9999 (512-8K tokens)
qwen2.5-1.5b: b = 1.78, quadratic R^2 = 0.9998 (512-8K tokens)
qwen2.5-3b: b = 1.58, quadratic R^2 = 0.9992 (512-4K tokens)

The exponents 1.58-1.78 confirm superlinear scaling consistent with O(n^2) attention, partially mitigated by hardware optimizations. The quadratic model fits the data almost perfectly (R^2 = 0.999+).

Post-spillover (VRAM thrashing -- the real bottleneck):

qwen2.5-0.5b: 100.7x latency cliff at 16K tokens
qwen2.5-1.5b: 104.7x latency cliff at 16K tokens
qwen2.5-3b: 25.2x latency cliff at 8K tokens

Ollama (optimized attention -- no quadratic cost):

llama3.2-1b: b = 0.083 (sub-linear)
llama3.2-3b: b = 0.158 (sub-linear)
qwen2.5-1.5b: b = 0.109 (sub-linear)

Summary: On consumer hardware, you hit the VRAM wall long before O(n^2) attention becomes the practical bottleneck. Ollama eliminates both problems -- optimized attention AND no VRAM spillover (quantized models have smaller footprints).

Q2: At what context length does VRAM become the bottleneck?

Model	Spillover Threshold	Peak Alloc at Spillover	Alloc/VRAM Ratio	KV Cost (MB/tok)	OOM Cliff
qwen2.5-0.5b	16,384	34,787 MB	2.83x	2.11	32,768
qwen2.5-1.5b	16,384	32,690 MB	2.66x	1.85	32,768
qwen2.5-3b	8,192	16,167 MB	1.32x	1.32	16,384

The bottleneck context length is determined by: (GPU_VRAM - model_base_weight) / KV_cache_per_token. Larger models hit the wall sooner despite lower per-token KV costs because their weight footprint consumes more of the VRAM budget.

Q3: How does TTFT scale with context length?

HF (catastrophic scaling):

qwen2.5-0.5b: 53 ms (512) -> 474 seconds (16K) -- 9,004x increase over 32x context growth
qwen2.5-1.5b: 109 ms (512) -> 533 seconds (16K) -- 4,898x increase over 32x context growth
qwen2.5-3b: 164 ms (512) -> 62 seconds (8K) -- 378x increase over 16x context growth

Ollama (graceful scaling, medians):

llama3.2-1b: 8.4 ms (512) -> 11.4 ms (32K) -- 1.4x increase over 64x context growth
llama3.2-3b: 12.5 ms (512) -> 22.6 ms (32K) -- 1.8x increase over 64x context growth
qwen2.5-1.5b: 9.2 ms (512) -> 13.4 ms (32K) -- 1.5x increase over 64x context growth

Threshold crossings: All HF models exceed 1-second TTFT at 4,096 tokens. No Ollama model exceeds 1-second TTFT at any context length tested.

Q4: Is there a context length cliff?

Yes -- the VRAM spillover cliff:

Model	Cliff Location	Latency Jump	Multiplier
qwen2.5-0.5b	8K -> 16K	4,724 ms -> 473,789 ms	100.3x
qwen2.5-1.5b	8K -> 16K	5,086 ms -> 532,891 ms	104.8x
qwen2.5-3b	4K -> 8K	2,432 ms -> 61,982 ms	25.5x

These are not gradual degradations -- they are catastrophic cliffs. A deployment processing 8K tokens successfully will fail (100x slower) at 16K tokens with no warning. CUDA Unified Memory does not raise an exception; it silently degrades performance.

There is also a secondary cliff: OOM, where PyTorch raises torch.cuda.OutOfMemoryError. This occurs one context-length step after spillover.

Ollama shows no cliff at any context length tested (up to 32K). Decode throughput degrades gradually (41-53% over 64x context growth), but there is no sudden performance discontinuity.

14. Production Guidance & Decision Trees

14.1 Backend Selection by Context Length

Context Length	Recommended Backend	Rationale
<= 2K tokens	Either	HF: precise FP16, 50-500 ms TTFT. Ollama: 3x faster with quantization.
2K-4K tokens	Ollama preferred	HF: 1-2.5 seconds TTFT. Ollama: <15 ms TTFT.
4K-8K tokens	Ollama required	HF: 5-62 seconds TTFT (approaching spillover). Ollama: <15 ms.
8K-32K tokens	Ollama only	HF: OOM or 100x thrashing. Ollama: <25 ms TTFT, 47-130 tok/s decode.
>32K tokens	Ollama only	Not tested, but Ollama supports 131K context natively.

14.2 Model Selection by Use Case

Use Case	Context Budget	Best Model	Rationale
Interactive chat (< 4K)	2K-4K	llama3.2-1b (Ollama)	Fastest TTFT (8 ms), highest decode (163 tok/s)
RAG pipeline (4K-8K)	4K-8K	qwen2.5-1.5b (Ollama)	Good decode (130 tok/s at 8K), balanced quality
Document summarization (8K-32K)	8K-32K	llama3.2-3b (Ollama)	Largest model that handles 32K, 47 tok/s decode
Quality-critical (need FP16)	<= 4K	qwen2.5-0.5b (HF)	Exact FP16 precision, fits 8K pre-spillover

14.3 VRAM Budget Planning

For FP16 HF on N GB VRAM:

safe_context_tokens ~ (N_GB x 1024 - model_base_MB) / slope_MB_per_token x 0.8

The 0.8 factor provides a 20% safety margin to avoid approaching the spillover cliff.

GPU VRAM	qwen2.5-0.5b Max Context	qwen2.5-1.5b Max Context	qwen2.5-3b Max Context
8 GB	~2,600 tokens	~2,100 tokens	~1,100 tokens
12 GB	~4,200 tokens	~3,900 tokens	~3,700 tokens
16 GB	~6,100 tokens	~5,600 tokens	~5,900 tokens
24 GB	~9,700 tokens	~9,100 tokens	~10,200 tokens

14.4 Warning Signs of Approaching VRAM Spillover

torch.cuda.max_memory_allocated() exceeding 80% of physical VRAM
Sudden latency increase (>3x between consecutive context lengths)
Increasing variance (CV > 5%) at previously stable context lengths
The 4K-token inflection point for qwen2.5-1.5b (CV jumped to 7.4%)

15. Limitations & Future Work

15.1 Limitations

Single GPU. All measurements are on one RTX 4080 Laptop (12 GB). Results may differ on desktop GPUs (higher bandwidth, different thermal profiles) or server GPUs (24-80 GB VRAM would shift spillover thresholds dramatically).
Small model range. 0.5B-3.2B parameters. The 7B-70B models commonly used in production were not tested because they don't fit in 12 GB VRAM at FP16. Ollama handles larger models but was only tested up to 3.2B.
Synthetic prompts. Prompts are repeated text tokenized to exact lengths, not natural language. Real prompts may have different attention patterns (sparser or denser) that affect the scaling exponent.
No Flash Attention on HF. The HF measurements use PyTorch's default SDPA (Scaled Dot-Product Attention), not the Flash Attention 2 library. Flash Attention would likely reduce the pre-spillover exponent and delay the VRAM cliff.
Windows-only HF. TR126 showed that torch.compile on Windows falls back to aot_eager. The HF measurements reflect unoptimized eager execution. Compiled HF on Linux (with Triton) would be faster for prefill (TR126: -53% at short contexts) but would still hit the same VRAM cliff at the same context lengths.
Ollama cold-start variance. Despite 3 warmup reps, Ollama measurements show 97-307% CV when including rep-0. The cold-start analysis (SS4.5) identifies and characterizes this phenomenon, and all Ollama scaling exponents are reported using medians (robust to cold-start). After filtering rep-0, Ollama CV drops to 3.6-6.1% -- comparable to HF precision. Future experiments should use >=5 warmup reps or explicitly discard rep-0 from analysis.
N=10 power limitations. At N=10, only large effects (d > 0.94) are detectable. Subtle differences between context lengths or backends may be missed. The primary findings (100x cliffs, 2,352x backend differences) are well above the detection threshold.
No multi-GPU or model parallelism. All tests are single-GPU. Model parallelism across 2+ GPUs would increase the effective VRAM budget and shift spillover thresholds.
VRAM measurement is allocator-level. torch.cuda.max_memory_allocated() includes allocator overhead and fragmentation. The true KV cache cost may be lower than the measured slope.

15.2 Future Work

TR128+: Flash Attention comparison. Run the same context sweep with Flash Attention 2 on HF to measure how much it reduces the pre-spillover exponent and whether it delays the spillover cliff.
Isolate pure KV cache cost. SS6.4 shows empirical VRAM slopes are 20-95x theoretical KV costs due to attention workspace and allocator overhead. A targeted experiment holding attention workspace constant while varying only KV cache size would isolate the true KV cost. The current cross-validation (SS6.4) quantifies the total overhead but cannot decompose it.
Extended context beyond 32K. Ollama supports 131K context. Test 64K and 128K to find Ollama's practical limits.
Quantization-specific context limits. TR125 tested quality at different quant levels. Run the context sweep at Q2_K through Q8_0 to measure how quantization affects the VRAM budget and spillover threshold.
Server GPU validation. Repeat on an A100 (80 GB) or H_100 to measure whether the pre-spillover exponents change with hardware (more memory bandwidth, larger caches) and to find spillover thresholds at 80 GB scale.

16. Reproducibility

16.1 Exact Reproduction

# Prerequisites: Ollama running with llama3.2:1b, qwen2.5:1.5b, llama3.2:3b pulled
# HF models cached in models/qwen2.5-{0.5b,1.5b,3b}/
# Python 3.13+ with torch, transformers, scipy, numpy, pandas, pyyaml, requests

# Run the full pipeline
python research/tr127/run.py -v

# Or step-by-step:
python research/tr127/run_context_sweep.py --config research/tr127/config.yaml -v
python research/tr127/analyze.py -v
python research/tr127/generate_report.py -v

Estimated runtime: ~5 hours on RTX 4080 Laptop (12 GB VRAM). The majority of time is spent on HF models at high context lengths in the thrashing regime.

16.2 Config (Source of Truth)

See Appendix B for the exact config.yaml used.

16.3 Key Artifacts

Artifact	Path
Experiment runner	`research/tr127/run_context_sweep.py`
Analysis (with two-regime fixes)	`research/tr127/analyze.py`
Report generator	`research/tr127/generate_report.py`
Orchestrator	`research/tr127/run.py`
Config	`research/tr127/config.yaml`
Raw metrics	`research/tr127/results/20260224_101128/metrics.csv`
Analysis results	`research/tr127/results/20260224_101128/analysis.json`
Manifest	`research/tr127/results/20260224_101128/manifest.json`

16.4 Analysis Scripts

Script	Description
`research/tr127/run_context_sweep.py`	Core experiment: HF + Ollama context sweep
`research/tr127/analyze.py`	v2: Two-regime scaling (prefill + decode), cold-start detection, KV cross-validation, Bonferroni/Holm, ANOVA, trimmed means, distribution shape
`research/tr127/generate_report.py`	v2: Auto-generates markdown report with all v2 analysis sections
`research/tr127/shared/utils.py`	Prompt generation, path utilities

Appendix A: Environment Specifications

GPU Specifications

Property	Value
Name	NVIDIA GeForce RTX 4080 Laptop GPU
Architecture	Ada Lovelace (AD104)
Compute Capability	8.9
CUDA Cores	7,424
VRAM	12.88 GB GDDR6
Memory Bus	192-bit
Memory Bandwidth	256 GB/s
PCIe Bandwidth	~16 GB/s (PCIe 4.0 x16)
TDP	150W (laptop)

Software Stack

Component	Version
OS	Windows 11 Home 10.0.26200
Python	3.13.1
PyTorch	2.8.0+cu128
CUDA	12.8
cuDNN	91002
Triton	Not available (Windows)
Transformers	Latest (pip)
Ollama	localhost:11434

Ollama Model Tags

Model	Ollama Tag	Quantization
llama3.2-1b	`llama3.2:1b`	Q8_0 (default)
qwen2.5-1.5b	`qwen2.5:1.5b`	Q4_K_M (default)
llama3.2-3b	`llama3.2:3b`	Q4_K_M (default)

Appendix B: Config (Source of Truth)

# TR127: Long-Context Performance Characterization
experiment: tr127

context_lengths: [512, 1024, 2048, 4096, 8192, 16384, 32768]

models:
  - name: qwen2.5-0.5b
    path: models/qwen2.5-0.5b
    params_m: 500
    max_context: 32768
    dtype: fp16
    ollama_tag: null

  - name: qwen2.5-1.5b
    path: models/qwen2.5-1.5b
    params_m: 1543
    max_context: 131072
    dtype: fp16
    ollama_tag: "qwen2.5:1.5b"

  - name: qwen2.5-3b
    path: models/qwen2.5-3b
    params_m: 3000
    max_context: 32768
    dtype: fp16
    ollama_tag: null

  - name: llama3.2-1b
    path: null
    params_m: 1236
    max_context: 131072
    dtype: fp16
    ollama_tag: "llama3.2:1b"

  - name: llama3.2-3b
    path: null
    params_m: 3213
    max_context: 131072
    dtype: fp16
    ollama_tag: "llama3.2:3b"

backends:
  - transformers-gpu
  - ollama

device: cuda
repetitions: 10
warmup_repetitions: 3
max_new_tokens: 128
seed: 42

ollama_url: http://localhost:11434
ollama_timeout_s: 600

Appendix C: Glossary

Term	Definition
Context length	The number of tokens in the input prompt. Also called "sequence length" or "context window." Determines the size of the attention matrix (n x n) and KV cache (proportional to n).
CUDA Unified Memory	NVIDIA feature that transparently migrates data between GPU VRAM and system RAM. Allows allocation beyond physical GPU memory, but at PCIe bandwidth (~16 GB/s) instead of GDDR6 (~256 GB/s). PyTorch uses this implicitly when VRAM is exhausted.
Flash Attention	Memory-efficient exact attention algorithm (Dao et al., 2022) that computes attention in O(n) memory by tiling the computation. Used by llama.cpp (Ollama backend). Reduces the effective scaling exponent from O(n^2) to near-linear at moderate context lengths.
KV cache	Key-Value cache: stored attention keys and values from all previous tokens, reused during autoregressive decode. Each new token attends over the full KV cache, so decode cost grows linearly with context length. Size is proportional to `layers x kv_heads x head_dim x precision_bytes x context_length x 2`.
OOM (Out of Memory)	`torch.cuda.OutOfMemoryError`: PyTorch's CUDA allocator cannot satisfy a memory request even with Unified Memory. The hard failure point, typically 2-3x physical VRAM depending on system RAM.
Prefill	The initial forward pass over the entire input prompt. Populates the KV cache. This is the time-to-first-token (TTFT) computation. Scales superlinearly with context length due to O(n^2) self-attention.
SDPA	Scaled Dot-Product Attention: PyTorch's built-in attention implementation (`torch.nn.functional.scaled_dot_product_attention`). Selects between FlashAttention, Memory-Efficient Attention, or math-based attention depending on input size and hardware. On Windows without Triton, typically uses the math or memory-efficient kernel.
Spillover	When CUDA memory allocation exceeds physical GPU VRAM. Detected by `torch.cuda.max_memory_allocated() > GPU_VRAM_MB`. Causes 25-105x latency increases due to PCIe-bound data transfer.
TTFT	Time to First Token: the latency from receiving a prompt to producing the first output token. Equal to prefill latency plus any framework overhead. The primary responsiveness metric for interactive applications.
Thrashing	Performance degradation caused by repeated data movement between fast memory (VRAM) and slow memory (system RAM). Occurs when working set exceeds fast memory capacity. Characterized by latency multipliers of 10-100x rather than gradual degradation.

References

TR108-TR122: Baseline benchmarks and prior short-context performance data.
TR123: KV-Cache Production Economics -- theoretical KV cache cost model and VRAM formulas.
TR124: Quality & Accuracy Baseline -- backend equivalence and metric framework.
TR125: Quantization Decision Matrix -- Ollama model quality and throughput data.
TR126: Linux/Triton Validation -- HF vs Ollama comparison methodology, compiled decode findings.
Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
NVIDIA CUDA Toolkit Documentation: Unified Memory Programming Guide.
llama.cpp: C/C++ LLM inference engine powering Ollama's backend.

TR127: Long-Context Performance

Technical Report 127: Long-Context Performance Characterization

Consumer GPU context scaling from 512 to 32K tokens with two-regime VRAM analysis

Abstract

Executive Summary

Key Findings

Key Decisions

Claim Validation

When to Use This Report

Scenario 1: Planning Context Window for RAG Pipeline

Scenario 2: Evaluating VRAM Requirements for a New Model

Scenario 3: Understanding Why Inference Suddenly Became Very Slow

Scenario 4: Comparing with TR123 KV Cache Theory

Table of Contents

Metric Definitions & Statistical Methods

Latency Metrics

Throughput Metrics

Effect Size & Significance Metrics

Multiple Comparison Correction

ANOVA / Interaction Testing

Trimmed Mean

Distribution Shape

Scaling Fit Methods

Timing Methodology

VRAM Measurement Caveat

1. Introduction & Research Motivation

1.1 Research Questions

1.2 Why This Matters

1.3 Scope

1.4 Literature Grounding

1.5 How to Read This Report

2. Methodology & Experimental Design

2.1 Independent Variable

2.2 Model Lineup

2.3 Backend Selection

2.4 Measurement Protocol

2.5 Prompt Generation

2.6 Controlled Variables

2.7 Sample Counts

3. Environment & Artifacts

3.1 Environment Fingerprint

3.2 Preflight Validation

3.3 Run Timeline

3.4 Key Artifacts

4. Prefill Scaling Analysis

4.1 The Two-Regime Discovery

4.2 Full-Range Scaling Fits

4.3 Pre-Thrashing Scaling Fits (True Computational Scaling)

4.4 Per-Model Prefill Analysis (HF)

qwen2.5-0.5b (500M params, FP16)

qwen2.5-1.5b (1.5B params, FP16)

qwen2.5-3b (3B params, FP16)

4.5 Ollama Cold-Start Analysis

Cold-Start Magnitude

4.6 Ollama Prefill Scaling

llama3.2-1b (Ollama, Q8_0)

llama3.2-3b (Ollama, Q4_K_M)

qwen2.5-1.5b (Ollama, Q4_K_M)

4.6 Scaling Exponent Summary

4.8 Trimmed-Mean Robustness Analysis

5. Decode Scaling Analysis

5.1 Decode Throughput vs Context Length

Ollama Decode Throughput (Clean Scaling)

HF Decode Throughput (Pre-Spillover + Thrashing)

5.2 Decode Scaling Fits

5.3 Pre-Thrashing Decode Fits (Two-Regime Analysis)

5.4 Decode Trimmed-Mean Robustness

6. Memory Scaling (CUDA Allocation)

6.1 VRAM Growth Summary

6.2 Per-Model VRAM Curves

qwen2.5-0.5b -- CUDA Allocation

qwen2.5-1.5b -- CUDA Allocation

qwen2.5-3b -- CUDA Allocation

6.3 VRAM Budget Model

6.4 KV Cache Cross-Validation with TR123 Theory

7. Time to First Token (TTFT) Analysis

7.1 Threshold Crossings

7.2 TTFT Growth Factors

7.3 Production Implications

8. Backend Comparison (HF vs Ollama)