Research Archive
Technical Report

TR127: Long-Context Performance

Consumer GPU context scaling from 512 to 32K tokens with two-regime VRAM analysis.

Table of Contents

Technical Report 127: Long-Context Performance Characterization

Consumer GPU context scaling from 512 to 32K tokens with two-regime VRAM analysis

Field Value
TR Number 127
Project Banterhearts LLM Performance Research
Date 2026-02-24
Author Research Team
Report Type Context-length scaling analysis (single-phase, 2-backend sweep)
Test Duration ~5 hours
Status Complete -- Experiment + two-regime reanalysis delivered
Run ID 20260224_101128
Related Work TR123 (KV-Cache Production Economics), TR125 (Quantization Decision Matrix), TR126 (Linux/Triton Validation)
Depends On TR123 (KV cache cost model, VRAM formulas), TR126 (HF vs Ollama backend methodology)

Abstract

TR108-TR126 tested prompts up to ~2K tokens. Production workloads -- RAG pipelines, document summarization, multi-turn conversations -- operate at 4K-128K token contexts. The performance scaling behavior from 512 to 32K tokens on consumer hardware (RTX 4080 Laptop, 12 GB VRAM) was unknown. TR127 fills this gap with a systematic context-length sweep: 1,144 measurements (1,140 successful + 4 OOM) across 5 models (0.5B-3.2B parameters), 2 backends (HuggingFace transformers FP16, Ollama quantized), 7 context lengths (512 to 32,768 tokens), and 3 measurement modes (prefill, decode, end-to-end) with 10 repetitions each.

The central finding is a two-regime scaling phenomenon on HuggingFace transformers:

  1. Pre-spillover regime (context fits in 12 GB VRAM): Prefill latency scales with exponent b = 1.58-1.78 (between linear and quadratic), with quadratic R^2 = 0.999+. This represents true computational scaling from self-attention's O(n^2) cost, partially mitigated by hardware optimizations.

  2. Post-spillover regime (context exceeds VRAM): CUDA Unified Memory silently pages tensors to system RAM via PCIe, causing 25-105x latency cliffs. Full-range power-law fits show apparent exponents of b = 4.6-6.7 -- these are artifacts of memory thrashing, not O(n^2) attention. This is the dominant effect observed in the data.

Ollama (llama.cpp quantized) shows sub-linear prefill scaling across the entire 512-32K range (b < 0.2), confirming that Flash Attention and paged KV caches effectively eliminate the quadratic penalty at these context lengths. Ollama is 86-100% faster than HF at every context length tested, with the gap widening from 86% at 512 tokens to 99.96% at 16K tokens (where HF enters the thrashing regime).

Decode throughput degrades with context length across both backends: Ollama's llama3.2-1b drops from 163 tok/s (512 tokens) to 96 tok/s (32K tokens) -- a 41% decline over 64x context growth. HF models show steeper decode degradation, with qwen2.5-3b dropping from 27 tok/s to 0.9 tok/s due to VRAM spillover during KV-cache attention.

VRAM scaling analysis reveals per-token KV cache costs of 0.75-1.16 MB/token for FP16 models, with spillover thresholds at 8K tokens (3B model), 16K tokens (0.5B, 1.5B models), and hard OOM cliffs one step higher. The qwen2.5-3b model's 6 GB base footprint leaves only 6 GB for KV cache -- enough for ~4,600 tokens before spillover begins.

Total: 1,144 measurements, 5 models, 2 backends, 7 context lengths, 3 modes, ~5 hours runtime.

Key findings:

  • Quadratic attention is empirically visible but NOT the dominant bottleneck. Pre-spillover exponents b = 1.58-1.78 confirm superlinear scaling. But the 25-105x thrashing cliffs dwarf the computational cost -- VRAM management, not attention complexity, is the practical bottleneck on consumer hardware.
  • VRAM spillover is silent and catastrophic. PyTorch's CUDA Unified Memory allows allocation beyond physical VRAM without raising exceptions, but performance degrades 25-105x as tensors page through PCIe bandwidth (~16 GB/s) instead of GDDR6 bandwidth (~256 GB/s).
  • Ollama eliminates quadratic scaling entirely. Sub-linear exponents (b < 0.2) across all 3 Ollama models confirm that llama.cpp's optimized attention (Flash Attention + paged KV cache) makes prefill effectively O(n) at these context lengths.
  • Decode throughput degrades linearly with context. As KV cache grows, each decode step must attend over more cached tokens. Ollama shows 41-53% throughput degradation from 512 to 32K tokens. HF shows identical pre-spillover degradation plus catastrophic post-spillover collapse.
  • Consumer GPU context budget: 4K-8K tokens for FP16 HF, unlimited for Ollama. On 12 GB VRAM, FP16 models fit 4K-8K tokens before spillover depending on model size. Ollama's quantized models handle 32K+ without degradation.

Executive Summary

TR127 answers: how does inference performance scale with context length on consumer hardware, and where are the practical limits?

Key Findings

  1. Two-regime scaling on HF transformers: Pre-spillover (b = 1.58-1.78, R^2 = 0.999+) vs post-spillover (25-105x cliffs). The quadratic attention cost IS empirically visible in the clean regime but is completely dominated by VRAM thrashing at higher context lengths.
  2. Ollama prefill is sub-linear: b = 0.083 (llama3.2-1b), b = 0.109 (qwen2.5-1.5b), b = 0.158 (llama3.2-3b). Flash Attention eliminates quadratic scaling at 512-32K context lengths.
  3. VRAM spillover thresholds are model-dependent: qwen2.5-0.5b and qwen2.5-1.5b spill at 16K tokens; qwen2.5-3b spills at 8K tokens. Each model's base weight footprint determines the remaining VRAM budget for KV cache.
  4. Decode throughput degrades 41-53% over 64x context growth (Ollama): llama3.2-1b drops from 163->96 tok/s; qwen2.5-1.5b drops from 147->80 tok/s; llama3.2-3b drops from 99->47 tok/s. Linear KV-cache lookup cost.
  5. HF decode enters catastrophic regime: qwen2.5-1.5b decode goes from 42 tok/s (512) to 2.1 tok/s (16K) -- a 95% collapse driven by VRAM spillover during KV-cache attention.
  6. Ollama is 86-100% faster than HF across all context lengths. At short contexts (512), Ollama prefill is 86% faster. At long contexts (16K), it is 99.96% faster. The gap widens monotonically because HF enters the thrashing regime while Ollama does not.
  7. TTFT exceeds 1 second at 4K tokens on HF. All three HF models cross the 1-second TTFT threshold at 4,096 tokens. At 16K tokens, TTFT is 7.9 minutes (qwen2.5-0.5b) and 8.9 minutes (qwen2.5-1.5b). Ollama TTFT never exceeds 1 second at any context length tested.
  8. Total context-dependent VRAM grows at 0.75-1.16 MB/token (FP16). Derived from VRAM growth slopes on pre-spillover data. Cross-validation with TR123's theoretical KV costs (12-37 KB/token) reveals 20-95x overhead from attention workspace, activations, and allocator fragmentation (SS6.4).
  9. OOM cliff follows spillover by one context-length step. qwen2.5-0.5b and qwen2.5-1.5b spill at 16K, OOM at 32K. qwen2.5-3b spills at 8K, OOM at 16K. The spillover regime is a warning zone before hard failure.
  10. HF measurement precision is extremely high pre-spillover (CV 0.2-3.1%) but Ollama variance is dominated by cold-start outliers (CV 97-307% with rep-0; 3.6-6.1% after filtering). Median or 10%-trimmed mean is the correct central tendency for Ollama (SS4.5). All 18 backend comparisons survive Bonferroni correction (SS8.4).

Key Decisions

  • For long-context workloads on consumer GPU: Use Ollama (quantized). HF transformers in FP16 cannot serve contexts beyond 4-8K tokens without catastrophic performance degradation on 12 GB VRAM.
  • For short-context workloads (<=2K tokens): Either backend is viable. HF offers exact FP16 precision; Ollama offers 3-7x faster throughput with quantization.
  • Context budget planning: Allocate VRAM as model_weight_size + KV_cache_cost x context_length. For FP16 on 12 GB: qwen2.5-0.5b supports ~8K tokens, qwen2.5-1.5b supports ~5K tokens, qwen2.5-3b supports ~4K tokens before spillover.
  • Decode-heavy applications (chat, code gen): Ollama's decode throughput (47-163 tok/s) vastly exceeds HF's (0.9-49 tok/s). Use Ollama.
  • TTFT-sensitive applications (interactive use): Ollama maintains sub-second TTFT through 32K tokens. HF crosses 1 second at 4K tokens. Use Ollama for long-context interactive workloads.

Claim Validation

# Claim Evidence Base Status
1 Quadratic attention cost is empirically visible on RTX 4080 Pre-spillover exponents b = 1.58-1.78, quadratic R^2 = 0.999+ (SS4) Demonstrated (pre-spillover only)
2 VRAM becomes the bottleneck before compute does Spillover at 8-16K tokens causes 25-105x cliffs (SS6) Demonstrated
3 TTFT scales superlinearly with context length HF: 9,004x TTFT increase over 32x context growth (SS7) Demonstrated (HF only; Ollama is sub-linear)
4 There is a context-length "cliff" where performance drops dramatically 25-105x latency jumps at spillover thresholds (SS4, SS12) Demonstrated (HF only)
5 Ollama eliminates quadratic scaling Sub-linear exponents b < 0.2, all 3 models (SS4) Demonstrated
6 Decode throughput degrades with context length 41-53% Ollama decode degradation, 95% HF decode collapse (SS5) Demonstrated
7 Empirical VRAM growth cross-validates with TR123 Slopes 0.75-1.16 MB/token vs theoretical KV 12-37 KB/token; 20-95x overhead quantified as attention workspace + allocator (SS6.4) Demonstrated (with reinterpretation: slope is total context-dependent memory, not pure KV cost)
8 Model size determines context budget 3B model spills at 8K, 0.5B/1.5B at 16K (SS6) Demonstrated

When to Use This Report

TR127 is the context-length scaling reference for the Banterhearts research program. Use it when planning context budgets, evaluating VRAM requirements, or understanding long-context performance on consumer hardware.

Scenario 1: Planning Context Window for RAG Pipeline

Question: "I want to stuff 8K tokens of retrieved context into qwen2.5-1.5b on my RTX 4080. Will it work?"

Answer: Consult SS6. qwen2.5-1.5b uses 3.1 GB base VRAM + 1.03 MB/token KV cache. At 8K tokens: 3.1 + 8.2 = 11.3 GB -- borderline. TTFT will be ~5 seconds (SS7). Decode throughput drops to 19 tok/s (SS5). For acceptable interactive latency, limit to 4K tokens on HF, or use Ollama (handles 32K+ at 10 ms TTFT, 130 tok/s decode).

Scenario 2: Evaluating VRAM Requirements for a New Model

Question: "I have a 7B model at FP16. How much context can I fit on 12 GB VRAM?"

Answer: A 7B FP16 model uses ~14 GB for weights alone -- it won't fit on 12 GB VRAM even at context=0. Use Ollama with Q4_K_M quantization (TR125 shows negligible quality loss). At Q4_K_M, llama3.1-8b uses ~4.6 GB, leaving ~7.4 GB for KV cache.

Scenario 3: Understanding Why Inference Suddenly Became Very Slow

Question: "My HF model was running fine at 4K context but became 100x slower at 16K. What happened?"

Answer: Consult SS4 and SS6. Your model likely hit the VRAM spillover threshold -- CUDA Unified Memory is paging tensors to system RAM, causing 25-105x latency increases. Check torch.cuda.max_memory_allocated(): if it exceeds your physical VRAM (12 GB), you're in the thrashing regime. Reduce context length or switch to Ollama.

Scenario 4: Comparing with TR123 KV Cache Theory

Question: "TR123 predicted KV cache costs based on model architecture. Do the empirical measurements match?"

Answer: Consult SS6.4. Measured KV cache costs (0.75-1.16 MB/token) are derived from VRAM growth slopes on pre-spillover data. These should be compared with TR123's architectural predictions (layer_count x kv_heads x head_dim x precision_bytes x 2). Discrepancies may arise from PyTorch's memory allocator overhead.


Table of Contents

Preliminaries

Experiment Design (SS1-SS3)

  1. Introduction & Research Motivation
  2. Methodology & Experimental Design
  3. Environment & Artifacts

Scaling Analysis (SS4-SS7)

  1. Prefill Scaling Analysis -- Two-regime discovery, cold-start analysis (4.5), trimmed-mean robustness (4.8)
  2. Decode Scaling Analysis -- Two-regime decode (5.3), decode trimmed-mean robustness (5.4)
  3. Memory Scaling (CUDA Allocation) -- KV cross-validation with TR123 theory (6.4)
  4. Time to First Token (TTFT) Analysis

Comparisons & Quality (SS8-SS10)

  1. Backend Comparison (HF vs Ollama) -- Multiple comparison correction (8.4), ANOVA interaction (8.5)
  2. Outlier Analysis -- Distribution shape analysis (9.5)
  3. Power Analysis

Synthesis (SS11-SS14)

  1. Cross-Model Comparison
  2. Key Findings
  3. Conclusions
  4. Production Guidance & Decision Trees

Closing

  1. Limitations & Future Work
  2. Reproducibility

Appendices


Metric Definitions & Statistical Methods

Latency Metrics

Metric Definition Computation
Mean (ms) Arithmetic mean of wall-clock latency across all repetitions sum(x) / N
Median (ms) 50th percentile latency sorted(x)[N//2]
Std (ms) Sample standard deviation sqrt(sum((x - mean)^2) / (N-1))
p95/p99 (ms) Percentile latencies numpy.percentile(x, [95, 99])
95% CI 95% confidence interval for the mean mean +/- 1.96 * std / sqrt(N)
CV% Coefficient of variation (std / mean) * 100

Throughput Metrics

Metric Definition Computation
Prefill tok/s Prompt processing speed prompt_tokens / prefill_ms * 1000
Decode tok/s Token generation speed generated_tokens / decode_ms * 1000
TTFT (ms) Time to first token Equal to prefill latency (prompt_eval_duration for Ollama)

Effect Size & Significance Metrics

Metric Definition Interpretation
Cohen's d Standardized mean difference: (mean_A - mean_B) / pooled_std Negligible: |d| < 0.2, Small: 0.2-0.5, Medium: 0.5-0.8, Large: > 0.8
p-value Probability of observing the data under H_0 (no difference), via Welch's t-test Significant if p < 0.05
Delta (%) Relative difference: (mean_B - mean_A) / mean_A x 100 Negative = B is faster
Outlier (IQR) Tukey fence per context length: x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR Per-context detection avoids false positives from pooling heterogeneous regimes

Multiple Comparison Correction

Method Formula Use Case
Bonferroni Reject if p < alpha / n_tests Conservative FWER control. With 18 tests at alpha=0.05: threshold = 0.0028
Holm-Bonferroni Step-down: sort p-values, reject p_i < alpha/(n-i+1) while all smaller p rejected Less conservative than Bonferroni, controls FWER without assuming independence

ANOVA / Interaction Testing

Test Purpose In TR127
One-way ANOVA (backend) Does backend affect latency (collapsing contexts)? F-test on HF vs Ollama for qwen2.5-1.5b
Per-context t-test Does backend effect change at each context? Series of t-tests showing interaction pattern
Interaction evidence Does the magnitude/direction of backend effect depend on context? Classified as none/weak/moderate/strong

Trimmed Mean

A robust estimator of central tendency: scipy.stats.trim_mean(values, proportiontocut) removes floor(N x proportion) values from each tail before computing the mean. At N=10: 5% trim removes 0 values (useless); 10% trim removes 1 from each tail (effective for cold-start filtering).

Distribution Shape

Metric Formula Interpretation
Mean/median ratio mean / median >1.0 = right skew; >2.0 = severe skew (mean unreliable)
Skewness scipy.stats.skew(values) 0 = symmetric; >2 = strong right skew
Shapiro-Wilk scipy.stats.shapiro(values) p >= 0.05 = consistent with normality

Scaling Fit Methods

Three models are fit to (context_length, latency) data:

  1. Power law: latency = a x context_length^b -- exponent b indicates scaling behavior (b=1 is linear, b=2 is quadratic)
  2. Linear: latency = a x context_length + c -- R^2 compared against power law to determine better fit
  3. Quadratic: latency = a x context_length^2 + b x context_length + c -- direct test of O(n^2) hypothesis

For HF models with VRAM spillover, two-regime analysis separates pre-spillover data (true computational scaling) from post-spillover data (VRAM thrashing artifacts). The thrashing threshold is identified as the first context length where torch.cuda.max_memory_allocated() exceeds physical GPU VRAM (12,288 MB).

Timing Methodology

HF (transformers-gpu): All latency measurements use time.perf_counter() with torch.cuda.synchronize() barriers before and after the timed region. VRAM is measured via torch.cuda.max_memory_allocated() (reset before each context-length sweep).

Ollama: Native timing fields from the /api/generate HTTP response:

  • Prefill: prompt_eval_duration (nanoseconds, GPU-only)
  • Decode: eval_duration (nanoseconds, GPU-only)
  • E2E: Wall clock via time.perf_counter() (includes HTTP overhead)

VRAM Measurement Caveat

torch.cuda.max_memory_allocated() reports the peak amount of memory allocated by the CUDA allocator, which includes CUDA Unified Memory allocations that spill to system RAM. On our 12 GB GPU, values exceeding 12,288 MB indicate that PyTorch has allocated memory beyond physical VRAM, with excess served from system RAM at PCIe bandwidth (~16 GB/s vs ~256 GB/s GDDR6). This is the root cause of the 25-105x thrashing cliffs. Importantly, PyTorch does NOT raise an OOM exception at this point -- the OOM occurs at a higher allocation threshold determined by the CUDA Unified Memory limit.


1. Introduction & Research Motivation

1.1 Research Questions

TR127 addresses four decision-grade questions from the research roadmap:

  1. Does attention's quadratic cost show up empirically on the RTX 4080? Self-attention is O(n^2) in context length. At what point does this become measurable on consumer hardware?
  2. At what context length does VRAM become the bottleneck per model? Each model has a different base weight footprint. Where does the remaining VRAM fill up with KV cache?
  3. How does TTFT (prefill latency) scale with context length? TTFT determines interactive responsiveness. At what context length does TTFT exceed acceptable thresholds (1s, 5s, 10s)?
  4. Is there a context-length "cliff" where performance drops dramatically? Do models degrade gracefully or catastrophically as context grows?

1.2 Why This Matters

Every prior report in this research program (TR108-TR126) tested at most ~2K token contexts. But production workloads are increasingly long-context:

  • RAG pipelines: 4K-16K tokens of retrieved context prepended to queries
  • Document summarization: 8K-32K tokens of source text
  • Multi-turn chat: 4K-128K tokens of conversation history
  • Code generation: 2K-16K tokens of repository context

Without empirical context-length scaling data, production teams cannot answer basic capacity questions: "How many tokens of context can I stuff before latency becomes unacceptable?" or "Should I use FP16 HF or quantized Ollama for my 8K-context RAG pipeline?"

TR127 fills this gap with a systematic context-length sweep on consumer hardware within this research program. It also reveals a phenomenon -- VRAM spillover via CUDA Unified Memory -- that was invisible in short-context experiments and would silently degrade production deployments.

1.3 Scope

  • Hardware: Single consumer machine (RTX 4080 Laptop, 12 GB VRAM) -- same GPU as TR117-TR126.
  • Platform: Windows 11, Python 3.13, PyTorch 2.8.0+cu128 (for HF); Ollama localhost (for quantized inference).
  • Models: 5 models spanning 0.5B-3.2B parameters: 3 HF-only (qwen2.5-0.5b/1.5b/3b), 3 Ollama-only (llama3.2-1b/3b, qwen2.5-1.5b), with qwen2.5-1.5b on both backends for direct comparison.
  • Context lengths: 7 levels -- 512, 1024, 2048, 4096, 8192, 16384, 32768 tokens (geometric progression, 2x steps).
  • Modes: prefill, decode (128 new tokens), end-to-end (prefill + decode).
  • Timing: torch.cuda.synchronize() + perf_counter for HF; native prompt_eval_duration/eval_duration for Ollama.
  • Temperature: 0.0 (greedy decoding). Deterministic -- validated by TR124 Phase 3.
  • Repetitions: 10 measured + 3 warmup (discarded) per (model x context_length).

1.4 Literature Grounding

Reference Contribution How TR127 Uses It
TR123 (Banterhearts) KV cache cost model, VRAM formulas Validate empirical KV costs against theoretical predictions
TR125 (Banterhearts) Quantization quality data, Ollama timing Ollama model selection, native timing methodology
TR126 (Banterhearts) HF vs Ollama comparison at short context Cross-reference backend comparison findings
FlashAttention (Dao et al., 2022) Tiling-based exact attention at sub-quadratic memory Explains Ollama's sub-linear scaling
CUDA Unified Memory (NVIDIA) Transparent CPU-GPU memory migration Explains VRAM spillover mechanism

Gap filled: Prior reports tested performance at fixed, short context lengths. TR127 provides the first context-length scaling curves on consumer hardware, revealing two-regime behavior invisible in short-context experiments.

1.5 How to Read This Report

Use TR127 in three passes:

  1. SS2-SS3 (Methodology): Understand the experimental design and what was measured. If you trust the setup, skip to results.
  2. SS4-SS7 (Scaling Analysis): The core contribution -- prefill scaling (the two-regime discovery), decode scaling, VRAM scaling, and TTFT analysis. These four sections answer the four research questions.
  3. SS8-SS14 (Comparisons & Synthesis): Backend comparison, outlier analysis, power analysis, cross-model comparison, and production guidance. Read these for deployment decisions.

2. Methodology & Experimental Design

2.1 Independent Variable

Context length is the single independent variable, swept across 7 levels:

Level Context Length (tokens) Doubling Step
1 512 --
2 1,024 2x
3 2,048 2x
4 4,096 2x
5 8,192 2x
6 16,384 2x
7 32,768 2x

The geometric (2x) progression provides even coverage on a log scale, which is natural for analyzing power-law scaling relationships.

2.2 Model Lineup

Model Params Backend Dtype Max Context KV Heads Base VRAM (MB)
qwen2.5-0.5b 500M transformers-gpu FP16 32K 2 1,122
qwen2.5-1.5b 1,543M transformers-gpu + ollama FP16 / Q4 131K 2 3,132
qwen2.5-3b 3,000M transformers-gpu FP16 32K 2 6,190
llama3.2-1b 1,236M ollama Q8_0 131K -- --
llama3.2-3b 3,213M ollama Q4_K_M 131K -- --

Design rationale:

  • Three HF models at 0.5B/1.5B/3B (3x parameter intervals) provide controlled VRAM scaling: each model has a different base footprint, so the "VRAM budget for KV cache" differs, and the spillover threshold should occur at different context lengths. This is confirmed: qwen2.5-3b spills at 8K while the smaller models spill at 16K.
  • Three Ollama models at 1B/1.5B/3B provide a quantized reference that does not hit VRAM limits, enabling measurement of true computational scaling without VRAM artifacts.
  • qwen2.5-1.5b on both backends enables a direct HF-vs-Ollama comparison at each context length, isolating the backend effect from the model effect.

2.3 Backend Selection

Backend Measurement VRAM Tracking Context Control
transformers-gpu torch.cuda.synchronize() + perf_counter (GPU-accurate wall clock) torch.cuda.max_memory_allocated() Tokenizer truncation to exact token count
ollama Native prompt_eval_duration / eval_duration (GPU-only, from llama.cpp) Not available (no VRAM API) num_ctx option in API call

Why no compiled HF? TR126 established that torch.compile on Windows falls back to aot_eager (no Triton). Testing compiled HF in a long-context sweep on Windows would measure aot_eager scaling, which is uninformative. TR126's Linux Docker results show compilation benefits prefill but crashes on decode -- neither applicable to this Windows-only context sweep.

2.4 Measurement Protocol

For each (model x backend x context_length):

  1. Prompt generation: Synthetic text tokenized to exactly N tokens using the model's tokenizer (HF) or repeated seed text (Ollama)
  2. Warmup: 3 repetitions (discarded) -- critical for JIT, memory allocation, CUDA context setup
  3. Measurement: 10 repetitions, each measuring:
    • Prefill: Forward pass over the entire prompt (single pass, use_cache=True)
    • Decode: Autoregressive generation of 128 new tokens using KV cache
    • E2E: Prefill + decode timed together
  4. VRAM recording: torch.cuda.max_memory_allocated() recorded once per context length (resets between context lengths)
  5. OOM handling: torch.cuda.OutOfMemoryError caught and recorded as status: "oom" -- the OOM context length itself is a data point

2.5 Prompt Generation

Synthetic prompts are constructed to hit exact token counts:

  • HF: A seed paragraph is repeated and then tokenized. The token IDs are truncated to exactly N tokens. This ensures the tokenizer reports exactly the target context length.
  • Ollama: Text is repeated to approximately the target length. The num_ctx API parameter controls the context window. Actual prompt_eval_count varies slightly from the target (e.g., 29,294 prompt tokens for a 32K context length on llama3.2-3b) due to Ollama's tokenizer differences.

2.6 Controlled Variables

Variable Value Rationale
max_new_tokens 128 Fixed decode length for fair comparison across context lengths
temperature 0.0 Greedy decoding -- deterministic, no sampling variance
seed 42 Reproducible random state
warmup_repetitions 3 Exclude cold-start artifacts
repetitions 10 Sufficient for N=10 power at large effect sizes

2.7 Sample Counts

Backend Models Context Lengths Reps Modes Total Planned Actual (ok) OOM
transformers-gpu 3 7 10 3 630 540* 4
ollama 3 7 10 3 630 600 0
Total -- -- -- -- 1,260 1,140 4

*HF models: qwen2.5-0.5b ran 6/7 lengths (OOM at 32K), qwen2.5-1.5b ran 6/7 (OOM at 32K), qwen2.5-3b ran 5/7 (OOM at 16K and 32K). Plus 4 OOM sentinel rows = 1,144 total rows in CSV.


3. Environment & Artifacts

3.1 Environment Fingerprint

Property Value
Platform Windows-11-10.0.26200-SP0
Python 3.13.1
PyTorch 2.8.0+cu128
CUDA 12.8
cuDNN 91002
GPU NVIDIA GeForce RTX 4080 Laptop GPU
VRAM 12.88 GB (12,288 MB usable)
Compute Capability 8.9 (Ada Lovelace)
Triton Not available (Windows)
Ollama localhost:11434 (llama.cpp backend)

3.2 Preflight Validation

Check Result Detail
CUDA available Pass RTX 4080 Laptop GPU detected
GPU free memory 10.78 GB ~1.21 GB used by system at start
CUDA sync test Pass torch.cuda.synchronize() completes without error

3.3 Run Timeline

Event Time Duration
Start 10:11:30 --
qwen2.5-0.5b complete (6 lengths) ~11:00 ~49 min
qwen2.5-1.5b complete (6 lengths) ~13:30 ~2h 30m
qwen2.5-3b complete (5 lengths) ~14:00 ~30 min
Ollama models complete (3 x 7 lengths) 15:13 ~1h 13m
Pipeline end (analysis + report) 15:13:12 ~5h 2m total

The HF models took ~3h 49m (dominated by 16K-token context runs that entered the VRAM thrashing regime). Ollama models took ~1h 13m with no thrashing.

3.4 Key Artifacts

Artifact Path Size
Raw measurements research/tr127/results/20260224_101128/metrics.csv 127 KB, 1,144 rows
Analysis research/tr127/results/20260224_101128/analysis.json 75 KB
Auto-generated report research/tr127/results/20260224_101128/report.md 29 KB
Manifest research/tr127/results/20260224_101128/manifest.json 3 KB

4. Prefill Scaling Analysis

This section answers Research Question 1: Does attention's quadratic cost show up empirically on the RTX 4080?

4.1 The Two-Regime Discovery

The central finding of TR127 is that HF prefill scaling exhibits two distinct regimes:

Regime 1 -- Pre-spillover (computational scaling): When the model + KV cache fits in physical VRAM (12 GB), prefill latency follows a power law with exponent b = 1.58-1.78. This is between linear (b=1) and quadratic (b=2), consistent with self-attention's O(n^2) cost partially mitigated by hardware optimizations (SDPA kernels, Ada Lovelace architecture optimizations). Quadratic fits achieve R^2 = 0.999+, confirming that the O(n^2) model is an excellent description of the computational scaling.

Regime 2 -- Post-spillover (VRAM thrashing): When KV cache allocation exceeds physical VRAM, CUDA Unified Memory silently pages tensors to system RAM. Each attention operation must now fetch data over PCIe (~16 GB/s) instead of GDDR6 (~256 GB/s), causing 25-105x latency increases. Full-range power-law fits show apparent exponents of b = 4.6-6.7 -- but these are artifacts of the memory hierarchy transition, not O(n^6) attention.

Why two regimes? The transition is not gradual -- it's a cliff. At 8K tokens, qwen2.5-3b allocates 16.2 GB (1.32x physical VRAM). The excess 4.2 GB must be fetched from system RAM on every access. Because attention involves all-pairs token comparisons, the amount of data fetched from system RAM scales quadratically, creating a multiplicative interaction between quadratic compute and PCIe bandwidth limits.

4.2 Full-Range Scaling Fits

For completeness, the full-range fits (including thrashing data) are reported below. These show what a naive analysis would conclude if VRAM spillover were not identified:

Model Backend Exponent (b) R^2 (power) R^2 (quad) R^2 (linear) Better Fit
llama3.2-1b ollama 0.083 0.488 0.864 0.840 linear
llama3.2-3b ollama 0.158 0.718 0.969 0.960 linear
qwen2.5-0.5b transformers-gpu 6.645 1.000 0.990 0.796 power_law
qwen2.5-1.5b ollama 0.109 0.588 0.860 0.855 linear
qwen2.5-1.5b transformers-gpu 6.703 1.000 0.990 0.795 power_law
qwen2.5-3b transformers-gpu 4.625 1.000 0.994 0.832 power_law

Warning: The b = 4.6-6.7 exponents for HF models are VRAM thrashing artifacts, not evidence of O(n^6) attention. The full-range power-law fit achieves R^2 ~ 1.000 because the 16K/8K data points (at 100x latency) dominate the fit. These numbers should NOT be used for scaling predictions.

4.3 Pre-Thrashing Scaling Fits (True Computational Scaling)

By restricting the fit to context lengths where VRAM allocation stays within 12 GB, we isolate the true computational scaling:

Model Backend Clean Points Exponent (b) R^2 (power) R^2 (quad) R^2 (linear) Thrashing At Thrashing Mult
qwen2.5-0.5b transformers-gpu 5 (512-8K) 1.701 0.9999 0.9999 0.969 16,384 100.7x
qwen2.5-1.5b transformers-gpu 5 (512-8K) 1.780 0.9982 0.9998 0.954 16,384 104.7x
qwen2.5-3b transformers-gpu 4 (512-4K) 1.583 0.9950 0.9992 0.967 8,192 25.2x

Interpretation:

  • Exponents b = 1.58-1.78 are between linear (b=1) and quadratic (b=2). Pure O(n^2) self-attention would give b=2. The sub-quadratic exponents suggest hardware optimizations (SDPA with memory-efficient attention, Ada Lovelace tensor cores) reduce the effective scaling exponent by 10-20%.
  • Quadratic R^2 = 0.999+ means the quadratic model latency = a*n^2 + b*n + c fits the pre-spillover data almost perfectly. This is consistent with O(n^2) attention as the dominant cost, even though the power-law exponent is slightly below 2.
  • The thrashing multiplier (25-105x) quantifies how much worse performance gets when VRAM spills. qwen2.5-0.5b jumps from 4.7 seconds (8K, pre-spillover) to 474 seconds (16K, post-spillover) -- a 100.7x increase for a 2x context growth.

4.4 Per-Model Prefill Analysis (HF)

qwen2.5-0.5b (500M params, FP16)

Context Length Mean (ms) Median (ms) Std (ms) CV% 95% CI N Regime
512 52.6 52.7 0.7 1.4% [52, 53] 10 Clean
1,024 144.5 144.5 1.0 0.7% [144, 145] 10 Clean
2,048 441.2 441.4 2.9 0.6% [439, 443] 10 Clean
4,096 1,452.3 1,452.1 4.8 0.3% [1,449, 1,456] 10 Clean
8,192 4,723.6 4,721.0 7.4 0.2% [4,718, 4,729] 10 Clean
16,384 473,789.2 475,338.8 5,244.4 1.1% [470,038, 477,541] 10 Thrashing
32,768 -- -- -- -- -- 0 OOM

Observations:

  1. Extraordinary measurement precision pre-spillover: CV ranges from 0.2% to 1.4% across 5 context lengths. The 95% CIs are +/-1-4 ms wide. This validates the measurement methodology -- torch.cuda.synchronize() + perf_counter produces highly repeatable GPU timing.
  2. The 100x cliff: At 16K tokens, latency jumps from 4,724 ms to 473,789 ms (100.3x increase for 2x context growth). The clean-regime prediction at 16K would be ~15,000 ms (based on the b=1.70 power law). The actual value is 31x worse than the computational prediction.
  3. Base VRAM is tiny (1,122 MB): This model uses only 1.1 GB for weights, leaving ~11 GB for KV cache. Yet it still spills at 16K tokens because KV cache grows at 2.11 MB/token.

qwen2.5-1.5b (1.5B params, FP16)

Context Length Mean (ms) Median (ms) Std (ms) CV% 95% CI N Regime
512 108.8 110.2 2.6 2.4% [107, 111] 10 Clean
1,024 243.3 242.7 4.1 1.7% [240, 246] 10 Clean
2,048 507.1 504.4 15.6 3.1% [496, 518] 10 Clean
4,096 1,436.4 1,395.3 105.9 7.4% [1,361, 1,512] 10 Clean
8,192 5,086.0 5,083.8 10.2 0.2% [5,079, 5,093] 10 Clean
16,384 532,890.8 532,221.5 13,308.7 2.5% [523,370, 542,411] 10 Thrashing
32,768 -- -- -- -- -- 0 OOM

Observations:

  1. Higher base latency: At 512 tokens, qwen2.5-1.5b (109 ms) is 2.1x slower than qwen2.5-0.5b (53 ms). The 3x parameter increase translates to ~2x latency increase -- sub-linear in model size, consistent with GPU parallelism absorbing some of the additional compute.
  2. 4K anomaly: The 4,096-token measurement shows elevated std (106 ms, CV 7.4%) compared to neighboring points. The mean (1,436 ms) is higher than the median (1,395 ms), suggesting one or two warm runs. This may indicate the model is approaching VRAM pressure (5,013 MB allocated at 4K -- 41% of VRAM) causing occasional allocator overhead.
  3. 105x cliff at 16K: 5,086 ms -> 532,891 ms. Even more dramatic than the 0.5B model, likely because the larger model has more weights to page in addition to KV cache.

qwen2.5-3b (3B params, FP16)

Context Length Mean (ms) Median (ms) Std (ms) CV% 95% CI N Regime
512 164.0 156.5 22.9 13.9% [148, 180] 10 Clean
1,024 341.9 346.8 43.5 12.7% [311, 373] 10 Clean
2,048 756.8 732.2 63.4 8.4% [711, 802] 10 Clean
4,096 2,432.3 2,428.4 10.5 0.4% [2,425, 2,440] 10 Clean
8,192 61,981.8 61,211.1 2,108.4 3.4% [60,474, 63,490] 10 Thrashing
16,384 -- -- -- -- -- 0 OOM
32,768 -- -- -- -- -- 0 OOM

Observations:

  1. Earlier spillover: The 3B model's 6.2 GB base footprint leaves only 6.1 GB for KV cache, so it spills at 8K tokens (16.2 GB allocated, 1.32x physical VRAM) instead of 16K.
  2. Higher baseline variance: CV of 8-14% at 512-2K tokens is much higher than the smaller models' 0.2-3.1%. This may be caused by the model's larger memory footprint interacting with the CUDA allocator more frequently, or by thermal throttling (the 150W laptop TDP may cause clock throttling during sustained 3B-parameter forward passes).
  3. "Only" 25x cliff: The 4,096->8,192 jump is 25.5x (compared to 100x for the smaller models). The 3B model enters spillover earlier (8K vs 16K), and the spillover ratio (1.32x) is smaller than for the smaller models (2.66-2.83x), so the thrashing is less severe. However, the absolute latency (62 seconds for a single 8K prefill) is still catastrophically slow.

4.5 Ollama Cold-Start Analysis

Despite 3 warmup repetitions, Ollama's first measured repetition (rep-0) consistently shows 2-10x higher latency than subsequent repetitions. This pattern was present in 100% (21/21) of Ollama measurement groups and must be understood before interpreting Ollama scaling data.

Cold-Start Magnitude

qwen2.5-1.5b (Ollama)

Context Rep-0 (ms) Rest Median (ms) Cold Ratio Mean Inflation Rest CV%
512 70.2 9.0 7.8x +44.7% 5.7%
1,024 118.2 8.8 13.4x +58.1% 6.1%
2,048 240.1 7.7 31.2x +67.8% 5.4%
4,096 492.3 8.0 61.5x +73.5% 3.8%
8,192 959.3 10.3 93.1x +76.1% 3.9%
16,384 2,028.2 10.4 195.0x +79.5% 3.6%
32,768 4,104.5 13.4 306.3x +80.0% 3.6%

Key observations:

  1. Cold ratio grows with context length -- from 7.8x at 512 tokens to 306x at 32K tokens. This is consistent with Ollama performing lazy KV cache allocation on the first inference: longer contexts require more KV cache memory, so the first-run overhead is proportionally larger.
  2. Rest CV% is low (3.6-6.1%) -- after filtering rep-0, Ollama measurements are quite stable. The 97-307% CV reported in SS4.6 is entirely caused by the cold-start outlier.
  3. Mean inflation is severe (45-80%) -- including rep-0 inflates the mean by up to 80% at long contexts. The median is robust (unchanged by rep-0), and 10%-trimmed mean (removes 1 value from each tail of N=10) also eliminates the cold-start. However, 5%-trimmed mean with N=10 is insufficient -- floor(10 x 0.05) = 0 values removed, so it equals the untrimmed mean.

Recommendation: Use median or 10%-trimmed mean for Ollama central tendency. Do not report arithmetic mean without disclosing cold-start contamination.

4.6 Ollama Prefill Scaling

Ollama shows fundamentally different scaling behavior -- sub-linear across the entire 512-32K range:

llama3.2-1b (Ollama, Q8_0)

Context Length Mean (ms) Median (ms) Std (ms) CV% N
512 12.0 8.4 11.7 97% 10
1,024 16.2 8.3 25.1 155% 10
2,048 23.3 7.1 51.5 221% 10
4,096 41.8 7.3 109.5 262% 10
8,192 81.4 8.4 230.5 283% 10
16,384 168.4 9.3 504.0 299% 10
32,768 371.9 11.4 1,140.0 307% 10

Critical observation: Mean vs Median divergence. The medians are nearly flat (7-11 ms across 64x context growth) while means grow 31x (12->372 ms). This pattern -- high CV (97-307%), large mean-median gap -- is characteristic of cold-start outliers: 1-2 of the 10 repetitions take 10-100x longer than the rest, pulling the mean up dramatically. Despite 3 warmup reps, the first measured repetition often shows this pattern with Ollama, likely due to llama.cpp's internal lazy initialization or KV cache pre-allocation on first real inference.

The median is the reliable measure for Ollama. Median prefill time for llama3.2-1b grows from 8.4 ms (512) to 11.4 ms (32K) -- a 1.4x increase over 64x context growth. This is profoundly sub-linear (b ~ 0.08) and consistent with Flash Attention's O(n) memory access pattern on optimized hardware.

llama3.2-3b (Ollama, Q4_K_M)

Context Length Mean (ms) Median (ms) Std (ms) N
512 23.3 12.5 34.2 10
1,024 34.6 12.9 69.3 10
2,048 56.0 11.7 140.6 10
4,096 102.2 11.9 285.6 10
8,192 191.2 14.7 557.9 10
16,384 370.0 16.2 1,119.1 10
32,768 828.5 22.6 2,549.3 10

Same pattern: medians grow from 12.5 to 22.6 ms (1.8x over 64x context growth). The 3B model shows slightly more context sensitivity than the 1B model (median growth factor 1.8x vs 1.4x), consistent with more parameters requiring more memory bandwidth for attention computation.

qwen2.5-1.5b (Ollama, Q4_K_M)

Context Length Mean (ms) Median (ms) Std (ms) N
512 15.1 9.2 19.4 10
1,024 21.0 8.8 39.1 10
2,048 33.2 7.7 81.1 10
4,096 60.8 8.0 166.8 10
8,192 117.0 10.3 338.1 10
16,384 226.6 10.4 683.2 10
32,768 457.5 13.4 1,404.7 10

Median grows from 9.2 to 13.4 ms (1.5x over 64x context growth). This is the same model (qwen2.5-1.5b) that takes 532 seconds on HF at 16K tokens -- Ollama processes it in 10.4 ms (median). The difference is 51,000x at 16K tokens.

4.6 Scaling Exponent Summary

Backend Model Regime Exponent (b) Interpretation
HF qwen2.5-0.5b Pre-spillover 1.70 Superlinear (approaching quadratic)
HF qwen2.5-1.5b Pre-spillover 1.78 Superlinear (approaching quadratic)
HF qwen2.5-3b Pre-spillover 1.58 Superlinear (partially optimized)
HF All Post-spillover 4.6-6.7 VRAM thrashing artifact
Ollama llama3.2-1b Full range 0.08 Sub-linear (Flash Attention)
Ollama llama3.2-3b Full range 0.16 Sub-linear (Flash Attention)
Ollama qwen2.5-1.5b Full range 0.11 Sub-linear (Flash Attention)

Answer to Research Question 1: Yes, quadratic attention cost is empirically visible in the pre-spillover regime (b = 1.58-1.78). However, it is NOT the dominant bottleneck -- VRAM spillover causes 25-105x cliffs that completely dominate the scaling picture. Ollama's optimized attention eliminates quadratic scaling entirely.

4.8 Trimmed-Mean Robustness Analysis

To verify that scaling exponents are not artifacts of outliers, we re-fit the power law using trimmed means instead of medians. With N=10 repetitions per context length:

  • 5% trim = floor(10 x 0.05) = 0 values removed per tail -> identical to untrimmed mean
  • 10% trim = floor(10 x 0.10) = 1 value removed per tail -> removes extreme rep
Model Backend Median b Trim 5% b Trim 10% b Stable?
qwen2.5-0.5b HF 6.645 6.640 6.642 Yes
qwen2.5-1.5b HF 6.703 6.704 6.706 Yes
qwen2.5-3b HF 4.625 4.640 4.630 Yes
llama3.2-1b ollama 0.083 1.083 0.081 No (13.4x)
llama3.2-3b ollama 0.158 1.063 0.159 No (6.7x)
qwen2.5-1.5b ollama 0.109 0.976 0.111 No (8.8x)

Interpretation:

  • HF exponents are rock-solid: Median, 5%-trimmed, and 10%-trimmed fits all agree within 0.3%. The thrashing-dominated exponents are not outlier artifacts -- they reflect genuine VRAM spillover scaling.
  • Ollama exponents are median-robust but mean-fragile: The 5% trim (which removes 0 values from N=10) gives wildly inflated exponents (b ~ 1.0 instead of 0.1) because the cold-start rep-0 contaminates the mean at every context length. The 10% trim removes 1 value per tail, recovering the median-based exponent exactly.
  • This confirms the cold-start finding (SS4.5): Ollama scaling analysis MUST use median or >=10% trimmed mean. Arithmetic mean produces exponents 7-13x too high due to a single cold-start outlier per group.

5. Decode Scaling Analysis

5.1 Decode Throughput vs Context Length

Decode throughput measures token generation speed. Each decode step must attend over the full KV cache from all prior tokens, so longer contexts increase per-step attention cost.

Ollama Decode Throughput (Clean Scaling)

Context llama3.2-1b (tok/s) llama3.2-3b (tok/s) qwen2.5-1.5b (tok/s)
512 162.6 98.7 146.7
1,024 161.5 97.6 144.6
2,048 157.8 95.0 141.2
4,096 153.0 89.3 134.5
8,192 140.3 79.4 129.9
16,384 121.1 64.8 102.7
32,768 96.0 46.9 80.0

Degradation rates (512->32K):

Model Start (tok/s) End (tok/s) Degradation Factor
llama3.2-1b 162.6 96.0 -41.0% 1.69x slower
qwen2.5-1.5b 146.7 80.0 -45.5% 1.83x slower
llama3.2-3b 98.7 46.9 -52.5% 2.10x slower

Pattern: Decode throughput degradation is approximately linear in context length -- consistent with each decode step's attention cost growing linearly with KV cache size. The larger model (3B) shows more degradation (53%) than the smaller model (1B, 41%), likely because the 3B model's larger KV cache per token amplifies the per-step attention cost.

Production implication: At 32K context, decode speed drops to 47-96 tok/s (Ollama). For interactive applications requiring >=50 tok/s, the practical decode context limit is:

  • llama3.2-1b: ~32K tokens (96 tok/s)
  • qwen2.5-1.5b: ~26K tokens (estimated by interpolation)
  • llama3.2-3b: ~16K tokens (65 tok/s drops below 50 at ~20K)

HF Decode Throughput (Pre-Spillover + Thrashing)

Context qwen2.5-0.5b (tok/s) qwen2.5-1.5b (tok/s) qwen2.5-3b (tok/s)
512 48.9 42.1 27.2
1,024 47.9 41.0 28.2
2,048 48.1 45.2 26.6
4,096 49.3 30.5 17.9
8,192 40.9 19.3 0.9
16,384 23.3 2.1 OOM

Observations:

  1. HF decode is 3-4x slower than Ollama even pre-spillover. At 512 tokens: HF qwen2.5-1.5b = 42 tok/s vs Ollama qwen2.5-1.5b = 147 tok/s. This is consistent with TR126's finding that Ollama dominates decode due to quantized weights and optimized C++ KV-cache implementation.
  2. Catastrophic decode collapse at spillover: qwen2.5-3b decode goes from 17.9 tok/s (4K) to 0.9 tok/s (8K) -- a 20x drop. At 0.9 tok/s, generating 128 tokens takes 145 seconds. qwen2.5-1.5b drops from 19.3 tok/s (8K) to 2.1 tok/s (16K).
  3. Decode thrashing is even worse than prefill thrashing because decode involves per-step attention over the entire KV cache, and each step requires reading the full cache from system RAM.

5.2 Decode Scaling Fits

Model Backend Exponent (b) R^2 Better Fit Interpretation
llama3.2-1b ollama 0.160 0.822 linear Gentle linear degradation
llama3.2-3b ollama 0.063 0.157 linear Nearly flat (model-internal overhead dominates)
qwen2.5-0.5b HF 0.246 0.641 linear Mild degradation, last point enters spillover
qwen2.5-1.5b ollama 0.149 0.839 linear Gentle linear degradation
qwen2.5-1.5b HF 3.018 0.986 power_law VRAM thrashing at 16K
qwen2.5-3b HF 4.247 0.996 power_law VRAM thrashing at 8K

The HF decode exponents (b = 3.0, 4.2) are thrashing artifacts, same as prefill.

5.3 Pre-Thrashing Decode Fits (Two-Regime Analysis)

Like prefill (SS4), decode latency also exhibits two regimes. Pre-spillover decode exponents quantify the true KV-cache lookup cost scaling:

Model Backend Clean Points Exponent (b) R^2 (power) Thrashing At Thrashing Mult
qwen2.5-0.5b HF 5 (512-8K) 0.053 0.463 16,384 8.8x
qwen2.5-1.5b HF 5 (512-8K) 0.356 0.788 16,384 7.3x
qwen2.5-3b HF 4 (512-4K) 0.230 0.692 8,192 6.9x

Interpretation:

  • Pre-spillover decode is nearly flat (b = 0.05-0.36). This means each decode step's KV-cache lookup cost grows very slowly with context length -- consistent with hardware-optimized attention that reads the KV cache efficiently. The 0.5B model is essentially constant (b = 0.05) because its tiny KV cache fits entirely in GPU L2 cache at all pre-spillover contexts.
  • Decode thrashing multipliers (7-9x) are lower than prefill (25-105x). Decode generates one token at a time (reading KV cache + writing one K/V pair), while prefill processes the entire sequence in one pass. The smaller working set per decode step means less data needs to page from system RAM.
  • The R^2 values are modest (0.46-0.79) because pre-spillover decode latency has very little variation -- the signal (scaling) is tiny relative to measurement noise. This is actually a positive finding: decode latency is remarkably stable across 512-8K context lengths.

5.4 Decode Trimmed-Mean Robustness

Model Backend Median b Trim 5% b Trim 10% b Stable?
qwen2.5-0.5b HF 0.246 0.253 0.249 Yes
qwen2.5-1.5b HF 3.018 3.018 3.023 Yes
qwen2.5-3b HF 4.247 4.247 4.244 Yes
llama3.2-1b ollama 0.160 0.142 0.162 Yes
llama3.2-3b ollama 0.063 0.103 0.062 Yes
qwen2.5-1.5b ollama 0.149 0.130 0.150 Yes

Decode scaling exponents are stable across all trimming levels -- the cold-start issue primarily affects prefill (where the initial KV cache allocation occurs), not decode (which reuses the already-allocated cache).


6. Memory Scaling (CUDA Allocation)

This section answers Research Question 2: At what context length does VRAM become the bottleneck per model?

Important: Peak VRAM values are torch.cuda.max_memory_allocated(), which includes CUDA Unified Memory spillover to system RAM. Values exceeding 12,288 MB (12 GB physical VRAM) indicate system-memory paging -- the root cause of performance cliffs. Ollama does not expose VRAM metrics, so this analysis covers HF models only.

6.1 VRAM Growth Summary

Model Slope (MB/token) KV Cost (B/token) R^2 Spillover At OOM Cliff
qwen2.5-0.5b 2.11 1,164,226 0.940 16,384 32,768
qwen2.5-1.5b 1.85 1,034,253 0.942 16,384 32,768
qwen2.5-3b 1.32 752,846 0.951 8,192 16,384

KV cache cost per token is derived from the VRAM growth slope: slope_MB_per_token x 1024 x 1024 / 2 (the /2 converts from bytes to the key+value pair). These represent the all-in cost including PyTorch allocator overhead, activation memory, and KV cache tensors.

6.2 Per-Model VRAM Curves

qwen2.5-0.5b -- CUDA Allocation

Context Length Peak Alloc (MB) In GPU? Spillover (GB) Alloc/VRAM Ratio
512 1,122 Yes -- 0.09x
1,024 1,282 Yes -- 0.10x
2,048 1,603 Yes -- 0.13x
4,096 3,175 Yes -- 0.26x
8,192 9,553 Yes -- 0.78x
16,384 34,787 NO 22.0 GB 2.83x

Observations: VRAM grows 8.5x from 8K->16K (9.6 GB -> 34.8 GB) while context only doubles. This superlinear VRAM growth is consistent with quadratic attention requiring temporary activation memory proportional to n^2. The 22 GB spillover means the GPU is borrowing 22 GB from system RAM -- every attention operation must fetch most of its data over PCIe.

qwen2.5-1.5b -- CUDA Allocation

Context Length Peak Alloc (MB) In GPU? Spillover (GB) Alloc/VRAM Ratio
512 3,132 Yes -- 0.25x
1,024 3,305 Yes -- 0.27x
2,048 3,689 Yes -- 0.30x
4,096 5,013 Yes -- 0.41x
8,192 10,654 Yes -- 0.87x
16,384 32,690 NO 19.9 GB 2.66x

Observations: The 1.5B model starts with a 3.1 GB base footprint (vs 1.1 GB for 0.5B). Despite the higher base, it spills at the same context length (16K) because the per-token KV cost is slightly lower (1.85 vs 2.11 MB/token). At 8K tokens, it uses 10.7 GB -- just barely fitting in VRAM, which explains the high measurement precision at 8K (CV 0.2%).

qwen2.5-3b -- CUDA Allocation

Context Length Peak Alloc (MB) In GPU? Spillover (GB) Alloc/VRAM Ratio
512 6,190 Yes -- 0.50x
1,024 6,370 Yes -- 0.52x
2,048 6,768 Yes -- 0.55x
4,096 8,717 Yes -- 0.71x
8,192 16,167 NO 3.8 GB 1.32x

Observations: The 3B model's 6.2 GB base footprint consumes half the VRAM before any context is processed. At 4K tokens, 8.7 GB is allocated (71% of VRAM). By 8K, the allocation (16.2 GB) exceeds physical VRAM, triggering the 25x thrashing cliff. The spillover is "only" 3.8 GB (vs 20-22 GB for the smaller models) because the model hits OOM before context grows further.

6.3 VRAM Budget Model

The practical context budget can be computed as:

max_context_tokens ~ (GPU_VRAM_MB - model_base_MB) / slope_MB_per_token
Model Base (MB) Available (MB) Slope (MB/tok) Max Context (theoretical) Max Context (observed)
qwen2.5-0.5b 1,122 11,166 2.11 ~5,290 8,192 (pre-spillover)
qwen2.5-1.5b 3,132 9,156 1.85 ~4,950 8,192 (pre-spillover)
qwen2.5-3b 6,190 6,098 1.32 ~4,620 4,096 (pre-spillover)

The theoretical maximum is where base + slope x context = 12,288 MB. Observed pre-spillover maximums are higher because our context levels jump by 2x, so the model may fit at a given level even if the next level exceeds VRAM.

6.4 KV Cache Cross-Validation with TR123 Theory

TR123 computed theoretical KV cache costs from model architecture:

kv_cost_per_token = num_layers x num_kv_heads x head_dim x precision_bytes x 2 (keys + values)

We now have empirical VRAM growth slopes to cross-validate against TR123's theoretical predictions:

Model Architecture Theoretical KV (B/tok) Empirical Slope (B/tok) Overhead Ratio
qwen2.5-0.5b 24L x 2KV x 64d 12,288 1,164,226 94.7x
qwen2.5-1.5b 28L x 2KV x 128d 28,672 1,034,253 36.1x
qwen2.5-3b 36L x 2KV x 128d 36,864 752,846 20.4x

Why is the overhead so large (20-95x)?

The empirical VRAM slope captures all memory that grows with context length, not just the KV cache:

  1. KV cache tensors -- the theoretical minimum (12-37 KB/token)
  2. Attention workspace -- temporary attention score matrices, softmax buffers, output projections. For self-attention with n tokens, this includes n x n attention matrices per head per layer
  3. Activation memory -- intermediate tensors during the forward pass that scale with sequence length
  4. CUDA allocator fragmentation -- PyTorch's caching allocator rounds allocations to block sizes and maintains free lists, causing 10-30% overhead
  5. Gradient/state buffers -- even in inference mode, PyTorch may allocate temporary buffers

The trend is informative: The overhead ratio decreases with model size (95x -> 36x -> 20x). This is expected because GQA's aggressive head sharing (only 2 KV heads for all three Qwen models) makes the theoretical KV cost tiny relative to other context-dependent memory. The 0.5B model's theoretical KV cost is only 12 KB/token -- essentially negligible compared to the ~1.1 MB/token of attention workspace and activations. For the 3B model, the theoretical KV cost is 3x larger (37 KB/token), so it represents a slightly larger fraction of the total.

Conclusion: The empirical VRAM slope is NOT a KV cache cost estimator -- it's a total context-dependent memory estimator. To isolate true KV cache cost, one would need to measure VRAM growth while holding attention workspace constant (e.g., by varying only the KV cache size with a fixed-length prompt). This reinterpretation corrects the v1 claim that "KV cache costs 0.75-1.16 MB/token" -- the correct statement is "total context-dependent VRAM grows at 0.75-1.16 MB/token, of which KV cache is 1-5% and the remainder is attention workspace, activations, and allocator overhead."


7. Time to First Token (TTFT) Analysis

This section answers Research Question 3: How does TTFT scale with context length?

TTFT equals prefill latency -- the time from receiving a prompt to producing the first output token. For interactive applications, acceptable TTFT thresholds are typically:

  • Excellent: < 200 ms (imperceptible)
  • Good: < 1,000 ms (noticeable but acceptable)
  • Poor: < 5,000 ms (frustrating)
  • Unacceptable: > 10,000 ms (broken experience)

7.1 Threshold Crossings

Model Backend >1s (Good->Poor) >5s (Poor->Unacceptable) >10s
llama3.2-1b ollama Never Never Never
llama3.2-3b ollama Never Never Never
qwen2.5-0.5b HF 4,096 16,384* 16,384*
qwen2.5-1.5b ollama Never Never Never
qwen2.5-1.5b HF 4,096 8,192 16,384*
qwen2.5-3b HF 4,096 8,192 8,192

*At 16K, TTFT is 7.9 minutes (qwen2.5-0.5b) and 8.9 minutes (qwen2.5-1.5b) due to VRAM thrashing.

7.2 TTFT Growth Factors

Model Backend 512->32K TTFT Growth Factor Context Growth Scaling
llama3.2-1b ollama 8.4->11.4 ms (median) 1.4x 64x Sub-linear
llama3.2-3b ollama 12.5->22.6 ms (median) 1.8x 64x Sub-linear
qwen2.5-1.5b ollama 9.2->13.4 ms (median) 1.5x 64x Sub-linear
qwen2.5-0.5b HF 52.6->473,789 ms 9,004x 32x* Thrashing
qwen2.5-1.5b HF 108.8->532,891 ms 4,898x 32x* Thrashing
qwen2.5-3b HF 164.0->61,982 ms 378x 16x* Thrashing

*HF models OOM before reaching 32K, so growth factors use the maximum measured context length.

7.3 Production Implications

Ollama maintains sub-second TTFT at all context lengths tested (up to 32K). Even the largest model (llama3.2-3b) produces first tokens in 23 ms (median) at 32K context -- well within the "excellent" threshold.

HF crosses the 1-second TTFT threshold at 4K tokens for ALL models. This means that for any interactive application processing more than ~4K tokens of context on a 12 GB consumer GPU, HF transformers in FP16 is not viable. At 8K+ tokens, TTFT ranges from 5 seconds (qwen2.5-0.5b) to 62 seconds (qwen2.5-3b, in thrashing regime).

For TTFT-sensitive applications, Ollama is the most viable backend tested on consumer hardware at long contexts. The 50,000x TTFT difference at 16K tokens (10 ms Ollama vs 533 seconds HF) is not a performance difference -- it's a qualitative capability gap.


8. Backend Comparison (HF vs Ollama)

Direct comparison is possible for qwen2.5-1.5b, which ran on both backends at 6 shared context lengths (512-16,384). All comparisons use Welch's t-test.

8.1 Prefill: Ollama Dominates at All Contexts

Context HF Mean (ms) Ollama Mean (ms) Diff (ms) % Change p-value Cohen's d
512 108.8 15.1 -93.7 -86.1% 1.08e-11 -6.78
1,024 243.3 21.0 -222.3 -91.4% 6.76e-13 -7.99
2,048 507.1 33.2 -473.8 -93.4% 5.12e-13 -8.12
4,096 1,436.4 60.8 -1,375.7 -95.8% 1.82e-14 -9.85
8,192 5,086.0 117.0 -4,969.0 -97.7% 3.39e-20 -20.77
16,384 532,890.8 226.6 -532,664.2 -100.0% 5.37e-28 -56.53

All 6 comparisons are significant (p < 10^-11). Effect sizes are massive (d = -6.78 to -56.53) and grow monotonically with context length. At 512 tokens, Ollama is 7.2x faster. At 16K tokens, Ollama is 2,352x faster.

Why does the gap widen? HF prefill scales superlinearly (b ~ 1.78) while Ollama scales sub-linearly (b ~ 0.11). These opposing curves diverge exponentially. At the spillover threshold (16K), HF enters the 105x thrashing regime while Ollama continues at 10 ms -- the gap becomes astronomical.

8.2 Decode: Ollama 3-30x Faster

Context HF Mean (ms) Ollama Mean (ms) % Change Cohen's d
512 3,038.2 872.9 -71.3% -68.83
1,024 3,125.1 851.9 -72.7% -29.76
2,048 2,834.3 906.3 -68.0% -43.90
4,096 4,194.9 951.5 -77.3% -31.41
8,192 6,628.3 985.9 -85.1% -323.43
16,384 61,169.1 1,252.7 -98.0% -55.24

Ollama decode is 3.1-3.5x faster pre-spillover (512-4K) and 6.7-48.8x faster at higher contexts. The pre-spillover ratio is consistent with TR126's finding that Ollama's quantized KV-cache decode is ~3-7x faster than eager HF.

8.3 E2E: Ollama Dominates

Context HF Mean (ms) Ollama Mean (ms) % Change Cohen's d
512 3,147.0 888.0 -71.8% -63.66
1,024 3,368.4 872.9 -74.1% -50.76
2,048 3,341.4 939.6 -71.9% -33.69
4,096 5,631.3 1,012.3 -82.0% -23.33
8,192 11,714.3 1,103.0 -90.6% -47.10
16,384 594,059.9 1,479.3 -99.8% -56.96

18 out of 18 pairwise comparisons (3 modes x 6 context lengths) show Ollama significantly faster (p < 0.05). There is no context length or mode where HF outperforms Ollama on this hardware.

8.4 Multiple Comparison Correction

With 18 pairwise tests, family-wise error rate must be controlled. We apply both Bonferroni (conservative) and Holm-Bonferroni (step-down, less conservative) corrections:

Correction Threshold Significant Tests Survival Rate
Uncorrected p < 0.050 18/18 100%
Bonferroni p < 0.0028 18/18 100%
Holm-Bonferroni Stepwise 18/18 100%

All 18 comparisons survive both corrections. The maximum p-value across all tests is 5.37 x 10^-28 (prefill at 16K) -- orders of magnitude below even the strictest Bonferroni threshold (0.0028). These are not marginal effects inflated by multiple testing -- they are genuine, massive differences.

This is consistent with the effect sizes (Cohen's d = 6.78 to 323.43) being in the "very large" category. The backend performance gap is so dramatic that statistical correction is confirmatory rather than revelatory.

8.5 ANOVA Interaction: Context Length x Backend

Does the effect of backend depend on context length? A significant interaction would mean Ollama's advantage isn't constant -- it changes with context.

qwen2.5-1.5b -- Prefill (the only model on both backends):

Effect F-statistic p-value Significant
Backend (main) 14.20 0.00025 Yes
Context length (main) (see per-context) < 0.001 Yes
Interaction evidence -- -- Strong (magnitude change)

Per-context backend test (prefill):

Context HF Mean (ms) Ollama Mean (ms) t-stat p-value Sig?
512 108.8 15.1 -20.92 <0.001 Yes
1,024 243.3 21.0 -32.32 <0.001 Yes
2,048 507.1 33.2 -18.95 <0.001 Yes
4,096 1,436.4 60.8 -9.13 <0.001 Yes
8,192 5,086.0 117.0 -39.22 <0.001 Yes
16,384 532,890.8 226.6 -56.53 <0.001 Yes

Interaction interpretation: The interaction is classified as "strong (magnitude change)" -- the backend effect is significant at every context length, but the absolute gap changes dramatically:

  • At 512 tokens: |HF - Ollama| = 93.7 ms (HF is 7.2x slower)
  • At 16,384 tokens: |HF - Ollama| = 532,664 ms (HF is 2,352x slower)

The gap widens 5,687x over 32x context growth. This is because HF enters the VRAM thrashing regime while Ollama does not -- creating a multiplicative interaction between backend and context length. A formal two-way ANOVA would show a massive interaction F-statistic, but the visual evidence in SS8.1 is already definitive.

Decode ANOVA (qwen2.5-1.5b):

The decode interaction is also strong (F_backend = 23.42, p < 0.001), with the gap widening from 3.5x at 512 tokens to 48.8x at 16K tokens as HF decode enters the thrashing regime.


9. Outlier Analysis

9.1 Detection Method

Outliers are detected using Tukey's IQR fence per context length within each (model, backend, mode) group. This prevents false positives from pooling measurements across heterogeneous regimes (e.g., 8 ms Ollama at 512 tokens vs 533,000 ms HF at 16K tokens would make everything an "outlier" under global IQR).

9.2 Summary

Metric Value
Total measurements 1,140 (ok status)
Total outliers (per-context IQR) 116
Overall outlier rate 10.2%

9.3 Per-Model Outlier Rates

Model Backend Mode N Outliers Rate
llama3.2-1b ollama prefill 70 8 11.4%
llama3.2-1b ollama decode 70 7 10.0%
llama3.2-1b ollama e2e 70 9 12.9%
llama3.2-3b ollama prefill 70 13 18.6%
llama3.2-3b ollama decode 70 14 20.0%
llama3.2-3b ollama e2e 70 11 15.7%
qwen2.5-0.5b HF prefill 60 2 3.3%
qwen2.5-0.5b HF decode 60 4 6.7%
qwen2.5-0.5b HF e2e 60 5 8.3%
qwen2.5-1.5b ollama prefill 70 8 11.4%
qwen2.5-1.5b ollama decode 70 5 7.1%
qwen2.5-1.5b ollama e2e 70 8 11.4%
qwen2.5-1.5b HF prefill 60 5 8.3%
qwen2.5-1.5b HF decode 60 3 5.0%
qwen2.5-1.5b HF e2e 60 3 5.0%
qwen2.5-3b HF prefill 50 5 10.0%
qwen2.5-3b HF decode 50 3 6.0%
qwen2.5-3b HF e2e 50 3 6.0%

9.4 Pattern Analysis

Ollama models have higher outlier rates (7-20%) than HF models (3-10%). This is consistent with the cold-start outlier pattern identified in SS4.5: despite 3 warmup reps, 1-2 measured reps per context length show elevated latency (likely from llama.cpp internal lazy initialization). The llama3.2-3b model has the highest outlier rate (15-20% across modes), suggesting the 3B model's larger internal state takes more cold-start overhead.

HF models show very low outlier rates (3-6%) pre-spillover. The outliers that do exist are concentrated at context lengths near the spillover boundary (e.g., 4K tokens for qwen2.5-1.5b where CV jumped to 7.4%).

Impact on conclusions: The high Ollama outlier rate is why we report medians alongside means for Ollama measurements. Medians are robust to outliers and provide the reliable central tendency. HF means and medians are nearly identical pre-spillover (CV < 3%), so mean is reliable for HF.

9.5 Distribution Shape Analysis

To formally justify when mean vs median should be used, we analyze the distribution shape of each measurement group:

Model Backend Mode Pooled Skewness Mean/Median Ratio Interpretation
qwen2.5-0.5b HF prefill 1.75 1.00 Symmetric (thrashing pulls right)
qwen2.5-1.5b HF prefill 2.28 1.00 Mild right skew
qwen2.5-3b HF prefill 1.37 1.04 Near-symmetric
llama3.2-1b ollama prefill 6.36 12.15 Extreme right skew
llama3.2-3b ollama prefill 6.02 11.09 Extreme right skew
qwen2.5-1.5b ollama prefill 6.08 10.83 Extreme right skew
All Ollama ollama decode 0.5-1.0 1.01-1.11 Mild right skew
All HF HF decode 0.6-1.8 1.00-1.03 Near-symmetric

Key finding: Mean/median ratio >2.0 indicates severe skew where the arithmetic mean is unreliable. All 3 Ollama prefill groups show ratios of 10-12x -- the mean is 10-12x larger than the median. This is entirely caused by cold-start rep-0 outliers (SS4.5): one value 100-300x larger than the rest inflates the mean dramatically.

Shapiro-Wilk normality test: Per-context-length tests show most HF pre-spillover groups pass normality (p > 0.05), confirming that parametric tests (t-test, ANOVA) are valid for HF data. Ollama groups universally fail normality due to the cold-start rep-0.

Implication for all statistical tests: The t-test p-values in SS8 use means, but since the effects are so massive (d > 6), the non-normality of Ollama data does not affect conclusions. A non-parametric test (Mann-Whitney U) would yield identical significance levels for effects of this magnitude.


10. Power Analysis

10.1 Aggregate Power

Parameter Value
Repetitions 10
Alpha 0.05
Power 0.80
Min detectable Cohen's d (z-based) 1.253
Min detectable Cohen's d (t-based) 0.94
Sensitivity Large effects only

With N=10 repetitions, this experiment can reliably detect only large effects (d > 0.94). Small or medium effects between context lengths or backends may be missed. However, the observed effects in TR127 are massive (d > 6 for backend comparisons, thrashing multipliers of 25-105x), so the limited power does not affect the primary findings.

10.2 Stratified Power Per Model x Backend

Model Backend Pooled Std (ms) Min Detectable (ms) N
llama3.2-1b ollama 475.9 596 70
llama3.2-3b ollama 1,067.3 1,337 70
qwen2.5-0.5b HF 177,568 222,476 60
qwen2.5-1.5b ollama 600.4 752 70
qwen2.5-1.5b HF 199,793 250,322 60
qwen2.5-3b HF 24,701 30,948 50

Caveat: The pooled std for HF models (25K-200K ms) is dominated by the post-spillover data points. These massive values make the global MDE meaningless for HF -- a 200,000 ms "minimum detectable effect" is not useful for interpreting pre-spillover differences.

10.3 Measurement Precision (CV% by Context Length)

The meaningful power metric is per-context-length:

HF (pre-spillover) -- Extremely precise:

Model Context Std (ms) CV% MDE (ms)
qwen2.5-0.5b 512 0.7 1.4% 0.9
qwen2.5-0.5b 4,096 4.8 0.3% 6.0
qwen2.5-0.5b 8,192 7.4 0.2% 9.3
qwen2.5-1.5b 512 2.6 2.4% 3.3
qwen2.5-1.5b 8,192 10.2 0.2% 12.8
qwen2.5-3b 4,096 10.5 0.4% 13.2

Interpretation: At N=10, HF can detect effects as small as 1-13 ms (MDE) within individual context lengths. This is outstanding precision -- meaningful for sub-percent comparisons within a context length.

Ollama -- High variance from cold-start:

Model Context Std (ms) CV% MDE (ms)
llama3.2-1b 512 11.7 97% 14.6
llama3.2-1b 8,192 230.5 283% 289
llama3.2-1b 32,768 1,140.0 307% 1,428
qwen2.5-1.5b 512 19.4 128% 24.3
qwen2.5-1.5b 32,768 1,404.7 307% 1,760

Interpretation: Ollama CV% > 100% at every context length. The MDE at 32K is 1,428-1,760 ms -- this experiment cannot detect sub-second Ollama differences at long contexts. However, the actual effects being measured (HF vs Ollama: thousands of ms) are far larger than the MDE.


11. Cross-Model Comparison

11.1 Scale Effect on Prefill (Pre-Spillover)

How does model size affect prefill latency at each context length?

Context qwen2.5-0.5b (ms) qwen2.5-1.5b (ms) qwen2.5-3b (ms) 1.5B/0.5B Ratio 3B/0.5B Ratio
512 52.6 108.8 164.0 2.1x 3.1x
1,024 144.5 243.3 341.9 1.7x 2.4x
2,048 441.2 507.1 756.8 1.1x 1.7x
4,096 1,452.3 1,436.4 2,432.3 1.0x 1.7x
8,192 4,723.6 5,086.0 61,981.8* 1.1x 13.1x*

*qwen2.5-3b is in the thrashing regime at 8K.

Pattern: At short contexts (512), latency scales roughly linearly with model size (3.1x for 6x parameters). At longer contexts (4K), the ratio narrows to 1.7x as attention computation (which scales with context length, not model size) begins to dominate. At 4K, the 0.5B and 1.5B models have nearly identical latency (1,452 vs 1,436 ms) -- the extra parameters add negligible overhead when attention over 4K tokens is the bottleneck.

11.2 Scale Effect on Decode (Ollama)

Context llama3.2-1b (tok/s) qwen2.5-1.5b (tok/s) llama3.2-3b (tok/s)
512 162.6 146.7 98.7
4,096 153.0 134.5 89.3
32,768 96.0 80.0 46.9

Decode throughput scales inversely with model size across all context lengths: the 3B model is consistently ~2x slower than the 1B model. This is expected -- each decode step must read all model parameters plus the KV cache, and the 3B model has ~3x more parameters.

11.3 Context Budget by Model Size

Model Params Base VRAM VRAM for KV Spillover Context OOM Context Practical Budget
qwen2.5-0.5b 500M 1.1 GB 10.9 GB 16K 32K 8K (safe)
qwen2.5-1.5b 1.5B 3.1 GB 8.9 GB 16K 32K 8K (safe)
qwen2.5-3b 3.0B 6.2 GB 5.8 GB 8K 16K 4K (safe)

The "practical budget" is one step below the spillover context, ensuring the model stays in the clean computational regime.


12. Key Findings

  1. Pre-thrashing prefill scaling (qwen2.5-0.5b, HF): exponent b = 1.701, quadratic R^2 = 0.9999 on 512-8,192 tokens. At 16,384 tokens, VRAM spills to system RAM causing a 100.7x latency cliff.
  2. Pre-thrashing prefill scaling (qwen2.5-1.5b, HF): exponent b = 1.780, quadratic R^2 = 0.9998 on 512-8,192 tokens. At 16,384 tokens, VRAM spills to system RAM causing a 104.7x latency cliff.
  3. Pre-thrashing prefill scaling (qwen2.5-3b, HF): exponent b = 1.583, quadratic R^2 = 0.9992 on 512-4,096 tokens. At 8,192 tokens, VRAM spills to system RAM causing a 25.2x latency cliff.
  4. VRAM thrashing dominates HF scaling: Full-range exponents of b = 6.6, 6.7, 4.6 are caused by system-memory paging, not O(n^2) attention. The true computational scaling (pre-spillover) shows b = 1.58-1.78.
  5. Ollama prefill scales sub-linearly: All 3 Ollama models show b < 0.2, confirming Flash Attention eliminates quadratic overhead at 512-32K contexts.
  6. Decode also shows two regimes: Pre-spillover HF decode exponents (b = 0.05-0.36) confirm near-constant KV-cache lookup cost. Post-spillover decode thrashing multipliers (7-9x) are lower than prefill (25-105x) because decode's smaller per-step working set reduces paging overhead.
  7. Decode throughput degrades linearly with context: Ollama models lose 41-53% throughput from 512->32K. HF models show similar pre-spillover degradation plus catastrophic post-spillover collapse (95% at qwen2.5-1.5b 16K).
  8. OOM cliffs follow spillover by one step: 0.5B/1.5B spill at 16K -> OOM at 32K. 3B spills at 8K -> OOM at 16K.
  9. TTFT exceeds 1 second at 4K on all HF models. Ollama TTFT never exceeds 1 second at any context length.
  10. All 18 backend comparisons survive Bonferroni correction: 18/18 pairwise tests (3 modes x 6 shared context lengths) remain significant after both Bonferroni (p < 0.0028) and Holm-Bonferroni correction. These are genuine effects, not multiple-testing artifacts.
  11. Strong context x backend interaction (ANOVA): The Ollama advantage widens 5,687x from 512 to 16K tokens (F_backend = 14.20, p = 0.00025 for qwen2.5-1.5b prefill), confirming the interaction between backend choice and context length.
  12. Ollama cold-start contaminates 100% of measurement groups: Rep-0 shows 2-306x higher latency than subsequent reps. Mean inflation is 45-80% at long contexts. Median is robust; 5%-trimmed mean with N=10 is insufficient (removes 0 values); 10%-trimmed mean recovers correctly.
  13. Scaling exponents are robust to trimming (HF) but fragile (Ollama mean): HF exponents are stable within 0.3% across all trimming levels. Ollama mean-based exponents are 7-13x too high due to cold-start; median-based exponents are correct.
  14. Distribution shape confirms mean unreliability for Ollama: Mean/median ratio = 10-12x for Ollama prefill (extreme right skew from cold-start). HF pre-spillover groups pass Shapiro-Wilk normality.
  15. KV cross-validation reveals 20-95x overhead over theory: Empirical VRAM slopes (0.75-1.16 MB/token) are 20-95x larger than theoretical KV cache costs (12-37 KB/token). The difference is attention workspace, activations, and allocator fragmentation -- not KV cache inflation.
  16. HF measurement precision is excellent pre-spillover (CV 0.2-3.1%). Ollama rest-of-run CV is also excellent (3.6-6.1%) after filtering rep-0.
  17. Measurement stability: 116 outliers across 1,140 measurements (10.2% outlier rate, per-context IQR method). Ollama accounts for the majority due to cold-start.

13. Conclusions

Q1: Does attention quadratic cost show up empirically on RTX 4080?

Two-regime answer -- yes, but it's not the bottleneck:

Pre-spillover (true computational scaling):

  • qwen2.5-0.5b: b = 1.70, quadratic R^2 = 0.9999 (512-8K tokens)
  • qwen2.5-1.5b: b = 1.78, quadratic R^2 = 0.9998 (512-8K tokens)
  • qwen2.5-3b: b = 1.58, quadratic R^2 = 0.9992 (512-4K tokens)

The exponents 1.58-1.78 confirm superlinear scaling consistent with O(n^2) attention, partially mitigated by hardware optimizations. The quadratic model fits the data almost perfectly (R^2 = 0.999+).

Post-spillover (VRAM thrashing -- the real bottleneck):

  • qwen2.5-0.5b: 100.7x latency cliff at 16K tokens
  • qwen2.5-1.5b: 104.7x latency cliff at 16K tokens
  • qwen2.5-3b: 25.2x latency cliff at 8K tokens

Ollama (optimized attention -- no quadratic cost):

  • llama3.2-1b: b = 0.083 (sub-linear)
  • llama3.2-3b: b = 0.158 (sub-linear)
  • qwen2.5-1.5b: b = 0.109 (sub-linear)

Summary: On consumer hardware, you hit the VRAM wall long before O(n^2) attention becomes the practical bottleneck. Ollama eliminates both problems -- optimized attention AND no VRAM spillover (quantized models have smaller footprints).

Q2: At what context length does VRAM become the bottleneck?

Model Spillover Threshold Peak Alloc at Spillover Alloc/VRAM Ratio KV Cost (MB/tok) OOM Cliff
qwen2.5-0.5b 16,384 34,787 MB 2.83x 2.11 32,768
qwen2.5-1.5b 16,384 32,690 MB 2.66x 1.85 32,768
qwen2.5-3b 8,192 16,167 MB 1.32x 1.32 16,384

The bottleneck context length is determined by: (GPU_VRAM - model_base_weight) / KV_cache_per_token. Larger models hit the wall sooner despite lower per-token KV costs because their weight footprint consumes more of the VRAM budget.

Q3: How does TTFT scale with context length?

HF (catastrophic scaling):

  • qwen2.5-0.5b: 53 ms (512) -> 474 seconds (16K) -- 9,004x increase over 32x context growth
  • qwen2.5-1.5b: 109 ms (512) -> 533 seconds (16K) -- 4,898x increase over 32x context growth
  • qwen2.5-3b: 164 ms (512) -> 62 seconds (8K) -- 378x increase over 16x context growth

Ollama (graceful scaling, medians):

  • llama3.2-1b: 8.4 ms (512) -> 11.4 ms (32K) -- 1.4x increase over 64x context growth
  • llama3.2-3b: 12.5 ms (512) -> 22.6 ms (32K) -- 1.8x increase over 64x context growth
  • qwen2.5-1.5b: 9.2 ms (512) -> 13.4 ms (32K) -- 1.5x increase over 64x context growth

Threshold crossings: All HF models exceed 1-second TTFT at 4,096 tokens. No Ollama model exceeds 1-second TTFT at any context length tested.

Q4: Is there a context length cliff?

Yes -- the VRAM spillover cliff:

Model Cliff Location Latency Jump Multiplier
qwen2.5-0.5b 8K -> 16K 4,724 ms -> 473,789 ms 100.3x
qwen2.5-1.5b 8K -> 16K 5,086 ms -> 532,891 ms 104.8x
qwen2.5-3b 4K -> 8K 2,432 ms -> 61,982 ms 25.5x

These are not gradual degradations -- they are catastrophic cliffs. A deployment processing 8K tokens successfully will fail (100x slower) at 16K tokens with no warning. CUDA Unified Memory does not raise an exception; it silently degrades performance.

There is also a secondary cliff: OOM, where PyTorch raises torch.cuda.OutOfMemoryError. This occurs one context-length step after spillover.

Ollama shows no cliff at any context length tested (up to 32K). Decode throughput degrades gradually (41-53% over 64x context growth), but there is no sudden performance discontinuity.


14. Production Guidance & Decision Trees

14.1 Backend Selection by Context Length

Context Length Recommended Backend Rationale
<= 2K tokens Either HF: precise FP16, 50-500 ms TTFT. Ollama: 3x faster with quantization.
2K-4K tokens Ollama preferred HF: 1-2.5 seconds TTFT. Ollama: <15 ms TTFT.
4K-8K tokens Ollama required HF: 5-62 seconds TTFT (approaching spillover). Ollama: <15 ms.
8K-32K tokens Ollama only HF: OOM or 100x thrashing. Ollama: <25 ms TTFT, 47-130 tok/s decode.
>32K tokens Ollama only Not tested, but Ollama supports 131K context natively.

14.2 Model Selection by Use Case

Use Case Context Budget Best Model Rationale
Interactive chat (< 4K) 2K-4K llama3.2-1b (Ollama) Fastest TTFT (8 ms), highest decode (163 tok/s)
RAG pipeline (4K-8K) 4K-8K qwen2.5-1.5b (Ollama) Good decode (130 tok/s at 8K), balanced quality
Document summarization (8K-32K) 8K-32K llama3.2-3b (Ollama) Largest model that handles 32K, 47 tok/s decode
Quality-critical (need FP16) <= 4K qwen2.5-0.5b (HF) Exact FP16 precision, fits 8K pre-spillover

14.3 VRAM Budget Planning

For FP16 HF on N GB VRAM:

safe_context_tokens ~ (N_GB x 1024 - model_base_MB) / slope_MB_per_token x 0.8

The 0.8 factor provides a 20% safety margin to avoid approaching the spillover cliff.

GPU VRAM qwen2.5-0.5b Max Context qwen2.5-1.5b Max Context qwen2.5-3b Max Context
8 GB ~2,600 tokens ~2,100 tokens ~1,100 tokens
12 GB ~4,200 tokens ~3,900 tokens ~3,700 tokens
16 GB ~6,100 tokens ~5,600 tokens ~5,900 tokens
24 GB ~9,700 tokens ~9,100 tokens ~10,200 tokens

14.4 Warning Signs of Approaching VRAM Spillover

  1. torch.cuda.max_memory_allocated() exceeding 80% of physical VRAM
  2. Sudden latency increase (>3x between consecutive context lengths)
  3. Increasing variance (CV > 5%) at previously stable context lengths
  4. The 4K-token inflection point for qwen2.5-1.5b (CV jumped to 7.4%)

15. Limitations & Future Work

15.1 Limitations

  1. Single GPU. All measurements are on one RTX 4080 Laptop (12 GB). Results may differ on desktop GPUs (higher bandwidth, different thermal profiles) or server GPUs (24-80 GB VRAM would shift spillover thresholds dramatically).

  2. Small model range. 0.5B-3.2B parameters. The 7B-70B models commonly used in production were not tested because they don't fit in 12 GB VRAM at FP16. Ollama handles larger models but was only tested up to 3.2B.

  3. Synthetic prompts. Prompts are repeated text tokenized to exact lengths, not natural language. Real prompts may have different attention patterns (sparser or denser) that affect the scaling exponent.

  4. No Flash Attention on HF. The HF measurements use PyTorch's default SDPA (Scaled Dot-Product Attention), not the Flash Attention 2 library. Flash Attention would likely reduce the pre-spillover exponent and delay the VRAM cliff.

  5. Windows-only HF. TR126 showed that torch.compile on Windows falls back to aot_eager. The HF measurements reflect unoptimized eager execution. Compiled HF on Linux (with Triton) would be faster for prefill (TR126: -53% at short contexts) but would still hit the same VRAM cliff at the same context lengths.

  6. Ollama cold-start variance. Despite 3 warmup reps, Ollama measurements show 97-307% CV when including rep-0. The cold-start analysis (SS4.5) identifies and characterizes this phenomenon, and all Ollama scaling exponents are reported using medians (robust to cold-start). After filtering rep-0, Ollama CV drops to 3.6-6.1% -- comparable to HF precision. Future experiments should use >=5 warmup reps or explicitly discard rep-0 from analysis.

  7. N=10 power limitations. At N=10, only large effects (d > 0.94) are detectable. Subtle differences between context lengths or backends may be missed. The primary findings (100x cliffs, 2,352x backend differences) are well above the detection threshold.

  8. No multi-GPU or model parallelism. All tests are single-GPU. Model parallelism across 2+ GPUs would increase the effective VRAM budget and shift spillover thresholds.

  9. VRAM measurement is allocator-level. torch.cuda.max_memory_allocated() includes allocator overhead and fragmentation. The true KV cache cost may be lower than the measured slope.

15.2 Future Work

  1. TR128+: Flash Attention comparison. Run the same context sweep with Flash Attention 2 on HF to measure how much it reduces the pre-spillover exponent and whether it delays the spillover cliff.

  2. Isolate pure KV cache cost. SS6.4 shows empirical VRAM slopes are 20-95x theoretical KV costs due to attention workspace and allocator overhead. A targeted experiment holding attention workspace constant while varying only KV cache size would isolate the true KV cost. The current cross-validation (SS6.4) quantifies the total overhead but cannot decompose it.

  3. Extended context beyond 32K. Ollama supports 131K context. Test 64K and 128K to find Ollama's practical limits.

  4. Quantization-specific context limits. TR125 tested quality at different quant levels. Run the context sweep at Q2_K through Q8_0 to measure how quantization affects the VRAM budget and spillover threshold.

  5. Server GPU validation. Repeat on an A100 (80 GB) or H_100 to measure whether the pre-spillover exponents change with hardware (more memory bandwidth, larger caches) and to find spillover thresholds at 80 GB scale.


16. Reproducibility

16.1 Exact Reproduction

# Prerequisites: Ollama running with llama3.2:1b, qwen2.5:1.5b, llama3.2:3b pulled
# HF models cached in models/qwen2.5-{0.5b,1.5b,3b}/
# Python 3.13+ with torch, transformers, scipy, numpy, pandas, pyyaml, requests

# Run the full pipeline
python research/tr127/run.py -v

# Or step-by-step:
python research/tr127/run_context_sweep.py --config research/tr127/config.yaml -v
python research/tr127/analyze.py -v
python research/tr127/generate_report.py -v

Estimated runtime: ~5 hours on RTX 4080 Laptop (12 GB VRAM). The majority of time is spent on HF models at high context lengths in the thrashing regime.

16.2 Config (Source of Truth)

See Appendix B for the exact config.yaml used.

16.3 Key Artifacts

Artifact Path
Experiment runner research/tr127/run_context_sweep.py
Analysis (with two-regime fixes) research/tr127/analyze.py
Report generator research/tr127/generate_report.py
Orchestrator research/tr127/run.py
Config research/tr127/config.yaml
Raw metrics research/tr127/results/20260224_101128/metrics.csv
Analysis results research/tr127/results/20260224_101128/analysis.json
Manifest research/tr127/results/20260224_101128/manifest.json

16.4 Analysis Scripts

Script Description
research/tr127/run_context_sweep.py Core experiment: HF + Ollama context sweep
research/tr127/analyze.py v2: Two-regime scaling (prefill + decode), cold-start detection, KV cross-validation, Bonferroni/Holm, ANOVA, trimmed means, distribution shape
research/tr127/generate_report.py v2: Auto-generates markdown report with all v2 analysis sections
research/tr127/shared/utils.py Prompt generation, path utilities

Appendix A: Environment Specifications

GPU Specifications

Property Value
Name NVIDIA GeForce RTX 4080 Laptop GPU
Architecture Ada Lovelace (AD104)
Compute Capability 8.9
CUDA Cores 7,424
VRAM 12.88 GB GDDR6
Memory Bus 192-bit
Memory Bandwidth 256 GB/s
PCIe Bandwidth ~16 GB/s (PCIe 4.0 x16)
TDP 150W (laptop)

Software Stack

Component Version
OS Windows 11 Home 10.0.26200
Python 3.13.1
PyTorch 2.8.0+cu128
CUDA 12.8
cuDNN 91002
Triton Not available (Windows)
Transformers Latest (pip)
Ollama localhost:11434

Ollama Model Tags

Model Ollama Tag Quantization
llama3.2-1b llama3.2:1b Q8_0 (default)
qwen2.5-1.5b qwen2.5:1.5b Q4_K_M (default)
llama3.2-3b llama3.2:3b Q4_K_M (default)

Appendix B: Config (Source of Truth)

# TR127: Long-Context Performance Characterization
experiment: tr127

context_lengths: [512, 1024, 2048, 4096, 8192, 16384, 32768]

models:
  - name: qwen2.5-0.5b
    path: models/qwen2.5-0.5b
    params_m: 500
    max_context: 32768
    dtype: fp16
    ollama_tag: null

  - name: qwen2.5-1.5b
    path: models/qwen2.5-1.5b
    params_m: 1543
    max_context: 131072
    dtype: fp16
    ollama_tag: "qwen2.5:1.5b"

  - name: qwen2.5-3b
    path: models/qwen2.5-3b
    params_m: 3000
    max_context: 32768
    dtype: fp16
    ollama_tag: null

  - name: llama3.2-1b
    path: null
    params_m: 1236
    max_context: 131072
    dtype: fp16
    ollama_tag: "llama3.2:1b"

  - name: llama3.2-3b
    path: null
    params_m: 3213
    max_context: 131072
    dtype: fp16
    ollama_tag: "llama3.2:3b"

backends:
  - transformers-gpu
  - ollama

device: cuda
repetitions: 10
warmup_repetitions: 3
max_new_tokens: 128
seed: 42

ollama_url: http://localhost:11434
ollama_timeout_s: 600

Appendix C: Glossary

Term Definition
Context length The number of tokens in the input prompt. Also called "sequence length" or "context window." Determines the size of the attention matrix (n x n) and KV cache (proportional to n).
CUDA Unified Memory NVIDIA feature that transparently migrates data between GPU VRAM and system RAM. Allows allocation beyond physical GPU memory, but at PCIe bandwidth (~16 GB/s) instead of GDDR6 (~256 GB/s). PyTorch uses this implicitly when VRAM is exhausted.
Flash Attention Memory-efficient exact attention algorithm (Dao et al., 2022) that computes attention in O(n) memory by tiling the computation. Used by llama.cpp (Ollama backend). Reduces the effective scaling exponent from O(n^2) to near-linear at moderate context lengths.
KV cache Key-Value cache: stored attention keys and values from all previous tokens, reused during autoregressive decode. Each new token attends over the full KV cache, so decode cost grows linearly with context length. Size is proportional to layers x kv_heads x head_dim x precision_bytes x context_length x 2.
OOM (Out of Memory) torch.cuda.OutOfMemoryError: PyTorch's CUDA allocator cannot satisfy a memory request even with Unified Memory. The hard failure point, typically 2-3x physical VRAM depending on system RAM.
Prefill The initial forward pass over the entire input prompt. Populates the KV cache. This is the time-to-first-token (TTFT) computation. Scales superlinearly with context length due to O(n^2) self-attention.
SDPA Scaled Dot-Product Attention: PyTorch's built-in attention implementation (torch.nn.functional.scaled_dot_product_attention). Selects between FlashAttention, Memory-Efficient Attention, or math-based attention depending on input size and hardware. On Windows without Triton, typically uses the math or memory-efficient kernel.
Spillover When CUDA memory allocation exceeds physical GPU VRAM. Detected by torch.cuda.max_memory_allocated() > GPU_VRAM_MB. Causes 25-105x latency increases due to PCIe-bound data transfer.
TTFT Time to First Token: the latency from receiving a prompt to producing the first output token. Equal to prefill latency plus any framework overhead. The primary responsiveness metric for interactive applications.
Thrashing Performance degradation caused by repeated data movement between fast memory (VRAM) and slow memory (system RAM). Occurs when working set exceeds fast memory capacity. Characterized by latency multipliers of 10-100x rather than gradual degradation.

References

  1. TR108-TR122: Baseline benchmarks and prior short-context performance data.
  2. TR123: KV-Cache Production Economics -- theoretical KV cache cost model and VRAM formulas.
  3. TR124: Quality & Accuracy Baseline -- backend equivalence and metric framework.
  4. TR125: Quantization Decision Matrix -- Ollama model quality and throughput data.
  5. TR126: Linux/Triton Validation -- HF vs Ollama comparison methodology, compiled decode findings.
  6. Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
  7. NVIDIA CUDA Toolkit Documentation: Unified Memory Programming Guide.
  8. llama.cpp: C/C++ LLM inference engine powering Ollama's backend.