Technical Report 121 (Superseded Draft): Model Scaling Study
How inference behavior changes from ~5M to ~20B parameters (HF + Ollama)
| Field | Value |
|---|---|
| TR Number | 121 (Superseded Draft) |
| Project | Banterhearts LLM Performance Research |
| Date | 2025-12-23 |
| Author | Research Team |
| Report Type | Scaling-law analysis + bottleneck characterization (artifact-backed) |
| Related Work | TR117 (backend matrix), TR118_v2.2 (pipeline rigor), TR120 (compile paradox), TR122 (planned resource profiling deep dive) |
Status (read this first)
This document is retained as the original TR121 pipeline draft. It is superseded by the publish-ready report:
PublishReady/reports/Technical_Report_121v1.md
If you are looking for the definitive TR121 analysis (methodology + results + business impact + artifacts), use PublishReady/reports/Technical_Report_121v1.md.
Executive Summary (v1)
TR121 answers a production planning question:
As model size increases, what changes first: latency, tail risk, or feasibility (memory wall), and which runtime regime dominates?
This report builds the TR121 experimentation pipeline and establishes a first, reproducible scaling harness that:
- measures prefill and KV-cached decode separately (TR120-style), and
- spans model sizes from ~5M (local HF GPT-2 variants) to ~20B (Ollama models) on the same machine.
Status and publishing posture
TR121 v1 is pipeline-complete and ready for a full sweep, but it is not yet publish-ready as a definitive scaling claim until:
- the full matrix run is executed (not a smoke subset), and
- the report is updated with the resulting tables/plots + conclusions.
The purpose of this v1 document is to:
- make the harness and artifact structure stable,
- define scaling metrics precisely,
- and make the follow-on "full sweep -> publishable TR121" straightforward.
Table of Contents
- Research Context & Objectives
- Methodology (What Is Measured)
- Experiment Design
- Artifact Structure & Reproducibility
- Analysis (Scaling Fits)
- Limitations & Next Steps
1. Research Context & Objectives
TR117 established backend rankings on a single small model, with a known model skew. TR121 exists to answer:
- Are those rankings stable as model size grows?
- At what size does the system move from "launch/overhead dominated" to "kernel/throughput dominated" to "memory wall dominated"?
- When do "policy decisions" (quantization, batching, routing, compilation) become mandatory rather than optional?
1.1 Scope of v1
TR121 is explicitly a two-family scaling study:
- HF local models (5M-124M) measured via a controlled PyTorch harness.
- Ollama models (270M-20B) measured via Ollama's
/api/generatecounters.
These are not identical runtimes; TR121 v1 focuses on:
- consistent phase definitions (prefill vs decode),
- a consistent decode token budget (
gen_tokens), - artifact-backed measurements that can be extended.
2. Methodology (What Is Measured)
TR121 adopts TR120's phase split:
- Prefill: a single forward pass over the prompt with
use_cache=True. - KV decode: a fixed-length decode loop using KV cache (
past_key_values) forgen_tokenssteps. - End-to-end KV: prefill + KV decode.
2.1 HF (local) measurement
HF models are measured by running the model directly in eval/no-grad mode:
- Prefill:
model(input_ids, attention_mask, use_cache=True) - KV decode: repeated single-token steps using
past_key_values
For CUDA devices, both wall-clock and CUDA-event timing are recorded.
2.2 Ollama measurement
Ollama models are measured by calling:
POST /api/generatewithoptions.num_predict = gen_tokens
The response includes:
prompt_eval_count,prompt_eval_duration(prefill analog)eval_count,eval_duration(decode analog)load_duration(model load / cold-start component)
TR121 treats these counters as the phase source-of-truth for Ollama runs.
2.3 Throughput definition
TR121 reports tokens_per_s as:
tokens_total / latency_sfor each mode.
This is valid within each runtime family (HF vs Ollama) given consistent token accounting.
3. Experiment Design
3.1 Config (source of truth)
Default config:
scripts/tr121/configs/scaling.yaml
Contains:
- scenarios (prompt strings)
- decode token budget (
gen_tokens) - repetitions and warmups
- HF model list (local paths) with
params_millions - Ollama model list with
params_millions
3.2 Output structure
Each run writes:
manifest.json(environment + run config)runs.jsonl(one JSON record per measurement)metrics.csv(flat table of all records)analysis/*(generated byscripts/tr121/analyze_scaling.py)
4. Artifact Structure & Reproducibility
4.1 Run the sweep
python scripts/tr121/run_scaling.py --config scripts/tr121/configs/scaling.yaml
python scripts/tr121/analyze_scaling.py --run-dir scripts/tr121/results/<RUN_ID>
python scripts/tr121/generate_report.py --run-dir scripts/tr121/results/<RUN_ID>
4.2 Key artifacts
- Runner:
scripts/tr121/run_scaling.py - Analysis:
scripts/tr121/analyze_scaling.py - Generated report draft:
scripts/tr121/generate_report.py - Config:
scripts/tr121/configs/scaling.yaml
5. Analysis (Scaling Fits)
TR121 fits a log-log power law per {backend, mode, scenario}:
log10(latency_ms) = a + b * log10(params_millions)
The analysis outputs:
analysis/summary_by_model_backend_mode.csvanalysis/scaling_fits.csvanalysis/plots/scaling_*.png
Interpretation rule:
- Slope
bapproximates the scaling exponent for the measured regime (prompt length + token budget + runtime). - R^2 indicates how well a simple power law explains observed behavior.
6. Limitations & Next Steps
6.1 Limitations (v1)
- TR121 v1 is a pipeline + definitions milestone; it does not yet embed a full "publishable" sweep in this report.
- HF and Ollama use different runtimes; cross-family comparisons must be framed as "regime behavior," not apples-to-apples backend ranking.
- Resource profiling (VRAM, RSS, power) is not yet first-class in TR121; it is planned as TR122.
6.2 Next steps (to make TR121 publish-ready)
- Run the full matrix in
scripts/tr121/configs/scaling.yaml(or an expanded config). - Add a model-family coherence sweep (e.g., Gemma family on HF and Ollama) if weights are available locally.
- Add memory wall evidence (peak VRAM, paging) and correlate with latency inflections.
- Add a decode-length sweep (8/32/64/128/256) for at least one family to show scaling stability vs token budget.