TR117-TR122 Decision Whitepaper (Revised Draft v1.1)

Executive guidance for deployment leaders

Field	Value
Project	Banterhearts LLM Performance Research
Date	2025-12-26
Version	1.1 (editorial tightening)
Report Type	Decision whitepaper
Audience	Decision makers, product leaders, ops leaders
Scope	TR117, TR118_v2.2, TR119v1, TR120, TR121v1, TR122
Primary Source	`PublishReady/reports/Technical_Report_Conclusive_117-122.md`

Abstract

This whitepaper distills TR117-TR122 into deployment policy for local-first LLM inference on a fixed hardware baseline. The program answers one central question:

How do we convert performance measurements into safe operational policy without false attribution or false precision?

Outcome: a reusable decision method. Tokens/sec drives cost and capacity only when the measurement boundary is explicit, inference phases are separated (prefill vs decode), and instrumentation limits are respected. No universal scaling claims are made; results are bound to the measured stack.

Boundary conditions (do not skip)

This guidance is valid only under the measured boundary:

Fixed hardware class (same GPU/CPU/memory tier)
Same inference stack (driver/runtime/compiler versions)
Same measurement definitions (phase split + end-to-end boundaries)
Same instrumentation cadence (notably energy sampling)

If any of these change, re-run the core scenario matrix and re-validate artifacts.

Executive summary: five decisions you can ship now

Default backend (decode-heavy): onnxruntime-gpu
Route by workload shape: batch-heavy prefill can invert winners
Compile policy: compile prefill only, and only with compiler-real evidence + stable shapes
Warmup policy: cold-start is a separate regime -> warm pools + pre-routing warmups
Energy reporting policy: per-event energy requires valid sampling coverage; otherwise label as no_data

If you follow these rules, you avoid the common deployment failures: wrong backend, wrong phase, wrong attribution, false precision.

Definitions (one-time)

Prefill: prompt ingestion / KV cache construction
Decode: token generation loop reusing KV (often the dominant time at moderate+ generation lengths)
Cold-start: first-run / post-idle behavior where caches/allocations/compilation/warm kernels are not resident
Steady-state: warmed execution with stable caches and runtime state

Decision matrix (one-glance policy)

Condition (workload shape)	Operational classification	Default backend	Compile policy	Warmup policy
gen_tokens >= 64 or decode_fraction > 0.9	Decode-heavy	onnxruntime-gpu	Keep decode eager unless a measured decode win exists in your stack	Maintain warm pools; pre-warm per model tier
batch > 1 and gen_tokens minimal	Prefill-heavy	Route by measured prefill winner (can invert)	Compile prefill only if stable shapes + compiler-real evidence	Warmup the prefill path for expected batch shapes
Mixed / uncertain	Default-safe	onnxruntime-gpu (unless your matrix contradicts)	No compile by default	Track cold-start separately; do not average into steady-state

Key findings (decision-grade)

Decode dominates end-to-end at gen >= 64; prefill-only logic is unsafe for capacity planning.
Backend labels are not evidence; label-only compile claims were false in TR117 and corrected via TR120 attribution audit.
Throughput drives cost under typical shadow pricing; energy is secondary unless you are power-constrained or energy-priced.
Scaling is regime-dependent; small-model GPU latency can be dominated by overhead and depth effects.
100ms polling cannot attribute sub-100ms events; short prefill energy frequently becomes no_data without sufficient samples.

Operational recommendations (policy statements)

Backend selection

Policy: Default to onnxruntime-gpu for decode-heavy traffic.
Policy: Maintain a measured scenario matrix; route prefill-heavy cases by actual prefill winner.

Compile

Policy: Compile prefill only.
Policy gate: Enable compile only if you have compiler-real evidence (runtime proof, not labels) and stable shapes.
Policy: Keep decode eager unless you can demonstrate a decode win in your stack without shifting tail latency risk.

Warmup / cold-start

Policy: Treat cold-start as its own regime and track it as an SLO dimension.
Policy: Use warm pools for large models and phase-specific warmups aligned to routing paths.

Energy reporting

Policy gate: Report per-event energy only when there are >= 2 in-window samples; otherwise label as no_data.
Policy: Report energy for macro windows (batch/regime) when event attribution is below sampling resolution.

Economic impact (what changes your spend)

Tokens/sec is the budget lever:

seconds_per_1M = 1e6 / tokens_per_s
usd_per_1M = (seconds_per_1M / 3600) * usd_per_hour

Implications:

Backend choice is multiplicative: cheapest generate path can be ~3x cheaper than the most expensive GPU alternative within this boundary.
Model tiering becomes mandatory at scale: ~7B can be ~3x cheaper than ~20B at fixed decode length in this regime.
Capacity numbers are lower bounds: apply burst factors from real traffic distributions (p95/p99 mix, concurrency, long-tail prompts).

Implementation plan (30-day view)

Days 1-7: reproduce + validate

Re-run the core scenario matrix on your hardware + workload mix.
Validate artifacts to TR118 standard; capture manifests.

Days 8-14: translate into planning numbers

Recompute cost per token and capacity with your own pricing inputs.
If compiling: verify compiler-real evidence and identify stable shapes.

Days 15-30: enforce policies

Ship routing + compile policies based on phase splits.
Roll out warmup + cold-start monitoring; set SLO alarms and dashboards.
Add re-run required triggers to change management.

Risks, limitations, invalidation triggers

Limitations

Single hardware baseline; not portable across hardware classes without re-run.
$/token values are planning proxies (shadow prices), not exact TCO.
Quality is not measured here; pair cost decisions with quality evaluation.
Energy attribution is valid only when sampling coverage supports it.

Invalidates guidance

Driver/compiler/runtime/model upgrades without rerun
Workload mix shifts (prompt/gen/batch) without reweighting
Instrumentation cadence changes or energy pipeline changes

Evidence anchors (audit-ready)

Cost per 1M tokens + scenario inversions: scripts/tr119/results/tr119_matrix/processed/cost_energy_summary.json
Decode dominance at gen_tokens = 64: scripts/tr121/results/tr121_decode_sweep/20251224_002955/gen_64/metrics.csv
Compile attribution audit: scripts/tr120/results/tr120_compile_paradox/processed/summary_overall.csv
Warmup ratio distribution: scripts/tr121/results/20251223_230615/analysis/warmup_effect.csv
Energy gating outcomes: scripts/tr122/results/20251225_190043/generation_events.jsonl

References

Conclusive report: PublishReady/reports/Technical_Report_Conclusive_117-122.md
TR117: PublishReady/reports/Technical_Report_117.md
TR118_v2.2: PublishReady/reports/Technical_Report_118_v2.2.md
TR119v1: PublishReady/reports/Technical_Report_119v1.md
TR120: PublishReady/reports/Technical_Report_120.md
TR121v1: PublishReady/reports/Technical_Report_121v1.md
TR122: PublishReady/reports/Technical_Report_122.md

Optional upgrades (if you want it to feel board-ready)

Add a 1-page Decision Card at the front (matrix + three gates + rerun triggers).
Add a single diagram: traffic -> classify -> route backend -> warmup regime -> measure and audit.
Add a Change Control clause: any upgrade triggers rerun of Scenario Set S0-S3.

Phase 1.5 Decision Whitepaper