Technical Report 119 v1.1: Cost & Energy Analysis

Local-first inference TCO with telemetry (prefill + generate)

Field	Value
TR Number	119 v1.1
Project	Banterhearts LLM Performance Research
Date	2025-12-20
Author	Research Team
Report Type	Frontier cost/energy deep dive
Test Duration	2.0 hours
Status	Frontier Report (artifact-backed)
Version	1.1
Git SHA	`fa75edf8367dce1e12e6762d23f3319ffd8e97b5`
Related Work	TR117, TR118_v2.2

Abstract

TR119 converts benchmark latency/throughput and on-device telemetry into comparable dollars, kWh, and carbon per 1M tokens. Using GPT-2 on the target hardware, we run prefill (single forward pass) and uncached generate (repeated full forward passes per token) across multiple backends and scenarios with repeated trials, then compute compute-hours, energy, carbon footprint, and dollars per 1M tokens under multiple pricing tiers. The outcome is a decision-ready ranking of backends by cost, latency, energy efficiency, and carbon footprint, backed by artifacts, validation, and statistical tests.

Measurement Definitions

This report follows TR118-style explicit definitions. These definitions matter because they control comparability across backends.

Prefill Mode

Latency (ms): wall time for one forward pass (warmups excluded; tokenization excluded).
Throughput (tok/s): tokens_processed / latency_s, where tokens_processed = batch_size * seq_len (padded length used in the forward pass).

Generate Mode (Uncached)

Latency (ms): total wall time for an uncached greedy decoding loop generating up to max_new_tokens new tokens.
Throughput (tok/s): tokens_generated / total_time_s (tokens_generated may be < max_new_tokens if EOS appears).
Interpretation: uncached generate is intentionally pessimistic; production KV-cache decoding should be materially faster.

Energy, Cost, and Carbon

Power (W): mean sampled power during the benchmark region (GPU for GPU backends; CPU package power for CPU backends).
Energy (kWh/1M tok): (power_w * seconds_per_1m) / 3.6e6 where seconds_per_1m = 1e6 / throughput_tok_s.
Infra cost (USD/1M tok): hours_per_1m * usd_per_hour (tier-specific).
Total cost (USD/1M tok): infra cost + energy cost.
Carbon (gCO2e/1M tok): energy_kwh_per_1m * carbon_intensity_gco2e_per_kwh.

Executive Summary

TR119 answers: which backend minimizes cost and energy for local-first inference once we include real telemetry and explicit pricing inputs? Across this matrix (5 backends x 5 scenarios x 7 repetitions x 2 modes = 350 runs), the ranking is throughput-driven: the fastest stable backend is also the cheapest per token under time-based pricing.

Key Findings

Best-cost backend (prefill, mean across scenarios, on-demand): onnxruntime-gpu at ~$0.1279 per 1M tokens.
Worst-cost backend (prefill): transformers-cpu at ~$0.971 per 1M tokens.
Best spot pricing (prefill): onnxruntime-gpu at ~$0.03868 per 1M tokens (69.8% savings vs on-demand).
Lowest carbon footprint (prefill): onnxruntime-gpu at ~1.0 gCO2e per 1M tokens.
Best energy efficiency (prefill): onnxruntime-gpu at ~503440295 tokens/kWh.
Lowest on-demand provider (prefill, mean across scenarios): azure_nc_t4_v3/onnxruntime-gpu at ~$0.1144 per 1M tokens.
Best request-level cost: onnxruntime-gpu at ~$0.0001475 per request.
Runs: 350 total, 0 degraded (0.0%).
Best-latency backend (mean across scenarios): onnxruntime-gpu at ~16.5 ms.

Key Decision

If GPU is available, onnxruntime-gpu is the default recommendation on this hardware (best cost and best energy efficiency in this benchmark).
If CPU-only, prefer onnxruntime-cpu over transformers-cpu for materially lower $/token and kWh/token.

Introduction & Research Motivation
Methodology & Experimental Design
Environment & Artifacts
Results & Analysis
Statistical Analysis
Synthesis & Decision Matrix
Reproducibility

1. Introduction & Research Motivation

TR117 established baseline latency and throughput across multiple inference backends. TR118 raised rigor: explicit measurement definitions, artifact pipelines, and reproducibility. TR119 extends that foundation by translating speed into $/token, kWh/token, and gCO2e/token so backend selection becomes a cost-and-energy decision, not just a latency chart.

1.1 Research Questions

Which backend minimizes dollars per 1M tokens for prefill and for generate?
How large is the pricing-tier lever (on-demand vs spot vs reserved) relative to backend choice?
Does energy meaningfully change rankings, or is throughput the dominant driver?
What is request-level cost for a representative prompt+generate mix?

1.2 Scope

Single target machine; results are hardware-specific.
Model: GPT-2 (as configured).
Generate mode is uncached (KV-cache disabled) to isolate raw compute; production decode will differ.

2. Methodology & Experimental Design

Metrics

Latency (ms), throughput (tokens/sec).
GPU power (W), temperature (deg C), memory (MB); CPU package power (W).

Benchmark Matrix

Backends: transformers-gpu-compile, transformers-gpu, transformers-cpu, onnxruntime-gpu, onnxruntime-cpu
Scenarios: single_short, single_medium, single_long, batch_short, batch_medium
Repetitions: 7 per backend/scenario/mode
Warmup runs: 2

Cost & Energy Model

GPU-hours per 1M tokens: gpu_hours = 1_000_000 / (throughput_tok_s * 3600).
Energy per 1M tokens (kWh): energy_kwh = (power_w * seconds_per_1m) / 3.6e6.
Infra cost: gpu_hours * price_per_hour (per pricing tier).
Energy cost: energy_kwh * usd_per_kwh.
Total cost: infra cost + energy cost.
Carbon footprint: energy_kwh * carbon_intensity_gco2e_per_kwh.

Telemetry Collection

GPU metrics sampled via ResourceMonitor at the configured interval.
CPU package power captured from Windows Energy Meter (RAPL) counters when available.

Pricing & Energy Inputs

On-demand rate: $1.006/hour
Spot rate: $0.302/hour
Reserved 1yr rate: $0.704/hour
Reserved 3yr rate: $0.503/hour
Energy price: $0.2/kWh
Carbon intensity: 500.0 gCO2e/kWh

Request Token Mix

prompt_tokens: 256
generate_tokens: 128

3. Environment & Artifacts

Config: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\configs\matrix.yaml
Results root: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\results\tr119_matrix

Telemetry

Sample interval: 0.25 s
GPU telemetry: True
CPU telemetry: True

Environment

OS: Windows-11-10.0.26200-SP0
Python: 3.13.1 (tags/v3.13.1:0671451, Dec 3 2024, 19:06:28) [MSC v.1942 64 bit (AMD64)]
CPU: 13th Gen Intel(R) Core(TM) i9-13980HX
Prompt config: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr117\configs\matrix_tier3.yaml
GPU: NVIDIA GeForce RTX 4080 Laptop GPU (12282 MB, CC 8.9)
Modes observed: generate, prefill

4. Results & Analysis

This section summarizes observed performance, telemetry, and derived cost/energy metrics. Tables are artifact-backed.

Latency & Throughput Summary (Mean Across Scenarios)

Backend	Mode	lat_mean_ms	throughput_mean_tok_s	power_mean_w	degraded_runs
onnxruntime-cpu	generate	266.1	59.94	87.18	0/35
onnxruntime-cpu	prefill	26.21	1224	101.2	0/35
onnxruntime-gpu	generate	56.07	311.6	30.81	0/35
onnxruntime-gpu	prefill	16.54	2246	16.63	0/35
transformers-cpu	generate	815.4	20.11	74.87	0/35
transformers-cpu	prefill	88.82	377.5	64.24	0/35
transformers-gpu	generate	158	105.5	23.57	0/35
transformers-gpu	prefill	21.22	2071	16.37	0/35
transformers-gpu-compile	generate	166.2	116.8	30.71	0/35
transformers-gpu-compile	prefill	20.84	1817	20.39	0/35

Interpretation

Prefill measures a single forward pass; uncached generate repeats full forward passes per token and therefore has substantially lower throughput.
Under time-based pricing, higher throughput almost always implies lower $/token; power differences matter most when power varies dramatically at similar throughput.
Treat CPU backends as fallbacks unless GPU is unavailable; the gap in throughput and cost per token is large in both modes.

Latency, Throughput, and Telemetry (Per Backend/Scenario)

Backend	Mode	Scenario	lat_mean_ms	lat_ci_lower	lat_ci_upper	throughput_mean_tok_s	power_mean_w	gpu_temp_mean_c
onnxruntime-cpu	generate	batch_medium	436	427.8	444.3	73.44	1.738	50
onnxruntime-cpu	prefill	batch_medium	38.37	36.37	40.36	1987	1.955	48
onnxruntime-cpu	generate	batch_short	329.5	323.9	335.2	97.14	1.756	50
onnxruntime-cpu	prefill	batch_short	29.31	28.43	30.18	1504	1.924	48
onnxruntime-cpu	generate	single_long	213.1	208.9	217.3	37.55	1.755	50
onnxruntime-cpu	prefill	single_long	24.02	23.29	24.74	1125	2.691	48
onnxruntime-cpu	generate	single_medium	188.3	182	194.6	42.55	7.181	50
onnxruntime-cpu	prefill	single_medium	21.14	20.42	21.87	900	17.42	48
onnxruntime-cpu	generate	single_short	163.4	158.8	168	49.01	26.19	51.36
onnxruntime-cpu	prefill	single_short	18.23	17.54	18.93	604.3	18.98	48
onnxruntime-gpu	generate	batch_medium	63.12	60.8	65.44	508.7	36.3	52.43
onnxruntime-gpu	prefill	batch_medium	26.68	26.05	27.32	2850	14.61	48
onnxruntime-gpu	generate	batch_short	60.27	53.36	67.19	537.5	30.45	51.57
onnxruntime-gpu	prefill	batch_short	23.61	19.76	27.46	1940	15.94	48
onnxruntime-gpu	generate	single_long	47.04	44.73	49.35	170.6	27.64	51.29
onnxruntime-gpu	prefill	single_long	11.93	10.63	13.22	2292	16.81	48
onnxruntime-gpu	generate	single_medium	47.56	42.16	52.96	172.7	31.95	51.71
onnxruntime-gpu	prefill	single_medium	9.015	6.535	11.49	2315	23.32	48.43
onnxruntime-gpu	generate	single_short	62.38	23.51	101.2	168.4	27.69	51.14
onnxruntime-gpu	prefill	single_short	11.48	-2.496	25.45	1832	12.49	48.57
transformers-cpu	generate	batch_medium	1021	970.3	1072	31.45	1.847	49.9
transformers-cpu	prefill	batch_medium	113.3	102.7	124	679	2.032	47
transformers-cpu	generate	batch_short	910.7	875.4	946	35.23	1.859	49.57
transformers-cpu	prefill	batch_short	92.59	83.45	101.7	485.4	2.038	47
transformers-cpu	generate	single_long	721.4	641.5	801.4	11.27	1.841	49
transformers-cpu	prefill	single_long	81.4	70.35	92.45	339.2	1.947	47
transformers-cpu	generate	single_medium	709.1	661	757.2	11.35	1.824	49
transformers-cpu	prefill	single_medium	86.59	80.63	92.55	221.4	2.618	47
transformers-cpu	generate	single_short	714.7	662.6	766.7	11.27	6.342	49.04
transformers-cpu	prefill	single_short	70.15	58.52	81.79	162.2	15.52	47
transformers-gpu	generate	batch_medium	177.5	166.7	188.2	181.4	27.18	50
transformers-gpu	prefill	batch_medium	25.32	21.9	28.74	3102	20.19	47.14
transformers-gpu	generate	batch_short	181.5	170	193	177.3	22.61	49.57
transformers-gpu	prefill	batch_short	11.34	6.077	16.6	4798	18.46	47.14
transformers-gpu	generate	single_long	154.2	140.2	168.2	52.32	20.42	49
transformers-gpu	prefill	single_long	23.61	22.71	24.5	1150	14.16	47
transformers-gpu	generate	single_medium	142.8	130.3	155.3	56.52	19.46	49
transformers-gpu	prefill	single_medium	23.57	22.96	24.19	810.6	14.05	47
transformers-gpu	generate	single_short	134.2	128.3	140	60.17	28.16	49.89
transformers-gpu	prefill	single_short	22.25	21.93	22.57	496.7	14.98	47
transformers-gpu-compile	generate	batch_medium	269.1	195.7	342.5	133.1	29	50.43
transformers-gpu-compile	prefill	batch_medium	25.56	22.54	28.57	3038	19.99	47.16
transformers-gpu-compile	generate	batch_short	183	128.7	237.4	196.4	33.58	50.47
transformers-gpu-compile	prefill	batch_short	21.62	8.448	34.79	2441	15.75	47.66
transformers-gpu-compile	generate	single_long	169.4	131.9	206.8	51.95	23.85	48.77
transformers-gpu-compile	prefill	single_long	26.47	25.48	27.46	1022	17.75	47
transformers-gpu-compile	generate	single_medium	154.2	107.4	201	57.62	22.32	49.22
transformers-gpu-compile	prefill	single_medium	24.44	23.68	25.19	778.3	16.11	47.14
transformers-gpu-compile	generate	single_short	55.44	52.73	58.14	144.7	44.79	51.29
transformers-gpu-compile	prefill	single_short	6.1	5.836	6.364	1808	32.36	48.42

Cost & Energy Summary (Mean Across Scenarios)

Backend	Mode	total_cost_usd_per_1M_tok	energy_cost_usd_per_1M_tok	energy_kwh_per_1M_tok	carbon_gco2e_per_1M_tok
onnxruntime-cpu	generate	5.37	0.09212	0.4606	230.3
onnxruntime-cpu	prefill	0.2748	0.005213	0.02607	13.03
onnxruntime-gpu	generate	1.204	0.007105	0.03553	17.76
onnxruntime-gpu	prefill	0.1279	0.0004174	0.002087	1.043
transformers-cpu	generate	18.47	0.2655	1.328	663.8
transformers-cpu	prefill	0.971	0.01186	0.05931	29.66
transformers-gpu	generate	3.626	0.01644	0.08222	41.11
transformers-gpu	prefill	0.2605	0.0007796	0.003898	1.949
transformers-gpu-compile	generate	3.154	0.01716	0.08582	42.91
transformers-gpu-compile	prefill	0.1995	0.0007668	0.003834	1.917

Interpretation

Prefill and uncached generate operate in different cost regimes because generate throughput is far lower under repeated full forward passes.
Energy and carbon scale linearly with mean power and runtime. At the configured rates, infra cost dominates total cost for all backends.

Cost & Energy per 1M Tokens (On-Demand Pricing)

Backend	Mode	Scenario	throughput_mean_tok_s	power_mean_w	gpu_hours_per_1M_tok	infra_cost_usd_per_1M_tok	energy_cost_usd_per_1M_tok	total_cost_usd_per_1M_tok
onnxruntime-cpu	generate	batch_medium	73.44	82.13	3.782	3.805	0.06213	3.867
onnxruntime-cpu	prefill	batch_medium	1987	119.7	0.1398	0.1406	0.003347	0.144
onnxruntime-cpu	generate	batch_short	97.14	86.29	2.86	2.877	0.04935	2.926
onnxruntime-cpu	prefill	batch_short	1504	102.1	0.1847	0.1858	0.003772	0.1896
onnxruntime-cpu	generate	single_long	37.55	90.5	7.397	7.441	0.1339	7.575
onnxruntime-cpu	prefill	single_long	1125	94.11	0.2468	0.2483	0.004646	0.253
onnxruntime-cpu	generate	single_medium	42.55	85.19	6.528	6.567	0.1112	6.678
onnxruntime-cpu	prefill	single_medium	900	104.6	0.3087	0.3105	0.006457	0.317
onnxruntime-cpu	generate	single_short	49.01	91.76	5.668	5.702	0.104	5.806
onnxruntime-cpu	prefill	single_short	604.3	85.34	0.4597	0.4625	0.007846	0.4703
onnxruntime-gpu	generate	batch_medium	508.7	36.3	0.5461	0.5494	0.003964	0.5533
onnxruntime-gpu	prefill	batch_medium	2850	14.61	0.09745	0.09804	0.0002847	0.09832
onnxruntime-gpu	generate	batch_short	537.5	30.45	0.5168	0.5199	0.003148	0.523
onnxruntime-gpu	prefill	batch_short	1940	15.94	0.1432	0.144	0.0004563	0.1445
onnxruntime-gpu	generate	single_long	170.6	27.64	1.629	1.638	0.009003	1.647
onnxruntime-gpu	prefill	single_long	2292	16.81	0.1212	0.1219	0.0004074	0.1223
onnxruntime-gpu	generate	single_medium	172.7	31.95	1.608	1.618	0.01027	1.628
onnxruntime-gpu	prefill	single_medium	2315	23.32	0.12	0.1207	0.0005596	0.1212
onnxruntime-gpu	generate	single_short	168.4	27.69	1.65	1.66	0.009136	1.669
onnxruntime-gpu	prefill	single_short	1832	12.49	0.1516	0.1526	0.0003788	0.1529
transformers-cpu	generate	batch_medium	31.45	81.49	8.833	8.886	0.144	9.03
transformers-cpu	prefill	batch_medium	679	71.92	0.4091	0.4116	0.005884	0.4174
transformers-cpu	generate	batch_short	35.23	76.93	7.886	7.933	0.1213	8.054
transformers-cpu	prefill	batch_short	485.4	69.12	0.5723	0.5757	0.007911	0.5836
transformers-cpu	generate	single_long	11.27	73.08	24.65	24.8	0.3603	25.16
transformers-cpu	prefill	single_long	339.2	60.09	0.8189	0.8238	0.009842	0.8337
transformers-cpu	generate	single_medium	11.35	69.21	24.48	24.63	0.3388	24.96
transformers-cpu	prefill	single_medium	221.4	59.44	1.254	1.262	0.01491	1.277
transformers-cpu	generate	single_short	11.27	73.66	24.64	24.79	0.3631	25.15
transformers-cpu	prefill	single_short	162.2	60.63	1.712	1.723	0.02076	1.743
transformers-gpu	generate	batch_medium	181.4	27.18	1.531	1.54	0.008325	1.549
transformers-gpu	prefill	batch_medium	3102	20.19	0.08956	0.09009	0.0003617	0.09046
transformers-gpu	generate	batch_short	177.3	22.61	1.567	1.577	0.007088	1.584
transformers-gpu	prefill	batch_short	4798	18.46	0.0579	0.05824	0.0002137	0.05846
transformers-gpu	generate	single_long	52.32	20.42	5.309	5.341	0.02168	5.363
transformers-gpu	prefill	single_long	1150	14.16	0.2416	0.243	0.0006839	0.2437
transformers-gpu	generate	single_medium	56.52	19.46	4.914	4.944	0.01912	4.963
transformers-gpu	prefill	single_medium	810.6	14.05	0.3427	0.3447	0.000963	0.3457
transformers-gpu	generate	single_short	60.17	28.16	4.617	4.644	0.026	4.67
transformers-gpu	prefill	single_short	496.7	14.98	0.5592	0.5626	0.001676	0.5642
transformers-gpu-compile	generate	batch_medium	133.1	29	2.087	2.1	0.0121	2.112
transformers-gpu-compile	prefill	batch_medium	3038	19.99	0.09144	0.09199	0.0003657	0.09236
transformers-gpu-compile	generate	batch_short	196.4	33.58	1.414	1.423	0.0095	1.432
transformers-gpu-compile	prefill	batch_short	2441	15.75	0.1138	0.1145	0.0003586	0.1148
transformers-gpu-compile	generate	single_long	51.95	23.85	5.347	5.379	0.0255	5.404
transformers-gpu-compile	prefill	single_long	1022	17.75	0.2719	0.2735	0.0009651	0.2745
transformers-gpu-compile	generate	single_medium	57.62	22.32	4.821	4.85	0.02152	4.871
transformers-gpu-compile	prefill	single_medium	778.3	16.11	0.3569	0.359	0.00115	0.3602
transformers-gpu-compile	generate	single_short	144.7	44.79	1.919	1.931	0.01719	1.948
transformers-gpu-compile	prefill	single_short	1808	32.36	0.1537	0.1546	0.0009946	0.1556

Multi-Tier Pricing Comparison (per 1M Tokens)

Backend	Mode	Scenario	on_demand_usd	spot_usd	reserved_usd	reserved_1yr_usd	reserved_3yr_usd	on_prem_usd
onnxruntime-cpu	generate	batch_medium	3.867	1.204	2.725	2.725	1.965	0.06213
onnxruntime-cpu	prefill	batch_medium	0.144	0.04557	0.1018	0.1018	0.07367	0.003347
onnxruntime-cpu	generate	batch_short	2.926	0.9129	2.062	2.062	1.488	0.04935
onnxruntime-cpu	prefill	batch_short	0.1896	0.05956	0.1338	0.1338	0.09668	0.003772
onnxruntime-cpu	generate	single_long	7.575	2.368	5.341	5.341	3.854	0.1339
onnxruntime-cpu	prefill	single_long	0.253	0.07919	0.1784	0.1784	0.1288	0.004646
onnxruntime-cpu	generate	single_medium	6.678	2.083	4.707	4.707	3.395	0.1112
onnxruntime-cpu	prefill	single_medium	0.317	0.09967	0.2238	0.2238	0.1617	0.006457
onnxruntime-cpu	generate	single_short	5.806	1.816	4.094	4.094	2.955	0.104
onnxruntime-cpu	prefill	single_short	0.4703	0.1467	0.3315	0.3315	0.2391	0.007846
onnxruntime-gpu	generate	batch_medium	0.5533	0.1689	0.3884	0.3884	0.2786	0.003964
onnxruntime-gpu	prefill	batch_medium	0.09832	0.02972	0.06889	0.06889	0.0493	0.0002847
onnxruntime-gpu	generate	batch_short	0.523	0.1592	0.367	0.367	0.2631	0.003148
onnxruntime-gpu	prefill	batch_short	0.1445	0.04369	0.1012	0.1012	0.07247	0.0004563
onnxruntime-gpu	generate	single_long	1.647	0.5009	1.156	1.156	0.8282	0.009003
onnxruntime-gpu	prefill	single_long	0.1223	0.03701	0.08574	0.08574	0.06138	0.0004074
onnxruntime-gpu	generate	single_medium	1.628	0.4959	1.142	1.142	0.8191	0.01027
onnxruntime-gpu	prefill	single_medium	0.1212	0.03679	0.08502	0.08502	0.0609	0.0005596
onnxruntime-gpu	generate	single_short	1.669	0.5074	1.171	1.171	0.839	0.009136
onnxruntime-gpu	prefill	single_short	0.1529	0.04618	0.1071	0.1071	0.07666	0.0003788
transformers-cpu	generate	batch_medium	9.03	2.811	6.362	6.362	4.587	0.144
transformers-cpu	prefill	batch_medium	0.4174	0.1294	0.2939	0.2939	0.2117	0.005884
transformers-cpu	generate	batch_short	8.054	2.503	5.673	5.673	4.088	0.1213
transformers-cpu	prefill	batch_short	0.5836	0.1807	0.4108	0.4108	0.2958	0.007911
transformers-cpu	generate	single_long	25.16	7.805	17.72	17.72	12.76	0.3603
transformers-cpu	prefill	single_long	0.8337	0.2572	0.5864	0.5864	0.4218	0.009842
transformers-cpu	generate	single_medium	24.96	7.731	17.57	17.57	12.65	0.3388
transformers-cpu	prefill	single_medium	1.277	0.3937	0.898	0.898	0.6459	0.01491
transformers-cpu	generate	single_short	25.15	7.805	17.71	17.71	12.76	0.3631
transformers-cpu	prefill	single_short	1.743	0.5379	1.226	1.226	0.8821	0.02076
transformers-gpu	generate	batch_medium	1.549	0.4707	1.086	1.086	0.7785	0.008325
transformers-gpu	prefill	batch_medium	0.09046	0.02741	0.06341	0.06341	0.04541	0.0003617
transformers-gpu	generate	batch_short	1.584	0.4804	1.11	1.11	0.7954	0.007088
transformers-gpu	prefill	batch_short	0.05846	0.0177	0.04097	0.04097	0.02934	0.0002137
transformers-gpu	generate	single_long	5.363	1.625	3.759	3.759	2.692	0.02168
transformers-gpu	prefill	single_long	0.2437	0.07364	0.1707	0.1707	0.1222	0.0006839
transformers-gpu	generate	single_medium	4.963	1.503	3.479	3.479	2.491	0.01912
transformers-gpu	prefill	single_medium	0.3457	0.1045	0.2422	0.2422	0.1733	0.000963
transformers-gpu	generate	single_short	4.67	1.42	3.276	3.276	2.348	0.026
transformers-gpu	prefill	single_short	0.5642	0.1706	0.3954	0.3954	0.283	0.001676
transformers-gpu-compile	generate	batch_medium	2.112	0.6424	1.481	1.481	1.062	0.0121
transformers-gpu-compile	prefill	batch_medium	0.09236	0.02798	0.06474	0.06474	0.04636	0.0003657
transformers-gpu-compile	generate	batch_short	1.432	0.4367	1.005	1.005	0.7209	0.0095
transformers-gpu-compile	prefill	batch_short	0.1148	0.03473	0.08048	0.08048	0.0576	0.0003586
transformers-gpu-compile	generate	single_long	5.404	1.64	3.79	3.79	2.715	0.0255
transformers-gpu-compile	prefill	single_long	0.2745	0.08308	0.1924	0.1924	0.1377	0.0009651
transformers-gpu-compile	generate	single_medium	4.871	1.477	3.415	3.415	2.446	0.02152
transformers-gpu-compile	prefill	single_medium	0.3602	0.1089	0.2524	0.2524	0.1807	0.00115
transformers-gpu-compile	generate	single_short	1.948	0.5968	1.368	1.368	0.9826	0.01719
transformers-gpu-compile	prefill	single_short	0.1556	0.0474	0.1092	0.1092	0.07828	0.0009946

Cloud Provider Cost Comparison (Mean Across Scenarios)

Provider	Backend	Mode	on_demand_usd	spot_usd	reserved_usd
aws_g5_xlarge	onnxruntime-cpu	generate	5.37	1.677	3.786
aws_g5_xlarge	onnxruntime-cpu	prefill	0.2748	0.08613	0.1938
aws_g5_xlarge	onnxruntime-gpu	generate	1.204	0.3665	0.8448
aws_g5_xlarge	onnxruntime-gpu	prefill	0.1279	0.03868	0.08961
aws_g5_xlarge	transformers-cpu	generate	18.47	5.731	13.01
aws_g5_xlarge	transformers-cpu	prefill	0.971	0.2998	0.6831
aws_g5_xlarge	transformers-gpu	generate	3.626	1.1	2.542
aws_g5_xlarge	transformers-gpu	prefill	0.2605	0.07875	0.1825
aws_g5_xlarge	transformers-gpu-compile	generate	3.154	0.9587	2.212
aws_g5_xlarge	transformers-gpu-compile	prefill	0.1995	0.06042	0.1398
azure_nc_t4_v3	onnxruntime-cpu	generate	4.814	1.509	3.24
azure_nc_t4_v3	onnxruntime-cpu	prefill	0.2464	0.07756	0.166
azure_nc_t4_v3	onnxruntime-gpu	generate	1.078	0.3284	0.721
azure_nc_t4_v3	onnxruntime-gpu	prefill	0.1144	0.03462	0.07643
azure_nc_t4_v3	transformers-cpu	generate	16.55	5.152	11.12
azure_nc_t4_v3	transformers-cpu	prefill	0.8699	0.2693	0.5839
azure_nc_t4_v3	transformers-gpu	generate	3.245	0.9851	2.169
azure_nc_t4_v3	transformers-gpu	prefill	0.2331	0.07049	0.1557
azure_nc_t4_v3	transformers-gpu-compile	generate	2.823	0.8589	1.888
azure_nc_t4_v3	transformers-gpu-compile	prefill	0.1785	0.0541	0.1193
gcp_a2_highgpu	onnxruntime-cpu	generate	6.388	1.981	4.29
gcp_a2_highgpu	onnxruntime-cpu	prefill	0.3267	0.1017	0.2196
gcp_a2_highgpu	onnxruntime-gpu	generate	1.435	0.4355	0.959
gcp_a2_highgpu	onnxruntime-gpu	prefill	0.1524	0.04602	0.1018
gcp_a2_highgpu	transformers-cpu	generate	21.98	6.781	14.74
gcp_a2_highgpu	transformers-cpu	prefill	1.156	0.3551	0.7746
gcp_a2_highgpu	transformers-gpu	generate	4.322	1.308	2.887
gcp_a2_highgpu	transformers-gpu	prefill	0.3106	0.09373	0.2073
gcp_a2_highgpu	transformers-gpu-compile	generate	3.758	1.14	2.511
gcp_a2_highgpu	transformers-gpu-compile	prefill	0.2378	0.07188	0.1588

Carbon Footprint per 1M Tokens

Backend	Mode	Scenario	energy_kwh_per_1M_tok	carbon_gco2e_per_1M_tok
onnxruntime-cpu	generate	batch_medium	0.3107	155.3
onnxruntime-cpu	prefill	batch_medium	0.01674	8.368
onnxruntime-cpu	generate	batch_short	0.2468	123.4
onnxruntime-cpu	prefill	batch_short	0.01886	9.43
onnxruntime-cpu	generate	single_long	0.6694	334.7
onnxruntime-cpu	prefill	single_long	0.02323	11.61
onnxruntime-cpu	generate	single_medium	0.5561	278.1
onnxruntime-cpu	prefill	single_medium	0.03228	16.14
onnxruntime-cpu	generate	single_short	0.5201	260
onnxruntime-cpu	prefill	single_short	0.03923	19.61
onnxruntime-gpu	generate	batch_medium	0.01982	9.911
onnxruntime-gpu	prefill	batch_medium	0.001423	0.7117
onnxruntime-gpu	generate	batch_short	0.01574	7.869
onnxruntime-gpu	prefill	batch_short	0.002281	1.141
onnxruntime-gpu	generate	single_long	0.04502	22.51
onnxruntime-gpu	prefill	single_long	0.002037	1.019
onnxruntime-gpu	generate	single_medium	0.05137	25.69
onnxruntime-gpu	prefill	single_medium	0.002798	1.399
onnxruntime-gpu	generate	single_short	0.04568	22.84
onnxruntime-gpu	prefill	single_short	0.001894	0.9469
transformers-cpu	generate	batch_medium	0.7198	359.9
transformers-cpu	prefill	batch_medium	0.02942	14.71
transformers-cpu	generate	batch_short	0.6066	303.3
transformers-cpu	prefill	batch_short	0.03956	19.78
transformers-cpu	generate	single_long	1.802	900.8
transformers-cpu	prefill	single_long	0.04921	24.6
transformers-cpu	generate	single_medium	1.694	847.1
transformers-cpu	prefill	single_medium	0.07456	37.28
transformers-cpu	generate	single_short	1.815	907.7
transformers-cpu	prefill	single_short	0.1038	51.91
transformers-gpu	generate	batch_medium	0.04163	20.81
transformers-gpu	prefill	batch_medium	0.001808	0.9042
transformers-gpu	generate	batch_short	0.03544	17.72
transformers-gpu	prefill	batch_short	0.001069	0.5343
transformers-gpu	generate	single_long	0.1084	54.21
transformers-gpu	prefill	single_long	0.003419	1.71
transformers-gpu	generate	single_medium	0.09561	47.81
transformers-gpu	prefill	single_medium	0.004815	2.408
transformers-gpu	generate	single_short	0.13	65
transformers-gpu	prefill	single_short	0.008378	4.189
transformers-gpu-compile	generate	batch_medium	0.06052	30.26
transformers-gpu-compile	prefill	batch_medium	0.001828	0.9142
transformers-gpu-compile	generate	batch_short	0.0475	23.75
transformers-gpu-compile	prefill	batch_short	0.001793	0.8964
transformers-gpu-compile	generate	single_long	0.1275	63.76
transformers-gpu-compile	prefill	single_long	0.004826	2.413
transformers-gpu-compile	generate	single_medium	0.1076	53.8
transformers-gpu-compile	prefill	single_medium	0.005749	2.875
transformers-gpu-compile	generate	single_short	0.08596	42.98
transformers-gpu-compile	prefill	single_short	0.004973	2.486

Prefill Deep Dive

Prefill is the prompt-processing phase. Under time-based pricing, the dominant driver of $/token is throughput (tokens/sec).

Best prefill backend by cost: onnxruntime-gpu at 0.1279 USD/1M.
onnxruntime-gpu throughput: ~2246 tok/s; mean power: ~16.6 W; energy share: ~0.33%.
Next-best: transformers-gpu-compile at 0.1995 USD/1M (56.0% higher).

Generate Deep Dive (Uncached)

Generate mode here is an uncached greedy decoding loop. Every step re-runs a full forward pass, so throughput collapses and cost per generated token rises.

Best generate backend by cost: onnxruntime-gpu at 1.204 USD/1M.
onnxruntime-gpu throughput: ~312 tok/s; mean power: ~30.8 W.

Figures

[Figure mean_latency_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure throughput_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure total_cost_per_1m_tokens_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure cost_tiers_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure energy_efficiency_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure carbon_footprint_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure cost_vs_throughput_tr119 — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure mean_latency_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure throughput_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure total_cost_per_1m_tokens_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure cost_tiers_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure energy_efficiency_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure carbon_footprint_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

[Figure cost_vs_throughput_tr119_generate — artifact lives in the upstream research repository and is not bundled with the web build.]

Validation & Sanity Checks

Validation status: PASS

Cost & Energy Analysis

Cost Breakdown (Infra vs Energy)

Backend	infra_usd_per_1M	energy_usd_per_1M	total_usd_per_1M	infra_pct	energy_pct
onnxruntime-gpu	0.6622	0.003761	0.666	99.44	0.5647
transformers-gpu-compile	1.668	0.008965	1.677	99.47	0.5348
transformers-gpu	1.934	0.008612	1.943	99.56	0.4432
onnxruntime-cpu	2.774	0.04867	2.823	98.28	1.724
transformers-cpu	9.583	0.1387	9.722	98.57	1.426

Energy Efficiency Ranking

Backends ranked by tokens per kWh (higher is better):

onnxruntime-gpu: 269476245 tokens/kWh
transformers-gpu: 218787639 tokens/kWh
transformers-gpu-compile: 175331811 tokens/kWh
onnxruntime-cpu: 22477886 tokens/kWh
transformers-cpu: 10736929 tokens/kWh

Carbon Footprint Comparison

Lowest carbon: onnxruntime-gpu (9.4 gCO2e/1M tokens)
Highest carbon: transformers-cpu (346.7 gCO2e/1M tokens)
Range: 337.3 gCO2e/1M tokens

ROI by Pricing Tier

Savings from switching to spot or reserved pricing:

transformers-gpu: Spot saves 69.7%, Reserved saves 29.9%
transformers-gpu-compile: Spot saves 69.6%, Reserved saves 29.9%
onnxruntime-gpu: Spot saves 69.6%, Reserved saves 29.9%
transformers-cpu: Spot saves 69.0%, Reserved saves 29.6%
onnxruntime-cpu: Spot saves 68.8%, Reserved saves 29.5%

Request-Level Cost (Prompt+Generate Mix)

Assumptions: prompt_tokens=256.0, generate_tokens=128.0.

Backend	time_prefill_s	time_generate_s	energy_kwh_per_request	total_cost_usd_per_request
onnxruntime-gpu	0.114	0.4108	4.042e-06	0.0001475
transformers-gpu-compile	0.1409	1.096	1.015e-05	0.0003477
transformers-gpu	0.1236	1.213	8.502e-06	0.0003752
onnxruntime-cpu	0.2091	2.135	5.759e-05	0.0006667
transformers-cpu	0.6782	6.364	0.0001445	0.001997

TCO Summary

Assumptions: 1000000000.0 tokens/month, 12 months, upfront $0.0.

Backend	total_cost_usd	cost_per_month_usd	cost_per_1m_tokens_usd
onnxruntime-gpu	7992	666	0.666
transformers-gpu-compile	2.012e+04	1677	1.677
transformers-gpu	2.332e+04	1943	1.943
onnxruntime-cpu	3.387e+04	2823	2.823
transformers-cpu	1.167e+05	9722	9.722

5. Statistical Analysis

We test whether observed cost differences are statistically significant across backends. Tests are run per mode when available.

Generate Mode

Hypothesis Testing

ANOVA: Significant differences detected across backends (F=12.85, p=0.0000)

Significant Pairwise Comparisons (p < 0.05)

onnxruntime-cpu vs onnxruntime-gpu: $-4.1663 difference (-77.6%), p=0.0018, Cohen's d=-2.903
onnxruntime-cpu vs transformers-cpu: $13.1022 difference (244.0%), p=0.0134, Cohen's d=1.997
onnxruntime-gpu vs transformers-cpu: $17.2685 difference (1434.1%), p=0.0028, Cohen's d=2.686
onnxruntime-gpu vs transformers-gpu: $2.4215 difference (201.1%), p=0.0263, Cohen's d=1.720
transformers-cpu vs transformers-gpu: $-14.8470 difference (-80.4%), p=0.0072, Cohen's d=-2.265
transformers-cpu vs transformers-gpu-compile: $-15.3191 difference (-82.9%), p=0.0060, Cohen's d=-2.340

Prefill Mode

Hypothesis Testing

ANOVA: Significant differences detected across backends (F=8.08, p=0.0005)

Significant Pairwise Comparisons (p < 0.05)

onnxruntime-cpu vs onnxruntime-gpu: $-0.1469 difference (-53.5%), p=0.0345, Cohen's d=-1.609
onnxruntime-cpu vs transformers-cpu: $0.6962 difference (253.4%), p=0.0230, Cohen's d=1.775
onnxruntime-gpu vs transformers-cpu: $0.8431 difference (659.4%), p=0.0082, Cohen's d=2.207
transformers-cpu vs transformers-gpu: $-0.7105 difference (-73.2%), p=0.0251, Cohen's d=-1.739
transformers-cpu vs transformers-gpu-compile: $-0.7715 difference (-79.5%), p=0.0141, Cohen's d=-1.978

6. Synthesis & Decision Matrix

6.1 What matters most

Throughput dominates $/token under the configured pricing inputs; small power differences rarely change rankings.
Pricing tier is a second lever: spot/reserved can shift total cost by ~2-3x for the same backend.
Uncached generate is an upper bound: production KV-cache decoding should reduce generate cost per token materially.

6.2 Deployment Recommendations (This Hardware)

Default GPU backend: onnxruntime-gpu (best cost and best energy efficiency in this benchmark).
CPU-only fallback: onnxruntime-cpu (better cost/energy than transformers-cpu).
Transformers backends: keep when you need maximum feature parity; expect higher $/token.

6.3 Decision Matrix (On-Demand, Mean Across Scenarios)

Backend	prefill_usd_per_1M	generate_usd_per_1M	generate/prefill
onnxruntime-cpu	0.2748	5.37	19.55
onnxruntime-gpu	0.1279	1.204	9.417
transformers-cpu	0.971	18.47	19.02
transformers-gpu	0.2605	3.626	13.92
transformers-gpu-compile	0.1995	3.154	15.81

6.4 Operational Considerations

onnxruntime-gpu: best efficiency here, but requires ONNX export + runtime integration (engineering overhead is moderate but stable).
transformers-gpu: simplest integration path, but higher $/token in this benchmark.
transformers-gpu-compile: can improve throughput for some models, but compilation overhead and variability can complicate deployment.
CPU backends are viable for compatibility/fallback, not for cost-optimal throughput at scale.

6.5 Limitations & Next Steps

Single hardware system; replicate on your production GPU/CPU to lock absolute costs.
Generate results are uncached and intentionally pessimistic; repeat with KV-cache for production planning.
Tokenization, batching overheads, and end-to-end serving stack are not included; integrate into TR123 for full-stack TCO.

7. Reproducibility

7.1 Run the pipeline

python scripts/tr119/run_experiment.py --config C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\configs\matrix.yaml --device cuda

7.2 Key artifacts

Results root: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\results\tr119_matrix
Processed: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\results\tr119_matrix\processed
Report: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\reports\generated\Technical_Report_119.md
Manifest: C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts\scripts\tr119\results\tr119_matrix\processed\experiment_manifest_1766283570.json

TR119: Cost & Energy Analysis

Technical Report 119 v1.1: Cost & Energy Analysis

Local-first inference TCO with telemetry (prefill + generate)

Abstract

Measurement Definitions

Prefill Mode

Generate Mode (Uncached)

Energy, Cost, and Carbon

Executive Summary

Key Findings

Key Decision

Table of Contents

1. Introduction & Research Motivation

1.1 Research Questions

1.2 Scope

2. Methodology & Experimental Design

Metrics

Benchmark Matrix

Cost & Energy Model

Telemetry Collection

Pricing & Energy Inputs

Request Token Mix

3. Environment & Artifacts

Telemetry

Environment

4. Results & Analysis

Latency & Throughput Summary (Mean Across Scenarios)

Interpretation

Latency, Throughput, and Telemetry (Per Backend/Scenario)

Cost & Energy Summary (Mean Across Scenarios)

Interpretation

Cost & Energy per 1M Tokens (On-Demand Pricing)

Multi-Tier Pricing Comparison (per 1M Tokens)

Cloud Provider Cost Comparison (Mean Across Scenarios)

Carbon Footprint per 1M Tokens

Prefill Deep Dive

Generate Deep Dive (Uncached)

Figures

Validation & Sanity Checks

Cost & Energy Analysis

Cost Breakdown (Infra vs Energy)

Energy Efficiency Ranking

Carbon Footprint Comparison

ROI by Pricing Tier

Request-Level Cost (Prompt+Generate Mix)

TCO Summary

5. Statistical Analysis

Generate Mode

Hypothesis Testing

Significant Pairwise Comparisons (p < 0.05)

Prefill Mode

Hypothesis Testing

Significant Pairwise Comparisons (p < 0.05)

6. Synthesis & Decision Matrix

6.1 What matters most

6.2 Deployment Recommendations (This Hardware)

6.3 Decision Matrix (On-Demand, Mean Across Scenarios)

6.4 Operational Considerations

6.5 Limitations & Next Steps

7. Reproducibility

7.1 Run the pipeline

7.2 Key artifacts