Technical Report 117: Cross-Backend Inference Benchmark

Field	Value
TR Number	117
Version	1.1 (Revised)
Date	2025-12-08
Author	Banterhearts Team
Status	Complete (Data-Consistent Revision)

Executive Summary

This report presents a benchmark of 5 inference backends (3 successfully tested, 2 failed infrastructure checks) for local-first LLM serving: PyTorch transformers (CPU/GPU, with and without torch.compile) and Ollama. Across 3,017 total runs (2,471 successful, 546 degraded), we measured latency, throughput, and cost efficiency.

Key Findings

GPU-compile wins on mean (389ms) and cost ($0.045/1M tokens) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
Plain GPU wins on median (323ms) - compile paradox discovered
Ollama 8.8x slower than GPU-compile (3,411ms vs 389ms mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
CPU compilation ineffective (2% improvement, p=0.826, not significant)
TensorRT/ONNXRuntime infrastructure failed (100% degraded runs)

Honest Limitations

WARNING This report reflects ACTUAL test results, not aspirations:

NO accuracy metrics (accuracy column empty in all 3,017 runs)
TensorRT/ONNX NOT tested (546/546 runs degraded due to missing engines)
Model skew: 55% runs on tiny-gpt2 (124M), rest on Ollama models (270M-8B)
Single hardware: RTX 4080 laptop only
Synthetic prompts: Not production workload traces

Bottom Line: In this benchmark, transformers-gpu-compile delivers best mean latency and cost, but plain GPU has slightly better median. Ollama viable only when model flexibility outweighs 8.8x performance penalty (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B).

1. Introduction

1.1 Motivation

Local-first LLM inference requires choosing the optimal backend for speed, cost, and reliability. PyTorch offers torch.compile(), Ollama provides multi-model flexibility, and specialized runtimes (ONNX, TensorRT) promise further optimizations.

Research Question: Which backend delivers the best latency, cost, and reliability for production inference?

1.2 Scope

Tested:

PASS PyTorch transformers (CPU, GPU, CPU-compile, GPU-compile)
PASS Ollama (llama.cpp backend with 6 models)

Failed (Infrastructure Issues):

FAIL TensorRT (273 runs -> 100% degraded, missing .plan files)
FAIL ONNXRuntime (273 runs -> 100% degraded, ONNX export failures)

Test Matrix:

Models: tiny-gpt2 (124M), gemma3 (270M, 1B, 3B), qwen2.5 (7B), llama3.1 (8B-q4)
Scenarios: 7 prompt types (micro, short, medium, long, dual, stress)
Repetitions: 7 per combination
Total: 3,017 runs planned, 2,471 successful (82%)

PyTorch 2.x: torch.compile() introduced for graph-level optimization
Ollama: Optimized llama.cpp with quantization and KV cache tuning
TensorRT: NVIDIA's inference SDK (not validated in this study)
ONNX Runtime: Cross-platform inference (not validated in this study)

2. Methodology

2.1 Hardware

GPU: NVIDIA RTX 4080 Laptop (12.9GB VRAM, CUDA 12.8)
CPU: Unspecified (Windows 11, 32-core system)
RAM: 32GB+

2.2 Software

PyTorch: 2.5.1
Transformers: 4.45.2
Ollama: Latest (December 2025)
CUDA: 12.8
Python: 3.13

2.3 Test Scenarios

Scenario	Prompt Length	Mode	Prompts
micro	1-2 words	single	"Hello", "Test"
short	8-15 words	single	"Summarize RLHF..."
medium	20-30 words	single	"Explain backpressure..."
long	40-50 words	single	"Overview of attention mechanism..."
dual_short	8-15 words	dual	Two-agent prompts
dual_medium	20-30 words	dual	Two-agent prompts
stress	80+ words	single	"Generate 500-word essay..."

7 scenarios x 7 repetitions = 49 runs per backend/model combination

2.4 Metrics

Latency:

Mean, median, std dev, min, max
TTFT, p50, p95, p99: listed but not reported in this revision

Throughput:

Tokens/second
Requests/second (inverse of latency)

Cost:

$/1M tokens (based on $0.035/hour GPU cost)
Compute efficiency (tokens/$)

Reliability:

Success rate (ok vs degraded vs error)
Degradation reasons

2.5 Reproducibility

Seeds: 42 (consistent across runs)
Temperature: 0.7
Max tokens: 128
Docker: Not used (native Windows environment)
Frozen deps: requirements_frozen.txt in scripts/tr117/

2.6 Statistical Analysis

Tests: Independent-samples t-tests, ANOVA, Bonferroni correction (corrected alpha = 0.05/4 = 0.0125 for 4 pairwise comparisons)
Effect sizes: Cohen's d
Confidence intervals: not computed for this revision
Significance threshold: p < 0.05 (p < 0.0125 after Bonferroni correction)
Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.

3. Results Overview

3.1 Summary Table

Backend	Primary Model	Successful Runs	Mean (ms)	Median (ms)	Std (ms)	Cost ($/1M tok)	Throughput (tok/s)
transformers-gpu-compile	tiny-gpt2 (124M)	273	389.2	328.7	117.8	$0.045	215.2
transformers-gpu	tiny-gpt2 (124M)	273	404.1	322.7	223.9	$0.046	211.7
transformers-cpu-compile	tiny-gpt2 (124M)	273	559.3	526.7	103.5	$0.071	137.3
transformers-cpu	tiny-gpt2 (124M)	287	570.6	530.4	117.2	$0.074	132.2
ollama	270M-8B (mixed)	1,365	3,410.5	1,238.5	3,874.9	$0.106	91.9
tensorrt	—	0	N/A	N/A	N/A	N/A	N/A
onnxruntime	—	0	N/A	N/A	N/A	N/A	N/A

Note: Throughput comparison is across different model sizes. Ollama's 91.9 tok/s is averaged across 270M-8B models; transformers' 215.2 tok/s is from tiny-gpt2 (124M).

Key Observations:

Best mean: GPU-compile (389ms)
Best median: Plain GPU (323ms) <- compile paradox
Best cost: GPU-compile ($0.045/1M)
Best consistency: CPU-compile (std 103.5ms)
Worst performance: Ollama (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
Infrastructure failures: TRT/ORT 100% degraded

3.2 Status Breakdown

Total Runs: 3,017
|- Successful: 2,471 (82%)
|- Degraded:     546 (18%)
|   |- TensorRT:     273 (missing .plan engines)
|   +- ONNXRuntime:  273 (ONNX export failures)
+- Hard Errors:      0 (0%)

3.3 Compile Paradox

Discovery: torch.compile() improves mean but degrades median:

Mean: 404ms -> 389ms (3.7% faster)
Median: 323ms -> 329ms (1.9% slower)

Hypothesis: Compile reduces outliers (better tail latency) but adds overhead to typical requests.

4. Backend Deep Dive

4.1 transformers-gpu-compile TOP

Winner on mean latency and cost.

Performance:

Mean: 389.2ms (3.7% faster than plain GPU)
Median: 328.7ms (1.9% slower than plain GPU)
Std: 117.8ms (2x better than plain GPU)
Cost: $0.045/1M tokens (cheapest)
Throughput: 215.2 tok/s (highest)

Strengths:

PASS Best mean latency
PASS Lowest cost
PASS Highest throughput
PASS Better consistency (lower std dev)

Weaknesses:

FAIL Median 1.9% slower than plain GPU
FAIL 30s compilation overhead on first run
FAIL GPU required

Recommendation: Best-performing in this benchmark for cost-sensitive workloads.

4.2 transformers-gpu

Winner on median latency.

Performance:

Mean: 404.1ms
Median: 322.7ms (BEST)
Std: 223.9ms (2x worse than GPU-compile)
Cost: $0.046/1M tokens
Throughput: 211.7 tok/s

Strengths:

PASS Best median (1.9% faster than GPU-compile)
PASS No compilation overhead
PASS Simpler debugging

Weaknesses:

FAIL Mean 3.7% slower than GPU-compile
FAIL Higher variance (outliers up to 3.3s)
FAIL GPU required

Recommendation: Best-performing in this benchmark for p50 SLA workloads; suitable for development/prototyping.

4.3 transformers-cpu-compile

Performance:

Mean: 559.3ms
Median: 526.7ms
Std: 103.5ms (BEST consistency)
Cost: $0.071/1M tokens
Throughput: 137.3 tok/s

Strengths:

PASS No GPU required
PASS Best consistency (lowest std dev)

Weaknesses:

FAIL Only 2% faster than plain CPU (p=0.826, not significant)
FAIL 1.44x slower than GPU
FAIL 1.57x more expensive than GPU

Recommendation: CPU-only environments (but compile brings minimal benefit).

4.4 transformers-cpu

Performance:

Mean: 570.6ms
Median: 530.4ms
Std: 117.2ms
Cost: $0.074/1M tokens (most expensive for transformers)
Throughput: 132.2 tok/s (lowest for transformers)

Strengths:

PASS No GPU required
PASS Baseline for comparison

Weaknesses:

FAIL 1.47x slower than GPU
FAIL 1.64x more expensive than GPU

Recommendation: Development on CPU-only machines.

4.5 ollama

Performance:

Mean: 3,410.5ms (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
Median: 1,238.5ms (3.8x slower)
Std: 3,874.9ms (TERRIBLE - 114% of mean!)
Cost: $0.106/1M tokens (2.35x more expensive)
Throughput: 91.9 tok/s (lowest; averaged across 270M-8B models vs transformers' tiny-gpt2 124M)

Strengths:

PASS Multi-model flexibility (6 models tested)
PASS Simple API (swap models on demand)
PASS Good for experimentation

Weaknesses:

FAIL 8.8x slower than GPU-compile (mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
FAIL 2.35x more expensive
FAIL Catastrophic variance (173ms to 27,964ms - 161x range!)
FAIL Unreliable for production SLAs

Recommendation: Use only when model flexibility matters more than performance.

4.6 tensorrt FAIL

Status: NOT TESTED (100% degraded)

Issue: Missing TensorRT .plan engines. All 273 runs reported degraded status with placeholder latencies (0.35ms).

Next Steps: TR118 will build real engines and re-test.

4.7 onnxruntime FAIL

Status: NOT TESTED (100% degraded)

Issue: ONNX export failures. All 273 runs reported degraded status with placeholder latencies (0.30ms).

Next Steps: TR118 will fix ONNX export and re-test.

5. Statistical Analysis

5.1 Backend Comparison (ANOVA)

CAVEAT: Backend and model are confounded in this design. Transformers backends tested tiny-gpt2 (124M); Ollama tested 270M-8B models. This ANOVA tests system-level differences, not backend-isolated effects.

Null Hypothesis: No difference in mean latency across backends.

Test: One-way ANOVA on 5 backends (excludes TRT/ORT).

Result: F = 45.86, p < 10^-15 PASS Significant

Interpretation: Backend choice critically affects latency (but confounded with model size).

5.2 Pairwise Comparisons

GPU-compile vs GPU:

Mean difference: -14.8ms (GPU-compile faster)
p-value: < 0.05 PASS Significant
Cohen's d: 0.14 (small effect)
Finding: GPU-compile 3.7% faster on mean, but median paradox exists

GPU vs CPU:

Mean difference: -166.5ms (GPU faster)
p-value: < 0.001 PASS Highly significant
Cohen's d: 1.48 (large effect)
Finding: GPU 1.41x faster, 1.61x cheaper

GPU-compile vs Ollama:

Mean difference: -3,021.3ms (GPU-compile faster)
p-value: < 10^-15 PASS Astronomically significant
Cohen's d: 1.60 (huge effect) (confounded by model size)
Finding: GPU-compile 8.8x faster, 2.35x cheaper (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.

CPU-compile vs CPU:

Mean difference: -11.2ms (CPU-compile faster)
p-value: 0.826 FAIL NOT significant
Cohen's d: 0.10 (negligible)
Finding: Compilation ineffective on CPU

6. Cost Analysis

6.1 Cost Model

Assumptions:

GPU: $0.035/hour (AWS g5.xlarge proxy)
CPU: $0.035/hour (same, for simplicity)
Ollama: $0.035/hour (runs on same hardware)

Limitation: Oversimplified (no spot/reserved pricing, no energy cost).

6.2 Results

Backend	Cost/1M Tokens	Tokens/$	Notes
GPU-compile	$0.045	22.1M	Best
GPU	$0.046	21.7M	2nd
CPU-compile	$0.071	14.1M	3rd
CPU	$0.074	13.5M	4th
Ollama	$0.106	9.4M	Worst (2.35x more expensive)

7. Data Integrity & Limitations

7.1 Missing Data

WARNING Accuracy Metrics: NOT COLLECTED

The metrics.csv accuracy column is 100% NULL (0/3,017 values). The report CANNOT make accuracy claims.

Why: Accuracy validation was disabled during the benchmark run. Baseline outputs were not compared.

Impact: We can only rank backends by speed and cost, not quality.

Fix: TR118 will re-run with accuracy validation enabled.

7.2 Infrastructure Failures

TensorRT: 273/273 runs degraded (100% failure rate)

Cause: Missing .plan engine files
Evidence: Placeholder latencies (0.35ms average)
Fix: TR118 will build real TensorRT engines

ONNXRuntime: 273/273 runs degraded (100% failure rate)

Cause: ONNX export failures
Evidence: Placeholder latencies (0.30ms average)
Fix: TR118 will fix ONNX export pipeline

Total Degraded: 546/3,017 runs (18%)

7.3 Model Skew

Distribution:

tiny-gpt2: 55% of runs (HuggingFace, 124M params)
gemma3: 25% of runs (Ollama, 270M-3B)
qwen2.5: 10% of runs (Ollama, 7B)
llama3.1: 10% of runs (Ollama, 8B-q4)

Issue: Cannot isolate backend effects from model effects.

Fix: TR121 will test same models across all backends.

7.4 Single Hardware

All tests on one laptop (RTX 4080). Findings may not generalize to:

Data center GPUs (A100, H_100)
AMD GPUs
Apple Silicon
Cloud providers (AWS, Azure, GCP)

Acknowledged limitation; not addressed in the current program (all TRs use RTX 4080 Laptop).

7.5 Synthetic Prompts

Test prompts are not production traces. Real workloads may differ in:

Prompt length distribution
Batching patterns
Concurrent requests
Model switching frequency

Acknowledged limitation; TR128 tests production workload patterns but on the same hardware.

8. Recommendations

8.1 Deployment Guidance (Single-Hardware, Constrained Scope)

Best-performing in this benchmark for cost-optimized workloads:

Backend: transformers-gpu-compile
Config:
  BANTER_FORCE_BACKEND=transformers-gpu-compile
  BANTER_INFERENCE_TIMEOUT_S=2
  BANTER_LATENCY_GUARDRAIL_MS=500

Expected: 389ms mean, $0.045/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 215 tok/s
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.

Best-performing in this benchmark for p50 SLA workloads:

Backend: transformers-gpu
Config:
  BANTER_FORCE_BACKEND=transformers-gpu
  BANTER_INFERENCE_TIMEOUT_S=5

Expected: 323ms median, $0.046/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 212 tok/s
Note: Compile paradox - GPU has better median despite worse mean
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.

Best-performing in this benchmark for multi-model flexibility:

Backend: ollama
Config:
  BANTER_OLLAMA_URL=http://localhost:11434

Expected: 3,411ms mean (8.8x slower, confounded by model size), $0.106/1M (based on flat $0.035/hr; see Section 6 limitations)
Only viable when model swapping > performance
Caveat: Ollama tested on 270M-8B models vs transformers' tiny-gpt2 (124M). The 8.8x gap is not a backend-isolated comparison.

NOT RECOMMENDED:

FAIL CPU-only: 1.4x slower, 1.6x more expensive than GPU
FAIL CPU-compile: Only 2% faster than CPU (not significant)
FAIL TensorRT/ONNX: Infrastructure not ready (100% degraded)

8.2 Future Work

TR118: ONNX/TRT Deep Dive (Week of 2025-12-09)

Build real TensorRT engines (FP32, FP16, INT8)
Fix ONNX export pipeline
Re-run benchmark with 0% degraded target
Accuracy validation (perplexity + ROUGE)

TR119: Cost & Energy Analysis (Week of 2025-12-16)

Real cloud pricing (spot, reserved, on-prem)
Energy measurement (Joules/token, carbon footprint)
TCO calculator for 1M req/day workload

TR120: Compile Paradox Investigation (Week of 2025-12-23)

Profiler traces (torch.profiler, nsys)
Kernel-level analysis (where compile helps/hurts)
Hybrid strategy (compile for batch, eager for single)

TR121: Model Scaling Study (Week of 2025-12-30)

Unified model matrix (same models on all backends)
Scaling laws (latency vs params)
Quantization necessity analysis

TR122: Resource Profiling (Week of 2026-01-06)

Memory profiling (GPU VRAM, CPU RAM, swap)
Power measurement (Watts, thermal throttling)
Bottleneck identification

TR123: Multi-Hardware Validation (Week of 2026-01-13)

A100, H_100, AMD, Apple Silicon
AWS g5, Azure NC, GCP A2
Real production workload traces

9. Conclusions

9.1 Key Findings

transformers-gpu-compile wins on mean (389ms) and cost ($0.045/1M) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
Plain transformers-gpu wins on median (323ms) - compile paradox
Ollama 8.8x slower, 2.35x more expensive (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B; only viable for multi-model)
CPU compilation ineffective (2% improvement, p=0.826, not significant)
TensorRT/ONNX infrastructure failed (546/546 runs degraded, 0% tested)

9.2 Best-Performing in This Benchmark

transformers-gpu-compile for cost-sensitive workloads (single-hardware, constrained scope).

Decision Matrix:

Need lowest mean latency + cost? -> GPU-compile
Need best median (p50 SLA)? -> Plain GPU
Need multi-model flexibility? -> Ollama (accept 8.8x penalty, confounded by model size)
CPU-only? -> Plain CPU (compile brings no benefit)
TensorRT/ONNX? -> Wait for TR118 (currently 100% broken)

9.3 Scientific Integrity

WARNING This report reflects ACTUAL test results:

NO accuracy data (column empty)
TensorRT/ONNX NOT tested (100% degraded)
Model skew (55% tiny-gpt2)
Single hardware (RTX 4080 laptop)
Synthetic prompts (not production traces)

10. Reproducibility

10.1 Artifacts

Data:

results/tr117_tier3/metrics.csv (3,017 rows, 2,471 ok, 546 degraded)
results/tr117_tier3/cost_analysis.json
results/tr117_tier3/statistical_analysis.json

Scripts:

scripts/tr117/run_matrix.py (benchmark runner)
scripts/tr117/analyze_tr117.py (aggregation)
scripts/tr117/statistical_analysis.py (ANOVA, t-tests)
scripts/tr117/cost_analysis.py ($/1M tokens)

Config:

scripts/tr117/configs/matrix_tier3_full.yaml

10.2 How to Reproduce

# 1. Setup environment
cd scripts/tr117
pip install -r requirements_frozen.txt

# 2. Run benchmark (10-20 hours)
python run_matrix.py --config configs/matrix_tier3_full.yaml

# 3. Analyze results
python analyze_tr117.py --input results/tr117_tier3/metrics.csv
python statistical_analysis.py --input results/tr117_tier3/metrics.csv
python cost_analysis.py --input results/tr117_tier3/metrics.csv

10.3 Hardware Requirements

NVIDIA GPU (RTX 4000+ or A100)
12GB+ VRAM (tested on RTX 4080 Laptop, 12.9GB)
32GB+ RAM
100GB+ disk space

Appendix A: Raw Statistics

Backend Statistics (Successful Runs Only):

{
  "transformers-gpu-compile": {
    "count": 273,
    "mean": 389.2,
    "median": 328.7,
    "std": 117.8,
    "min": 277.4,
    "max": 681.8
  },
  "transformers-gpu": {
    "count": 273,
    "mean": 404.1,
    "median": 322.7,
    "std": 223.9,
    "min": 276.8,
    "max": 3325.8
  },
  "transformers-cpu-compile": {
    "count": 273,
    "mean": 559.3,
    "median": 526.7,
    "std": 103.5,
    "min": 398.2,
    "max": 785.5
  },
  "transformers-cpu": {
    "count": 287,
    "mean": 570.6,
    "median": 530.4,
    "std": 117.2,
    "min": 314.5,
    "max": 842.0
  },
  "ollama": {
    "count": 1365,
    "mean": 3410.5,
    "median": 1238.5,
    "std": 3874.9,
    "min": 173.5,
    "max": 27963.9
  }
}

Appendix B: Degraded Runs

Total Degraded: 546/3,017 (18%)

By Backend:

TensorRT: 273 (100% of TRT runs)
ONNXRuntime: 273 (100% of ORT runs)
Others: 0 (0% degraded)

Degradation Reasons:

tensorrt_engine_not_found (273 runs)
onnx_export_failed (273 runs)

End of Technical Report 117 (Revised)

Changelog:

v1.0 (2025-12-07): Initial report (accuracy claims removed in v1.1 — no supporting data existed)
v1.1 (2025-12-08): DATA-CONSISTENT REVISION
- Removed all accuracy claims (no data exists)
- Marked TRT/ORT as NOT TESTED (100% degraded)
- Added Data Integrity section (honest limitations)
- Regenerated statistical analysis from tier3 data
- Acknowledged 546 degraded runs
- Changed recommendations to reflect 5 backends only

TR117: Cross-Backend Inference Benchmark

Technical Report 117: Cross-Backend Inference Benchmark

Executive Summary

Key Findings

Honest Limitations

1. Introduction

1.1 Motivation

1.2 Scope

1.3 Related Work

2. Methodology

2.1 Hardware

2.2 Software

2.3 Test Scenarios

2.4 Metrics

2.5 Reproducibility

2.6 Statistical Analysis

3. Results Overview

3.1 Summary Table

3.2 Status Breakdown

3.3 Compile Paradox

4. Backend Deep Dive

4.1 transformers-gpu-compile TOP

4.2 transformers-gpu

4.3 transformers-cpu-compile

4.4 transformers-cpu

4.5 ollama

4.6 tensorrt FAIL

4.7 onnxruntime FAIL

5. Statistical Analysis

5.1 Backend Comparison (ANOVA)

5.2 Pairwise Comparisons

6. Cost Analysis

6.1 Cost Model

6.2 Results

7. Data Integrity & Limitations

7.1 Missing Data

7.2 Infrastructure Failures

7.3 Model Skew

7.4 Single Hardware

7.5 Synthetic Prompts

8. Recommendations

8.1 Deployment Guidance (Single-Hardware, Constrained Scope)

8.2 Future Work

9. Conclusions

9.1 Key Findings

9.2 Best-Performing in This Benchmark

9.3 Scientific Integrity

10. Reproducibility

10.1 Artifacts

10.2 How to Reproduce

10.3 Hardware Requirements

Appendix A: Raw Statistics

Appendix B: Degraded Runs