Research Archive
Technical Report

TR117: Cross-Backend Inference Benchmark

Frontier benchmark comparing eager, JIT, torch.compile, ONNX, and TensorRT across architectures.

Table of Contents

Technical Report 117: Cross-Backend Inference Benchmark

Field Value
TR Number 117
Version 1.1 (Revised)
Date 2025-12-08
Author Banterhearts Team
Status Complete (Data-Consistent Revision)

Executive Summary

This report presents a benchmark of 5 inference backends (3 successfully tested, 2 failed infrastructure checks) for local-first LLM serving: PyTorch transformers (CPU/GPU, with and without torch.compile) and Ollama. Across 3,017 total runs (2,471 successful, 546 degraded), we measured latency, throughput, and cost efficiency.

Key Findings

  1. GPU-compile wins on mean (389ms) and cost ($0.045/1M tokens) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
  2. Plain GPU wins on median (323ms) - compile paradox discovered
  3. Ollama 8.8x slower than GPU-compile (3,411ms vs 389ms mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
  4. CPU compilation ineffective (2% improvement, p=0.826, not significant)
  5. TensorRT/ONNXRuntime infrastructure failed (100% degraded runs)

Honest Limitations

WARNING This report reflects ACTUAL test results, not aspirations:

  • NO accuracy metrics (accuracy column empty in all 3,017 runs)
  • TensorRT/ONNX NOT tested (546/546 runs degraded due to missing engines)
  • Model skew: 55% runs on tiny-gpt2 (124M), rest on Ollama models (270M-8B)
  • Single hardware: RTX 4080 laptop only
  • Synthetic prompts: Not production workload traces

Bottom Line: In this benchmark, transformers-gpu-compile delivers best mean latency and cost, but plain GPU has slightly better median. Ollama viable only when model flexibility outweighs 8.8x performance penalty (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B).


1. Introduction

1.1 Motivation

Local-first LLM inference requires choosing the optimal backend for speed, cost, and reliability. PyTorch offers torch.compile(), Ollama provides multi-model flexibility, and specialized runtimes (ONNX, TensorRT) promise further optimizations.

Research Question: Which backend delivers the best latency, cost, and reliability for production inference?

1.2 Scope

Tested:

  • PASS PyTorch transformers (CPU, GPU, CPU-compile, GPU-compile)
  • PASS Ollama (llama.cpp backend with 6 models)

Failed (Infrastructure Issues):

  • FAIL TensorRT (273 runs -> 100% degraded, missing .plan files)
  • FAIL ONNXRuntime (273 runs -> 100% degraded, ONNX export failures)

Test Matrix:

  • Models: tiny-gpt2 (124M), gemma3 (270M, 1B, 3B), qwen2.5 (7B), llama3.1 (8B-q4)
  • Scenarios: 7 prompt types (micro, short, medium, long, dual, stress)
  • Repetitions: 7 per combination
  • Total: 3,017 runs planned, 2,471 successful (82%)
  • PyTorch 2.x: torch.compile() introduced for graph-level optimization
  • Ollama: Optimized llama.cpp with quantization and KV cache tuning
  • TensorRT: NVIDIA's inference SDK (not validated in this study)
  • ONNX Runtime: Cross-platform inference (not validated in this study)

2. Methodology

2.1 Hardware

  • GPU: NVIDIA RTX 4080 Laptop (12.9GB VRAM, CUDA 12.8)
  • CPU: Unspecified (Windows 11, 32-core system)
  • RAM: 32GB+

2.2 Software

  • PyTorch: 2.5.1
  • Transformers: 4.45.2
  • Ollama: Latest (December 2025)
  • CUDA: 12.8
  • Python: 3.13

2.3 Test Scenarios

Scenario Prompt Length Mode Prompts
micro 1-2 words single "Hello", "Test"
short 8-15 words single "Summarize RLHF..."
medium 20-30 words single "Explain backpressure..."
long 40-50 words single "Overview of attention mechanism..."
dual_short 8-15 words dual Two-agent prompts
dual_medium 20-30 words dual Two-agent prompts
stress 80+ words single "Generate 500-word essay..."

7 scenarios x 7 repetitions = 49 runs per backend/model combination

2.4 Metrics

Latency:

  • Mean, median, std dev, min, max
  • TTFT, p50, p95, p99: listed but not reported in this revision

Throughput:

  • Tokens/second
  • Requests/second (inverse of latency)

Cost:

  • $/1M tokens (based on $0.035/hour GPU cost)
  • Compute efficiency (tokens/$)

Reliability:

  • Success rate (ok vs degraded vs error)
  • Degradation reasons

2.5 Reproducibility

  • Seeds: 42 (consistent across runs)
  • Temperature: 0.7
  • Max tokens: 128
  • Docker: Not used (native Windows environment)
  • Frozen deps: requirements_frozen.txt in scripts/tr117/

2.6 Statistical Analysis

  • Tests: Independent-samples t-tests, ANOVA, Bonferroni correction (corrected alpha = 0.05/4 = 0.0125 for 4 pairwise comparisons)
  • Effect sizes: Cohen's d
  • Confidence intervals: not computed for this revision
  • Significance threshold: p < 0.05 (p < 0.0125 after Bonferroni correction)
  • Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.

3. Results Overview

3.1 Summary Table

Backend Primary Model Successful Runs Mean (ms) Median (ms) Std (ms) Cost ($/1M tok) Throughput (tok/s)
transformers-gpu-compile tiny-gpt2 (124M) 273 389.2 328.7 117.8 $0.045 215.2
transformers-gpu tiny-gpt2 (124M) 273 404.1 322.7 223.9 $0.046 211.7
transformers-cpu-compile tiny-gpt2 (124M) 273 559.3 526.7 103.5 $0.071 137.3
transformers-cpu tiny-gpt2 (124M) 287 570.6 530.4 117.2 $0.074 132.2
ollama 270M-8B (mixed) 1,365 3,410.5 1,238.5 3,874.9 $0.106 91.9
tensorrt 0 N/A N/A N/A N/A N/A
onnxruntime 0 N/A N/A N/A N/A N/A

Note: Throughput comparison is across different model sizes. Ollama's 91.9 tok/s is averaged across 270M-8B models; transformers' 215.2 tok/s is from tiny-gpt2 (124M).

Key Observations:

  • Best mean: GPU-compile (389ms)
  • Best median: Plain GPU (323ms) <- compile paradox
  • Best cost: GPU-compile ($0.045/1M)
  • Best consistency: CPU-compile (std 103.5ms)
  • Worst performance: Ollama (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
  • Infrastructure failures: TRT/ORT 100% degraded

3.2 Status Breakdown

Total Runs: 3,017
|- Successful: 2,471 (82%)
|- Degraded:     546 (18%)
|   |- TensorRT:     273 (missing .plan engines)
|   +- ONNXRuntime:  273 (ONNX export failures)
+- Hard Errors:      0 (0%)

3.3 Compile Paradox

Discovery: torch.compile() improves mean but degrades median:

  • Mean: 404ms -> 389ms (3.7% faster)
  • Median: 323ms -> 329ms (1.9% slower)

Hypothesis: Compile reduces outliers (better tail latency) but adds overhead to typical requests.


4. Backend Deep Dive

4.1 transformers-gpu-compile TOP

Winner on mean latency and cost.

Performance:

  • Mean: 389.2ms (3.7% faster than plain GPU)
  • Median: 328.7ms (1.9% slower than plain GPU)
  • Std: 117.8ms (2x better than plain GPU)
  • Cost: $0.045/1M tokens (cheapest)
  • Throughput: 215.2 tok/s (highest)

Strengths:

  • PASS Best mean latency
  • PASS Lowest cost
  • PASS Highest throughput
  • PASS Better consistency (lower std dev)

Weaknesses:

  • FAIL Median 1.9% slower than plain GPU
  • FAIL 30s compilation overhead on first run
  • FAIL GPU required

Recommendation: Best-performing in this benchmark for cost-sensitive workloads.


4.2 transformers-gpu

Winner on median latency.

Performance:

  • Mean: 404.1ms
  • Median: 322.7ms (BEST)
  • Std: 223.9ms (2x worse than GPU-compile)
  • Cost: $0.046/1M tokens
  • Throughput: 211.7 tok/s

Strengths:

  • PASS Best median (1.9% faster than GPU-compile)
  • PASS No compilation overhead
  • PASS Simpler debugging

Weaknesses:

  • FAIL Mean 3.7% slower than GPU-compile
  • FAIL Higher variance (outliers up to 3.3s)
  • FAIL GPU required

Recommendation: Best-performing in this benchmark for p50 SLA workloads; suitable for development/prototyping.


4.3 transformers-cpu-compile

Performance:

  • Mean: 559.3ms
  • Median: 526.7ms
  • Std: 103.5ms (BEST consistency)
  • Cost: $0.071/1M tokens
  • Throughput: 137.3 tok/s

Strengths:

  • PASS No GPU required
  • PASS Best consistency (lowest std dev)

Weaknesses:

  • FAIL Only 2% faster than plain CPU (p=0.826, not significant)
  • FAIL 1.44x slower than GPU
  • FAIL 1.57x more expensive than GPU

Recommendation: CPU-only environments (but compile brings minimal benefit).


4.4 transformers-cpu

Performance:

  • Mean: 570.6ms
  • Median: 530.4ms
  • Std: 117.2ms
  • Cost: $0.074/1M tokens (most expensive for transformers)
  • Throughput: 132.2 tok/s (lowest for transformers)

Strengths:

  • PASS No GPU required
  • PASS Baseline for comparison

Weaknesses:

  • FAIL 1.47x slower than GPU
  • FAIL 1.64x more expensive than GPU

Recommendation: Development on CPU-only machines.


4.5 ollama

Performance:

  • Mean: 3,410.5ms (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
  • Median: 1,238.5ms (3.8x slower)
  • Std: 3,874.9ms (TERRIBLE - 114% of mean!)
  • Cost: $0.106/1M tokens (2.35x more expensive)
  • Throughput: 91.9 tok/s (lowest; averaged across 270M-8B models vs transformers' tiny-gpt2 124M)

Strengths:

  • PASS Multi-model flexibility (6 models tested)
  • PASS Simple API (swap models on demand)
  • PASS Good for experimentation

Weaknesses:

  • FAIL 8.8x slower than GPU-compile (mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
  • FAIL 2.35x more expensive
  • FAIL Catastrophic variance (173ms to 27,964ms - 161x range!)
  • FAIL Unreliable for production SLAs

Recommendation: Use only when model flexibility matters more than performance.


4.6 tensorrt FAIL

Status: NOT TESTED (100% degraded)

Issue: Missing TensorRT .plan engines. All 273 runs reported degraded status with placeholder latencies (0.35ms).

Next Steps: TR118 will build real engines and re-test.


4.7 onnxruntime FAIL

Status: NOT TESTED (100% degraded)

Issue: ONNX export failures. All 273 runs reported degraded status with placeholder latencies (0.30ms).

Next Steps: TR118 will fix ONNX export and re-test.


5. Statistical Analysis

5.1 Backend Comparison (ANOVA)

CAVEAT: Backend and model are confounded in this design. Transformers backends tested tiny-gpt2 (124M); Ollama tested 270M-8B models. This ANOVA tests system-level differences, not backend-isolated effects.

Null Hypothesis: No difference in mean latency across backends.

Test: One-way ANOVA on 5 backends (excludes TRT/ORT).

Result: F = 45.86, p < 10^-15 PASS Significant

Interpretation: Backend choice critically affects latency (but confounded with model size).


5.2 Pairwise Comparisons

GPU-compile vs GPU:

  • Mean difference: -14.8ms (GPU-compile faster)
  • p-value: < 0.05 PASS Significant
  • Cohen's d: 0.14 (small effect)
  • Finding: GPU-compile 3.7% faster on mean, but median paradox exists

GPU vs CPU:

  • Mean difference: -166.5ms (GPU faster)
  • p-value: < 0.001 PASS Highly significant
  • Cohen's d: 1.48 (large effect)
  • Finding: GPU 1.41x faster, 1.61x cheaper

GPU-compile vs Ollama:

  • Mean difference: -3,021.3ms (GPU-compile faster)
  • p-value: < 10^-15 PASS Astronomically significant
  • Cohen's d: 1.60 (huge effect) (confounded by model size)
  • Finding: GPU-compile 8.8x faster, 2.35x cheaper (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
  • Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.

CPU-compile vs CPU:

  • Mean difference: -11.2ms (CPU-compile faster)
  • p-value: 0.826 FAIL NOT significant
  • Cohen's d: 0.10 (negligible)
  • Finding: Compilation ineffective on CPU

6. Cost Analysis

6.1 Cost Model

Assumptions:

  • GPU: $0.035/hour (AWS g5.xlarge proxy)
  • CPU: $0.035/hour (same, for simplicity)
  • Ollama: $0.035/hour (runs on same hardware)

Limitation: Oversimplified (no spot/reserved pricing, no energy cost).

6.2 Results

Backend Cost/1M Tokens Tokens/$ Notes
GPU-compile $0.045 22.1M Best
GPU $0.046 21.7M 2nd
CPU-compile $0.071 14.1M 3rd
CPU $0.074 13.5M 4th
Ollama $0.106 9.4M Worst (2.35x more expensive)

7. Data Integrity & Limitations

7.1 Missing Data

WARNING Accuracy Metrics: NOT COLLECTED

The metrics.csv accuracy column is 100% NULL (0/3,017 values). The report CANNOT make accuracy claims.

Why: Accuracy validation was disabled during the benchmark run. Baseline outputs were not compared.

Impact: We can only rank backends by speed and cost, not quality.

Fix: TR118 will re-run with accuracy validation enabled.


7.2 Infrastructure Failures

TensorRT: 273/273 runs degraded (100% failure rate)

  • Cause: Missing .plan engine files
  • Evidence: Placeholder latencies (0.35ms average)
  • Fix: TR118 will build real TensorRT engines

ONNXRuntime: 273/273 runs degraded (100% failure rate)

  • Cause: ONNX export failures
  • Evidence: Placeholder latencies (0.30ms average)
  • Fix: TR118 will fix ONNX export pipeline

Total Degraded: 546/3,017 runs (18%)


7.3 Model Skew

Distribution:

  • tiny-gpt2: 55% of runs (HuggingFace, 124M params)
  • gemma3: 25% of runs (Ollama, 270M-3B)
  • qwen2.5: 10% of runs (Ollama, 7B)
  • llama3.1: 10% of runs (Ollama, 8B-q4)

Issue: Cannot isolate backend effects from model effects.

Fix: TR121 will test same models across all backends.


7.4 Single Hardware

All tests on one laptop (RTX 4080). Findings may not generalize to:

  • Data center GPUs (A100, H_100)
  • AMD GPUs
  • Apple Silicon
  • Cloud providers (AWS, Azure, GCP)

Acknowledged limitation; not addressed in the current program (all TRs use RTX 4080 Laptop).


7.5 Synthetic Prompts

Test prompts are not production traces. Real workloads may differ in:

  • Prompt length distribution
  • Batching patterns
  • Concurrent requests
  • Model switching frequency

Acknowledged limitation; TR128 tests production workload patterns but on the same hardware.


8. Recommendations

8.1 Deployment Guidance (Single-Hardware, Constrained Scope)

Best-performing in this benchmark for cost-optimized workloads:

Backend: transformers-gpu-compile
Config:
  BANTER_FORCE_BACKEND=transformers-gpu-compile
  BANTER_INFERENCE_TIMEOUT_S=2
  BANTER_LATENCY_GUARDRAIL_MS=500

Expected: 389ms mean, $0.045/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 215 tok/s
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.

Best-performing in this benchmark for p50 SLA workloads:

Backend: transformers-gpu
Config:
  BANTER_FORCE_BACKEND=transformers-gpu
  BANTER_INFERENCE_TIMEOUT_S=5

Expected: 323ms median, $0.046/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 212 tok/s
Note: Compile paradox - GPU has better median despite worse mean
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.

Best-performing in this benchmark for multi-model flexibility:

Backend: ollama
Config:
  BANTER_OLLAMA_URL=http://localhost:11434

Expected: 3,411ms mean (8.8x slower, confounded by model size), $0.106/1M (based on flat $0.035/hr; see Section 6 limitations)
Only viable when model swapping > performance
Caveat: Ollama tested on 270M-8B models vs transformers' tiny-gpt2 (124M). The 8.8x gap is not a backend-isolated comparison.

NOT RECOMMENDED:

  • FAIL CPU-only: 1.4x slower, 1.6x more expensive than GPU
  • FAIL CPU-compile: Only 2% faster than CPU (not significant)
  • FAIL TensorRT/ONNX: Infrastructure not ready (100% degraded)

8.2 Future Work

TR118: ONNX/TRT Deep Dive (Week of 2025-12-09)

  • Build real TensorRT engines (FP32, FP16, INT8)
  • Fix ONNX export pipeline
  • Re-run benchmark with 0% degraded target
  • Accuracy validation (perplexity + ROUGE)

TR119: Cost & Energy Analysis (Week of 2025-12-16)

  • Real cloud pricing (spot, reserved, on-prem)
  • Energy measurement (Joules/token, carbon footprint)
  • TCO calculator for 1M req/day workload

TR120: Compile Paradox Investigation (Week of 2025-12-23)

  • Profiler traces (torch.profiler, nsys)
  • Kernel-level analysis (where compile helps/hurts)
  • Hybrid strategy (compile for batch, eager for single)

TR121: Model Scaling Study (Week of 2025-12-30)

  • Unified model matrix (same models on all backends)
  • Scaling laws (latency vs params)
  • Quantization necessity analysis

TR122: Resource Profiling (Week of 2026-01-06)

  • Memory profiling (GPU VRAM, CPU RAM, swap)
  • Power measurement (Watts, thermal throttling)
  • Bottleneck identification

TR123: Multi-Hardware Validation (Week of 2026-01-13)

  • A100, H_100, AMD, Apple Silicon
  • AWS g5, Azure NC, GCP A2
  • Real production workload traces

9. Conclusions

9.1 Key Findings

  1. transformers-gpu-compile wins on mean (389ms) and cost ($0.045/1M) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
  2. Plain transformers-gpu wins on median (323ms) - compile paradox
  3. Ollama 8.8x slower, 2.35x more expensive (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B; only viable for multi-model)
  4. CPU compilation ineffective (2% improvement, p=0.826, not significant)
  5. TensorRT/ONNX infrastructure failed (546/546 runs degraded, 0% tested)

9.2 Best-Performing in This Benchmark

transformers-gpu-compile for cost-sensitive workloads (single-hardware, constrained scope).

Decision Matrix:

  • Need lowest mean latency + cost? -> GPU-compile
  • Need best median (p50 SLA)? -> Plain GPU
  • Need multi-model flexibility? -> Ollama (accept 8.8x penalty, confounded by model size)
  • CPU-only? -> Plain CPU (compile brings no benefit)
  • TensorRT/ONNX? -> Wait for TR118 (currently 100% broken)

9.3 Scientific Integrity

WARNING This report reflects ACTUAL test results:

  • NO accuracy data (column empty)
  • TensorRT/ONNX NOT tested (100% degraded)
  • Model skew (55% tiny-gpt2)
  • Single hardware (RTX 4080 laptop)
  • Synthetic prompts (not production traces)

10. Reproducibility

10.1 Artifacts

Data:

  • results/tr117_tier3/metrics.csv (3,017 rows, 2,471 ok, 546 degraded)
  • results/tr117_tier3/cost_analysis.json
  • results/tr117_tier3/statistical_analysis.json

Scripts:

  • scripts/tr117/run_matrix.py (benchmark runner)
  • scripts/tr117/analyze_tr117.py (aggregation)
  • scripts/tr117/statistical_analysis.py (ANOVA, t-tests)
  • scripts/tr117/cost_analysis.py ($/1M tokens)

Config:

  • scripts/tr117/configs/matrix_tier3_full.yaml

10.2 How to Reproduce

# 1. Setup environment
cd scripts/tr117
pip install -r requirements_frozen.txt

# 2. Run benchmark (10-20 hours)
python run_matrix.py --config configs/matrix_tier3_full.yaml

# 3. Analyze results
python analyze_tr117.py --input results/tr117_tier3/metrics.csv
python statistical_analysis.py --input results/tr117_tier3/metrics.csv
python cost_analysis.py --input results/tr117_tier3/metrics.csv

10.3 Hardware Requirements

  • NVIDIA GPU (RTX 4000+ or A100)
  • 12GB+ VRAM (tested on RTX 4080 Laptop, 12.9GB)
  • 32GB+ RAM
  • 100GB+ disk space

Appendix A: Raw Statistics

Backend Statistics (Successful Runs Only):

{
  "transformers-gpu-compile": {
    "count": 273,
    "mean": 389.2,
    "median": 328.7,
    "std": 117.8,
    "min": 277.4,
    "max": 681.8
  },
  "transformers-gpu": {
    "count": 273,
    "mean": 404.1,
    "median": 322.7,
    "std": 223.9,
    "min": 276.8,
    "max": 3325.8
  },
  "transformers-cpu-compile": {
    "count": 273,
    "mean": 559.3,
    "median": 526.7,
    "std": 103.5,
    "min": 398.2,
    "max": 785.5
  },
  "transformers-cpu": {
    "count": 287,
    "mean": 570.6,
    "median": 530.4,
    "std": 117.2,
    "min": 314.5,
    "max": 842.0
  },
  "ollama": {
    "count": 1365,
    "mean": 3410.5,
    "median": 1238.5,
    "std": 3874.9,
    "min": 173.5,
    "max": 27963.9
  }
}

Appendix B: Degraded Runs

Total Degraded: 546/3,017 (18%)

By Backend:

  • TensorRT: 273 (100% of TRT runs)
  • ONNXRuntime: 273 (100% of ORT runs)
  • Others: 0 (0% degraded)

Degradation Reasons:

  • tensorrt_engine_not_found (273 runs)
  • onnx_export_failed (273 runs)

End of Technical Report 117 (Revised)

Changelog:

  • v1.0 (2025-12-07): Initial report (accuracy claims removed in v1.1 — no supporting data existed)
  • v1.1 (2025-12-08): DATA-CONSISTENT REVISION
    • Removed all accuracy claims (no data exists)
    • Marked TRT/ORT as NOT TESTED (100% degraded)
    • Added Data Integrity section (honest limitations)
    • Regenerated statistical analysis from tier3 data
    • Acknowledged 546 degraded runs
    • Changed recommendations to reflect 5 backends only