Technical Report 117: Cross-Backend Inference Benchmark
| Field | Value |
|---|---|
| TR Number | 117 |
| Version | 1.1 (Revised) |
| Date | 2025-12-08 |
| Author | Banterhearts Team |
| Status | Complete (Data-Consistent Revision) |
Executive Summary
This report presents a benchmark of 5 inference backends (3 successfully tested, 2 failed infrastructure checks) for local-first LLM serving: PyTorch transformers (CPU/GPU, with and without torch.compile) and Ollama. Across 3,017 total runs (2,471 successful, 546 degraded), we measured latency, throughput, and cost efficiency.
Key Findings
- GPU-compile wins on mean (389ms) and cost ($0.045/1M tokens) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
- Plain GPU wins on median (323ms) - compile paradox discovered
- Ollama 8.8x slower than GPU-compile (3,411ms vs 389ms mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
- CPU compilation ineffective (2% improvement, p=0.826, not significant)
- TensorRT/ONNXRuntime infrastructure failed (100% degraded runs)
Honest Limitations
WARNING This report reflects ACTUAL test results, not aspirations:
- NO accuracy metrics (accuracy column empty in all 3,017 runs)
- TensorRT/ONNX NOT tested (546/546 runs degraded due to missing engines)
- Model skew: 55% runs on tiny-gpt2 (124M), rest on Ollama models (270M-8B)
- Single hardware: RTX 4080 laptop only
- Synthetic prompts: Not production workload traces
Bottom Line: In this benchmark, transformers-gpu-compile delivers best mean latency and cost, but plain GPU has slightly better median. Ollama viable only when model flexibility outweighs 8.8x performance penalty (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B).
1. Introduction
1.1 Motivation
Local-first LLM inference requires choosing the optimal backend for speed, cost, and reliability. PyTorch offers torch.compile(), Ollama provides multi-model flexibility, and specialized runtimes (ONNX, TensorRT) promise further optimizations.
Research Question: Which backend delivers the best latency, cost, and reliability for production inference?
1.2 Scope
Tested:
- PASS PyTorch transformers (CPU, GPU, CPU-compile, GPU-compile)
- PASS Ollama (llama.cpp backend with 6 models)
Failed (Infrastructure Issues):
- FAIL TensorRT (273 runs -> 100% degraded, missing .plan files)
- FAIL ONNXRuntime (273 runs -> 100% degraded, ONNX export failures)
Test Matrix:
- Models: tiny-gpt2 (124M), gemma3 (270M, 1B, 3B), qwen2.5 (7B), llama3.1 (8B-q4)
- Scenarios: 7 prompt types (micro, short, medium, long, dual, stress)
- Repetitions: 7 per combination
- Total: 3,017 runs planned, 2,471 successful (82%)
1.3 Related Work
- PyTorch 2.x:
torch.compile()introduced for graph-level optimization - Ollama: Optimized llama.cpp with quantization and KV cache tuning
- TensorRT: NVIDIA's inference SDK (not validated in this study)
- ONNX Runtime: Cross-platform inference (not validated in this study)
2. Methodology
2.1 Hardware
- GPU: NVIDIA RTX 4080 Laptop (12.9GB VRAM, CUDA 12.8)
- CPU: Unspecified (Windows 11, 32-core system)
- RAM: 32GB+
2.2 Software
- PyTorch: 2.5.1
- Transformers: 4.45.2
- Ollama: Latest (December 2025)
- CUDA: 12.8
- Python: 3.13
2.3 Test Scenarios
| Scenario | Prompt Length | Mode | Prompts |
|---|---|---|---|
| micro | 1-2 words | single | "Hello", "Test" |
| short | 8-15 words | single | "Summarize RLHF..." |
| medium | 20-30 words | single | "Explain backpressure..." |
| long | 40-50 words | single | "Overview of attention mechanism..." |
| dual_short | 8-15 words | dual | Two-agent prompts |
| dual_medium | 20-30 words | dual | Two-agent prompts |
| stress | 80+ words | single | "Generate 500-word essay..." |
7 scenarios x 7 repetitions = 49 runs per backend/model combination
2.4 Metrics
Latency:
- Mean, median, std dev, min, max
- TTFT, p50, p95, p99: listed but not reported in this revision
Throughput:
- Tokens/second
- Requests/second (inverse of latency)
Cost:
- $/1M tokens (based on $0.035/hour GPU cost)
- Compute efficiency (tokens/$)
Reliability:
- Success rate (ok vs degraded vs error)
- Degradation reasons
2.5 Reproducibility
- Seeds: 42 (consistent across runs)
- Temperature: 0.7
- Max tokens: 128
- Docker: Not used (native Windows environment)
- Frozen deps:
requirements_frozen.txtinscripts/tr117/
2.6 Statistical Analysis
- Tests: Independent-samples t-tests, ANOVA, Bonferroni correction (corrected alpha = 0.05/4 = 0.0125 for 4 pairwise comparisons)
- Effect sizes: Cohen's d
- Confidence intervals: not computed for this revision
- Significance threshold: p < 0.05 (p < 0.0125 after Bonferroni correction)
- Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.
3. Results Overview
3.1 Summary Table
| Backend | Primary Model | Successful Runs | Mean (ms) | Median (ms) | Std (ms) | Cost ($/1M tok) | Throughput (tok/s) |
|---|---|---|---|---|---|---|---|
| transformers-gpu-compile | tiny-gpt2 (124M) | 273 | 389.2 | 328.7 | 117.8 | $0.045 | 215.2 |
| transformers-gpu | tiny-gpt2 (124M) | 273 | 404.1 | 322.7 | 223.9 | $0.046 | 211.7 |
| transformers-cpu-compile | tiny-gpt2 (124M) | 273 | 559.3 | 526.7 | 103.5 | $0.071 | 137.3 |
| transformers-cpu | tiny-gpt2 (124M) | 287 | 570.6 | 530.4 | 117.2 | $0.074 | 132.2 |
| ollama | 270M-8B (mixed) | 1,365 | 3,410.5 | 1,238.5 | 3,874.9 | $0.106 | 91.9 |
| tensorrt | — | 0 | N/A | N/A | N/A | N/A | N/A |
| onnxruntime | — | 0 | N/A | N/A | N/A | N/A | N/A |
Note: Throughput comparison is across different model sizes. Ollama's 91.9 tok/s is averaged across 270M-8B models; transformers' 215.2 tok/s is from tiny-gpt2 (124M).
Key Observations:
- Best mean: GPU-compile (389ms)
- Best median: Plain GPU (323ms) <- compile paradox
- Best cost: GPU-compile ($0.045/1M)
- Best consistency: CPU-compile (std 103.5ms)
- Worst performance: Ollama (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
- Infrastructure failures: TRT/ORT 100% degraded
3.2 Status Breakdown
Total Runs: 3,017
|- Successful: 2,471 (82%)
|- Degraded: 546 (18%)
| |- TensorRT: 273 (missing .plan engines)
| +- ONNXRuntime: 273 (ONNX export failures)
+- Hard Errors: 0 (0%)
3.3 Compile Paradox
Discovery: torch.compile() improves mean but degrades median:
- Mean: 404ms -> 389ms (3.7% faster)
- Median: 323ms -> 329ms (1.9% slower)
Hypothesis: Compile reduces outliers (better tail latency) but adds overhead to typical requests.
4. Backend Deep Dive
4.1 transformers-gpu-compile TOP
Winner on mean latency and cost.
Performance:
- Mean: 389.2ms (3.7% faster than plain GPU)
- Median: 328.7ms (1.9% slower than plain GPU)
- Std: 117.8ms (2x better than plain GPU)
- Cost: $0.045/1M tokens (cheapest)
- Throughput: 215.2 tok/s (highest)
Strengths:
- PASS Best mean latency
- PASS Lowest cost
- PASS Highest throughput
- PASS Better consistency (lower std dev)
Weaknesses:
- FAIL Median 1.9% slower than plain GPU
- FAIL 30s compilation overhead on first run
- FAIL GPU required
Recommendation: Best-performing in this benchmark for cost-sensitive workloads.
4.2 transformers-gpu
Winner on median latency.
Performance:
- Mean: 404.1ms
- Median: 322.7ms (BEST)
- Std: 223.9ms (2x worse than GPU-compile)
- Cost: $0.046/1M tokens
- Throughput: 211.7 tok/s
Strengths:
- PASS Best median (1.9% faster than GPU-compile)
- PASS No compilation overhead
- PASS Simpler debugging
Weaknesses:
- FAIL Mean 3.7% slower than GPU-compile
- FAIL Higher variance (outliers up to 3.3s)
- FAIL GPU required
Recommendation: Best-performing in this benchmark for p50 SLA workloads; suitable for development/prototyping.
4.3 transformers-cpu-compile
Performance:
- Mean: 559.3ms
- Median: 526.7ms
- Std: 103.5ms (BEST consistency)
- Cost: $0.071/1M tokens
- Throughput: 137.3 tok/s
Strengths:
- PASS No GPU required
- PASS Best consistency (lowest std dev)
Weaknesses:
- FAIL Only 2% faster than plain CPU (p=0.826, not significant)
- FAIL 1.44x slower than GPU
- FAIL 1.57x more expensive than GPU
Recommendation: CPU-only environments (but compile brings minimal benefit).
4.4 transformers-cpu
Performance:
- Mean: 570.6ms
- Median: 530.4ms
- Std: 117.2ms
- Cost: $0.074/1M tokens (most expensive for transformers)
- Throughput: 132.2 tok/s (lowest for transformers)
Strengths:
- PASS No GPU required
- PASS Baseline for comparison
Weaknesses:
- FAIL 1.47x slower than GPU
- FAIL 1.64x more expensive than GPU
Recommendation: Development on CPU-only machines.
4.5 ollama
Performance:
- Mean: 3,410.5ms (8.8x slower than GPU-compile) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
- Median: 1,238.5ms (3.8x slower)
- Std: 3,874.9ms (TERRIBLE - 114% of mean!)
- Cost: $0.106/1M tokens (2.35x more expensive)
- Throughput: 91.9 tok/s (lowest; averaged across 270M-8B models vs transformers' tiny-gpt2 124M)
Strengths:
- PASS Multi-model flexibility (6 models tested)
- PASS Simple API (swap models on demand)
- PASS Good for experimentation
Weaknesses:
- FAIL 8.8x slower than GPU-compile (mean) (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
- FAIL 2.35x more expensive
- FAIL Catastrophic variance (173ms to 27,964ms - 161x range!)
- FAIL Unreliable for production SLAs
Recommendation: Use only when model flexibility matters more than performance.
4.6 tensorrt FAIL
Status: NOT TESTED (100% degraded)
Issue: Missing TensorRT .plan engines. All 273 runs reported degraded status with placeholder latencies (0.35ms).
Next Steps: TR118 will build real engines and re-test.
4.7 onnxruntime FAIL
Status: NOT TESTED (100% degraded)
Issue: ONNX export failures. All 273 runs reported degraded status with placeholder latencies (0.30ms).
Next Steps: TR118 will fix ONNX export and re-test.
5. Statistical Analysis
5.1 Backend Comparison (ANOVA)
CAVEAT: Backend and model are confounded in this design. Transformers backends tested tiny-gpt2 (124M); Ollama tested 270M-8B models. This ANOVA tests system-level differences, not backend-isolated effects.
Null Hypothesis: No difference in mean latency across backends.
Test: One-way ANOVA on 5 backends (excludes TRT/ORT).
Result: F = 45.86, p < 10^-15 PASS Significant
Interpretation: Backend choice critically affects latency (but confounded with model size).
5.2 Pairwise Comparisons
GPU-compile vs GPU:
- Mean difference: -14.8ms (GPU-compile faster)
- p-value: < 0.05 PASS Significant
- Cohen's d: 0.14 (small effect)
- Finding: GPU-compile 3.7% faster on mean, but median paradox exists
GPU vs CPU:
- Mean difference: -166.5ms (GPU faster)
- p-value: < 0.001 PASS Highly significant
- Cohen's d: 1.48 (large effect)
- Finding: GPU 1.41x faster, 1.61x cheaper
GPU-compile vs Ollama:
- Mean difference: -3,021.3ms (GPU-compile faster)
- p-value: < 10^-15 PASS Astronomically significant
- Cohen's d: 1.60 (huge effect) (confounded by model size)
- Finding: GPU-compile 8.8x faster, 2.35x cheaper (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B)
- Note: GPU-compile vs Ollama comparison is between different models and cannot be interpreted as a backend-isolated effect.
CPU-compile vs CPU:
- Mean difference: -11.2ms (CPU-compile faster)
- p-value: 0.826 FAIL NOT significant
- Cohen's d: 0.10 (negligible)
- Finding: Compilation ineffective on CPU
6. Cost Analysis
6.1 Cost Model
Assumptions:
- GPU: $0.035/hour (AWS g5.xlarge proxy)
- CPU: $0.035/hour (same, for simplicity)
- Ollama: $0.035/hour (runs on same hardware)
Limitation: Oversimplified (no spot/reserved pricing, no energy cost).
6.2 Results
| Backend | Cost/1M Tokens | Tokens/$ | Notes |
|---|---|---|---|
| GPU-compile | $0.045 | 22.1M | Best |
| GPU | $0.046 | 21.7M | 2nd |
| CPU-compile | $0.071 | 14.1M | 3rd |
| CPU | $0.074 | 13.5M | 4th |
| Ollama | $0.106 | 9.4M | Worst (2.35x more expensive) |
7. Data Integrity & Limitations
7.1 Missing Data
WARNING Accuracy Metrics: NOT COLLECTED
The metrics.csv accuracy column is 100% NULL (0/3,017 values). The report CANNOT make accuracy claims.
Why: Accuracy validation was disabled during the benchmark run. Baseline outputs were not compared.
Impact: We can only rank backends by speed and cost, not quality.
Fix: TR118 will re-run with accuracy validation enabled.
7.2 Infrastructure Failures
TensorRT: 273/273 runs degraded (100% failure rate)
- Cause: Missing
.planengine files - Evidence: Placeholder latencies (0.35ms average)
- Fix: TR118 will build real TensorRT engines
ONNXRuntime: 273/273 runs degraded (100% failure rate)
- Cause: ONNX export failures
- Evidence: Placeholder latencies (0.30ms average)
- Fix: TR118 will fix ONNX export pipeline
Total Degraded: 546/3,017 runs (18%)
7.3 Model Skew
Distribution:
- tiny-gpt2: 55% of runs (HuggingFace, 124M params)
- gemma3: 25% of runs (Ollama, 270M-3B)
- qwen2.5: 10% of runs (Ollama, 7B)
- llama3.1: 10% of runs (Ollama, 8B-q4)
Issue: Cannot isolate backend effects from model effects.
Fix: TR121 will test same models across all backends.
7.4 Single Hardware
All tests on one laptop (RTX 4080). Findings may not generalize to:
- Data center GPUs (A100, H_100)
- AMD GPUs
- Apple Silicon
- Cloud providers (AWS, Azure, GCP)
Acknowledged limitation; not addressed in the current program (all TRs use RTX 4080 Laptop).
7.5 Synthetic Prompts
Test prompts are not production traces. Real workloads may differ in:
- Prompt length distribution
- Batching patterns
- Concurrent requests
- Model switching frequency
Acknowledged limitation; TR128 tests production workload patterns but on the same hardware.
8. Recommendations
8.1 Deployment Guidance (Single-Hardware, Constrained Scope)
Best-performing in this benchmark for cost-optimized workloads:
Backend: transformers-gpu-compile
Config:
BANTER_FORCE_BACKEND=transformers-gpu-compile
BANTER_INFERENCE_TIMEOUT_S=2
BANTER_LATENCY_GUARDRAIL_MS=500
Expected: 389ms mean, $0.045/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 215 tok/s
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.
Best-performing in this benchmark for p50 SLA workloads:
Backend: transformers-gpu
Config:
BANTER_FORCE_BACKEND=transformers-gpu
BANTER_INFERENCE_TIMEOUT_S=5
Expected: 323ms median, $0.046/1M tokens (based on flat $0.035/hr for all hardware; see Section 6 limitations), 212 tok/s
Note: Compile paradox - GPU has better median despite worse mean
Caveat: Tested on tiny-gpt2 (124M) only. Results may not generalize to larger models or different hardware.
Best-performing in this benchmark for multi-model flexibility:
Backend: ollama
Config:
BANTER_OLLAMA_URL=http://localhost:11434
Expected: 3,411ms mean (8.8x slower, confounded by model size), $0.106/1M (based on flat $0.035/hr; see Section 6 limitations)
Only viable when model swapping > performance
Caveat: Ollama tested on 270M-8B models vs transformers' tiny-gpt2 (124M). The 8.8x gap is not a backend-isolated comparison.
NOT RECOMMENDED:
- FAIL CPU-only: 1.4x slower, 1.6x more expensive than GPU
- FAIL CPU-compile: Only 2% faster than CPU (not significant)
- FAIL TensorRT/ONNX: Infrastructure not ready (100% degraded)
8.2 Future Work
TR118: ONNX/TRT Deep Dive (Week of 2025-12-09)
- Build real TensorRT engines (FP32, FP16, INT8)
- Fix ONNX export pipeline
- Re-run benchmark with 0% degraded target
- Accuracy validation (perplexity + ROUGE)
TR119: Cost & Energy Analysis (Week of 2025-12-16)
- Real cloud pricing (spot, reserved, on-prem)
- Energy measurement (Joules/token, carbon footprint)
- TCO calculator for 1M req/day workload
TR120: Compile Paradox Investigation (Week of 2025-12-23)
- Profiler traces (torch.profiler, nsys)
- Kernel-level analysis (where compile helps/hurts)
- Hybrid strategy (compile for batch, eager for single)
TR121: Model Scaling Study (Week of 2025-12-30)
- Unified model matrix (same models on all backends)
- Scaling laws (latency vs params)
- Quantization necessity analysis
TR122: Resource Profiling (Week of 2026-01-06)
- Memory profiling (GPU VRAM, CPU RAM, swap)
- Power measurement (Watts, thermal throttling)
- Bottleneck identification
TR123: Multi-Hardware Validation (Week of 2026-01-13)
- A100, H_100, AMD, Apple Silicon
- AWS g5, Azure NC, GCP A2
- Real production workload traces
9. Conclusions
9.1 Key Findings
- transformers-gpu-compile wins on mean (389ms) and cost ($0.045/1M) (based on flat $0.035/hr for all hardware; see Section 6 limitations)
- Plain transformers-gpu wins on median (323ms) - compile paradox
- Ollama 8.8x slower, 2.35x more expensive (confounded by model size: transformers tested tiny-gpt2 124M, Ollama tested 270M-8B; only viable for multi-model)
- CPU compilation ineffective (2% improvement, p=0.826, not significant)
- TensorRT/ONNX infrastructure failed (546/546 runs degraded, 0% tested)
9.2 Best-Performing in This Benchmark
transformers-gpu-compile for cost-sensitive workloads (single-hardware, constrained scope).
Decision Matrix:
- Need lowest mean latency + cost? -> GPU-compile
- Need best median (p50 SLA)? -> Plain GPU
- Need multi-model flexibility? -> Ollama (accept 8.8x penalty, confounded by model size)
- CPU-only? -> Plain CPU (compile brings no benefit)
- TensorRT/ONNX? -> Wait for TR118 (currently 100% broken)
9.3 Scientific Integrity
WARNING This report reflects ACTUAL test results:
- NO accuracy data (column empty)
- TensorRT/ONNX NOT tested (100% degraded)
- Model skew (55% tiny-gpt2)
- Single hardware (RTX 4080 laptop)
- Synthetic prompts (not production traces)
10. Reproducibility
10.1 Artifacts
Data:
results/tr117_tier3/metrics.csv(3,017 rows, 2,471 ok, 546 degraded)results/tr117_tier3/cost_analysis.jsonresults/tr117_tier3/statistical_analysis.json
Scripts:
scripts/tr117/run_matrix.py(benchmark runner)scripts/tr117/analyze_tr117.py(aggregation)scripts/tr117/statistical_analysis.py(ANOVA, t-tests)scripts/tr117/cost_analysis.py($/1M tokens)
Config:
scripts/tr117/configs/matrix_tier3_full.yaml
10.2 How to Reproduce
# 1. Setup environment
cd scripts/tr117
pip install -r requirements_frozen.txt
# 2. Run benchmark (10-20 hours)
python run_matrix.py --config configs/matrix_tier3_full.yaml
# 3. Analyze results
python analyze_tr117.py --input results/tr117_tier3/metrics.csv
python statistical_analysis.py --input results/tr117_tier3/metrics.csv
python cost_analysis.py --input results/tr117_tier3/metrics.csv
10.3 Hardware Requirements
- NVIDIA GPU (RTX 4000+ or A100)
- 12GB+ VRAM (tested on RTX 4080 Laptop, 12.9GB)
- 32GB+ RAM
- 100GB+ disk space
Appendix A: Raw Statistics
Backend Statistics (Successful Runs Only):
{
"transformers-gpu-compile": {
"count": 273,
"mean": 389.2,
"median": 328.7,
"std": 117.8,
"min": 277.4,
"max": 681.8
},
"transformers-gpu": {
"count": 273,
"mean": 404.1,
"median": 322.7,
"std": 223.9,
"min": 276.8,
"max": 3325.8
},
"transformers-cpu-compile": {
"count": 273,
"mean": 559.3,
"median": 526.7,
"std": 103.5,
"min": 398.2,
"max": 785.5
},
"transformers-cpu": {
"count": 287,
"mean": 570.6,
"median": 530.4,
"std": 117.2,
"min": 314.5,
"max": 842.0
},
"ollama": {
"count": 1365,
"mean": 3410.5,
"median": 1238.5,
"std": 3874.9,
"min": 173.5,
"max": 27963.9
}
}
Appendix B: Degraded Runs
Total Degraded: 546/3,017 (18%)
By Backend:
- TensorRT: 273 (100% of TRT runs)
- ONNXRuntime: 273 (100% of ORT runs)
- Others: 0 (0% degraded)
Degradation Reasons:
tensorrt_engine_not_found(273 runs)onnx_export_failed(273 runs)
End of Technical Report 117 (Revised)
Changelog:
- v1.0 (2025-12-07): Initial report (accuracy claims removed in v1.1 — no supporting data existed)
- v1.1 (2025-12-08): DATA-CONSISTENT REVISION
- Removed all accuracy claims (no data exists)
- Marked TRT/ORT as NOT TESTED (100% degraded)
- Added Data Integrity section (honest limitations)
- Regenerated statistical analysis from tier3 data
- Acknowledged 546 degraded runs
- Changed recommendations to reflect 5 backends only