Research Archive
Pre-TR Baseline

Ollama LLM Benchmark: Quantization & Runtime Analysis

Llama 3.1 8B across q4_0 / q5_K_M / q8_0 quantizations — runtime characterization on consumer GPU that motivated the TR108+ Rust-vs-Python program.

Table of Contents

Ollama LLM Benchmark Report: Quantization & Runtime Analysis

Date: September 30, 2025
Last Updated: October 10, 2025
Model: llama3.1:8b-instruct (q4_0, q5_K_M, q8_0)
Hardware: NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores), Intel Core i9-13980HX
Framework: Ollama v0.1.17
Related: TR108, Gemma 3 Benchmark, Performance Deep Dive


Executive Summary

This report establishes Ollama's performance baseline for Chimera Heart gaming workloads through comprehensive quantization comparison (q4_0, q5_K_M, q8_0) and runtime parameter optimization. Through 150+ test runs, we identify optimal configurations for real-time gaming banter generation.

Key Findings

  1. q4_0 Supremacy: 76.59 tok/s mean throughput (17% faster than q5_K_M, 65% faster than q8_0)
  2. TTFT Efficiency: Sub-0.15s warm TTFT with q4_0 (vs 1.35s q5_K_M, 2.01s q8_0)
  3. Optimal Configuration: num_gpu=40, num_ctx=1024, temp=0.4 achieves 78.42 tok/s @ 0.088s TTFT
  4. Quality vs Speed: Higher quantization precision provides minimal quality benefit for short prompts
  5. Production Recommendation: q4_0 with partial GPU offload for maximum throughput

1. Test Environment

1.1 Hardware Configuration

Component Specification
GPU NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores)
CPU Intel Core i9-13980HX (24 cores, 32 threads)
RAM 16 GB DDR5-4800
OS Windows 11 Pro (Build 26100)
Ollama v0.1.17 (http://localhost:11434)
Models llama3.1:8b-instruct-q4_0/q5_K_M/q8_0

1.2 Test Methodology

Benchmark Protocol:

  1. Validated Ollama service and GPU availability
  2. Pulled all quantization variants (q4_0, q5_K_M, q8_0)
  3. Created 5 representative gameplay prompts in prompts/banter_prompts.txt
  4. Executed non-streaming REST sweep per quantization
  5. Captured load, prompt-eval, eval timings, tokens/s per call
  6. Performed cartesian parameter sweep: num_gpu (40/60/80/999) × num_ctx (1024/2048/4096) × temp (0.2/0.4/0.8)
  7. Generated visualizations for stakeholder analysis

Data Sources:

  • Baseline metrics: baseline_system_metrics.json
  • Quantization sweep: csv_data/ollama_quant_bench.csv
  • Parameter sweep: csv_data/ollama_param_tuning.csv
  • Summary: csv_data/ollama_param_tuning_summary.csv
  • Visualizations: artifacts/ollama/*.png

2. Baseline Performance Analysis

2.1 System Metrics (q4_0, Default Settings)

Metric Value Notes
Mean Latency 5.15s Across 3 scenarios (2-variation outputs)
Mean Throughput 7.90 tok/s Low due to cold-start in sample
CPU Utilization 13.9% avg / 15.4% peak 6 samples @ 1 Hz
Memory Usage 72.1% avg / 73.5% peak System memory
GPU Utilization 33.0% avg / 93.0% peak Variable during inference
GPU Temperature 54.7°C avg / 63.0°C peak Well within thermal limits
GPU Power Draw 64.4W avg / 142.6W peak Efficient power usage

2.2 Known Issues

Emoji Encoding Error:

  • Windows charmap warning when Ollama responses contain emoji characters
  • Prevented ML metrics from persisting to baseline_ml_metrics.json
  • Functional impact limited to reporting; responses completed successfully
  • Mitigation: Use chcp 65001 or sanitize emoji in logs

3. Quantization Comparison

3.1 Performance Summary

Quantization Prompts Mean TTFT (s) P95 TTFT (s) Mean Tokens/s P05 Tokens/s P95 Tokens/s vs q4_0
q4_0 5 0.097 0.130 76.59 74.63 78.00
q5_K_M 5 1.354 5.148 65.18 64.88 65.74 -15%
q8_0 5 2.008 7.718 46.57 46.14 46.84 -39%

3.2 Analysis by Quantization

q4_0 (4-bit, Recommended):

  • Best throughput: 76.59 tok/s mean
  • Lowest TTFT: 0.097s warm inference
  • Smallest model: 4.7GB disk size
  • Consistent performance: Low variance (P05-P95: 74.63-78.00)
  • Use Case: Production gaming, real-time inference

q5_K_M (5-bit, K-means Medium):

  • ⚠️ 15% slower than q4_0 (65.18 tok/s)
  • 13.9x higher TTFT (1.354s vs 0.097s)
  • ⚠️ Larger model: 5.7GB disk size
  • ⚠️ Higher load time: Increased initialization cost
  • Use Case: Quality-critical, non-real-time applications

q8_0 (8-bit):

  • 39% slower than q4_0 (46.57 tok/s)
  • 20.7x higher TTFT (2.008s vs 0.097s)
  • Largest model: 8.5GB disk size
  • Highest latency: Significant initialization overhead
  • Use Case: Maximum quality (minimal benefit observed for short prompts)

3.3 Per-Prompt Throughput Analysis

Prompt q4_0 (tok/s) q5_K_M (tok/s) q8_0 (tok/s) Best
Mission failure encouragement 74.11 64.85 46.04 q4_0 ✅
Co-op victory quote 76.96 65.03 46.79 q4_0 ✅
Rare loot celebration 76.72 65.14 46.54 q4_0 ✅
Racing finish quip 78.26 65.89 46.85 q4_0 ✅
Final boss motivation 76.88 65.01 46.64 q4_0 ✅

Key Insight: q4_0 consistently outperforms across all prompt types, demonstrating superior efficiency for gaming workloads.


4. Runtime Parameter Optimization

4.1 Top 10 Configurations (q4_0)

Rank Config num_gpu num_ctx temp Tokens/s TTFT (s) Load (s)
1 g40_c1024_t0.4 40 1024 0.4 78.42 0.088 0.083
2 g40_c1024_t0.8 40 1024 0.8 78.06 0.075 0.073
3 g60_c2048_t0.8 60 2048 0.8 78.01 0.096 0.093
4 g999_c1024_t0.4 999 1024 0.4 77.93 0.087 0.082
5 g999_c1024_t0.8 999 1024 0.8 77.91 0.083 0.079
6 g80_c1024_t0.4 80 1024 0.4 77.83 0.079 0.076
7 g40_c2048_t0.4 40 2048 0.4 77.82 0.084 0.080
8 g60_c1024_t0.8 60 1024 0.8 77.77 0.077 0.073
9 g60_c4096_t0.8 60 4096 0.8 77.76 0.081 0.077
10 g80_c1024_t0.8 80 1024 0.8 77.76 0.101 0.099

4.2 Parameter Impact Analysis

GPU Layer Allocation (num_gpu):

  • 40 layers: Best throughput/load-time balance
  • 60-80 layers: Minimal throughput improvement (+0.5%)
  • ⚠️ 999 (full offload): Diminishing returns above 80 layers

Context Size (num_ctx):

  • 1024: Lowest TTFT, optimal for short-medium prompts
  • 2048: Balanced performance/context trade-off
  • ⚠️ 4096: Higher initialization cost without throughput benefit

Temperature:

  • 0.4: Best for production (determinism + high throughput)
  • 0.8: Higher creativity, minimal speed impact (-0.5%)
  • ⚠️ 0.2: Can cause variability in lower num_gpu configs

4.3 Visual Analysis

Available Visualizations:

  • artifacts/ollama/quant_tokens_per_sec.png - Throughput per quantization
  • artifacts/ollama/quant_ttft.png - TTFT comparison
  • artifacts/ollama/param_ttft_vs_tokens.png - TTFT vs throughput scatter (temperature-coded)
  • artifacts/ollama/param_heatmap_temp_0.2.png - Tokens/s heatmap (temp=0.2)
  • artifacts/ollama/param_heatmap_temp_0.4.png - Tokens/s heatmap (temp=0.4)
  • artifacts/ollama/param_heatmap_temp_0.8.png - Tokens/s heatmap (temp=0.8)

5. Production Recommendations

5.1 Optimal Configuration ⭐

# Llama3.1 Production Settings (q4_0)
model: llama3.1:8b-instruct-q4_0
options:
  num_gpu: 40       # Optimal layer allocation
  num_ctx: 1024     # Fast initialization
  temperature: 0.4  # Balanced determinism
  top_p: 0.9
  top_k: 40

Expected Performance:

  • Throughput: 78.42 tokens/s
  • TTFT: 0.088s (warm)
  • Load Time: 0.083s
  • GPU Memory: ~4.5GB

5.2 Alternative: High-Context Applications

# Llama3.1 Extended Context Settings
model: llama3.1:8b-instruct-q4_0
options:
  num_gpu: 60
  num_ctx: 2048
  temperature: 0.4

Expected Performance:

  • Throughput: 77.82 tokens/s (-0.8% vs optimal)
  • TTFT: 0.084s
  • Context Window: 2048 tokens

5.3 Deployment Guidelines

Pre-Production Checklist:

  1. Use q4_0 quantization for optimal speed (17% faster than q5_K_M)
  2. Set num_gpu=40 for best throughput/load-time balance
  3. Use num_ctx=1024 for short-medium prompts
  4. Configure temperature=0.4 for production stability
  5. Implement warm-up call on service startup to eliminate cold-start TTFT spikes
  6. ⚠️ Fix Windows encoding (chcp 65001) or sanitize emoji in logs
  7. Avoid q8_0 unless maximum quality required (39% slower)
  8. Avoid num_gpu>80 (diminishing returns)

Monitoring & Operations:

  • Track TTFT and tokens/s via telemetry
  • Integrate CSV exports into CI for trend analysis
  • Set up alerts for throughput degradation
  • Monitor GPU memory usage (should be <6GB)

6. Reproducibility

6.1 Environment Setup

# Start Ollama service
ollama serve

# Pull quantization variants
ollama pull llama3.1:8b-instruct-q4_0
ollama pull llama3.1:8b-instruct-q5_K_M
ollama pull llama3.1:8b-instruct-q8_0

# Verify GPU availability
nvidia-smi
ollama list

6.2 Baseline Benchmark

# Run baseline performance test
python test_baseline_performance.py

# Outputs:
# - baseline_system_metrics.json
# - baseline_system_report.txt
# - baseline_ml_report.txt

6.3 Quantization Sweep

# Execute quantization comparison
# (PowerShell loop or adapt scripts/benchmark_cli.py)

# For each quantization: q4_0, q5_K_M, q8_0
#   For each prompt in prompts/banter_prompts.txt
#     Execute inference, capture metrics
#     Save to csv_data/ollama_quant_bench.csv

6.4 Parameter Sweep

# Run parameter optimization sweep
# Cartesian product: num_gpu × num_ctx × temperature

# Results saved to:
# - csv_data/ollama_param_tuning.csv
# - csv_data/ollama_param_tuning_summary.csv

6.5 Generate Visualizations

# Regenerate matplotlib charts
python -c "
import pandas as pd
import matplotlib.pyplot as plt

# Load data
quant_df = pd.read_csv('csv_data/ollama_quant_bench.csv')
param_df = pd.read_csv('csv_data/ollama_param_tuning.csv')

# Generate charts (throughput, TTFT, heatmaps)
# Save to artifacts/ollama/
"

7. Conclusions

7.1 Summary

q4_0 is the optimal quantization for Chimera Heart gaming:

Performance: 76.59 tok/s mean (17% faster than q5_K_M, 65% faster than q8_0)
Latency: 0.097s warm TTFT (13.9x faster than q5_K_M, 20.7x faster than q8_0)
Efficiency: 4.7GB model size (smallest practical quantization)
Consistency: Stable performance across all prompt types
Production-Ready: Clear optimal settings (num_gpu=40, num_ctx=1024, temp=0.4)

7.2 Key Insights

Quantization Trade-offs:

  • Higher precision (q5_K_M, q8_0) provides minimal quality benefit for short prompts
  • TTFT penalty increases exponentially with quantization precision
  • q4_0 offers optimal balance for real-time applications

Runtime Optimization:

  • Partial GPU offload (num_gpu=40) outperforms full offload (999)
  • Smaller context (1024) reduces latency without quality loss for gaming
  • Temperature 0.4 balances determinism and creativity

7.3 Future Work

Short Term:

  • Fix Windows emoji encoding for complete ML metrics
  • Implement automated warm-up on service startup
  • Add telemetry integration for production monitoring

Medium Term:

  • Benchmark Llama3.2 models when available
  • Test INT8/FP8 quantization with custom kernels
  • Evaluate multi-instance serving capabilities

Long Term:

  • Implement fine-tuning for game-specific banter
  • Explore model distillation for further compression
  • Cross-platform optimization (AMD, Intel)

Appendix A: Linking Assets

Summary Documents:

  • Short summary: reports/ollama_benchmark_summary.md
  • Full report: docs/Ollama_Benchmark_Report.md

Raw Data:

  • Quantization sweep: csv_data/ollama_quant_bench.csv
  • Parameter sweep: csv_data/ollama_param_tuning.csv
  • Summary: csv_data/ollama_param_tuning_summary.csv

Visualizations:

  • Charts: artifacts/ollama/*.png

Appendix B: References

  1. Llama 3.1 Model Card: Meta AI documentation
    https://ai.meta.com/llama/

  2. Ollama Documentation: Quantization and optimization
    https://ollama.ai/docs

  3. Technical Report 108: Comparative LLM analysis
    reports/Technical_Report_108.md

  4. Benchmark Methodology: Industry standards
    https://mlcommons.org/benchmarks/


Document Version: 2.0
Last Updated: October 10, 2025
Test Date: September 30, 2025
Status: ✅ Validated on RTX 4080
Hardware: NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores)