Research Archive
Technical Report

Gemma 3 Benchmark Report

Performance benchmarks for Google Gemma 3 models — baseline measurements, parameter tuning, and quantization.

Table of Contents

Gemma3 Performance Benchmark Report

Date: October 8, 2025
Last Updated: October 10, 2025
Model: gemma3:latest (4.3B parameters, Q4_K_M quantization, 3.3GB)
Hardware: NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores), Intel Core i9-13980HX
Framework: Ollama v0.1.17
Related: TR108, Ollama Benchmark


Executive Summary

This report establishes Gemma3's performance baseline for Chimera Heart gaming banter generation through comprehensive benchmarking across quantization levels and runtime parameter sweeps. Through 150+ test runs, we demonstrate Gemma3's superiority over Llama3.1 for real-time gaming applications.

Key Findings

  1. Performance Leadership: 102.85 tokens/s mean throughput (34% faster than Llama3.1)
  2. Efficiency Advantage: 3.3GB model size (30% smaller than Llama3.1)
  3. GPU Utilization: 100% GPU processing confirmed via ollama ps
  4. Optimal Configuration: num_gpu=999, num_ctx=4096, temp=0.4 achieves 102.31 tok/s @ 0.128s TTFT
  5. Production Ready: Consistent performance across all test scenarios

1. Test Environment

1.1 Hardware Configuration

Component Specification
GPU NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores)
CPU Intel Core i9-13980HX (24 cores, 32 threads)
RAM 16 GB DDR5-4800
OS Windows 11 Pro (Build 26100)
Ollama v0.1.17 (http://localhost:11434)
Model gemma3:latest (4.3B params, Q4_K_M, 3.3GB)

1.2 Test Methodology

Benchmark Protocol:

  1. Validated Ollama service and GPU availability (nvidia-smi)
  2. Loaded gemma3:latest (3.3GB) into GPU memory
  3. Executed baseline benchmark with 5 representative prompts
  4. Captured load, prompt-eval, eval timings per prompt
  5. Performed parameter sweep: num_gpu (40/60/80/999) × num_ctx (1024/2048/4096) × temp (0.2/0.4/0.8)
  6. Analyzed results for optimal gaming configuration

Data Sources:

  • Baseline benchmark: reports/gemma3/gemma3_baseline.json, reports/gemma3/gemma3_baseline.csv
  • Parameter sweep: reports/gemma3/gemma3_param_tuning.csv
  • Summary: reports/gemma3/gemma3_param_tuning_summary.csv
  • Prompts: prompts/banter_prompts.txt

2. Baseline Performance

2.1 Overall Metrics

Metric Value Notes
Mean TTFT 0.165s Excluding cold start
Mean Throughput 102.85 tok/s Consistent across prompts
GPU Memory 5.3 GB Model + context
GPU Utilization 100% Confirmed via ollama ps
Default Settings temp=0.3, top_p=0.9

2.2 Per-Prompt Performance

Prompt TTFT (s) Tokens/s Eval Count Response Length
Mission failure encouragement 0.344 102.34 883 3,503 chars
Co-op victory quote 0.121 103.58 320 1,005 chars
Rare loot celebration 0.118 102.22 746 2,945 chars
Racing finish quip 0.119 103.72 272 978 chars
Final boss motivation 0.122 102.38 636 2,409 chars

2.3 Analysis

Key Observations:

  • Consistent High Throughput: 102-104 tokens/s across all prompts
  • Stable TTFT: ~0.12s for warm inference (first prompt: 0.344s cold start)
  • Quality Output: 272-883 tokens per response with contextually appropriate content
  • GPU Efficiency: 100% utilization, minimal load times (~0.10-0.11s avg)

3. Parameter Tuning Results

3.1 Top 10 Configurations

Config num_gpu num_ctx temp Tokens/s TTFT (s) vs Baseline
g999_c4096_t0.4 999 4096 0.4 102.31 0.128 -0.5%
g80_c4096_t0.8 80 4096 0.8 102.18 0.142 -0.7%
g999_c1024_t0.8 999 1024 0.8 102.03 0.117 -0.8%
g80_c2048_t0.4 80 2048 0.4 101.89 0.144 -0.9%
g999_c1024_t0.4 999 1024 0.4 101.77 0.126 -1.0%
g60_c2048_t0.8 60 2048 0.8 101.65 0.139 -1.2%
g999_c2048_t0.4 999 2048 0.4 101.52 0.133 -1.3%
g80_c1024_t0.8 80 1024 0.8 101.45 0.121 -1.4%
g60_c4096_t0.4 60 4096 0.4 101.38 0.147 -1.4%
g999_c4096_t0.8 999 4096 0.8 101.25 0.135 -1.6%

3.2 Parameter Impact Analysis

GPU Layer Allocation (num_gpu):

  • 999 (Full offload): Optimal throughput, minimal TTFT variance
  • 80 layers: Near-identical performance (-0.5% avg), lower VRAM
  • ⚠️ 40 layers: Significant throughput degradation (-3.2% avg)

Context Size (num_ctx):

  • 4096: Best overall performance for long-form content
  • 2048: Balanced performance/memory trade-off
  • ⚠️ 1024: Lower latency but reduced context window

Temperature:

  • 0.4: Optimal balance of speed and creativity
  • 0.8: Higher creativity, minimal speed impact (-0.2%)
  • ⚠️ 0.2: Deterministic but can cause TTFT spikes with num_gpu<80

4. Comparative Analysis: Gemma3 vs Llama3.1

4.1 Performance Comparison

Metric Gemma3:latest Llama3.1:8b-q4_0 Winner Δ
Model Size 3.3 GB 4.7 GB Gemma3 ✅ -30%
Parameters 4.3B 8B Gemma3 ✅ Smaller
Mean Throughput 102.85 tok/s 76.59 tok/s Gemma3 ✅ +34%
Mean TTFT (warm) 0.165s 0.097s Llama3.1 ✅ +70%
Best Config 102.31 tok/s 78.42 tok/s Gemma3 ✅ +30%
GPU Memory 5.3 GB ~6-7 GB Gemma3 ✅ Lower
GPU Utilization 100% Variable Gemma3 ✅ Better

4.2 Decision Matrix

Choose Gemma3 When:

  • ✅ Real-time gaming banter (throughput critical)
  • ✅ Streaming text generation (tokens/s matters)
  • ✅ Memory-constrained deployments (30% smaller)
  • ✅ Multi-instance serving (better GPU efficiency)

Choose Llama3.1 When:

  • ✅ Lowest first-token latency required (0.097s vs 0.165s)
  • ✅ Maximum model capacity needed (8B params)
  • ❌ Not recommended for gaming use cases

Winner: Gemma3 🏆

  • 34% faster token generation (critical for real-time gaming)
  • 30% smaller model (easier deployment, lower costs)
  • Better GPU efficiency (100% utilization, lower memory)
  • Trade-off: +0.07s TTFT (negligible for gaming applications)

5. Production Recommendations

5.1 Optimal Configuration ⭐

# Gemma3 Production Settings
model: gemma3:latest
options:
  num_gpu: 999      # Full GPU offload
  num_ctx: 4096     # Optimal context window
  temperature: 0.4  # Balanced creativity/coherence
  top_p: 0.9
  top_k: 40

Expected Performance:

  • Throughput: 102.31 tokens/s
  • TTFT: 0.128s (warm)
  • GPU Memory: ~5.3GB
  • Context Window: 4096 tokens

5.2 Alternative: Memory-Constrained Systems

# Gemma3 Constrained Settings
model: gemma3:latest
options:
  num_gpu: 80       # Partial offload
  num_ctx: 2048     # Medium context
  temperature: 0.4
  top_p: 0.9

Expected Performance:

  • Throughput: 101.89 tokens/s (-0.4% vs optimal)
  • TTFT: 0.144s
  • GPU Memory: ~3.8GB
  • Context Window: 2048 tokens

5.3 Deployment Guidelines

Pre-Production Checklist:

  1. Choose Gemma3 over Llama3.1 for 34% faster generation
  2. Pre-load model on service startup (eliminate cold-start penalty)
  3. Reserve 6GB GPU memory for model + context buffer
  4. Use temperature 0.4-0.6 for creative gaming dialogue
  5. Monitor GPU utilization (ollama ps should show 100%)
  6. Avoid temperature 0.2 with num_gpu<80 (causes TTFT spikes)
  7. Avoid num_gpu<60 (significant throughput degradation)

Production Deployment:

  • Implement health check endpoint with warm-up prompt
  • Use connection pooling for Ollama HTTP API
  • Monitor TTFT and throughput via telemetry
  • Set up alerts for GPU memory >90% utilization

6. Reproducibility

6.1 Environment Setup

# Start Ollama service
ollama serve

# Pull Gemma3 model
ollama pull gemma3:latest

# Verify GPU availability
nvidia-smi
ollama ps  # Should show "100% GPU"

6.2 Baseline Benchmark

# Run comprehensive benchmark
python scripts/ollama/gemma3_comprehensive_benchmark.py

# Outputs:
# - reports/gemma3/gemma3_baseline.json
# - reports/gemma3/gemma3_baseline.csv
# - reports/gemma3/gemma3_param_tuning.csv
# - reports/gemma3/gemma3_param_tuning_summary.csv

6.3 Verification

# Verify GPU usage
ollama ps  # Should show "100% GPU"
nvidia-smi  # Check memory usage (~5.3GB)

# Test inference
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:latest",
  "prompt": "Generate encouraging gaming banter",
  "stream": false,
  "options": {
    "num_gpu": 999,
    "num_ctx": 4096,
    "temperature": 0.4
  }
}'

7. Conclusions

7.1 Summary

Gemma3 is the superior choice for Chimera Heart gaming banter generation:

Performance: 34% faster token generation (102.85 vs 76.59 tok/s)
Efficiency: 30% smaller model (3.3GB vs 4.7GB)
GPU Utilization: 100% confirmed, 40% memory overhead
Consistency: Stable performance across all test configurations
Production-Ready: Clear optimal settings with <2% performance variance

7.2 Trade-offs

Advantages:

  • Significantly faster throughput for streaming text
  • Lower memory footprint for multi-instance deployment
  • Better GPU efficiency and utilization

Limitations:

  • Slightly higher TTFT (+0.07s vs Llama3.1)
  • Smaller parameter count (4.3B vs 8B)

Verdict: The 0.07s higher TTFT is negligible for gaming where total response time and generation speed are more critical than initial latency.

7.3 Next Steps

Short Term:

  • Deploy Gemma3 in staging environment
  • Implement monitoring for production metrics
  • A/B test against current LLM backend

Medium Term:

  • Benchmark Gemma3 with different quantizations (INT8, FP8)
  • Test multi-instance serving capabilities
  • Optimize for edge deployment scenarios

Long Term:

  • Evaluate newer Gemma versions as released
  • Implement fine-tuning for game-specific banter
  • Explore model distillation for further compression

Appendix A: Visual Assets

Performance Charts:

  • Throughput comparison: Gemma3 vs Llama3.1
  • TTFT analysis across configurations
  • GPU memory utilization patterns
  • Parameter sweep heatmaps

Location: artifacts/gemma3/


Appendix B: References

  1. Gemma Model Card: Architecture & specifications
    https://ai.google.dev/gemma/docs

  2. Ollama Documentation: Model serving and optimization
    https://ollama.ai/docs

  3. Technical Report 108: Comprehensive LLM performance analysis
    reports/Technical_Report_108.md

  4. Benchmark Methodology: Industry-standard practices
    https://mlcommons.org/benchmarks/


Document Version: 2.0
Last Updated: October 10, 2025
Test Duration: ~45 minutes
Status: ✅ Validated on 100% GPU processing
Hardware: NVIDIA RTX 4080 (12GB VRAM, 9,728 CUDA cores)