Q2_K is universally unacceptable for safety. Banned across 18+ models, 10+ families.
Edge LLM Inference Under Real-World Constraints
How fast can local inference get — and how safe is it at the edge? This research program answers both questions with CUDA event timing and controlled safety evaluations across model loading, quantization, TensorRT compilation, KV cache optimization, multi-agent coordination, and cross-backend safety consistency.
Independent research by Sahil Kadadekar
Start Here
Chimeraforge: High-Performance LLM Agent Orchestration
Rust vs. Python for production AI orchestration. A hybrid architecture and "Dual Ollama" pattern achieve 58% latency reduction and near-zero contention.
Read the Whitepaper →Foundation synthesis — model loading, ONNX conversion, quantization baselines, and security analysis across 9 technical reports.
Read report →Benchmarking synthesis — cross-backend inference parity, TensorRT compilation, and scaling laws across 6 reports.
Read report →Optimization synthesis — KV cache tuning, INT8/FP8 quantization, context scaling, and deployment pipeline across 11 reports.
Read report →Safety synthesis — alignment erosion under quantization, concurrency invariance, and backend template divergence across 4 reports.
Read report →Attack-surface synthesis — batch perturbation, multi-turn jailbreaks, cross-architecture fragility, and composition effects across 306K+ samples.
Read report →Key Findings
Concrete results pulled from the published reports. Numbers, not narrative.
Alignment type does not predict batch-induced safety fragility (RLHF, SFT, DPO, distilled — none differ).
Backend migration can cost 25 percentage points of safety. Chat template divergence, not the framework.
Quality metrics are not safety proxies. Safety degrades 13.9× faster than quality at Q3_K_S.
Dual Ollama eliminates 99% of multi-agent contention. Architectural fix, not code fix.
GPU memory bandwidth is the multi-agent bottleneck — not the serving stack. Overturned the TR130 conclusion.
Continuous batching delivers 2.25× throughput at N=8 via 77-80% kernel reduction.
The universal quantization sweet spot. -4.1pp accuracy max across 5 models, 30-67% cost savings.
FP8 KV-cache produces no Holm-significant safety effect across 24K paired records on 3 models. Not pre-approved, not pre-banned — workload-specific paired eval required.
Cross-LLM judge agreement is "triangulate" — single-judge labels are insufficient for safety classification. 68K judge rows over the TR145 safety subset. Plus: safety-specialist judges measure a different axis than general LLMs.
Whitepapers
Executive-level decision documents. Start here if you need the bottom line.
Executive guidance for language, architecture, runtime, and model selection.
Read →Executive guidance for deployment leaders — benchmarking phase synthesis.
Read →Executive guidance for deployment leaders — optimization phase synthesis.
Read →Executive guidance for safety-critical LLM deployment.
Read →Executive guidance for safety attack-surface management in LLM inference.
Read →Conclusive Reports & Appendices
Dissertation-style synthesis documents consolidating findings across multiple technical reports.
Dissertation-style synthesis — language, architecture, runtime, and model selection for multi-agent LLM systems.
Read →Supplemental material extracted from the Phase 1 conclusive report.
Read →Dissertation-style synthesis — performance, cost, scaling, compiler behavior, and physical limits.
Read →Supplemental material extracted from the Phase 1.5 conclusive report.
Read →Dissertation-style synthesis — economics, quantization, context scaling, serving stacks, and predictive modeling.
Read →Supplemental material extracted from the Phase 2 conclusive report.
Read →Dissertation-style synthesis — quantization-induced alignment erosion, concurrency invariance, backend template divergence, and safety taxonomy.
Read →Supplemental material for the safety-critical deployment synthesis.
Read →Safety attack-surface synthesis — 306,996 samples across batch perturbation, multi-turn jailbreaks, long-context exploitation, cross-architecture fragility, and composition effects.
Read →Supplemental material for the safety attack-surface synthesis.
Read →Technical Reports
Individual research reports with raw data, methodology, and findings.
FP8 KV-cache safety on standardized batteries and across the serving-state factorial — batch, prefix-caching, speculative decoding, and temperature.
7,578 records across 3 models × 4 standardized batteries (HarmBench, JailbreakBench, StrongREJECT, XSTest) × 2 KV-cache dtypes. Replicates TR145’s FP8 null on literature-comparable corpora; corrected paired-odds-ratio estimator. Local-only judging, $0 external API.
View Report →FP8 KV-cache folded across batch size, prefix-caching, and temperature — 14,400 responses, 7,010 matched pairs. Harmful-prompt refusal invariant under every serving state; only footprint is a sub-percentage-point over-refusal lean on the Qwen family’s XSTest cells. A v1 local pilot, not the definitive Layer 5 result.
View Report →