Independent Research

Edge LLM Inference Under Real-World Constraints

How fast can local inference get — and how safe is it at the edge? This research program answers both questions with CUDA event timing and controlled safety evaluations across model loading, quantization, TensorRT compilation, KV cache optimization, multi-agent coordination, and cross-backend safety consistency.

Independent research by Sahil Kadadekar

1,348,000+

Research Measurements

Technical Reports

Synthesis Whitepapers

Repositories

Start Here

Whitepaper

Chimeraforge: High-Performance LLM Agent Orchestration

Rust vs. Python for production AI orchestration. A hybrid architecture and "Dual Ollama" pattern achieve 58% latency reduction and near-zero contention.

Read the Whitepaper →

Phase 1 Whitepaper

Foundation synthesis — model loading, ONNX conversion, quantization baselines, and security analysis across 9 technical reports.

Read report →

Phase 2 Whitepaper

Benchmarking synthesis — cross-backend inference parity, TensorRT compilation, and scaling laws across 6 reports.

Read report →

Phase 3 Whitepaper

Optimization synthesis — KV cache tuning, INT8/FP8 quantization, context scaling, and deployment pipeline across 11 reports.

Read report →

Phase 4 Whitepaper

Safety-pivot synthesis — alignment erosion under quantization, concurrency invariance, and backend template divergence across 4 reports.

Read report →

Phase 5 Whitepaper

Attack-surface synthesis — batch perturbation, multi-turn jailbreaks, cross-architecture fragility, and composition effects across 306K+ samples. TR138 Study D batch-invariant-kernel ablation as standalone addendum.

Read report →

Phase 6 Whitepaper

Serving-state safety certification — measurement-validity substrate (judge triangulation, KV-cache safety null, speculative decoding null, mechanistic probing, portability validation) + FP8 KV-cache standardized batteries + serving-state factorial.

Read report →

Key Findings

Concrete results pulled from the published reports. Numbers, not narrative.

100% ASR

Q2_K is universally unacceptable for safety. Banned across 18+ models, 10+ families.

TR134 TR139

p = 0.942

Alignment type does not predict batch-induced safety fragility (RLHF, SFT, DPO, distilled — none differ).

TR141

25pp

Backend migration can cost 25 percentage points of safety. Chat template divergence, not the framework.

TR136

13.9×

Quality metrics are not safety proxies. Safety degrades 13.9× faster than quality at Q3_K_S.

TR142

99.4%

Dual Ollama eliminates 99% of multi-agent contention. Architectural fix, not code fix.

TR114

+74%

GPU memory bandwidth is the multi-agent bottleneck — not the serving stack. Overturned the TR130 conclusion.

TR131

2.25×

Continuous batching delivers 2.25× throughput at N=8 via 77-80% kernel reduction.

TR132

Q4_K_M

The universal quantization sweet spot. -4.1pp accuracy max across 5 models, 30-67% cost savings.

TR125

NULL

FP8 KV-cache produces no Holm-significant safety effect across 24K paired records on 3 models. Not pre-approved, not pre-banned — workload-specific paired eval required.

TR145

κ = 0.69

Cross-LLM judge agreement is "triangulate" — single-judge labels are insufficient for safety classification. 68K judge rows over the TR145 safety subset. Plus: safety-specialist judges measure a different axis than general LLMs.

TR148

Whitepapers

Executive-level decision documents. Start here if you need the bottom line.

Phase 1 Decision Whitepaper

Executive guidance for language, architecture, runtime, and model selection.

Edge LLM Inference Under Real-World Constraints

Start Here

Chimeraforge: High-Performance LLM Agent Orchestration

Key Findings

Whitepapers

Conclusive Reports & Appendices

Technical Reports