Edge LLM Inference Under Real-World Constraints
How fast can local inference get — and how safe is it at the edge? This research program answers both questions with CUDA event timing and controlled safety evaluations across model loading, quantization, TensorRT compilation, KV cache optimization, multi-agent coordination, and cross-backend safety consistency.
Independent research by Sahil Kadadekar
Start Here
Chimeraforge: High-Performance LLM Agent Orchestration
Rust vs. Python for production AI orchestration. A hybrid architecture and "Dual Ollama" pattern achieve 58% latency reduction and near-zero contention.
Read the Whitepaper →Foundation synthesis — model loading, ONNX conversion, quantization baselines, and security analysis across 9 technical reports.
Read report →Benchmarking synthesis — cross-backend inference parity, TensorRT compilation, and scaling laws across 6 reports.
Read report →Optimization synthesis — KV cache tuning, INT8/FP8 quantization, context scaling, and deployment pipeline across 11 reports.
Read report →Safety synthesis — alignment erosion under quantization, concurrency invariance, and backend template divergence across 4 reports.
Read report →Attack-surface synthesis — batch perturbation, multi-turn jailbreaks, cross-architecture fragility, and composition effects across 306K+ samples.
Read report →Whitepapers
Executive-level decision documents. Start here if you need the bottom line.
Executive guidance for language, architecture, runtime, and model selection.
Read →Executive guidance for deployment leaders — benchmarking phase synthesis.
Read →Executive guidance for deployment leaders — optimization phase synthesis.
Read →Executive guidance for safety-critical LLM deployment.
Read →Executive guidance for safety attack-surface management in LLM inference.
Read →Conclusive Reports & Appendices
Dissertation-style synthesis documents consolidating findings across multiple technical reports.
Dissertation-style synthesis — language, architecture, runtime, and model selection for multi-agent LLM systems.
Read →Supplemental material extracted from the Phase 1 conclusive report.
Read →Dissertation-style synthesis — performance, cost, scaling, compiler behavior, and physical limits.
Read →Supplemental material extracted from the Phase 1.5 conclusive report.
Read →Dissertation-style synthesis — economics, quantization, context scaling, serving stacks, and predictive modeling.
Read →Supplemental material extracted from the Phase 2 conclusive report.
Read →Dissertation-style synthesis — quantization-induced alignment erosion, concurrency invariance, backend template divergence, and safety taxonomy.
Read →Supplemental material for the safety-critical deployment synthesis.
Read →Safety attack-surface synthesis — 306,996 samples across batch perturbation, multi-turn jailbreaks, long-context exploitation, cross-architecture fragility, and composition effects.
Read →Supplemental material for the safety attack-surface synthesis.
Read →Technical Reports
Individual research reports with raw data, methodology, and findings.
Alignment under quantization, batch perturbation, multi-turn jailbreaks, many-shot attacks, cross-architecture fragility, cross-request composition.
Multi-family safety evaluation across 4 models (1.2B–7.6B) with jailbreak amplification analysis.
View Report →Does running N concurrent agents on a shared backend degrade model safety?
View Report →Ollama vs vLLM vs TGI safety comparison across 3 models, 4 backends, and 6 benchmarks.
View Report →Unified synthesis of quantization, concurrency, and backend effects on LLM safety — 74,254 samples.
View Report →Audit-layer flip adjudication and 7,257-sample replication with corrected refusal detector.
View Report →Conversational attack sweep — 10,600 conversations across 4 models, 6 quant levels, 8 attack strategies.
View Report →15,000 scored samples across 4 models, 6 quant levels, 5 shot counts, and 3 context-length profiles.
View Report →127,224 records across 18 models, 10+ families, 4 alignment types — batch-induced safety flip asymmetry on Blackwell GPU.
View Report →Cross-referencing TR125 quality metrics with TR134 safety metrics — analysis-only, no new experiments.
View Report →14,250 records — batch composition effects on safety in multi-tenant vLLM inference.
View Report →