Batch Inference Safety Under Non-Determinism
SubmittedPhase 1 safety flips at ~0.58% vs capability ~0.14% under controlled batching. Refusal-to-compliance dominant direction. Reduced true-batching validation reaches ~99.4% agreement with synchronized dispatch.
Target:AI4Good
Inference Optimization Is Not Safety-Neutral
SynthesisSynthesis paper. Quantization drives 57% of total safety cost, backend choice 41%, concurrency 2%. Chat template divergence can induce larger safety shifts than numerical precision.
Target:TBD
Empirical Capacity Planning for Local LLM Inference
SynthesisCapacity planning as a fitted systems problem. Backend choice, context length, and memory pressure all materially change the feasible operating regime. Planner quality should be judged by validation against explicit targets, not analytic elegance.
Target:Systems venue
Multi-Agent Runtime Architecture
SynthesisRecasts "which language wins" as "which system design preserves throughput." Python and Rust near-parity on throughput; architecture and concurrency strategy drive larger differences. Dual Ollama achieves 99.4% multi-agent efficiency.
Target:Systems venue
Serving Stacks, Continuous Batching & the Physics of Throughput
SynthesisLLM serving-stack differences are mechanistic, not benchmark trivia. Continuous batching drives throughput scaling with agent count via 77-80% kernel reduction. Backends differ in effective serial fraction. 2.25× throughput gain at N=8 explained mechanistically.
Target:Systems venue
KV-Cache Quantization and Safety
In preparationKV-cache quantization is a serving-layer perturbation that touches retained attention state. 5-phase paired study on FP16 vs FP8 across 24K records, 3 models. Headline result is a null: no Holm-significant safety effect detectable at α=0.05, 80% power. Operational rule: workload-specific paired eval, not pre-approval.
Target:Workshop submission