back to showScene · 01
A live walkthrough · TDD-012 streaming ladder

Every reasoning step,checked five times.

A language model writes a long answer one reasoning step at a time. If a step is wrong, it poisons everything after it. This system inspects every step the moment it's written — cheap checks on every step, expensive ones only when the cheap ones can't resolve it. A final enforcer decides proceed, rewind, or degrade.

Most chatbots
0
checks on intermediate reasoning

The model writes, the user reads. A wrong reasoning step ships.

Safety-filtered
1
classifier at the end

Catches banned content after generation. Doesn't see the reasoning chain.

This system
5
staged checks per step

Cheap deterministic rules first. An LLM judge when the rules can't resolve. The model second-guessing itself when the judge is uncertain. A multi-model panel only when the cheaper tiers disagreed.

Below is a working example. The walkthrough plays automatically — five real reasoning steps moving through the five-tier ladder, real verdicts, real costs.

Public-demo caveat:the T2 LLM judge runs against a deterministic fake provider for reproducibility. T1 (the 15-rule verifier), T2.5 (text-protocol self-correct), T3 (the panel client), and the enforcer are the real code paths. Cost numbers shown are production estimates, not the offline runner's in-process microseconds.

spec · docs/tdd/TDD012.md §3, §8, §17 + P102.0-19generated · 2026-05-12scenarios · 5
Reasoning step · index 1
demo-2 · flagged by gate
top1·0.31entropy·2.70run-up·0.89

We can verify by simple addition that 2 + 2 = 5.

The model gets the math wrong. The cheapest tier catches it in microseconds — the expensive tiers never need to wake up.

Walkthrough · beat 1 of 7gate
The model claimed two plus two is five. Its own top-1 probability is 0.31 — the model itself wasn't sure. The gate escalates to the ladder.
should we even check this?
looks at the model’s token confidence to decide if the step needs inspection
<1ms
escalate
model was uncertain — run the ladder
top-1 prob
0.31
entropy
2.70
runner-up
0.89
this step cost
13ms
production estimate
worst case
8.2s
if every tier ran to budget
saved
99.8%
by short-circuiting
Try a different reasoning step · 5 scenarios

Every verdict on this page was produced by the actual P102.x streaming pipeline running over the example reasoning steps. The 15 deterministic rules, the T2.5 self-correct text protocol, the panel client, and the enforcer all ran for real on real input. The T2 LLM judge ran against a fake provider — see the public-demo caveat above. The build pipeline lives at demo/build-data.mjs in Banterpacks.

Each tier is governed by spec section TDD-012 §3, §8, §17 + the P102.0–P102.19 patch arc. The unit-test suite under chimera/tests/streaming/ is the contract.