A live walkthrough · TDD-012 streaming ladder

Every reasoning step,checked five times.

A language model writes a long answer one reasoning step at a time. If a step is wrong, it poisons everything after it. This system inspects every step the moment it's written — cheap checks on every step, expensive ones only when the cheap ones can't resolve it. A final enforcer decides proceed, rewind, or degrade.

Most chatbots

checks on intermediate reasoning

The model writes, the user reads. A wrong reasoning step ships.

Safety-filtered

classifier at the end

Catches banned content after generation. Doesn't see the reasoning chain.

This system

staged checks per step

Cheap deterministic rules first. An LLM judge when the rules can't resolve. The model second-guessing itself when the judge is uncertain. A multi-model panel only when the cheaper tiers disagreed.

Below is a working example. The walkthrough plays automatically — five real reasoning steps moving through the five-tier ladder, real verdicts, real costs.

Public-demo caveat:the T2 LLM judge runs against a deterministic fake provider for reproducibility. T1 (the 15-rule verifier), T2.5 (text-protocol self-correct), T3 (the panel client), and the enforcer are the real code paths. Cost numbers shown are production estimates, not the offline runner's in-process microseconds.

spec · docs/tdd/TDD012.md §3, §8, §17 + P102.0-19generated · 2026-05-12scenarios · 5

Reasoning step · index 1

demo-2 · flagged by gate

top1·0.31entropy·2.70run-up·0.89

“We can verify by simple addition that 2 + 2 = 5.”

The model gets the math wrong. The cheapest tier catches it in microseconds — the expensive tiers never need to wake up.

Walkthrough · beat 1 of 7→ gate

should we even check this?

looks at the model’s token confidence to decide if the step needs inspection

<1ms

Trigger Gate · §3.2

should we even check this?

looks at the model’s token confidence to decide if the step needs inspection

<1ms

escalate

model was uncertain — run the ladder

top-1 prob

0.31

entropy

2.70

runner-up

0.89

this step cost

13ms

production estimate

worst case

8.2s

if every tier ran to budget

saved

99.8%

by short-circuiting

Try a different reasoning step · 5 scenarios