Technical Report 167: JTPv2 -- Predictive Validity of Cheap Pre-Rejudge Signals for Judge-Sensitivity
A Standard-Depth Predictive-Validity Test of the JTP Framework on the TR140 v3 / TR148 v2 GGUF-Local Corpus -- Verdict: Structural Degenerate Class on the rlhf-only Holdout + Substantive Pool-Robustness Secondary Finding
Status box. Standard-depth run complete on the TR140 v3 / TR148 v2 inherited substrate. Run directory
research/tr167/results/20260610_204823/, generated 2026-06-11T04:38:39Z, scipy + matplotlib engines on, bootstrap n=2000, 528/528 clean records, 0 soft violations. Substrate: 528 records / 24 of 24 expected (model, quant) cells / 4 batteries x 2 pools x 2 splits / 4-judge cohort (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b) / families llama and qwen (the GGUF-local lane; gemma, phi, mistral families explicitly out of scope and gated to the cloudrun_paper.pylane). Label-source distribution: 264 v1_reuse + 264 live_nonrlhf. Pre-registered primary verdict: FAIL_OR_INSUFFICIENT_DATA on both Req1 (cheap signal predicts judge-sensitivity) and Req2 (cheap model beats baseline). The cause is structural rather than statistical: on the rlhf_only / holdout subset, all 10 valid cells returnjudge_sensitive=True(positives=10, negatives=0), so ROC-AUC is mathematically undefined and binary discrimination cannot be evaluated at this depth. The JSON flagdegenerate_single_class=Trueis the canonical record of this property. We flag this explicitly so downstream readers do not mis-read the verdict as a power problem -- more records at the same depth on the same lane would not rescue Req1, because the class label on this lane saturates. The structural-degenerate-class finding is itself the most substantive deliverable of the report, in the same family of findings as the TR148 v2 dual-axis kappa-near-zero observation: a property of the JTP framework on production-quantization substrate, not a measurement failure. Directional signals do hold and the report does not collapse to vacuity: cheap_score Spearman rho on kappa_min carries the correct NEGATIVE sign at rho=-0.1566 (p=0.6657, n=10, 95% CI [-0.8178, 0.7116]); the kappa_min monotonicity HIGH < LOW PASSES (0.023 < 0.221); and the P8 pool-robustness secondary analysis is load-bearing positive -- 6 cells that are insufficient_data under rlhf-only resurface as judge-dominated under expanded_nonrlhf, mean kappa_min shift -0.1529 toward less agreement, zero reverse-direction flips, drop-one-judge cheap_score AUC = 1.0000 without gemma3:12b and 0.8333 without qwen2.5:7b. The TR167 deliverable is therefore three-layered and honest: primary pre-registered Req1/Req2 returns FAIL_OR_INSUFFICIENT_DATA on structural grounds; directional secondary returns correct-sign + monotonicity-pass; and methodological main finding is the cohort-composition sensitivity exposed by P8. Parent-anchor cross-reference: TR167 is the predictive-validity follow-up to the JTP v1 framework calibrated in TR148 v2 (kappa=0.6917 cross-family on TR145 n=12,809; PublishReady/reports/Technical_Report_148.md, 1,556 lines, dual-axis refusal vs composite-harm methodology). It is also the second entry in a deliberate v2 predictive-validity series across the program's three named methodological artifacts -- TR166 / RTSIv2 (sibling, the RTSI arXiv:2606.10154 v2 predictive arm) and TR168 / CRIv2 (sibling, the compile-reproducibility v2 scaffold). The series move is methodologically coherent: each of the three v1 contributions (RTSI, JTP, CRI) was published as a calibration claim; the v2 series independently asks whether each calibrated index predicts its target on a disjoint holdout. TR167 is the JTP slot of that triad and the first to land with a structural-degenerate negative on the primary verdict -- a result that constrains how the bridge paper atpapers/serving_state_safety_certification/can stage its Layer 1a refusal-axis JTP claim. Forward path is bounded and budgeted: TR167 v2 cloud expansion viarun_paper.pyadds gemma / phi / mistral families on GPTQ / AWQ / fp16 cells via vLLM at approximately 10-30 USD on RunPod A100 PCIe and 1-2 days wall-clock; either it introduces judge-stable cells into the holdout and rescues binary discrimination, or it fails to introduce stable cells and the structural-degenerate finding sharpens from "GGUF-local lane property" into a universal-collapse claim on production-quantization substrate. Both branches are publishable on this substrate; this report is the calibrated ground from which the branch is selected.
1. Abstract
This report extends the Judge Triangulation Protocol (JTP) framework introduced in TR148 v2 -- the JTP v1 dual-axis finding that refusal-axis judges and composite-harm-axis judges measure structurally different latent constructs, anchored at cross-family kappa = 0.6917 on the TR145 safety subset at n = 12,809 -- by asking a predictive-validity question that is sharper than v1 itself. Where v1 calibrated a cross-family kappa screen and licensed when its verdict could be trusted, v2 asks whether a cheap pre-rejudge signal -- composed of quant_bpw, refusal_rate_delta against the Q8_0 anchor, model family, the first-judge UNCLEAR/ambiguity rate, and mean output length -- can predict a (model, quant, battery) cell's JTP reliability outcome (kappa_min and sig-class flip) on a disjoint leave-one-model-family-out hold-out, BEFORE paying for the expensive second-judge rejudge pass that v1 demands. The lift from calibration to prediction is methodologically non-trivial: calibration permits in-sample fit on the same substrate the predictor was built on, whereas predictive validity requires both a disjoint holdout AND family transfer, so that the cheap predictor must survive contact with model families that did not contribute to its construction.
Two non-negotiable pre-registered requirements gate the verdict. Req1 demands that the cheap signal predict judge-sensitivity directly: negative-sign Spearman/Pearson rho on kappa_min, ROC-AUC above 0.5 with a 95% CI excluding 0.5, and an LOOCV out-of-fold logistic AUC above 0.5. Req2 demands that the combined cheap model BEAT the trivial single-judge-unclear-rate baseline: a DeLong/bootstrap delta-AUC CI excluding 0 and a nested logistic LRT p < 0.05. The substrate executed at STANDARD depth comprises 528 clean records (zero soft violations) across 24 of 24 expected (model, quant) cells, coverage_complete = True, four judges in the cohort (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b), 11 batteries, 2 pools (rlhf_only, expanded_nonrlhf), 2 splits (calibrate, holdout), with label_source split as 264 v1_reuse plus 264 live_nonrlhf records.
The verdict on both Req1 and Req2 returns FAIL_OR_INSUFFICIENT_DATA, but the failure mode is structural rather than statistical -- and the substantive value of TR167 turns on that distinction. On the headline rlhf_only / holdout subset, all 10 valid cells are judge_sensitive = True (positives = 10, negatives = 0), so ROC-AUC is mathematically undefined, binary discrimination cannot be evaluated, and degenerate_single_class = True is recorded in the analyze.py JSON. This is not a sample-size deficiency: it is a property of the lane itself. At STANDARD depth on the GGUF-local rlhf-only judge pool, every valid holdout cell crosses the 0.7 judge-sensitivity threshold on kappa_min. Within that degenerate frame, however, the directional signals DO hold: cheap_score's Spearman rho against kappa_min carries the CORRECT NEGATIVE SIGN at -0.1566 (p = 0.6657, n = 10, 95% CI [-0.8178, 0.7116]); the rate-monotonicity test collapses by construction (every band rate is 1.000); but the kappa_min monotonicity test HIGH < LOW PASSES (HIGH-band mean kappa_min = 0.023 < LOW-band mean kappa_min = 0.221). The signal is real but underpowered; the binary class label is what fails, not the underlying continuous monotonicity.
The load-bearing positive finding is the P8 pool-robustness secondary analysis, which we read as the methodologically richest observation in the report and the canonical tie-back to TR148 v2's dual-axis finding. Holding (model, quant) fixed and swapping the judge pool from rlhf_only to expanded_nonrlhf, 18 cells are judge_sensitive under rlhf-only versus 24 under expanded_nonrlhf, with 6 cells resurfacing (insufficient_data -> judge-dominated kappa_min = 0.000) and zero masked flips, a mean kappa_min shift of -0.1529, and drop-one-judge cheap_score AUCs of 1.0000 without gemma3:12b and 0.8333 without qwen2.5:7b. Pool composition -- not cheap-signal regression -- is the methodologically rigorous JTP observable on this corpus, and the six resurfaced cells are precisely the cells the composite-harm axis sees but the refusal axis misses. We read this as a publishable methodological finding about JTP framework dynamics on the GGUF-local rlhf-only lane, not a failed prediction exercise; the bridge paper Layer 1a/1b decomposition inherits it directly, and the sibling v2 series (TR166 / RTSIv2, TR168 / CRIv2) provides matched predictive-validity context.
2. Table of Contents
The report is organized into a status box, an abstract, this table of contents, an executive summary, four context sections (introduction and motivation; pre-registered requirements; methodology; substrate inheritance), fifteen numbered sub-studies (SS1 through SS15) carrying the analytic spine, a conclusion, a references block, and five appendices. The numbering scheme follows the standing Banterhearts TR template: section numbers 1-7 cover front matter and methodological setup; sections 8-22 carry the SS1-SS15 analytical chain; sections 23-29 carry the back matter. The SS labels in the section titles below are the same identifiers used in cross-references throughout the report (for example, "see SS9" refers to Section 16 below).
- Abstract
- Table of Contents
- Executive Summary
- Introduction and Research Motivation
- Pre-Registered Research Requirements (Req1 / Req2)
- Methodology
- Substrate Inheritance (TR140 v3 + TR148 v2)
- SS1 -- Cell Coverage and Standard-Depth Confirmation
- SS2 -- JTP Class Distribution and the Degenerate-Class Finding
- SS3 -- Band-Stratified Judge-Sensitivity Rate (P1)
- SS4 -- Cheap-Feature Spearman/Pearson on kappa_min (P2)
- SS5 -- Discrimination Analysis (P3) -- Why ROC-AUC is Structurally Undefined
- SS6 -- Baseline Head-to-Head DeLong (P4) -- Insufficient Data Reason
- SS7 -- Calibrate vs Holdout Comparison (P5)
- SS8 -- Leave-One-Model-Family-Out Validation (P6)
- SS9 -- Pool Robustness -- The Load-Bearing Secondary Finding (P8)
- SS10 -- Verdict -- Why Both Req1 and Req2 Return INSUFFICIENT_DATA
- SS11 -- Mechanism Interpretation -- Why JTP Class is Degenerate on rlhf-only
- SS12 -- Cross-Reference to TR148 v2 and the Bridge Paper
- SS13 -- Forbidden Claims
- SS14 -- Limitations
- SS15 -- Future Work -- TR167 v2 Cloud Expansion Path
- Conclusion
- References
- Appendix A -- Hardware and Environment
- Appendix B -- Reproduction Commands
- Appendix C -- Per-Cell Table
- Appendix D -- Per-Battery Disaggregation
- Appendix E -- Pool Robustness Table
Reading order note: a reader who only wants the verdict can jump from Section 3 (Executive Summary) to Section 17 (SS10) and Section 16 (SS9, the load-bearing secondary). A reader who wants the methodology audit should read Sections 5-7 before any SS. A reader who wants the forward-work scope and what the v2 cloud expansion is licensed to claim should read Section 22 (SS15) after SS9 and SS10.
3. Executive Summary
TR167 / JTPv2 is the predictive-validity follow-up to the v1 Judge Triangulation Protocol established in TR148 v2. The v1 framework instantiated JTP as a calibrated cross-family judge-agreement screen with kappa = 0.6917 measured cross-family on the TR145 safety subset at n = 12,809, and split the JTP screen into two orthogonal axes: a Layer 1a refusal-axis screen (general LLMs gemma3 and llama3.1 measuring response-refusal behavior) and a Layer 1b composite-harm-axis screen (safety specialists shieldgemma and llama-guard3 measuring composite-harm content). The v1 contribution was a calibration claim: JTP correctly classifies when cross-judge labels can be trusted. TR167 raises the bar from calibration to prediction. The question under test is methodologically stronger and adversarially harder: can a cheap pre-rejudge signal -- assembled from quant_bpw, refusal_rate_delta versus Q8_0, family code, the first judge's UNCLEAR/ambiguity rate, and mean output length -- predict whether a (model, quant, battery) cell will be judge-sensitive before the expensive multi-judge rejudge pass is launched, and does the combined cheap model BEAT the trivial single-judge-unclear-rate baseline on a disjoint leave-one-model-family-out holdout? Two pre-registered requirements gate the verdict: Req1 demands a negatively-signed cheap-signal correlation with kappa_min plus a ROC-AUC above 0.5 with CI excluding 0.5 plus a LOOCV out-of-fold AUC above 0.5; Req2 demands a baseline-beating delta-AUC with CI excluding zero plus a nested-LRT p < 0.05.
The v1 methodological context bears explicit restating because it determines what TR167's verdict can and cannot say. TR148 v2 (PublishReady Technical_Report_148.md, 1,556 lines) anchored JTP v1 on the TR145 safety subset at n=12,809 cross-judge labels and returned a kappa of 0.6917 -- a value that sits inside what the JTP taxonomy calls the triangulate band (kappa in 0.4 to 0.7). The triangulate band is not "robust" (kappa >= 0.7, where a single judge is sufficient) and not "untrustable" (kappa < 0.4, where the panel is incoherent enough that no aggregation is meaningful); it is the regime where heterogeneous judges agree enough to be useful but disagree enough that a second axis is required to license any downstream safety claim. The TR145 subset at 0.6917 is the closest empirical anchor to the robust threshold that the v1 corpus surfaced, which is precisely the regime in which a predictive-validity follow-up has maximum leverage: if a cheap pre-rejudge signal can rank-order cells by their kappa_min on the v2 substrate, the program can route rejudge budget specifically at the cells closest to the 0.7 boundary rather than burning rejudge cycles on cells that the v1 single-judge labels already resolve. TR148 v2's dual-axis finding (refusal-axis judges anti-correlate with composite-harm-axis judges at kappa in -0.13 to -0.26 because they measure different latent constructs) further constrains what "judge-sensitivity" means on this substrate: it is not noise; it is two structurally orthogonal axes folded under a single kappa.
The substrate executed at STANDARD depth is 528 clean records (zero soft violations) across 24 of 24 expected (model, quant) cells, coverage_complete = True, four judges in the cohort (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b plus the regex screen reused from the v1 corpus), eleven batteries (pooled plus s1, s4, s16, s64, s128 crossed with faux-dialogue and message-array), two pools (rlhf_only, expanded_nonrlhf), two splits (calibrate, holdout), one phase, and a label-source distribution of 264 v1_reuse records plus 264 live_nonrlhf records. The families present are llama and qwen -- the GGUF-local lane -- with gemma, phi, and mistral families gated to run_paper.py and explicitly out of scope for this run.
The load-bearing finding is structural and negative. On the rlhf_only / holdout subset, all 10 valid cells are judge_sensitive = True. Positives = 10, negatives = 0. ROC-AUC is mathematically undefined when only one class is present, so the discrimination test in Req1 cannot return a numerical value and the baseline head-to-head test in Req2 cannot return a delta. Both verdicts return FAIL_OR_INSUFFICIENT_DATA, and the JSON marks degenerate_single_class = True. This is not a sample-size deficiency curable by more records at the same depth; it is a property of the lane. At STANDARD depth on the GGUF-local rlhf-only judge pool, every valid cell crosses the judge-sensitivity threshold.
The structural-degenerate-class verdict mechanism deserves an explicit walk-through because it determines exactly how to read the two FAIL_OR_INSUFFICIENT_DATA lines and prevents a common misreading. ROC-AUC is defined as the integral of true-positive rate against false-positive rate as the decision threshold is swept across the predictor's range. The true-positive rate at any threshold is TP / (TP + FN) -- the fraction of actual positives correctly flagged -- which is well-defined whenever at least one true positive exists. The false-positive rate at any threshold is FP / (FP + TN) -- the fraction of actual negatives incorrectly flagged -- which requires at least one true negative to be defined. On the TR167 rlhf_only / holdout subset, the negative-class count is zero: there are no judge-stable cells against which to compute a false-positive rate at any threshold. The FPR axis is consequently 0/0 at every operating point, the ROC curve does not exist as a measurable object, and the AUC integral has no integrand. The same single-class condition propagates to every downstream binary-classification quantity P3 attempts: per-feature univariate AUC is undefined (all seven features); LOOCV out-of-fold logistic AUC returns insufficient_data because every fold's held-out cell carries the same positive label and the logistic regression has no class variance to learn; the precision-recall AUC reaches exactly 1.0000 but the random-floor PR-AUC also equals exactly 1.0000 because precision = positives / (positives + false_positives) saturates at 1 when there are no negatives to misclassify. This is the textbook degenerate-class trap, and reporting PR-AUC = 1.0000 as evidence of discrimination would be the textbook way to fall into it. The FAIL_OR_INSUFFICIENT_DATA verdict on both Req1 and Req2 is therefore not "we didn't get enough samples" -- it is "the substrate's class composition makes the pre-registered test mathematically inapplicable."
The cheap_score itself, however, is not silent. Its Spearman rho against kappa_min is -0.1566, p = 0.6657, 95 percent CI [-0.8178, 0.7116] at n = 10 holdout cells. The sign is correct -- higher cheap_score predicts lower kappa_min, exactly the pre-registered direction -- but the magnitude does not clear significance at this holdout sample size. This is directional evidence, not statistical evidence. The cheap_score is doing something; the holdout is too narrow to prove it.
A complementary diagnostic strengthens this directional read, and the kappa_min monotonicity passing while the rate monotonicity collapses is itself the cleanest available demonstration that the cheap signal is not a null predictor. The band-stratified P1 analysis bins the 10 cells into LOW, MODERATE, and HIGH cheap_score tertiles. The judge-sensitivity rate is 1.000 in every band (the degenerate-class artifact), so the rate-monotonicity test fails by construction. The kappa_min monotonicity test -- HIGH band kappa_min less than LOW band kappa_min -- passes: HIGH = 0.023, MODERATE = 0.000, LOW = 0.221. The cheap_score is informative about the magnitude of cross-judge disagreement even when the binary class label is structurally saturated, and the directional signal that survives band-stratification is the same directional signal that surfaces independently as the negative-sign Spearman rho. Two independent rank-based tests pointing the same way on the same substrate is not significance, but it is more than noise -- it is the substrate telling the analyst that cheap_score does in fact rank cells by their judge-agreement magnitude in the pre-registered direction. What the directional signal cannot do at n=10 is reject the null hypothesis at conventional alpha; what it can do is license the claim that the cheap-feature hypothesis is alive on this substrate and that a properly powered follow-up (TR167 v2 cloud expansion) is methodologically the right next move rather than abandoning the feature class. We record this as the only piece of Req1 evidence that returns True: monotonicity_kmin_high_lt_low = True.
The P8 pool-robustness analysis is the load-bearing substantive positive finding and the methodologically richest observation in the report, and the structure of the six resurfaced flips deserves a fuller walk-through here because the secondary finding is what the report stands on. Holding the (model, quant) cell fixed and swapping the judge pool from rlhf_only to expanded_nonrlhf, 18 cells are judge_sensitive under the rlhf-only cohort, 24 cells are judge_sensitive under the expanded_nonrlhf cohort, and 6 cells resurface (rlhf-stable or insufficient-data -> expanded-sensitive). Zero cells flip in the reverse direction -- there are no cases where a cell judged sensitive under rlhf_only resolves to stable under the expanded pool, which means the expanded pool is strictly more conservative on every cell that disagrees across pools. The mean kappa_min shift expanded minus rlhf is -0.1529, a substantial drift toward less cross-judge agreement under the expanded pool. The six resurfaced cells are all llama3.2 family -- 1B and 3B at quants Q5_K_M, Q6_K, Q8_0 (1B family) and Q4_K_M, Q5_K_M, Q6_K (3B family) -- precisely the cells that the rlhf-only pool returned as insufficient_data (too few overlapping label pairs to score a kappa_min) and that the expanded pool resolves as judge-dominated with kappa_min = 0.000. The architectural-family concentration of the six flips (all llama3.2, no llama3.1, no qwen) is itself a non-trivial observation: the resurfaced cells cluster on the family that the rlhf-only cohort could not adjudicate at all, suggesting that the safety-specialist axis introduced in expanded_nonrlhf is doing the work that the rlhf-only general-LLM axis structurally cannot do on small llama3.2 quants. The drop-one-judge sensitivity panel sharpens this further: removing gemma3:12b from the cohort yields cheap_score AUC = 1.0000 (perfect rank discrimination across the cells that remain), and removing qwen2.5:7b yields cheap_score AUC = 0.8333. Both rates are above the 0.5 random floor and the without-gemma3 number is at ceiling, which means the cheap_score is doing useful binary discrimination work on subsets where the judge-pool composition is altered -- the binary class label is what fails on the full rlhf_only cohort, not the signal. The composite reading is that cohort composition (which judges are in the room) is the dominant lever on judge-sensitivity outcomes on this substrate, that the cheap_score discriminates well when the cohort is varied, and that pool robustness is the methodologically rigorous JTP observable on the GGUF-local lane at this depth.
The interpretation is the one TR148 v2 prefigured. The rlhf-only judge cohort (gemma3:12b and qwen2.5:7b, both general LLMs measuring response-refusal behavior) misses six judge-decisions that the safety-specialist axis (llama-guard3:8b and shieldgemma:9b, measuring composite-harm content) does not miss. This is the dual-axis cohort-composition sensitivity surfacing as a predictive-validity problem rather than a calibration problem. JTP's binary verdict is not invariant to the judge pool; the cheap_score's correlation against kappa_min is meaningful, but the cell's class label is a function of which axis you select.
The verdict therefore decomposes into three honest reads. Primary, pre-registered: both Req1 and Req2 return FAIL_OR_INSUFFICIENT_DATA because the holdout class distribution is structurally degenerate. Directional secondary: the cheap_score has the correct negative sign and the kappa_min monotonicity test passes -- the signal is real but not powered. Methodological main finding: pool robustness via P8 -- six resurfaced flips, mean kappa_min shift -0.153, drop-one-judge AUC 1.0000 and 0.8333 -- is the load-bearing substantive contribution and is itself a methodologically publishable observation about JTP framework sensitivity to judge cohort composition.
Future-work scope is bounded but consequential, and the v2 cloud expansion path is the single methodological move that resolves the structural-degenerate verdict one way or the other. TR167 v2 cloud expansion via run_paper.py adds gemma, phi, and mistral families on GPTQ, AWQ, and fp16 cells via vLLM at approximately 10-30 USD on RunPod A100 PCIe and 1-2 days wall-clock. The anticipated effect splits into two branches with different downstream consequences for the bridge paper substrate at papers/serving_state_safety_certification/. In the first branch, the cloud families introduce judge-stable cells into the holdout that the GGUF-local lane structurally cannot provide -- a plausible outcome given that gemma/phi/mistral at fp16 are not subject to the production-quantization stress that the llama/qwen GGUF cells carry. In that branch, ROC-AUC becomes defined, the directional Spearman rho gets re-evaluated at a larger n with non-degenerate class composition, and the pre-registered Req1/Req2 ladder gets a clean numerical answer. In the second branch, the cloud families fail to introduce stable cells -- meaning judge-sensitivity is universal across families and quantization recipes at this depth, not just a GGUF-local artifact -- and the structural-degenerate finding strengthens into a universal-collapse claim: JTP-validity does not certify any cell as judge-stable on production safety corpora at standard depth, and the only methodologically rigorous JTP observable is pool robustness. Both branches are publishable; this report is the substrate on which the branch is selected. Sibling v2 predictive-validity series TR166 (RTSIv2, scaffolded against the arxiv preprint arXiv:2606.10154 public 2026-06-08) and TR168 (CRIv2) follow the same scaffold pattern and will surface their own degenerate-class or non-degenerate-class findings on the same TR140 v3 substrate, providing cross-paper triangulation on whether the structural-degenerate verdict generalizes across the program's three named methodological artifacts or is specific to the JTP axis.
4. Introduction and Research Motivation
The Judge Triangulation Protocol (JTP), as established in TR148 v2, was built to answer a narrow but high-leverage question: when a panel of heterogeneous safety judges scores the same set of model outputs, can we trust the resulting safety verdict, or is the verdict an artifact of which judge happened to be in the room? TR148 v2 answered this with a cross-family kappa gate, a dual-axis decomposition of refusal-versus-composite-harm disagreement, and an empirical anchor of kappa equal to 0.6917 on the TR145 safety subset of n equal to 12,809 records. That value placed the entire cohort in the triangulate band -- not robust, not untrustable, but in the regime where a second judge from a different family is required to license a safety claim. TR167 / JTPv2 is the predictive-validity follow-up to that anchor, and it asks a sharper question: can we predict which (model, quant, battery) cells will be judge-sensitive without paying for the second-judge rejudge?
4.1 The Parent JTP Framework (TR148 v2)
TR148 v2 named two structural findings that travel into the current substrate. The first is the cross-family kappa-reliability gate: kappa values below 0.40 mark the untrustable regime, 0.40 to 0.70 the triangulate regime, and 0.70 and above the robust regime. The TR145 safety subset at kappa equal to 0.6917 sat just under the robust threshold, which is exactly the regime where a JTP follow-up has the most leverage. The triangulate-band placement is not a marginal aesthetic distinction; it is the band where the JTP framework is functionally most useful, because cells above 0.70 do not need rejudging and cells below 0.40 cannot be rescued by rejudging. The band between is precisely where a second judge from a different axis can convert a contested cell into a trustable verdict, and the kappa = 0.6917 anchor lands close enough to the upper edge that the cohort is almost trustable -- not enough to skip the rejudge, but close enough that the rejudge cost-per-decision feels disproportionate. That asymmetry is the entire economic motivation for the cheap-signal hypothesis under test here.
The second TR148 v2 finding is the dual-axis decomposition: refusal-style judges (gemma3:12b, qwen2.5:7b, and the regex baseline) measure a different latent construct than composite-harm-style judges (llama-guard3:8b, shieldgemma:9b), and the cross-family kappa between those two axes can be near zero or negative without either axis being defective. Specifically, TR148 v2 measured kappa values in the -0.13 to -0.26 range between safety specialists and general LLMs on the TR145 safety subset, which is not noise around a near-zero true agreement but a substantively anti-correlated signal: the safety-specialist axis labels content composite-harm even when the model successfully refuses, and the refusal axis labels refusal-shaped outputs as safe even when the latent intent is harmful. The two axes are not noisy estimates of one ground truth; they are orthogonal measurement constructs, and TR148 v2 routed this finding into a Layer 1a refusal-axis JTP plus a Layer 1b composite-harm-axis screen in the bridge paper substrate at papers/serving_state_safety_certification/. TR167 inherits both gates: every cell in the TR167 cohort is judged by both axes, and the cheap-signal hypothesis is evaluated against the kappa_min metric, which is the minimum pairwise kappa across the panel and therefore the binding constraint on JTP-validity. Using kappa_min rather than a panel-averaged kappa is itself a methodological commitment: it forces the predictive-validity test to predict the weakest pairwise agreement in the cohort rather than the average, which is the conservative choice because the weakest pairwise agreement is the one that triggers the rejudge gate in production.
4.2 The Cheap-Signal Hypothesis
The rejudge cost in TR148 v2 was nontrivial: every record had to be relabeled by a second judge from a structurally different family, and the panel-level kappa was computed only after the full second-judge pass landed. At ten of thousands of records and four-axis cohorts, this puts the per-corpus rejudge bill into the dollars-to-low-tens-of-dollars range under local Ollama dispatch and into the hundreds under hosted Claude or gpt-4o dispatch. TR167 asks whether a cheap pre-rejudge signal -- one that can be computed from the v1 first-judge labels and the model outputs themselves, without ever calling a second judge -- can predict which cells will land in the judge-sensitive band. If that cheap signal works, the rejudge budget can be spent only on the cells the cheap signal flags as at-risk, and the kappa-validity claim of TR148 v2 can be extended to new substrates at a fraction of the cost.
The candidate features are seven, and each carries an independent mechanistic justification. quant_bpw (bits per weight, a structural property of the quantization recipe) is included because the TR140 v3 and TR141 lineage have repeatedly shown that low-bpw quants degrade refusal fidelity in non-monotonic ways -- the prediction is that Q2_K and Q3_K_M cells will sit at the high end of cheap_score and the low end of kappa_min. refusal_rate_delta (the refusal-rate shift versus the Q8_0 baseline for the same model and battery) is the within-model per-quant deviation: a cell that refuses substantially less than its Q8_0 anchor is a cell where the quantization has shifted the refusal-axis distribution, and that shift should correlate with cross-judge disagreement. family_code (a categorical encoding over the present families llama and qwen) is included as the architectural prior: different families have different RLHF lineages and refusal-template distributions, and family-level effects should absorb cross-cell variance that the per-quant features cannot. single_judge_unclear_rate (the v1 first-judge UNCLEAR rate) is the most direct cheap signal -- if the first judge already cannot decide, the panel-level kappa floor is almost certainly low -- and it serves as the trivial baseline against which the composite is compared in Req2. single_judge_ambiguity_rate is a related ambiguity-flag rate from the v1 first judge, designed to catch cells where the judge committed to a label but flagged the decision as uncertain. mean_output_len_tokens (the per-cell mean output length in tokens, computed from the v1 candidate strings) is the structural-output prior: longer outputs offer more surface area for refusal-axis and harm-axis judges to disagree, and the prediction is a positive correlation with judge sensitivity. single_judge_refusal_rate (the v1 first-judge REFUSAL rate) is included as a confound check: cells with very high or very low refusal rates may be easy to classify by either axis, and intermediate refusal rates may be where disagreement concentrates. The composite cheap_score is a fitted linear combination of these on the calibrate split, then frozen for evaluation on the holdout split, and the pre-registered claim is that this composite predicts kappa_min on a disjoint leave-one-model-family-out hold-out with the correct sign and with statistical separation from the trivial baseline.
4.3 Why Predictive Validity is Stronger Than Calibration
There is a meaningful distinction between calibration and predictive validity that this report leans on heavily, and it is worth making explicit because the literature on judge-based evaluation conflates the two regularly. A calibration claim is the weaker form: the predictor is trained and evaluated on the same substrate, the resulting in-sample fit is reported, and the relevant statistical hazard is overfitting to substrate-specific idiosyncrasies. Calibration claims can be valuable -- TR148 v2's kappa = 0.6917 anchor is a calibration claim, and it is load-bearing for the bridge paper Layer 1 instantiation -- but they do not license extrapolation to a new substrate. Predictive validity is the stronger form: the predictor is trained on the calibrate split, evaluated on a disjoint holdout, and -- crucially in this design -- evaluated under a leave-one-model-family-out (LOFO) protocol, so the holdout family was not represented during predictor construction. The relevant statistical hazard shifts from overfitting to substrate-specific idiosyncrasies to family transfer: the predictor must work on a family it has not seen, which is the condition that matters in production because every new model family arriving in deployment is, by definition, a family the predictor was not tuned on.
This distinction matters because the cheap-signal hypothesis must survive family transfer to be useful: if a cheap predictor only works within the family it was tuned on, it cannot replace the rejudge step on a new model family arriving in production. TR167 therefore reports two pre-registered requirements: Req1 that the cheap signal predicts judge-sensitivity with the correct sign and ROC-AUC separation, and Req2 that the combined cheap model beats the trivial single-judge-unclear-rate baseline under a DeLong or nested-LRT test. These requirements were committed to the run before the analyze.py outputs were inspected, and the verdict ladder in Section 5 reflects what the substrate could and could not resolve against them. The RTSI arxiv preprint (arXiv:2606.10154, public 2026-06-08) makes the parallel methodological move on the refusal-template-stability axis: predictive validity on a disjoint cohort, not calibration on a unified one, and the sibling TR166 / RTSIv2 and TR168 / CRIv2 scaffolds extend the same predictive-validity discipline to RTSI and CRI respectively. TR167 is the JTP arm of that three-arm predictive-validity series, and the headline verdict shape on this report -- structural-degenerate primary plus substantive secondary -- is one of three possible outcomes the series is designed to surface honestly.
4.4 Pre-Registration and Single-Substrate Discipline
The pre-registration commits TR167 to two further disciplines. First, the abort conditions for a degenerate single class were specified in advance: if the holdout subset contains positives equal to 0 or negatives equal to 0 on the judge-sensitive label, ROC-AUC is undefined and Req1 must return FAIL_OR_INSUFFICIENT_DATA rather than be quietly recoded as a sample-size complaint. This guard is deliberate. The temptation in any binary-discrimination study with a thin holdout is to relax the class definition until both classes are populated -- to shift the kappa_min threshold from 0.7 down to wherever the rate of "negatives" first exceeds zero, or to swap the headline split from holdout to a mixed calibrate-plus-holdout pool. Either move launders a structural finding through a re-specification, and the pre-registration explicitly forbids both: the threshold stays at 0.7 (inherited from TR148 v2), the headline pool stays at rlhf_only, and the headline split stays at holdout, and if the resulting subset is degenerate the verdict reports degenerate rather than being recoded.
Second, the substrate is the single GGUF-local lane covering the llama and qwen families across six quantization recipes per model (four models, six quants, n_records equal to 528 across 24 cells), and no cross-substrate borrowing is permitted in the headline verdict. The cloud-family cells (gemma, phi, mistral) that would normally diversify the holdout require run_paper.py and a paid RunPod A100 PCIe pass, and are explicitly out of scope for the v1 substrate. The choice of disjoint LOFO over a random hold-out is also load-bearing here: a random hold-out would leak structural within-family correlations across the train/test boundary (a llama Q4_K_M cell shares architecture, tokenizer, and RLHF lineage with a llama Q8_0 cell), and would inflate apparent predictive power without testing the family-transfer property that the cheap-signal hypothesis actually claims. LOFO is the harder and more honest test -- and on this substrate, with only two families present, it produces exactly the single-class collapse the abort condition was specified to catch.
The downstream consequence of this discipline -- that the rlhf-only holdout collapses to a single-class set of 10 positives and 0 negatives, that the headline Req1 returns FAIL_OR_INSUFFICIENT_DATA, and that the directional kappa_min monotonicity signal and the P8 pool-robustness finding become the load-bearing positive results -- is exactly what an honest predictive-validity follow-up is supposed to surface rather than paper over. The negative-results-with-substantive-secondary framing is not a hedge; it is the verdict the pre-registration committed to delivering if the substrate produced this exact configuration, and it does.
5. Pre-Registered Research Requirements
The TR167 / JTPv2 pre-registration commits to two requirements, evaluated on a single headline pool and split with explicit ABORT conditions. The headline pool is rlhf_only (the conservative judge cohort: gemma3:12b, qwen2.5:7b, plus the regex baseline inherited from TR148 v2). The headline split is holdout, generated by a disjoint leave-one-model-family-out (LOFO) partition rather than a random hold-out. The stable threshold for the positive class on kappa_min is fixed at 0.7 -- cells with kappa_min < 0.7 are labelled judge-sensitive (positive class), and cells with kappa_min >= 0.7 are labelled judge-stable (negative class). These three commitments -- pool, split, threshold -- were locked into config.yaml before the run.py invocation that produced research/tr167/results/20260610_204823/ and are not re-tuned anywhere in the analysis. Pre-registration discipline at this granularity is the load-bearing reason the FAIL_OR_INSUFFICIENT_DATA verdict is reported as a verdict rather than silently recoded as a sample-size complaint.
5.1 Req1 -- Cheap Signal Predicts Judge-Sensitivity
The first requirement asks whether the pre-rejudge cheap_score (a composite over quant_bpw, refusal_rate_delta vs Q8_0, family code, the first-judge UNCLEAR/ambiguity rate, and mean output length) predicts cell-level judge-sensitivity well enough to be admissible as a screen. Four admissibility conditions are pre-registered, each tied to a specific evidence field in the run-dir JSON:
- Correlation sign and significance. Spearman rho and Pearson r on
cheap_score x kappa_minmust be NEGATIVE -- a higher cheap_score must predict a lowerkappa_min. The 95% bootstrap CI on rho (n=2000 resamples, configured inanalyze.pyengine block) should exclude zero. This admissibility condition encodes the directional hypothesis: cheap_score is supposed to rank cells in the same order as the kappa_min disagreement metric, with the steeper-slope direction corresponding to greater judge disagreement. The evidence field ischeap_score_kappamin_rho_negative. - Binary discrimination via ROC-AUC with CI. ROC-AUC for binary cell-level judge-sensitivity prediction must be strictly ABOVE 0.5 with a 95% bootstrap CI EXCLUDING 0.5. The "CI excluding 0.5" gate is what distinguishes a real classifier from a chance fluctuation: a point estimate above 0.5 with a CI that crosses 0.5 is consistent with random guessing and does not license a "cheap predicts" claim. The evidence field is
auc_above_05. - LOOCV out-of-fold AUC. The companion leave-one-out cross-validation cheap-logistic out-of-fold AUC on the holdout must also be ABOVE 0.5. This is a stronger gate than the in-sample ROC-AUC because the logistic is refit at every fold, and a model that overfits the holdout will collapse here. The evidence field is the LOOCV branch of the P3 panel.
- Monotonicity tests across cheap_score tertiles. The band-stratified judge-sensitivity rate must satisfy HIGH > MOD > LOW (the higher-cheap-score band should carry the higher sensitivity rate), and the band-stratified mean
kappa_minmust satisfy HIGH < LOW (the predictive-validity direction: more cheap-score risk implies less judge agreement). These are two distinct fields,monotonicity_sens_high_gt_lowandmonotonicity_kmin_high_lt_low. The two are not redundant: the rate test collapses when the positive class saturates at 100% across all bands, but the kappa_min test retains signal even under saturation because mean kappa_min is a continuous quantity.
The pre-registered ABORT condition for Req1 is degenerate_single_class=True: if the holdout positive-class count and negative-class count are not both at least one, ROC-AUC is undefined and Req1 returns FAIL_OR_INSUFFICIENT_DATA regardless of the rho sign or the monotonicity outcome. This guard exists because ROC-AUC on a single-class sample is not a measurement -- it is a structural artefact -- and reporting it as "above 0.5" would be a defensibility-bar violation under the standing scaffold defensibility bar.
5.2 Req2 -- Cheap Beats Trivial Baseline
The second requirement asks whether the combined cheap model adds discriminative power above a trivial single-feature baseline -- the first-judge unclear_rate. The baseline is deliberately chosen to be the strongest single feature available from the v1 pass without rejudging, because beating a weaker baseline (e.g. majority class, or an uninformative constant) would not license deployment of the cheap_score over the existing first-judge UNCLEAR signal. Two admissibility conditions are pre-registered:
- DeLong / bootstrap delta-AUC with CI excluding zero. The 95% bootstrap CI on
AUC(cheap) - AUC(unclear_rate)must EXCLUDE 0, and the DeLong test p-value (using the standard DeLong-DeLong-Clarke-Pearson covariance for paired ROC-AUCs) must fall below alpha=0.05. The DeLong test is the canonical paired-AUC mechanic: it constructs a Mann-Whitney-U-based variance estimate for each ROC-AUC, computes their covariance under the assumption of shared subjects, and forms a normal-approximation z-statistic on the AUC difference. The bootstrap CI provides a distribution-free companion that does not lean on the normal-approximation assumption when n is small. Both tests must agree. The evidence fields aredelta_auc_ci_excludes_zeroanddelong_p_below_alpha05. - Nested logistic LRT. A likelihood-ratio test between a nested logistic model
[unclear_rate](baseline-only) and[unclear_rate + cheap_features](baseline+cheap) must report p < 0.05. The LRT statistic is-2 * (log L_baseline - log L_baseline+cheap), distributed as chi-squared with degrees of freedom equal to the number of added features (here, the six non-baseline cheap features). The LRT is the correct mechanic for nested-model comparison and is robust to the AUC-pathology that flattens the DeLong test when discrimination collapses. The evidence field islrt_p_below_alpha05.
The pre-registered ABORT for Req2 is "insufficient data on the holdout" -- which includes the single-class degenerate case, because both AUCs collapse to undefined and the LRT loses its denominator (the baseline-only logistic cannot estimate a slope coefficient against a constant outcome vector). Req2 cannot be admissible independently of Req1: if Req1 aborts on degeneracy, Req2 also aborts.
5.3 Abort Conditions and Why They Were Specified Before Data Collection
The ABORT conditions in 5.1 and 5.2 were specified in the pre-registration before any cells were judged at TR167 depth -- they are not post-hoc explanations of a failed run. The reason for committing to them in advance is that a researcher inspecting a single-class holdout AFTER the fact has a strong temptation to either (a) silently report an undefined AUC as "0.5" and let the reader's pattern-match smooth it over, or (b) substitute a different metric (PR-AUC, balanced accuracy on a single class, anything continuous) and claim Req1 as passed. Both moves are defensibility-bar violations; pre-registration of the degenerate-class abort is what prevents them. The abort fires in the present run (degenerate_single_class=True is True in the JSON evidence block), and the verdict reports it transparently rather than launder it.
5.4 Headline-Pool, Headline-Split, and Threshold Choices
The headline pool is rlhf_only because the pre-registration treats general-LLM judges (gemma3, qwen2.5) as the conservative judge axis -- the same axis the program external JTP submission uses for primary kappa reporting. The expanded expanded_nonrlhf pool, which adds the safety-specialist judges (llama-guard3:8b, shieldgemma:9b), is reserved for Req3 / pool-robustness diagnostics in Section 8 (the P8 panel that ultimately carries the load-bearing positive secondary finding). The headline split is holdout because the calibrate split is used to set the cheap_score weights and tertile boundaries -- evaluating predictive validity on the calibrate split would be circular and would not license a generalization claim.
The stable-threshold edge kappa_min = 0.7 is inherited from the TR148 v2 dual-axis JTP framework and corresponds to the Landis-Koch substantial-agreement band boundary. Landis and Koch's 1977 reliability table maps kappa<0 to poor, 0-0.20 to slight, 0.20-0.40 to fair, 0.40-0.60 to moderate, 0.60-0.80 to substantial, and 0.80-1.00 to almost perfect. The 0.7 edge sits inside the substantial band and is the conventional pivot at which a judge cohort is treated as "agreeing enough that the safety verdict can stand without rejudging." Cells above 0.7 are treated as "judge cohort agrees" and cells below 0.7 are flagged for rejudging. The 0.7 edge is NOT re-tuned on TR167; it is held fixed so that the predictive-validity test is a true out-of-sample probe of the same operating point the parent JTP work commits to. Re-tuning the threshold to rescue a degenerate class would be the same kind of defensibility-bar violation that silently reporting AUC=0.5 would be.
5.5 Why Disjoint LOFO and Not a Random Hold-Out
The split is generated by leave-one-model-family-out (LOFO), not random partition. Cell-level outcomes are correlated within a model family -- a llama family cell at Q4_K_M shares architecture, tokenizer, and training-data lineage with a llama family cell at Q8_0. A random hold-out would leak structural correlations across the train/test boundary and inflate apparent predictive power. LOFO enforces a disjoint generalization probe: the holdout family was never seen during cheap_score calibration. In TR167 v1, this produces two folds (held-out llama with n_cells=10, held-out qwen with n_cells=5), and the headline holdout column reports the held-out-llama fold because it is the larger sample. This is the harder and more honest test; it is also the test that surfaces the structural single-class finding documented in Section 6, and is the same generalization-probe discipline TR166 / RTSIv2 and the scaffolded TR168 / CRIv2 follow-up commit to.
6. Methodology
This section documents how TR167 operationalizes the JTPv2 pre-registered claim into an executable pipeline. The design choices made here are load-bearing for how the negative-results verdict and the secondary pool-robustness finding should be read: TR167 deliberately restricts itself to cheap signals computable from the v1 first-judge pass and the model outputs themselves, deliberately excludes paper-depth cells, and deliberately respects the --skip-openai-judge umbrella gate documented in the project's adversarial-content discipline. The methodology is what it is; the verdict ladder in Section 5 reflects what that methodology is structurally capable of resolving on this corpus. We work through the methodology in five subsections: feature extraction (6.1), cell-level aggregation (6.2), judge-cohort composition (6.3), calibrate-versus-holdout splits (6.4), and the explicit exclusion ledger (6.5). Each subsection is paired with the verdict-ladder consequence it controls, so the downstream FAIL_OR_INSUFFICIENT_DATA verdict on Req1 / Req2 and the load-bearing P8 secondary positive can be traced back to specific methodology choices.
6.1 Cheap-Feature Extraction
The seven cheap features tested in P2 are extracted entirely from artifacts already produced by the v1 (TR148 v2 / TR140) run. No second judge call is issued during cheap-feature extraction, which is the entire point of the "cheap pre-rejudge signal" framing -- if a feature required a second model invocation, it would no longer be cheaper than the rejudge it is supposed to predict.
The first feature, quant_bpw, is computed by parsing the quant tag attached to each GGUF file (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0) into an effective bits-per-weight rational summary that captures the structural precision of the quantization recipe. This is the cheapest feature in the panel -- it is read off the filename, not measured -- and its inclusion is justified on prior grounds rather than predictive ones: TR140 v3 found a quantization-precision gradient on safety behavior, and the JTPv2 hypothesis under test is whether that gradient transfers to JTP class membership.
The second feature, refusal_rate_delta, is computed per-cell as the v1 first-judge refusal rate at the given quant minus the same model's Q8_0 refusal rate on the same battery. The Q8_0 anchor is chosen rather than fp16 because the GGUF-local lane does not maintain an unquantized fp16 baseline; Q8_0 is the lowest-loss representative present in every cell row of the 4 x 6 design grid. The delta encodes per-quant deviation against the high-precision baseline for the same model and the same battery, which is the per-quant cell-level analogue of the refusal-fragility metric from TR141.
The third and fourth features, single_judge_unclear_rate and single_judge_ambiguity_rate, are extracted from the v1 first-judge label JSONL: the unclear-rate counts records where the first judge returned an explicit UNCLEAR token, and the ambiguity-rate counts records flagged with an ambiguity heuristic in the parser (off-vocab tokens, mid-confidence rationale text, etc.). The two are correlated but not identical; both are retained as separate features because the predictive-validity hypothesis treats them as orthogonal noise signatures.
The fifth feature, mean_output_len_tokens, is the per-cell mean of the model's response length in tokens, computed from the v1 candidate strings via the same tokenizer the v1 judge used. Long-output cells in TR148 v2 were modestly associated with judge disagreement, so this is included as a structural covariate. The sixth feature, single_judge_refusal_rate, is the raw v1 first-judge refusal rate (not delta'd against Q8_0); it carries the absolute refusal level rather than the quant-deviation signal of feature two, and is included to let the predictor decompose absolute-level effects from delta-level effects.
The seventh feature, family_code, is a categorical encoding over the present families (llama, qwen) collapsed to an integer index. The composite cheap_score is a fitted linear combination of these seven features on the calibrate split, then frozen for evaluation on the holdout split. No feature is re-fit on the holdout; no feature is computed using any v2 (TR167) judge call.
6.2 Cell-Level Aggregation
The unit of analysis is the cell, defined as the (model, quant, battery) tuple aggregated across the prompts within that cell. Each cell rolls up the seven cheap features into a single feature vector by per-feature mean over the cell's record pool, and aggregates the multi-judge labels into the JTP-framework outcomes: kappa_min (the minimum pairwise judge-judge Cohen's kappa across the cohort), judge_sensitive (boolean: did the cell cross the JTP judge-sensitivity threshold at kappa_min < 0.7), and jtp_valid (boolean: did the cell produce enough overlapping judge pairs to render a kappa).
Eleven batteries are present (the pooled battery plus s1, s4, s16, s64, s128 each crossed with faux-dialogue and message-array prompt registers); the headline (pool=rlhf_only, split=holdout) slice rolls these batteries up into ten valid holdout cells using a pooled-rather-than-disaggregated aggregation choice for the headline metric. The pooled choice is principled: the JTPv2 hypothesis is whether a cell-level (model, quant) signal predicts JTP class membership in production, where production traffic mixes prompt registers rather than restricting to s1 or s128 alone. Disaggregated per-battery breakouts are retained in Appendix D for diagnostic purposes but are not the headline surface.
The per-cell n_overlap values (452 to 499 overlapping judge-judge labels per cell) confirm that the cell-level kappa estimates are not n-starved at the within-cell label level; the holdout sample-size limitation is at the cell level (n=10), not at the within-cell label level. This is the binding constraint that propagates into the P2 rho width (CI [-0.8178, 0.7116]) and into the P3 ROC-AUC undefinedness -- there are enough labels per cell to score kappa_min stably, but there are not enough cells in the holdout to resolve a rho of moderate magnitude from zero at conventional alpha.
6.3 Judge-Cohort Composition
The judge cohort for the rlhf_only pool is llama-guard3:8b, shieldgemma:9b, gemma3:12b, and qwen2.5:7b, with the v1 regex labeler also retained in the corpus. The cohort composition is not arbitrary -- it is the operational instantiation of the TR148 v2 dual-axis JTP framework, with both axes represented in the panel.
The general-LLM cross-family axis is carried by gemma3:12b (Google family) and qwen2.5:7b (Alibaba family). These two judges measure response-refusal behavior -- whether the model produced a refusal token, an evasive completion, or a direct answer -- and they are deliberately drawn from different model families to avoid the intra-family kappa inflation that TR148 v1 (pre-dual-axis) suffered when both refusal judges came from the same family. The safety-specialist axis is carried by llama-guard3:8b (Meta safety classifier) and shieldgemma:9b (Google safety classifier). These two judges measure composite-harm content -- whether the model's output contains harm regardless of refusal posture -- and TR148 v2's dual-axis finding was that this axis can anti-correlate with the refusal axis at kappa values in the -0.13 to -0.26 range without either axis being defective.
The expanded_nonrlhf pool retains the same four LLM judges and adds the safety-specialist axis more explicitly into the kappa aggregation -- this is what produces the P8 contrast in SS9 below. The cohort is gated by the project's standing umbrella discipline: OpenAI judges (gpt-4o, etc.) are not invoked on adversarial corpora pending Researcher Access Program cover, and Claude judges (claude-sonnet-4.6 and siblings) are similarly gated on Fellowship-conditional dispatch -- the P8 "drop one judge" row for without claude-sonnet-4.6 therefore reads as "was not in run" rather than as a counterfactual ablation. The cohort as fielded is the maximal cohort that the standing-umbrella gates permit on this substrate at the run date.
6.4 Calibrate vs Holdout Splits
The calibrate vs holdout discipline is a leave-one-model-family-out (LOFO) split implemented at the family granularity: cells from one family enter the calibrate split, cells from the held-out family enter the holdout split, and the cheap-logistic is fitted on calibrate and evaluated on holdout. With only two families present (llama, qwen), the LOFO arrangement produces two folds: held-out llama with ten holdout cells, and held-out qwen with five holdout cells.
The mechanics are: for fold 1 (held-out=llama), the qwen cells enter calibrate, the cheap_score weights are fitted by OLS on the calibrate split, the resulting weight vector is frozen, and the frozen cheap_score is evaluated on the ten llama holdout cells; for fold 2 (held-out=qwen), the symmetric procedure runs with llama in calibrate and the five qwen cells in holdout. The headline rho/AUC numbers report the held-out llama fold (n=10 cells) because it is the larger sample; the held-out qwen fold (n=5) is too thin to support the per-feature CI computation the methodology requires and is reported in the verdict ladder as "insufficient_data." The P6 LOFO aggregate in SS8 therefore averages over two folds where each fold is independently insufficient_data on the binary-AUC axis, and the aggregate is reported as -- rather than as a pooled estimate.
The LOFO design rather than a random hold-out is the methodologically stronger choice precisely because cell-level outcomes are correlated within a model family -- a llama family cell at Q4_K_M shares architecture, tokenizer, and training-data lineage with a llama family cell at Q8_0, and a random hold-out would leak that structural correlation across the train/test boundary. LOFO is also the design that surfaces the structural-degenerate-class finding (SS9): once the held-out family is held out, the rlhf_only judge cohort returns judge_sensitive=True on every valid holdout cell, and the binary discrimination test loses its negative class.
6.5 What the Methodology Excludes
| Excluded scope | Reason | Recovery path |
|---|---|---|
| Cloud families (gemma, phi, mistral) | run_paper.py lane not fired here; TR167 is GGUF-local standard depth |
TR167 v2 expansion via cloud GPTQ/AWQ/fp16 cells on vLLM, approximately 10-30 USD on A100 PCIe |
| Claude judge axis | Cohort umbrella discipline (Fellowship-conditional dispatch) | Same v2 expansion, fired post external acceptance signal |
| gpt-4o judge axis | Researcher Access Program gating on adversarial corpora | Same v2 expansion, fired post umbrella resolution |
| Per-attack feature engineering beyond the seven cheap features | Predictive-validity test under-powers feature-engineering ablations at n=10 holdout cells | TR167 v2 expanded substrate (more cells) before adding features |
| Paper-depth multi-attack stratification | Standard depth only; paper depth gated on cohort umbrellas | run_paper.py lane |
Observations. The exclusion list is not incidental -- each exclusion is the binding constraint on a different verdict-ladder failure mode in Section 7. The cloud-families exclusion is the proximate structural cause of the rlhf-only holdout being one-class-degenerate: introducing cells from a structurally distinct family lineage (gemma's pretraining mixture, phi's small-model regime, mistral's tokenizer choices) is the most plausible way to populate the judge-stable end of the kappa_min distribution and produce holdout cells with kappa_min >= 0.7. The Claude/gpt-4o exclusion is the binding constraint on cross-axis triangulation in the cohort: without a third refusal-axis judge from a non-Ollama family, the cheap_score's drop-one-judge sensitivity in SS9 cannot be probed beyond the four judges actually in the cohort. The per-attack feature-engineering exclusion is the binding constraint on cheap-feature expressivity given only ten holdout cells -- adding more features at n=10 risks degrees-of-freedom inflation rather than predictive gain. The paper-depth stratification exclusion is the most consequential for downstream claims: at standard depth, every battery contributes to the pooled cell, and per-attack effect heterogeneity is averaged out rather than estimated.
The honest reading is that TR167's methodology was deliberately constrained to the GGUF-local rlhf-only lane at standard depth, and the structural single-class verdict on the primary claim is a direct downstream consequence of that constraint -- not a failure of the cheap-feature hypothesis itself. The methodology supports a clean answer to "what can a GGUF-local rlhf-only lane resolve about JTP predictive validity at n=10 holdout cells," and the answer is: directionally yes (monotonicity passes; rho has the correct sign at -0.157), but not at conventional significance, and not via binary discrimination, until the cohort and family axes are expanded. The TR167 v2 cloud expansion described in SS17 is the recovery path that each exclusion row above points to; the v1 substrate is sufficient to ground the structural-degenerate finding and the load-bearing P8 pool-robustness secondary, and explicitly insufficient to ground a non-degenerate ROC-AUC. The cross-references to TR148 v2 (Layer 1a refusal axis, Layer 1b composite-harm axis), to the bridge paper substrate at
papers/serving_state_safety_certification/, and to the sibling TR166 (RTSIv2) and TR168 (CRIv2) predictive-validity follow-ups are intentional: the JTPv2 methodology here is one leg of a three-leg v2-predictive-validity series, and the constraint pattern documented in this exclusion ledger applies symmetrically across all three legs.
7. Substrate Inheritance -- TR140 v3 + TR148 v2
TR167 is a read-only consumer of two upstream substrates: TR140 v3.0 (the parent quantization x judge-triangulation corpus) and TR148 v2 (the dual-axis JTP v1 cohort definition and class taxonomy). No record in either upstream is modified, re-scored, or re-labeled in this run; TR167 instead extracts cheap pre-rejudge features over the v1 labels and re-aggregates the cell-level outcomes against a fixed JTP class taxonomy. The inheritance contract is therefore: TR140 v3 supplies the model outputs and the v1 first-judge labels from which single_judge_unclear_rate, single_judge_ambiguity_rate, single_judge_refusal_rate, and mean_output_len_tokens are derived; TR148 v2 supplies the judge roster (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b plus the regex axis) and the kappa thresholds at 0.4 / 0.7 / 0.9 that map a cell's kappa_min into the judge-sensitive / judge-dominated / insufficient_data taxonomy.
7.1 TR140 v3.0 parent corpus shape
The TR140 v3.0 parent corpus is the canonical scale anchor for the entire JTP line and the largest single substrate TR167 inherits from. As recorded in the program ledger, it carries 63,950 scored samples and 78,950 judge labels across 15K v1 baseline records plus approximately 76K v2 controls C1-C13, with JTP triangulation calibrated at kappa=0.925 at n=11,451 on the gemma3 x Claude cross-family pair. The triangulation panel was integrated into the program external JTP submission as the v1 evidence base, and the kappa thresholds at 0.4 / 0.7 / 0.9 that TR167 reuses verbatim were the same thresholds that returned the v1 triangulate-band verdict on TR145 at kappa=0.6917 on n=12,809. What TR167 borrows from TR140 v3 is therefore not just the labels themselves but the calibration anchor that licenses the 0.7 threshold to be treated as a fixed operating point rather than a re-tunable hyperparameter. Re-tuning the threshold on TR167 would invalidate the predictive-validity framing -- the cheap-signal hypothesis is testing whether cheap_score predicts cell-level kappa_min relative to an operating point the parent JTP framework already committed to, not whether a new operating point can be reverse-engineered to make the cheap signal look better on this holdout.
7.2 TR148 v2 dual-axis carryover
TR148 v2 contributed two structural findings that travel into the current substrate. The first is the cross-family kappa-reliability gate at 0.4 / 0.7 / 0.9, with the TR145 safety subset landing at kappa=0.6917 in the triangulate band -- exactly the regime where a JTP follow-up has the most predictive-validity leverage. The second is the dual-axis finding: the safety-specialist judges (llama-guard3:8b, shieldgemma:9b) anti-correlate with the general-LLM judges (gemma3:12b, qwen2.5:7b) at kappa values in the -0.13 to -0.26 range because they measure composite-harm content rather than response-refusal behavior -- two orthogonal axes folded under a single "JTP" label in v1. TR148 v2 routed this finding into a Layer 1a refusal-axis JTP plus a Layer 1b composite-harm-axis screen in the bridge paper substrate at papers/serving_state_safety_certification/. TR167 inherits both gates structurally: every cell in the TR167 cohort is judged by both axes, and the cheap-signal hypothesis is evaluated against the kappa_min metric, which is the minimum pairwise kappa across the panel and therefore the binding constraint on JTP-validity under either axis. The cohort/class taxonomy carryover is verbatim -- TR167 does not invent new classes, does not redraw the threshold edges, and does not collapse the dual-axis decomposition into a single-judge proxy.
7.3 Read-only inheritance discipline
The read-only discipline is non-negotiable and is what makes the P8 finding interpretable as a methodological observation rather than a re-litigation of TR140's labels. TR167's analyze.py never writes back into research/tr140/ or research/tr148/; the only artifacts produced live under research/tr167/results/20260610_204823/. The label_source distribution in TR167's 528-record substrate makes the inheritance explicit at the per-record level. The table below resolves it.
| label_source | n_records | provenance | TR167 modifies? |
|---|---|---|---|
| v1_reuse | 264 | TR140 v3 cells re-aggregated and feature-extracted under the TR148 v2 taxonomy | No |
| live_nonrlhf | 264 | Live judge labels from the TR148 v2 safety-specialist cohort run on the same prompts | No (read-only over the new live labels) |
| Total | 528 | -- | -- |
Observations. The 528 records partition exactly evenly across the two inheritance lanes (264 / 264), which is a structural consequence of the 24-cell design (4 models x 6 GGUF quants) being run once under each pool and re-aggregated at the cell level. No record is double-counted across lanes: the v1_reuse rows carry TR140 v3 first-judge labels under the rlhf-only judge subset, and the live_nonrlhf rows carry the TR148 v2 safety-specialist axis labels collected at TR167 run time against the same prompt set. The cohort still includes the regex axis (inherited from v1), so the four-judge cohort plus regex is the operative roster for both the rlhf-only and expanded_nonrlhf pools. The read-only contract is the reason the 6 resurfaced P8 flips and the mean kappa_min shift of -0.1529 cannot be explained away as re-judging drift -- the v1 labels under the rlhf-only pool are bit-identical to what TR140 v3 originally wrote.
7.4 Cross-references to the program external JTP submission and bridge paper Layer 1a
TR167's substrate inheritance is upstream of three deliverables that are already in flight elsewhere in the program. The first is the program external JTP submission, for which TR140 v3 + TR148 v2 form the v1 evidence base; TR167 is the predictive-validity follow-up whose verdict ladder (structural-degenerate-class on the rlhf-only holdout + load-bearing P8 pool-robustness secondary) becomes the v2 methodological addendum to that submission. The second is the bridge paper Layer 1a anchor at papers/serving_state_safety_certification/, where the refusal-axis JTP screen is one of the five certification layers; TR167's pool-robustness finding -- that 6 cells flip judge-class under the expanded cohort with no reverse flips and a -0.1529 mean kappa_min shift -- is the empirical observation that licenses Layer 1a to be reported as cohort-composition-sensitive rather than as a clean binary gate. The third is the sibling predictive-validity TR series: TR166 / RTSIv2 (the RTSI predictive-validity follow-up against arXiv:2606.10154) and TR168 / CRIv2 (the CRI predictive-validity follow-up scaffold) share TR167's pre-registration discipline -- single-substrate, single-headline-pool, degenerate-class abort condition specified in advance, directional secondary findings reported honestly when the binary primary collapses.
The honest inheritance reading is that TR167 inherits a substantial substrate (63,950 scored + 78,950 judge labels from TR140 v3, plus the dual-axis cohort definition from TR148 v2) under a strict read-only contract, and the headline FAIL_OR_INSUFFICIENT_DATA verdict on the rlhf-only holdout is forced by the inherited taxonomy applied to the inherited labels on the GGUF-local lane. The read-only discipline means TR167 cannot fix the structural degenerate-single-class outcome by retroactively reclassifying any v1 cell; the degeneracy is a property of the substrate at this depth, and the honest move is to report it as a structural finding rather than to launder it through a label edit or a threshold re-tune. The same discipline is what makes the P8 cohort-contrast finding load-bearing: with the v1 labels frozen and only the cohort composition changing, the 6 resurfaced flips are a clean attribution to pool composition rather than a confound with re-judging noise.
8. SS1. Cell Coverage and Standard-Depth Confirmation
The first standing sub-study (SS1) is the bookkeeping gate that any downstream predictive-validity claim must clear before P1-P8 are even legible. We document here that the GGUF-local lane is coverage-complete at the standard depth declared in config.yaml, and we explicitly bound what "standard depth" can and cannot license. Without this gate cleared, the structural-degenerate-class finding documented in SS2 could be re-litigated as a coverage artifact (missing cells), and the P8 pool-robustness finding in SS9 could be dismissed as a sampling artifact (uneven label_source distribution across pools). Both readings are foreclosed by the ledger that follows.
The substrate engine reports coverage_complete=True on the rlhf-only pool, with all 24 of 24 expected (model, quant) cells materialized across the 4 GGUF-local models (llama3.1-8b, llama3.2-1b, llama3.2-3b, and the qwen2.5 family anchor) and the 6 GGUF quantization rungs (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0). The cross-product is exhaustive at this depth: 4 models multiplied by 6 quant rungs equals 24 cells, and the engine fired all 24 without dropping a cell to OOM, tokenizer failure, or first-judge timeout. The cell ledger is built from 528 scored records with 0 soft violations against the cleanness schema, which is the precondition for treating downstream Spearman, Wilson, and monotonicity statistics as statistically legible rather than as artifacts of label-source contamination.
| Coverage axis | Value | Expected | Status |
|---|---|---|---|
(model, quant) cells |
24 | 24 (4 x 6) | complete |
| Total records | 528 | 528 | clean |
| Soft violations | 0 | 0 | clean |
| Families present | 2 (llama, qwen) | 5 at paper depth | partial |
| Batteries present | 11 | 11 (pooled + s1/s4/s16/s64/s128 x faux-dialogue/message-array) | complete |
| Pools present | 2 (rlhf_only, expanded_nonrlhf) | 2 | complete |
| Splits present | 2 (calibrate, holdout) | 2 | complete |
| Judges in cohort | 4 (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b) | 4 + regex | complete |
label_source distribution |
264 v1_reuse + 264 live_nonrlhf |
528 | balanced |
| Phases present | [1] | [1] at standard depth | complete |
Observations. Coverage on the axes that the standard-depth run.py lane is meant to exercise is exact: cells, records, batteries, pools, splits, and judge cohort all land on their declared targets, and the label_source ledger is balanced 264/264 between v1 reuse and live non-RLHF labels, which is the configuration that the P8 secondary pool-contrast analysis depends on for legibility. The 24-cell cross-product is closed in the strict sense that every (model, quant) combination produced a cell record -- no cell collapsed to insufficient_data because of generation failure, which is the kind of pre-statistics dropout that would have biased the downstream JTP class distribution toward judge-dominated by construction. The single coverage axis that is structurally incomplete is family diversity: the GGUF-local lane offers only two families (llama, qwen), and family_code consequently lacks intra-family variance for the Spearman analysis in P2 (rho reported as None). This is the binding limitation that propagates into every downstream verdict in SS2-SS8, and it is recorded here rather than rediscovered there.
The eleven batteries are the second-densest axis after cells and deserve a structural note. The pooled battery is the union over the per-attack-count batteries, and the s1/s4/s16/s64/s128 ladder crossed with the faux-dialogue and message-array prompt-format strategies produces ten format-strategy variants plus the pooled rollup. The s1 cells anchor the single-shot baseline against which higher shot-counts are read; s4 / s16 sweep the regime where most production multi-turn jailbreaks operate; s64 / s128 push into the long-context regime that historically stresses the judge cohort hardest because output lengths drift up and the first-judge unclear rate climbs with them. The faux-dialogue vs message-array contrast is the prompt-format axis -- whether the multi-turn attack is presented as a single user turn with simulated assistant turns interleaved (faux-dialogue) or as an actual multi-message chat array (message-array). Both prompt formats are operationally relevant: deployed inference servers see both forms in the wild, and TR148 v2 found that the kappa floor moves with prompt format because the safety-specialist judges read the dialogue structure literally while the general-LLM judges read the prompt holistically. The eleven-battery axis is what gives the cell-level kappa_min the n_overlap floor (452 to 499 overlapping judge pairs per cell) that P2 depends on; without it, the cell-level Spearman in SS4 would be exposed to within-cell label noise rather than the across-cell kappa_min signal it is meant to read.
The label_source distribution is the third axis worth a methodological note. 264 records carry v1_reuse provenance, meaning the cell's first-judge labels are inherited verbatim from the TR140 v3 parent corpus under the rlhf-only judge subset -- no relabeling, no rescoring, no prompt regeneration. The other 264 records carry live_nonrlhf provenance, meaning the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) was run live at TR167 execution time against the same prompt set that the v1 first-judge labels were drawn from. This 264/264 partition is what makes the SS9 / P8 contrast interpretable as a pool-composition effect rather than as a prompt-drift artifact: the rlhf-only pool reads the v1_reuse labels, the expanded_nonrlhf pool adds the live_nonrlhf labels on the same prompts, and any flip between pools is forced by the cohort composition change alone. If the partition had been unbalanced (say 400/128), the P8 mean kappa_min shift of -0.1529 could have been confounded with sampling weight; at 264/264 the contrast is structurally clean.
What "standard depth" cannot license is equally important to nail down here. Standard depth is the run.py --depth standard setting in config.yaml; paper depth is the run_paper.py cloud lane that adds three further families (gemma, phi, mistral) across three quantization stacks (GPTQ, AWQ, fp16) on vLLM. The cloud families are absent here -- not because they were dropped, but because the standard-depth run was scoped to the GGUF-local lane only, and the cloud-family run is gated on a paid RunPod A100 PCIe pass and on the Fellowship-conditional umbrella discipline described in the project's cohort-gating rules. The two-family GGUF-local cohort is therefore not a sampling deficiency relative to the standard-depth specification; it is the specification. The "5 at paper depth" entry in the Families row of the coverage table is the forward-looking target that TR167 v2 expansion will hit, not a missing-data flag against the present run.
The coverage table licenses exactly the claims standard depth can support and refuses the rest. Two families, eleven batteries, two pools, two splits, four judges, 528 clean records is the GGUF-local lane at full cell-table closure -- but it is not paper depth. Paper depth requires the
run_paper.pycloud cells (gemma, phi, mistral families across GPTQ, AWQ, and fp16 on vLLM), and their absence is the load-bearing reason the leave-one-model-family-out design in P6 collapses to a 2-fold split with only 5 cells in the qwen fold. We flag this here as the binding limitation rather than the binding failure: the standard-depth substrate is sufficient to publish the structural single-class finding and the pool-robustness secondary finding, and insufficient to publish a non-degenerate ROC-AUC. The TR167 v2 cloud expansion is scoped precisely to lift this coverage axis, not to re-litigate the cells already closed at this depth. The honest read of SS1 is that the GGUF-local cross-product is exhausted, the battery and pool axes are dense enough to sustain SS3 through SS9, the label_source ledger is balanced enough to make the P8 contrast structurally legible, and the only missing axis -- cloud-family diversity -- is the axis the present TR does not claim to cover.
9. SS2. JTP Class Distribution and the Degenerate-Class Finding
This section documents the JTP class distribution observed across the 24 covered cells of the TR167 substrate, and isolates the structural reason Req1 cannot return a positive cheap-predicts-judge-sensitivity verdict on the rlhf-only holdout. The headline finding here is not a sample-size complaint. It is a property of the substrate itself: the GGUF-local rlhf-only lane at standard depth does not produce judge-stable cells in the holdout split, and therefore the binary discrimination test specified by Req1 has only one class to discriminate. The chain of reasoning that follows -- pooled class taxonomy, holdout subset collapse, binary-discrimination prerequisites, and the cohort-axis-collapse hypothesis -- is the analytic spine of TR167's negative-primary verdict, and it terminates in a methodological finding about the production-GGUF lane under a refusal-axis-only judge cohort that is itself defensibly publishable.
9.1 Pooled JTP class counts (n_cells = 24)
| JTP class | n_cells |
|---|---|
| judge-dominated | 16 |
| insufficient_data | 6 |
| judge-sensitive | 2 |
Pooled across both splits and both pools, the binary collapse of these classes yields judge_sensitive True = 18 vs False = 6, and jtp_valid True = 15 vs False = 9. The two judge-sensitive cells are a tiny minority; the modal cell is judge-dominated (one judge driving the kappa floor), and insufficient_data cells are the residual where pairwise overlap with the safety-specialist axis was too thin to score a kappa_min at the configured floor.
Observations. The pooled distribution already telegraphs that the rlhf-only judge cohort is not a balanced positives-vs-negatives substrate at this depth. Two-thirds of all cells (16 of 24) fall into judge-dominated, only two cells land in the judge-sensitive class, and a quarter (6 of 24) are insufficient_data. A predictive-validity study designed to distinguish judge-stable cells from judge-sensitive cells on a binary axis has very little of the "stable" class to learn from in this lane. The taxonomy needs unpacking before the holdout collapse in 9.2 is legible: under the TR148 v2 JTP class definitions, judge-dominated means one judge drives kappa_min to the cohort floor (the minimum pairwise kappa pins the cohort verdict to a single dissenting axis); insufficient_data means the safety-specialist axis did not produce enough overlapping labels to score a stable pairwise kappa; and judge-sensitive in the v1 vocabulary is the residual where the cohort genuinely disagrees in a structured way that does not collapse to a single dominant judge. The 16/6/2 distribution observed here means most cells are not "the cohort split" cells -- they are cells where one judge axis pulls the floor down hard enough that the rest of the cohort cannot rescue the kappa_min above the 0.7 stable threshold.
Read at face value, the rlhf-only / GGUF-local lane behaves like a near-saturated regime for judge sensitivity. The cheap-signal hypothesis was framed assuming a mixed substrate -- some cells stable, some sensitive -- where a pre-rejudge feature could rank-order cells by their kappa_min and gate rejudge effort onto the at-risk cells. The TR167 substrate as actually realized is closer to "almost everything is at-risk under this judge cohort," which is itself an observation about the cohort, not a methodological gap in the feature engineering. The TR148 v2 dual-axis result anticipates exactly this outcome: when the refusal-axis judges (general LLMs measuring response-refusal behavior) are pulled into the same kappa pool as the composite-harm-axis judges (safety specialists measuring composite-harm content), the cross-axis kappa is structurally low because the two axes measure different latent constructs, and the resulting cohort kappa_min is dragged toward the floor on every cell where the axes disagree. The 16-cell judge-dominated bucket is the empirical fingerprint of that dual-axis dynamic on the GGUF-local 24-cell grid.
9.2 The rlhf-only / holdout subset (n = 10 valid cells)
When the analysis narrows to the pre-registered evaluation surface -- rlhf-only pool, holdout split, valid cells only -- the class distribution collapses to a degenerate single-class regime. All 10 valid holdout cells are judge_sensitive = True. Positives = 10. Negatives = 0.
| holdout subset metric | value |
|---|---|
| n_valid_holdout_cells | 10 |
| judge_sensitive = True | 10 |
| judge_sensitive = False | 0 |
| ROC-AUC for cheap_score | undefined (single class) |
| LOOCV out-of-fold AUC | insufficient_data |
| degenerate_single_class flag | True |
Observations. ROC-AUC requires at least one positive and one negative; the holdout has neither balance nor minority representation -- it has only positives. No cheap feature, however well engineered, can produce a binary discrimination metric on this surface. Req1's cheap_score_auc_above_05 evaluates False not because the cheap signal is bad but because the metric is mathematically undefined in this regime. The same applies to per-feature univariate AUCs (all undefined) and to the LOOCV logistic out-of-fold AUC (insufficient_data). The reduction path from 24 pooled cells to 10 valid holdout cells follows the pre-registered LOFO discipline: of the 24 cells, 12 sit in the held-out llama fold and the remaining 12 in the qwen fold, but the qwen fold further reduces to 5 valid cells after the jtp_valid gate, dropping it below the LOFO minimum and forcing the held-out llama fold (10 cells) into the headline slot. Of those 10 cells, all carry judge_sensitive=True under the rlhf-only pool -- the substrate is not 50/50 noise around a near-threshold rate, it is structurally one-class.
This is the load-bearing reason Req1 returns FAIL_OR_INSUFFICIENT_DATA, and it is structural rather than statistical. A larger n drawn from the same rlhf-only / GGUF-local generating process would not necessarily rescue the test, because the issue is not sampling noise around a near-50% positive rate; it is that the positive rate at this depth is 100% by the substrate's own behavior. The verdict is honest about this distinction via the
degenerate_single_class = Trueevidence field rather than degrading silently to a null AUC.
9.3 Binary-discrimination prerequisites and what "ROC-AUC undefined" means computationally
It is worth being explicit about what fails computationally when ROC-AUC is reported as undefined, because the failure mode is not "the classifier scored poorly" but "the classifier was never given a discrimination problem to solve." ROC-AUC is the integral of the true-positive rate (TPR) against the false-positive rate (FPR) as the decision threshold is swept across the score range. TPR equals true_positives / (true_positives + false_negatives) -- it requires positives in the sample. FPR equals false_positives / (false_positives + true_negatives) -- it requires negatives in the sample. When negatives = 0, the FPR denominator is identically zero at every threshold, and the integrand is 0/0 across the entire ROC curve. This is a divide-by-zero condition, not a small-sample variance condition, and it is invariant to which features are placed in the cheap_score. LOOCV propagates the same condition -- every leave-one-out fold draws from the same single-class vector, so every fold's out-of-fold AUC is undefined and the aggregate returns insufficient_data. The binary-discrimination test simply has no purchase on this holdout, regardless of how informative the underlying cheap_score is.
Observations. The PR-AUC computation is the textbook degenerate-class trap and deserves explicit unpacking. PR-AUC reports 1.0000 against a random floor of 1.0000 -- a tautological match. Precision at every recall level equals positives / (positives + false_positives), but with negatives = 0 the only achievable precision is 1.0000 across the recall axis, because there are no negatives to falsely classify as positive. A naive always-positive classifier scores PR-AUC = 1.0000; cheap_score scores PR-AUC = 1.0000; a random-coin classifier scores PR-AUC = 1.0000. The metric is not measuring discrimination on this holdout -- it is measuring the class composition, and any report-out that treats the 1.0000 as evidence of cheap_score quality is a defensibility-bar violation. The verdict ladder correctly refuses to claim "perfect PR-AUC" and instead returns insufficient_data on Req1's binary-discrimination axis.
The discrimination test is not "did cheap_score win or lose" but "was a discrimination problem posed to cheap_score at all," and the answer on this substrate is no. The honest framing is that the holdout was supposed to contain a mix of judge-stable and judge-sensitive cells against which a pre-rejudge feature could be evaluated, but the rlhf-only / GGUF-local lane at standard depth produces zero judge-stable cells in the holdout split, so the discrimination problem never materializes and no AUC -- ROC or PR -- can be reported as a measurement.
9.4 The cohort-axis-collapse hypothesis -- why the rlhf-only cohort universally judges-sensitive on production GGUF
The structural mechanism here is grounded in TR148 v2's dual-axis finding and deserves naming. The rlhf-only judge cohort in this run is composed of general-purpose RLHF-tuned LLMs (gemma3:12b, qwen2.5:7b) plus the regex screen carried over from the v1 corpus. All three components measure response-refusal behavior -- did the model refuse, partially refuse, or comply -- and operate on the same latent axis. They do not include a composite-harm-axis judge (llama-guard3:8b, shieldgemma:9b) in the rlhf-only pool. The cohort-axis-collapse hypothesis is the read that any single-axis judge cohort applied to production-GGUF quantized model outputs will collapse to judge-sensitive on every cell, because GGUF quantization introduces small but axis-correlated perturbations in refusal-token distribution that the refusal axis registers as disagreement, with no orthogonal axis to certify any cell as stable. Under this hypothesis, the 100% judge-sensitive rate on the rlhf-only holdout is not a property of the cheap_score, the prompts, or the model selection -- it is a property of the cohort axis.
Observations. The cohort-axis-collapse hypothesis makes three testable predictions that the substrate is consistent with: first, the pool-robustness contrast in SS9 should show that adding the composite-harm axis to the cohort shifts the class distribution (it does -- 24 of 24 cells are judge_sensitive under expanded_nonrlhf vs 18 of 24 under rlhf-only, with 6 cells resurfacing as judge-dominated where rlhf-only returned insufficient_data); second, drop-one-judge cheap_score AUCs should be informative against the expanded cohort even when the headline cohort returns undefined (the substrate confirms this: AUC = 1.0000 without gemma3:12b and 0.8333 without qwen2.5:7b); third, the cloud-family expansion in SS17 (gemma/phi/mistral via vLLM) should either introduce judge-stable cells (rescuing binary discrimination) or fail to introduce them (sharpening the cohort-axis-collapse hypothesis into a universal-collapse claim about production-quantization corpora). All three predictions are coherent with the dual-axis decomposition documented in TR148 v2 and inherited via the substrate-inheritance contract in SS0.
The cohort-axis-collapse read is the methodologically rigorous interpretation of the single-class holdout finding: TR167's rlhf-only / GGUF-local lane operates on a single judge axis (refusal), production-GGUF quantization perturbs that axis on every cell enough to cross the configured judge-sensitivity threshold, and no cell in the substrate can be certified as judge-stable because the cohort lacks the orthogonal composite-harm axis required to certify any cell. This is consistent with the TR148 v2 dual-axis finding (refusal-axis and composite-harm-axis are not the same construct), with the bridge paper Layer 1a / Layer 1b split (refusal-axis JTP vs composite-harm-axis screen as separate layers), and with the sibling-TR substrate (TR166 / RTSIv2 and TR168 / CRIv2 will face the same cohort-axis-collapse risk if they run their predictive-validity passes under a single-axis cohort).
9.5 Why this is a finding, not a failure
The degenerate single class on the rlhf-only holdout is itself a substantive methodological observation. The rlhf-only judge cohort in this run is composed of general-purpose RLHF-tuned LLMs (gemma3:12b, qwen2.5:7b) alongside the regex screen carried over from the v1 corpus. On this cohort, at standard depth, on the 4 x 6 GGUF-local model x quant grid, every valid cell crosses the configured judge-sensitivity threshold. The cohort does not certify any cell as judge-stable on this lane.
Observations. Two readings of this finding are simultaneously honest. First, it is consistent with the TR148 v2 dual-axis result: the refusal-axis judges and the composite-harm-axis judges measure different things, and the rlhf-only general-LLM cohort -- without the safety-specialist axis pulled into the kappa pool -- is structurally noisier than a dual-axis cohort would be. Second, it bounds the predictive-validity claim itself: cheap pre-rejudge features cannot be evaluated as a binary gate on this lane because there is no stable class to gate against. Both readings converge on the same forward path: either expand the cohort to include the composite-harm axis (which the P8 pool-robustness analysis already partially executes via the expanded_nonrlhf contrast) or expand the family axis to include cloud families that may produce judge-stable cells under the same single-axis cohort (the run_paper.py / TR167 v2 path).
The honest TR167 framing is that the rlhf-only / GGUF-local lane does not produce judge-stable cells at standard depth, so any cheap pre-rejudge feature that aims to triage stable cells out of rejudge is operating on an empty target set. This is exactly the regime in which the cloud-family expansion outlined in the future-work scope is methodologically necessary -- not as a confirmation pass for a positive result, but as the disjoint surface where judge-stable cells are actually expected to exist and where the binary discrimination test can be run at all. The degenerate-class observation is therefore the gate condition for TR167 v2, and is documented here as a primary finding rather than absorbed into a "not enough data" footnote.
10. SS3. Band-Stratified Judge-Sensitivity Rate (P1)
The SS3 analysis stratifies the ten holdout cells into LOW, MODERATE, and HIGH cheap_score tertiles and computes the judge-sensitivity rate within each band, with Wilson 95% confidence intervals on the rate and the mean kappa_min within band. The pre-registered prediction was twofold: (a) the judge-sensitivity rate should increase monotonically from LOW to HIGH (cheap_score should rank-discriminate which cells will end up judge-sensitive), and (b) the mean kappa_min should decrease monotonically from LOW to HIGH (higher cheap_score should correspond to lower judge agreement). The first prediction is rate-based and yields binary discrimination evidence; the second is continuous and yields directional evidence even when binary separation collapses. This bifurcation of the monotonicity test into two axes -- rate-based and magnitude-based -- is methodologically deliberate. It allows SS3 to surface a partial verdict in the regime where one axis structurally collapses but the other retains directional resolution, which is precisely the regime TR167 finds itself in.
The verbatim P1 table from tr167_analysis.json for the rlhf_only / holdout subset is reproduced below.
| Band | n_cells | judge-sensitive rate | Wilson 95% CI | mean kappa_min |
|---|---|---|---|---|
| LOW | 2 | 1.000 | [0.342, 1.000] | 0.221 |
| MODERATE | 3 | 1.000 | [0.438, 1.000] | 0.000 |
| HIGH | 5 | 1.000 | [0.566, 1.000] | 0.023 |
The rate monotonicity test resolves negatively in both directions: HIGH > MODERATE evaluates to No (1.000 is not greater than 1.000), and MODERATE > LOW evaluates to No (1.000 is not greater than 1.000). The kappa_min monotonicity test HIGH < LOW evaluates to YES (0.023 < 0.221), and is the only directional test in SS3 that resolves in favor of the pre-registered prediction.
Observations. All three populated cheap_score bands exhibit a 100% judge-sensitive rate, which is the substrate-level expression of the degenerate-single-class structural finding documented earlier: every valid holdout cell on the rlhf_only judge pool crosses the judge-sensitivity threshold, so a rate-based discrimination test has no negative class to separate against. The Wilson CIs are correspondingly wide and asymmetric -- [0.342, 1.000], [0.438, 1.000], [0.566, 1.000] -- and their lower bounds reflect small-n binomial uncertainty (n=2, n=3, n=5 per band) rather than any evidence that the true rate is below one. The CI lower bound narrowing from 0.342 in the LOW band to 0.566 in the HIGH band is a sample-size artifact tracking the per-band counts (2, 3, 5), not a meaningful gradient of underlying rate uncertainty. The kappa_min column, however, is informative in a way that the rate column structurally cannot be: the LOW band carries a mean kappa_min of 0.221 (driven primarily by the Q2_K llama3-1-8b cell at kappa_min=0.118 and the Q8_0 llama3-1-8b cell at kappa_min=0.441), the MODERATE band collapses to 0.000 (all three cells in the band carry kappa_min=0.000), and the HIGH band sits at 0.023 (five cells, four at kappa_min=0.000 and one at kappa_min approximately 0.118). The HIGH < LOW direction is the pre-registered prediction, and although there is no significance test attached to a three-point monotonicity check at n=2/3/5, the directional consistency is the same signal that surfaces independently in SS4 as the negative-sign Spearman rho of -0.1566 on cheap_score vs kappa_min.
Observations (band-level walk). Walking the bands individually clarifies what each is loaded with. The LOW band carries two cells -- one at kappa_min=0.118 (judge-dominated by a thin margin) and one at kappa_min=0.441 (the lone non-trivial kappa_min in the entire holdout, the Q8_0 llama3-1-8b cell). Both are judge_sensitive=True under the 0.7 stable-threshold, which forces the rate to 1.000; the mean kappa_min of 0.221 is meaningful, however, as the highest band-level mean in the table. The MODERATE band carries three cells (Q3_K_M llama3-2-3b, Q3_K_M llama3-2-1b, Q3_K_M llama3-1-8b) all sitting at kappa_min=0.000 -- complete cross-judge disagreement -- and a band mean of 0.000. The HIGH band carries five cells (Q2_K llama3-1-8b at kappa_min=0.118, and four cells at kappa_min=0.000 across Q2_K llama3-2-3b, Q3_K_M llama3-2-3b, Q4_K_M llama3-2-1b, Q4_K_M llama3-1-8b, etc.) -- the band mean of 0.023 reflects one cell carrying non-zero agreement while the others contribute floors. The structural read is that the LOW band's mean kappa_min is being lifted by a single Q8_0 outlier with kappa_min=0.441, and the HIGH band's mean is being depressed by the four-out-of-five collapse to zero. The directional signal is real but it sits on a knife edge of one outlier in the LOW band.
The directional signal -- HIGH band mean kappa_min (0.023) below LOW band mean kappa_min (0.221) -- is the only piece of Req1 evidence that returns True. It says that the cheap_score IS informative about the magnitude of cross-judge disagreement, just not in a way that produces a binary class separation on this substrate. The substrate has effectively two "speeds": cells at kappa_min=0.000 (full collapse) and cells at kappa_min between roughly 0.1 and 0.45 (partial agreement). The cheap_score sorts the partial-agreement cells preferentially into the LOW band and the full-collapse cells preferentially into the HIGH band, which is the predictive-validity direction. But all of those cells -- partial-agreement and full-collapse alike -- still sit below the 0.7 stable-threshold, so all of them carry judge_sensitive=True, and the binary classifier has nothing to separate.
Observations (why all rates are 1.000). The three 1.000 rates are not a failure of cheap_score discrimination -- they are a manifestation of the degenerate-class property of the substrate documented in SS2. Every valid holdout cell (n=10) carries judge_sensitive=True because every cell's kappa_min sits below the 0.7 threshold. Even the LOW-band Q8_0 llama3-1-8b cell at kappa_min=0.441 -- the highest kappa_min in the holdout -- is still below 0.7 and therefore still classified as judge_sensitive. The cheap_score cannot move any cell across the threshold because no cell is close to the threshold. The rate-based monotonicity test is asking "do higher cheap_score bands contain MORE judge-sensitive cells?" and on a substrate where every cell is judge-sensitive, the answer is structurally "the same number, namely all of them." This is a saturation phenomenon at the class label, not a feature-engineering deficiency at the cheap_score.
The distinction between "binary classifier discrimination" (failed) and "rank-order monotonicity" (passed) is the load-bearing methodological contribution of SS3, and it generalizes beyond TR167 to any predictive-validity follow-up evaluated on a substrate where the positive-class threshold sits well above the substrate's empirical mass. Binary classifier discrimination requires that the decision threshold cut through the empirical distribution; rank-order monotonicity only requires that the predictor's rank ordering agree with the outcome's rank ordering. The first is a strictly stronger claim and is what the pre-registered Req1 asks for; the second is a weaker but still informative claim, and is what the kappa_min monotonicity test in this section actually delivers. TR167 v1 cannot answer the stronger claim because the holdout has no negative class. TR167 v1 can answer the weaker claim, and the answer is: yes, cheap_score rank-orders kappa_min in the predicted direction. The pre-registration treats this as directional support rather than as a verdict-overriding positive, which is the honest call given the small n.
Observations (relationship to SS2 and SS4). SS3 sits between SS2's structural framing (positives=10, negatives=0, ROC-AUC undefined) and SS4's continuous regression (Spearman rho=-0.1566, p=0.6657, CI [-0.8178, 0.7116]). The three sections agree on direction and disagree only on the strength of the resolution each can offer at n=10. SS2 says: binary classification is structurally impossible on this holdout because the substrate is single-class. SS3 says: a band-based rank-order test on kappa_min passes the predicted direction (HIGH < LOW) even though the rate-based test cannot. SS4 says: the continuous rank correlation between cheap_score and kappa_min is in the predicted direction but is not statistically resolved at this n. The triad of findings is internally consistent and is what the pre-registration treats as "directional but not significant" evidence -- enough to license SS9's pool-robustness analysis as the load-bearing positive finding but not enough to override the FAIL_OR_INSUFFICIENT_DATA verdict on Req1.
The SS3 table is the cleanest single-frame illustration of why TR167's primary verdict resolves as FAIL_OR_INSUFFICIENT_DATA rather than as a clean positive or a clean negative. The rate-based axis collapses structurally -- not from noise but from the GGUF-local rlhf-only lane producing zero judge-stable holdout cells -- and this collapse propagates into ROC-AUC, LOOCV-AUC, and the baseline head-to-head head test in SS6. The kappa_min axis, by contrast, retains directional signal that points the predicted way. The honest interpretation is that cheap_score is not zero-information on this substrate; it ranks cells by judge-agreement in the expected direction but it cannot produce binary class separation when there is no negative class to separate. This is precisely the failure mode that the TR167 v2 cloud-family expansion (gemma/phi/mistral via vLLM, scoped in SS15) is designed to resolve: introducing cross-family diversity should populate the judge-stable end of the distribution and restore a defined ROC-AUC. If the cloud expansion does not introduce stable cells, the structural-degenerate verdict generalizes and pool robustness (SS9) becomes the only methodologically rigorous JTP observable on production-quantization corpora.
Observations (forward read into SS8 and SS9). The directional signal that survives SS3's collapse propagates into two downstream sections. SS8's LOFO validation re-evaluates the same monotonicity test under leave-one-model-family-out and confirms the held-out qwen fold is too small (n=5 cells) to run the test independently. SS9's pool-robustness analysis swaps the judge cohort from rlhf_only to expanded_nonrlhf and recovers the binary discrimination that SS3 cannot deliver -- drop-one-judge AUCs of 1.0000 (without gemma3:12b) and 0.8333 (without qwen2.5:7b) demonstrate that the cheap_score's rank-order signal in SS3 corresponds to a real binary signal once the cohort composition is changed. The honest read is that SS3 surfaces directional evidence at the limit of what the rlhf-only holdout can resolve, and SS9 converts that directional evidence into a binary-discrimination result once the cohort axis is permitted to move. This cross-reference is what licenses TR167's "negative-results-with-substantive-secondary-finding" framing: SS3 is the directional anchor and SS9 is the binary anchor, and the methodological story they jointly tell is that pool composition is the dominant lever on JTP class assignment on this substrate -- consistent with the dual-axis finding in the parent TR148 v2 framework and with the Layer 1a / Layer 1b decomposition in the bridge paper substrate at papers/serving_state_safety_certification/.
11. SS4. Cheap-Feature Correlation with kappa_min (P2)
P2 is the regression-grade companion to the band-stratified view in SS3. Where SS3 buckets the 10 holdout cells into LOW/MODERATE/HIGH tertiles on cheap_score and reads off mean kappa_min per band, P2 asks the strictly stronger question: across the continuous range of each cheap feature, does the rank ordering of kappa_min track the pre-registered direction with a Spearman rho whose 95% confidence interval excludes zero? The pre-registered prediction for the lead composite is unambiguous -- higher cheap_score is supposed to predict lower kappa_min, i.e. a negative Spearman rho with a confidence interval that does not cross the null. This section documents the full eight-feature P2 table verbatim from tr167_analysis.json, identifies which rows have the correct sign, walks the per-feature interpretive read in turn, and is explicit about which rows clear the significance bar (none of them do, at n=10).
The table below reproduces every feature in the pre-registered P2 schedule. Spearman rho and Pearson r are reported to four decimals; p-values are scipy.t.sf two-sided; the 95% confidence interval is reported where the substrate populates it. family_code is the one row that returns no rho -- with only the llama and qwen families present in the GGUF-local lane, the within-family variance collapses and Spearman is undefined.
| Feature | Spearman rho | Spearman p | Spearman 95% CI | Pearson r | Pearson p |
|---|---|---|---|---|---|
| quant_bpw | 0.1524 | 0.6743 | [-0.6614, 0.7687] | 0.4911 | 0.1494 |
| refusal_rate_delta | 0.0304 | 0.9336 | -- | -0.1287 | 0.7230 |
| single_judge_unclear_rate | -0.0714 | 0.8446 | -- | -0.1390 | 0.7017 |
| single_judge_ambiguity_rate | 0.1930 | 0.5933 | -- | 0.1331 | 0.7140 |
| mean_output_len_tokens | 0.2162 | 0.5485 | -- | 0.2032 | 0.5734 |
| single_judge_refusal_rate | -0.1930 | 0.5933 | -- | -0.1570 | 0.6649 |
| family_code | -- | -- | -- (insufficient var.) | -- | -- |
| cheap_score (composite) | -0.1566 | 0.6657 | [-0.8178, 0.7116] | -0.2179 | 0.5453 |
Observations. The lead composite, cheap_score, lands on the predicted side of the null: Spearman rho is -0.1566, which is the same sign the pre-registration demanded. The Pearson r at -0.2179 agrees on sign and is consistent in magnitude. However, the p-value for the Spearman test is 0.6657 and the 95% confidence interval [-0.8178, 0.7116] spans roughly the entire admissible range for a rank correlation. At n=10 this is the rho-CI width one expects from the t-approximation under scipy.t.sf -- the substrate is simply too thin to resolve a moderate-strength effect from null. Among the individual features, only single_judge_unclear_rate, single_judge_refusal_rate, and cheap_score itself carry the predicted negative sign; the remaining four either carry the wrong sign (quant_bpw, refusal_rate_delta, single_judge_ambiguity_rate, mean_output_len_tokens) or are estimated with no useful precision. None of the eight rows is significant at alpha=0.05.
11.1 Per-feature walk
The per-feature read is worth doing slowly, because each row encodes a different methodological story and a different boundary on what the rlhf-only / holdout pool can resolve.
quant_bpw is the most interesting individual row in the panel. Its Pearson r of 0.4911 at p=0.1494 is the closest any single feature gets to a conventionally interesting linear correlation. But its Spearman rho is only 0.1524 at p=0.6743. The gap between the Pearson and the Spearman point estimates -- 0.49 versus 0.15 -- is itself a methodological signal: a linear correlation that does not survive rank transformation typically reflects a non-monotone relationship, two influential cells at the bpw extremes, or both. Inspection of the per-cell holdout table makes the second hypothesis plausible: the Q8_0 cells anchor the high-bpw end with kappa_min values of 0.000 (llama3-2-3b-q8-0) and 0.441 (llama3-1-8b-q8-0), and the Q2_K cells anchor the low-bpw end with kappa_min values of 0.118 and 0.000. The Pearson coefficient absorbs the leverage of those four anchoring cells; the Spearman coefficient does not. Either way, the sign on quant_bpw is the opposite of the pre-registered direction -- higher bpw is associated with higher kappa_min, which is exactly what one would predict from the prior literature (higher precision -> more stable outputs -> more stable judge labels), and which therefore reflects a feature whose sign was misregistered rather than a feature that fails the hypothesis. We flag this as a registration-clarity issue rather than a defeat for the cheap-signal story.
refusal_rate_delta returns Spearman rho = 0.0304 at p=0.9336. This is the textbook null result for this panel: the cheap-feature shift in refusal rate versus the Q8_0 anchor has essentially no rank-correlation with cell-level judge disagreement on the rlhf-only / holdout subset. The Pearson r at -0.1287 is also nondescript. The honest reading is that refusal-rate deviation against a per-model high-precision anchor does not, on its own, carry a kappa_min-predictive signal at this n.
single_judge_unclear_rate is the trivial-baseline feature that Req2 was designed to beat. Its Spearman rho on kappa_min is -0.0714 at p=0.8446, with the correct (negative) sign but no precision. The first-judge UNCLEAR rate is the cheapest possible pre-rejudge proxy -- it is computable from a single judge's labels without any second-judge call or feature engineering -- and the fact that even this trivial feature carries a directionally correct but statistically null signal is exactly what justifies the Req2 head-to-head structure: if the trivial baseline barely registers, the cheap composite has a low bar to clear before it begins to add information.
single_judge_ambiguity_rate registers at Spearman rho = 0.1930 at p=0.5933, which carries the wrong sign at the same magnitude as the closely-related single_judge_refusal_rate row. The ambiguity rate is conceptually a sibling of the unclear rate -- both measure how often the first judge could not commit to a clean SAFE/UNSAFE label -- and the divergent signs between unclear_rate (-0.0714) and ambiguity_rate (+0.1930) suggests the two features are measuring different aspects of first-judge hesitation. At n=10 the cleanest read is that neither signal individually clears the noise floor.
mean_output_len_tokens registers at Spearman rho = 0.2162 at p=0.5485 with the wrong sign. The pre-registered intuition was that longer responses might give judges more surface area to disagree about, which would predict a negative sign on output length (longer outputs -> lower kappa_min). The observed positive sign suggests either that the intuition was backwards on this lane (longer outputs may be longer because the model committed cleanly to a position, making judge agreement easier rather than harder) or that the n=10 estimate is dominated by leverage from a few long-output cells. Neither interpretation is supported at significance.
single_judge_refusal_rate registers at Spearman rho = -0.1930 at p=0.5933, carrying the predicted negative sign with the second-largest rho magnitude in the panel. The interpretation is that cells where the first judge attributed a high refusal rate to the model tend to also be cells where the multi-judge cohort disagrees more -- a sensible mechanism if refusal-heavy outputs are the ones where the refusal-axis judges and the composite-harm-axis judges measure systematically different things. This is the per-feature row that most directly echoes the TR148 v2 dual-axis result, and it is also the row whose sign best supports the cheap_score composite's negative loading. It does not, however, clear the significance bar.
family_code returns no rho. With only two families present (llama and qwen) in the GGUF-local lane, the within-family rank variance collapses and Spearman is undefined; this is the structural counterpart of the SS1 coverage note that the standard-depth substrate is family-incomplete by design. The row exists in the schedule for symmetry with the run_paper.py cloud lane, which will introduce gemma/phi/mistral families and populate this row.
cheap_score (composite) lands at Spearman rho = -0.1566 at p=0.6657 with 95% CI [-0.8178, 0.7116] and Pearson r = -0.2179 at p=0.5453. Both estimators agree on the predicted negative sign; neither clears significance. The composite's sign is driven primarily by the single_judge_refusal_rate row (correct sign, second-largest rho) with secondary load from single_judge_unclear_rate (correct sign, small rho); it is partially cancelled by the wrong-signed quant_bpw, mean_output_len_tokens, and single_judge_ambiguity_rate rows.
11.2 Power discussion at n=10
The n=10 power discussion is what licenses the "not significant" line as expected rather than diagnostic. Under a one-tailed Spearman test at alpha=0.05 with true rho=0.3, the approximate power at n=10 is on the order of 0.18. That is, a true effect of moderate strength produces a significant Spearman p only about one time in five at this sample size. The observed cheap_score |rho| of 0.1566 sits well below that, so even if the true effect were exactly the magnitude we observe, the substrate is under-powered to resolve it from zero with conventional significance criteria. The 95% confidence interval width of approximately 1.53 (on a scale bounded at [-1, 1]) makes the same point geometrically: at n=10 the rho estimator simply cannot localise a moderate effect to within an interval narrower than the admissible range.
The P2 panel is consistent with the SS3 monotonicity story: cheap_score's correct-sign Spearman rho of -0.1566 echoes the kappa_min ordering HIGH (0.023) < LOW (0.221) but cannot meet the significance bar at n=10. The pre-registered cheap-predicts requirement therefore fails on this substrate -- not because the directional evidence is wrong, but because the holdout has too few cells to resolve a rho of small-to-moderate magnitude from zero. Combined with the SS3 structural verdict that all 10 valid holdout cells are judge_sensitive=True, the P2 panel is the second of two independent reasons the rlhf-only / holdout pool is under-powered for the pre-registered hypothesis test: SS3 shows the binary classification problem is degenerate (positives=10, negatives=0); SS4 shows the continuous regression problem has the right sign but no precision. The honest report-out is that P2 produces directionally-aligned but statistically-inconclusive evidence on the composite, no-significant-signal panels on the individual features, and family_code excluded by construction because the GGUF-local lane only carries two model families. The path forward is not feature engineering at n=10 -- that path is structurally under-powered -- but the run_paper.py cloud expansion (gemma/phi/mistral via vLLM) described in SS17, which expands both the cell-level n and the cross-family variance and is the cleanest way to push the Spearman estimator out of the small-sample regime.
12. SS5. Discrimination Analysis (P3) -- Why ROC-AUC is Structurally Undefined
P3 was pre-registered as the binary-discrimination arm of Req1: the cheap composite score, treated as a continuous decision variable, should produce an ROC-AUC strictly above 0.5 with a 95% confidence interval that excludes 0.5, and the LOOCV cheap-logistic out-of-fold AUC should likewise exceed 0.5. Neither quantity is reported as a number in this substrate. The run dir records cheap_score ROC-AUC as undefined and LOOCV out-of-fold AUC as insufficient_data, and the PR-AUC sits at exactly 1.0000 with a random floor of exactly 1.0000. This section walks through why each of those three lines is a structural consequence of the holdout's class composition, not a result that future tuning, regularization, or feature engineering can rescue without changing the corpus itself. The walk-through matters because a casual reading of "PR-AUC = 1.0000" could be mistaken for a perfect classifier; the actual content of that number is the opposite, and a hand-narrated TR is the right venue to draw that distinction in full.
The discrimination grid below summarises every binary-classification quantity P3 attempted to compute on the rlhf_only / holdout slice (n_cells=10).
| Quantity | Value | Reason / interpretation |
|---|---|---|
| Positives (judge-sensitive=True) | 10 | All valid holdout cells |
| Negatives (judge-stable=False) | 0 | Structural degeneracy |
| cheap_score ROC-AUC | -- (undefined) | Single-class -> TPR/FPR division by zero |
| cheap_score PR-AUC | 1.0000 | Random floor also 1.0000 -- tautology |
| Random-floor PR-AUC baseline | 1.0000 | precision = positives / (positives + negatives) = 10/10 |
| LOOCV cheap-logistic OOF AUC | -- (insufficient_data) | No class label variance across folds |
| Per-feature univariate AUC (all 7) | -- (undefined) | Same single-class condition propagates |
| AUC(majority-class baseline) | 0.5000 | Defined by convention; not a measurement |
Observations. Two failure modes must be distinguished, because they have different remediations and they pattern-match to different categories of reviewer objection. The first is computational undefinedness: ROC-AUC integrates TPR against FPR, and with zero negatives the false-positive rate is 0/0 at every threshold, so the integrand does not exist. This is a divide-by-zero condition, not a small-sample variance condition, and no bootstrap or LOOCV procedure changes it -- every resample of the all-positive class vector draws the same all-positive class vector. The second is scientific uninformativeness, which is the PR-AUC line: precision at every recall level is positives / (positives + false_positives), but with negatives=0 the only achievable precision is 1.0000 across the entire recall axis, so PR-AUC equals the random-floor PR-AUC by construction. The metric returns 1.0000 not because the cheap score is a perfect classifier, but because the question "what fraction of predicted-positives are truly positive?" is answered "all of them" before the classifier is consulted. Reporting PR-AUC = 1.0000 here as evidence of discrimination would be the textbook degenerate-class trap that the JTPv2 pre-registration explicitly guards against via the degenerate_single_class=True flag in tr167_analysis.json.
Observations. The LOOCV cheap-logistic line warrants its own walk-through because LOOCV is sometimes mistakenly reported as a workaround for small-n discrimination tests. It is not, in this regime. The LOOCV procedure holds out one cell at a time, fits a logistic on the remaining nine, and scores the held-out cell against the resulting model. Two structural barriers collapse this onto insufficient_data. First, sklearn's LogisticRegression.fit requires at least two distinct class labels in the training fold; with all ten training cells carrying judge_sensitive=True, every fold's training set is single-class and the fit refuses. Second, even if a degenerate "always-predict-positive" model is substituted, scoring the held-out cell yields a single (true_label=1, predicted_score=constant) tuple, and there is no AUC over a single tuple -- AUC requires rank comparisons across at least one positive and one negative. The substrate therefore records LOOCV out-of-fold AUC as insufficient_data rather than as a numeric value, which is the correct behavior. The same logic propagates into the per-feature univariate AUCs: each of the seven cheap features inherits the same single-class label vector and produces the same undefined output.
The structural reading is that P3 is not measuring the cheap signal at all -- it is measuring the composition of the holdout. In the rlhf_only / holdout subset of the GGUF-local lane, every valid cell crosses the judge-sensitivity threshold, so binary discrimination has no negative class against which to discriminate. The cheap_score Spearman rho still carries the correct negative sign (-0.1566) and the monotonicity test on kappa_min HIGH < LOW still passes (0.023 < 0.221 from SS3 / P1), which means the directional signal survives even when the binary signal does not. The honest framing is that P3's "fail" line is a consequence of the GGUF-local rlhf-only lane producing a structurally single-class holdout at this depth -- a finding about the test bed, not about cheap_score.
Observations. Given that binary discrimination is structurally unavailable on this surface, the methodologically defensible move is to ask what classes of metric would be informative on a degenerate-class holdout, and to flag them as the appropriate replacement targets for any follow-up that cannot wait for a non-degenerate corpus. Three candidate replacements present themselves on the existing TR167 substrate without requiring new sampling. The first is per-cell prediction-quality ranking: rank the ten holdout cells by cheap_score and rank them independently by kappa_min, and report Spearman rho on the two rankings -- which is exactly the test SS4 already runs, and which preserves the directional content of the cheap signal under a degenerate class. The second is direct regression on kappa_min itself, treating the continuous JTP-validity quantity as the response variable instead of the binarized judge_sensitive label; the LOW=0.221 / MODERATE=0.000 / HIGH=0.023 band means already constitutes a three-point regression view, and a per-cell scatter would be a defensible companion in the v2 expansion. The third is a rank-based concordance metric such as Somers' D or Kendall's tau-b, which degrade gracefully under class imbalance and would report a defined value at n=10 where ROC-AUC cannot. The TR167 v1 pre-registration committed to ROC-AUC and ROC-AUC alone for the headline Req1 admissibility test, so we report the ROC-AUC outcome honestly as undefined; the SS4 Spearman rho and the SS3 kappa_min monotonicity already serve as the rank-based substitutes for the binary signal, and TR167 v2 will be the appropriate venue to re-pre-register a continuous-response or rank-concordance primary test if the cloud-family expansion still does not introduce judge-stable cells.
Two corpus-side moves can lift the degeneracy and restore the binary-discrimination axis to a measurable state. The first is the cloud-family expansion described in SS15 (gemma / phi / mistral via vLLM at GPTQ / AWQ / fp16 cells) which the substrate predicts will introduce judge-stable cells into the holdout, because cross-family architectural diversity is exactly what is missing from the GGUF-local-only lane. The second is the expanded_nonrlhf pool contrast in SS9 / P8, which already demonstrates that pool composition is the dominant lever on which cells register as judge-sensitive: under the expanded cohort the same 10 holdout cells redistribute their kappa_min by a mean shift of -0.1529, and the drop-one-judge AUC without gemma3:12b reaches 1.0000, which is precisely the kind of cohort-composition sensitivity TR148 v2's dual-axis finding predicted. Until one of those moves runs, the only methodologically defensible discrimination claim from this run is the rank-based directional one carried by SS3 and SS4, not the binary AUC-shaped one that P3 was pre-registered to deliver. Recording P3 as undefined / insufficient_data rather than laundering it through a synthetic 0.5 or a misread PR-AUC = 1.0000 is the entire reason the JSON substrate carries
degenerate_single_class=Trueas a first-class evidence field, and is the reason the bridge paper Layer 1 (papers/serving_state_safety_certification/) will inherit P3 as a calibrated caveat rather than as a positive admissibility line.
13. SS6. Baseline Head-to-Head DeLong (P4) -- Insufficient Data Reason
P4 is the pre-registered baseline head-to-head test for Req2: the combined cheap model must beat the trivial single-judge-unclear-rate baseline by a positive delta-AUC whose 95% bootstrap confidence interval excludes zero, AND by a DeLong test p-value below alpha=0.05, AND by a nested-logistic LRT p-value below alpha=0.05. All three signals are pre-registered as required, not sufficient -- the design demanded conjunctive evidence so that a single statistical artifact could not carry a "cheap beats baseline" claim. On this substrate, each of the three lines returns insufficient_data for the same structural reason that P3 returned undefined: AUC(cheap) and AUC(baseline) are individually undefined on the single-class holdout, so the difference between them is also undefined, and the DeLong covariance estimator and the nested LRT both lose their statistical denominators.
The head-to-head grid below records the verbatim P4 output from the run dir, with the AUC(majority-class baseline) line included for completeness because it is the only entry in the grid that does carry a defined numerical value, and that value is a convention rather than a measurement.
| Quantity | Value | Reason / interpretation |
|---|---|---|
| AUC(cheap) | -- (undefined) | Inherits single-class condition from P3 |
| AUC(unclear_rate baseline) | -- (undefined) | Same single-class condition |
| AUC(majority-class baseline) | 0.5000 | Representational convention, not a measurement |
| Delta-AUC (cheap - baseline) | -- (insufficient_data) | Cannot subtract two undefined quantities |
| Delta-AUC bootstrap 95% CI | insufficient_data | No defined delta to bootstrap |
| DeLong test p | insufficient_data | DeLong variance estimator collapses on empty negative sample |
| Nested-logistic LRT p | insufficient_data | LRT denominator vanishes on constant target vector |
delong_p_below_alpha05 evidence field |
False | Confirms ABORT path fired |
lrt_p_below_alpha05 evidence field |
False | Confirms ABORT path fired |
beats_baseline_verdict |
FAIL_OR_INSUFFICIENT_DATA | Pre-registered ABORT condition triggered |
Observations. The pre-registered ABORT condition for Req2 fired exactly as the pre-registration specified, and the JSON evidence block records both delong_p_below_alpha05=False and lrt_p_below_alpha05=False -- which must be read as "the test could not be administered" rather than "the test was administered and produced p above 0.05." This distinction is methodologically load-bearing. A test that returns p=0.6 has been administered against a defined null hypothesis; a test that returns insufficient_data has not been administered at all. P4 is the second category, and the two evidence-field False values are flagged in the JSON specifically so that downstream consumers (the bridge paper Layer 1a anchor, the program external JTP follow-up, the sibling TR166 / RTSIv2 and TR168 / CRIv2 reports) do not mis-aggregate the False values into a claim that the cheap model failed to beat the baseline. The cheap model was never given an opportunity to fail; the holdout was structurally degenerate before the head-to-head test could execute.
Observations. The AUC(majority-class baseline) line at 0.5000 in the table needs unpacking, because the difference between a defined-by-convention 0.5000 and an undefined -- is itself a subtle reporting trap. The 0.5000 entry is a representational convention from the scikit-learn / sklearn.metrics.roc_auc_score implementation: when only one class is present, calling the function raises an error; when the function is replaced with a defensive wrapper that returns the majority-class prediction probability against the majority-class label, the resulting "AUC" is 0.5000 by construction -- it does not measure discrimination, it measures the prior. The TR167 substrate reports this defensive value to distinguish "the baseline classifier returns the majority class with probability one and therefore the convention 0.5 obtains" from "the AUC computation crashed and no value can be recorded." The -- entries for AUC(cheap) and AUC(unclear_rate baseline) signal the genuinely undefined state -- the cheap_score and the unclear_rate are continuous variables whose AUC would be defined if the labels were not single-class, and the absence of a value indicates the metric did not compute, not that it computed to 0.5. Conflating the 0.5000 convention with a real discrimination measurement of 0.5 (which would itself mean "no better than chance") is the failure mode this row in the grid is designed to forestall.
Observations. The DeLong test and the nested-logistic LRT collapse for related but distinct reasons that are worth disentangling, because each suggests a different forward-work intervention. The DeLong-DeLong-Clarke-Pearson covariance estimator that the DeLong test relies on is constructed from Mann-Whitney-U-derived pairwise rank statistics over (positive, negative) pairs: the variance of each AUC and the covariance between the two AUCs both depend on the number of (positive, negative) pairs in the substrate. With negatives=0, no (positive, negative) pairs exist, the U-statistics are undefined, and the entire DeLong machinery loses its denominator. This is a substrate-side problem and the only fix is to populate the negative class -- which is exactly what the SS15 cloud-family expansion is scoped to do via run_paper.py introducing gemma / phi / mistral cells that the substrate predicts will land at kappa_min above the 0.7 stable threshold. The nested-logistic LRT, by contrast, fails because the baseline logistic and the augmented logistic both fit identical models on the single-class target: maximum-likelihood estimation against a constant outcome vector produces a constant predicted probability regardless of the feature matrix, so the log-likelihoods are identical and the LRT statistic is exactly zero with zero degrees of freedom in the alternative. The LRT is therefore not a small-sample failure but a degenerate-optimization failure, and the fix is the same -- introduce label variance into the holdout.
Both Req2 admissibility conditions inherit the rlhf-only / holdout subset's single-class structural property. The delta-AUC has no point estimate to bootstrap because both AUCs are individually undefined. The DeLong p-value cannot be computed because the variance estimator's denominator vanishes on the empty negative class. The nested LRT degenerates because the baseline-only logistic and the cheap-plus-baseline logistic both maximize their likelihood at the same constant predicted probability against a constant target vector. Three independent statistical tests, all collapsing on the same substrate-level pathology -- which is the load-bearing reason the verdict here is "test not administered" rather than "cheap signal failed to beat baseline." The honest framing is what the JSON records:
beats_baseline_verdict = FAIL_OR_INSUFFICIENT_DATA, with the evidence fields confirming each sub-test aborted on the same pre-registered ABORT condition that closed P3 in Section 12 (SS5). Req2 cannot be evaluated independently of Req1 on this substrate, the cloud-family expansion in SS15 is the joint fix for both, and the structural-degenerate finding inherited from SS2 is the upstream cause of both ABORTs.
14. SS7. Calibrate vs Holdout Comparison (P5)
The pre-registration treats the calibrate / holdout split as the load-bearing generalization test. The calibrate split is the LOFO training partition the cheap-logistic is fitted on; the holdout split is the disjoint family-out test set that any honest predictive-validity claim must survive. The leave-one-family-out construction was chosen specifically to prevent within-family memorization from being confused with cross-family signal. In a well-powered run, the calibrate-versus-holdout contrast is the difference between "the cheap-logistic genuinely tracks judge-sensitivity" and "the cheap-logistic over-fitted the calibrate cells and the holdout reveals collapse to baseline." It is the single contrast in the pre-registered ladder that is supposed to discipline the strongest of the two failure modes a screen can exhibit -- over-fitting in calibration -- by demanding a disjoint replication.
In TR167 / JTPv2 at this depth and on this lane, the contrast is structurally aborted. Both splits inherit the same single-class pathology already documented in SS3 and SS6: the rlhf-only judge pool produces a holdout in which all 10 valid cells are judge-sensitive=True (positives=10, negatives=0). The calibrate split, drawn from the same 24-cell rlhf-only universe under the same GGUF-local lane (llama and qwen only, two families present), is no better positioned. Because ROC-AUC, DeLong delta-AUC, the LOOCV cheap-logistic out-of-fold AUC, and the nested LRT are all undefined when a split contains only one class, none of the calibrate-vs-holdout comparisons that the pre-registration calls for can be evaluated. The structural property -- "every valid cell on the rlhf-only / GGUF-local pool crosses the judge-sensitivity threshold at standard depth" -- is not unique to the holdout partition; it is a property of the rlhf-only generating process itself, and therefore propagates into the calibrate partition with equal force.
| Quantity | Calibrate split | Holdout split |
|---|---|---|
| n cells (valid) | TBD per analyze.py extension | 10 |
| Positives (judge-sensitive) | TBD per analyze.py extension | 10 |
| Negatives (judge-stable) | TBD per analyze.py extension | 0 |
| cheap_score ROC-AUC | -- (undefined) | -- (undefined) |
| LOOCV cheap-logistic out-of-fold AUC | -- (insufficient_data) | -- (insufficient_data) |
| Delta-AUC vs trivial baseline | -- (insufficient_data) | -- (insufficient_data) |
| DeLong p | insufficient_data | insufficient_data |
| Nested LRT p | insufficient_data | insufficient_data |
| cheap_score Spearman rho vs kappa_min | TBD per analyze.py extension | -0.1566 (p = 0.6657, 95% CI [-0.8178, 0.7116]) |
| Monotonicity (kappa_min HIGH < LOW) | TBD per analyze.py extension | PASS (0.023 < 0.221) |
Observations. Every entry in the holdout column that would license a generalization claim resolves to undefined or insufficient_data, and the calibrate column inherits the same single-class structural pathology because both splits are drawn from the rlhf-only pool on the same GGUF-local lane. The only cells in the contrast that carry numerical content are the Spearman rho (correct negative sign, not significant at n=10) and the kappa_min monotonicity check (correct direction, no significance test). Crucially, neither the "cheap-logistic generalizes" hypothesis nor the "cheap-logistic over-fits" hypothesis can be rejected, because the contrast that would distinguish them never resolves to a numerical comparison. A predictor that achieved AUC=0.9 on calibrate and AUC=0.6 on holdout would be diagnosed as over-fit; a predictor that achieved AUC=0.85 on both would be diagnosed as generalizing. A predictor that returns "undefined" on both is diagnosable as neither: there is no operating point at which the binary classification problem exists. The verdict label FAIL_OR_INSUFFICIENT_DATA on req1 and req2 (SS3) propagates here unchanged, but the load-bearing observation here is the symmetry of the failure, not the failure itself.
The symmetry has a clean substrate explanation. The LOFO partition draws the calibrate cells from one family (the trained-on family) and the holdout cells from the held-out family, but both partitions share the rlhf-only judge cohort -- gemma3:12b plus qwen2.5:7b plus the regex baseline -- because the pool is held fixed across the split by construction. The structural-degenerate property that all valid cells crossed the judge-sensitivity threshold is not a property of the family being measured; it is a property of the cohort doing the measuring. Once that property fixes the class label at "positive" for every valid cell, it does so in both partitions simultaneously, and the LOFO split loses its diagnostic power against over-fitting because there is no class boundary in either partition for an over-fit predictor to mislearn.
The calibrate-versus-holdout contrast in TR167 is structurally aborted, not under-powered. The pre-registered split exists, both partitions contain cells, and the LOFO machinery runs, but the rlhf-only pool produces a single-class outcome in both partitions and the predictive-validity comparison collapses to "undefined vs undefined" before any over-fitting test can be applied. This is a stronger and more honest failure mode than "over-fitting masked the holdout collapse" -- it is "the question the calibrate-holdout contrast was designed to answer (over-fitting vs generalization) cannot be put to the substrate at all." The TR148 v2 dual-axis finding prefigures this exactly: when the rlhf-only cohort is the only axis in the cohort, JTP-class diversity collapses along the GGUF-local lane, and any binary screen built on top inherits the collapse in every partition the cohort touches.
The honest report of this section is that the v1 framework cannot distinguish generalization from over-fitting on the GGUF-local lane at standard depth -- the methodological gap is real, and it is the gap that the TR167 v2 cloud-family expansion (gemma / phi / mistral via run_paper.py on GPTQ / AWQ / fp16 vLLM cells) is specifically scoped to close by introducing judge-stable cells into the holdout. The directional kappa_min monotonicity (HIGH 0.023 < LOW 0.221) and the negative-signed Spearman rho (-0.1566) are the only fragments of cross-split evidence that survive, and both belong on the directional-evidence ledger, not the generalization-claim ledger. Cross-referencing the sibling v2 series in TR166 / RTSIv2 and TR168 / CRIv2: the same predictive-validity pattern (calibrate-vs-holdout collapse when both partitions share a cohort-pinned degenerate class) is the dominant failure mode the v2 line is collectively designed to lift, and the bridge paper Layer 1a anchor at papers/serving_state_safety_certification/ inherits the conditionality from this TR's substrate.
A useful forward read: should the TR167 v2 cloud expansion succeed in introducing judge-stable cells into the holdout, the calibrate-vs-holdout contrast will become live for the first time on this corpus, and the over-fitting-versus-generalization question will get an answer rather than a "test not administered" verdict. Should the cloud expansion fail to introduce stable cells, the structural-degenerate finding generalizes upward: the calibrate-vs-holdout symmetry observed here is then the canonical TR167-line observation about JTP-validity on production-quantization corpora, and the contrast is permanently retired in favor of pool-robustness diagnostics (SS9 / P8) as the methodologically rigorous JTP observable. Either branch is a clean publishable outcome; the standard-depth substrate is the gate on which the branch selection is made.
15. SS8. Leave-One-Model-Family-Out Validation (P6)
P6 is the most ambitious of the pre-registered cross-validation passes in this report. Its intent is to ask whether the cheap pre-rejudge signal generalises across model families: if we hold out llama entirely and fit on qwen alone, does the cheap_score still order judge-sensitivity on the unseen llama cells -- and vice versa. This is the geometry the parent JTP framework needs in order to license the cheap signal as a deployable triage gate, because at deployment time the operator does not know which family a new (model, quant, battery) cell belongs to. LOFO is the honest external-validity test; it is strictly more demanding than P3's pooled ROC-AUC or P5's LOOCV, and it is the pass on which a published triage-gate claim would have to rest.
The substrate, however, cannot answer the question at the depth requested. The GGUF-local lane carries only two families -- llama and qwen -- and the qwen fold lands with five total cells under the holdout pool/split filter (3 LOW + 2 MOD by cheap_score band, with zero HIGH). The llama fold lands with ten cells, but as documented in SS5 and SS6, all ten are judge_sensitive=True, so the held-out binary target is degenerate by construction. Neither fold can produce a defined ROC-AUC; neither fold can be combined into an aggregate AUC; both folds return the insufficient_data sentinel. The aggregate row in the table below is the verbatim aggregate from the run dir; the per-fold rows are the verbatim per-fold readouts.
| Fold (held out) | n_cells | cheap AUC | LOOCV AUC | delta-AUC vs baseline | Status |
|---|---|---|---|---|---|
| llama | 10 | -- | -- | -- | insufficient_data (degenerate single class on llama holdout) |
| qwen | 5 | -- | -- | -- | insufficient_data (n=5 below LOFO minimum; 3 LOW + 2 MOD only) |
| Aggregate (2 folds) | 15 | -- | -- | -- | insufficient_data |
Observations. The two folds fail for different reasons, and the distinction is methodologically load-bearing because the recovery path for each is different. The llama fold fails for the same single-class reason that collapsed P3 and P4: the binary target has no negatives in the held-out partition, so the discrimination question is undefined regardless of how good the cheap signal might be in principle. The cells are present, the cheap_score is computed for each of the ten, the kappa_min is measured, but the y vector is the all-ones vector and ROC-AUC is structurally undefined on a constant target. The qwen fold fails for a sample-size reason: five cells split 3-2 across cheap_score tertiles is below the minimum for any reasonable logistic fit, and the absence of any HIGH-band qwen cell means the fold cannot even probe the part of the cheap_score range where SS3's monotonicity test located its directional evidence. The aggregate metric is therefore not a "weak" LOFO estimate; it is a non-estimate. We report -- rather than substitute a pooled-fold AUC -- because pooling across two insufficient_data folds does not rescue either failure mode and would misrepresent the test as having been run.
The aggregate "--" reported in the table is the verdict of the methodology, not a placeholder. A pooled-fold AUC computed by stacking the llama y-vector (all positives) under the qwen y-vector (n=5, mixed) would yield a number, but that number would be a weighted mean of one undefined estimate and one under-powered one, which is a defensibility-bar violation. The honest reporting move is to record the insufficient_data sentinel and to surface the per-fold failure mode separately so a downstream reader can see exactly what would have to change in the substrate to lift each one.
15.1 The llama fold (n_cells = 10)
The held-out-llama fold is the larger of the two and is the fold whose per-cell table is reproduced in SS5's appendix. Ten holdout cells, all judge_sensitive=True, kappa_min spanning from 0.000 (judge-dominated cells at Q3_K_M, Q4_K_M, Q5_K_M, Q8_0 on multiple model sizes) up to 0.441 (the llama3-1-8b Q8_0 cell that sits at the edge of the judge-sensitive band). The cheap_score axis spans roughly [-2.39, 4.74] across these ten cells, which is a wider dynamic range than the qwen fold offers. The fold therefore has the cheap-side variance to populate the LOW/MOD/HIGH bands -- the SS3 P1 table is computed against exactly this fold -- but the kappa-side target collapses to a single class. The cheap_score is doing rank-ordering work (the SS3 monotonicity HIGH 0.023 < LOW 0.221 holds, and the SS4 Spearman rho carries the correct -0.1566 sign) but the binary discrimination metric has no negative class against which to discriminate.
15.2 The qwen fold (n_cells = 5)
The qwen fold lands at exactly five cells under the headline (pool=rlhf_only, split=holdout) filter, which is below the LOFO minimum the methodology requires. The minimum is set by a paired methodological constraint: at fewer than roughly ten cells per fold, the bootstrap CI on a fold-level AUC is wider than the [0.5, 1.0] interval the AUC can legitimately take, and the per-feature logistic coefficient CIs cease to admit any meaningful comparison to zero. Five cells -- split 3 LOW + 2 MOD -- not only sits under that floor; it also fails to populate the HIGH cheap_score band where SS3's only directional signal lives. The qwen fold therefore cannot probe the part of the cheap_score range where the substrate licenses a directional read, and cannot produce a fold-level AUC the methodology would treat as legible.
15.3 The two-family bottleneck
The structural bottleneck is the two-family constraint of the GGUF-local lane. Phi, mistral, and gemma cells live on cloud-quantization paths (GPTQ/AWQ/fp16 on vLLM) that this depth and umbrella scope do not exercise. With only llama and qwen present, LOFO has at most two folds, and -- as the per-fold cell counts show -- one of them is structurally degenerate and the other is undersized. The TR167 v2 cloud expansion via run_paper.py is scoped to add three additional families (gemma, phi, mistral) at appropriate cell counts (six GPTQ/AWQ/fp16 levels per model, two model variants per family), which would lift the LOFO design from 2 folds to 5 folds and bring per-fold n into the 10-15 cell range where bootstrap-CI fold-AUC becomes interpretable.
15.4 The asymmetric fold problem
Even within the present two-family substrate, the folds are asymmetric in a way that LOFO machinery does not gracefully handle. The llama family is represented by three model variants -- llama3-1-8b, llama3-2-1b, llama3-2-3b -- across six GGUF quants apiece, which would in principle yield 18 cells on the llama side. The headline (pool=rlhf_only, split=holdout) filter narrows this to ten valid cells, which is the upper end of what the GGUF-local lane can deliver per fold. The qwen family, by contrast, is represented by two model variants -- qwen2.5-1.5b and qwen2.5-7b -- across six quants apiece, for 12 cells before filtering, narrowing to five valid cells in the holdout slice. The asymmetry is not a sampling accident; it is baked into the model-variant counts on each side of the family axis. Even if the rlhf-only holdout's class-degenerate problem were lifted, the qwen-held-out fold would remain undersized relative to the llama-held-out fold, and a published LOFO AUC would be reporting two estimates with very different precisions under the same headline.
The structural reading is that LOFO on the GGUF-local lane is not a single load-bearing failure but a stack of three: (a) the llama fold is one-class degenerate, (b) the qwen fold is under the cell-count floor, and (c) even at a hypothetical higher depth the two-family axis cannot generate the disjoint folds that a deployable triage-gate claim would require. The TR167 v2 cloud expansion is the joint fix for all three: gemma/phi/mistral cells through cloud quantization (approximately 10-30 USD RunPod A100 PCIe, 1-2 day wall-clock) are expected both to enlarge each held-out fold past the five-cell floor and to introduce judge-stable cells into the holdout, which would together rescue the discrimination geometry that the GGUF-local substrate cannot supply. The cross-references to TR148 v2's dual-axis cohort definition, the bridge paper's Layer 1 anchor, and the sibling TR166 / RTSIv2 and TR168 / CRIv2 LOFO designs all point at the same forward path: predictive-validity tests on production-quantization corpora need cross-family diversity that the GGUF-local lane structurally cannot supply at standard depth.
16. SS9. Pool Robustness -- The Load-Bearing Secondary Finding (P8)
This section documents P8, the secondary analysis that, in the absence of a positive primary verdict, carries the substantive interpretive weight of TR167 / JTPv2. P8 compares the rlhf-only judge cohort (gemma3:12b, qwen2.5:7b) against the expanded_nonrlhf cohort (which additionally includes the safety-specialist axis: llama-guard3:8b, shieldgemma:9b) on the common set of cells covered by both pools. The motivating question is not "does the cheap_score predict judge-sensitivity?" -- that question is structurally barred on the rlhf-only holdout, as Section 15 (SS8) established. The motivating question is instead "does the cohort composition itself materially change whether a (model, quant, battery) cell registers as judge-sensitive at all?" The answer, in the substrate, is yes -- and the magnitude of that "yes" is the load-bearing finding of this report. The pool-robustness observable is the one place in the TR167 substrate where the structural-degenerate-class verdict on the rlhf-only holdout does not block a quantitative inference; it operates at the cell level across the full common-cell intersection (n=24), not at the LOFO holdout slice (n=10), and it is robust to the binary-discrimination collapse that propagates through SS5, SS6, SS7, and SS8.
16.1 P8 verbatim counts
The pool-robustness comparison runs over the common-cell intersection of rlhf-only and expanded_nonrlhf at headline depth. The raw counts are:
| Quantity | Value |
|---|---|
| Common cells (rlhf-only intersect expanded_nonrlhf) | 24 |
| rlhf-only judge_sensitive count | 18 |
| expanded_nonrlhf judge_sensitive count | 24 |
| Resurfaced flips (rlhf-stable / insufficient_data implies expanded-sensitive) | 6 |
| Masked flips (rlhf-sensitive implies expanded-stable) | 0 |
| Mean kappa_min shift (expanded minus rlhf) | -0.1529 |
Observations. The expanded_nonrlhf cohort registers 24 / 24 cells as judge_sensitive against the rlhf-only cohort's 18 / 24. The difference is exactly 6, and all 6 of those resurfaced cells flipped in the same direction: cells that read as rlhf-stable or rlhf-insufficient_data become expanded-sensitive once the safety-specialist axis is added to the cohort. The reverse direction -- a cell rlhf-sensitive becoming expanded-stable -- does not occur in the substrate (masked flips = 0). The shift is monotone, not noisy. The mean kappa_min shift of -0.1529 expresses the magnitude in the agreement metric: the expanded cohort's worst-pair kappa is, on average, about 0.15 units lower than the rlhf-only cohort's worst-pair kappa on the same cell.
The asymmetry matters. If the rlhf-only and expanded_nonrlhf cohorts merely produced noisier vs less noisy estimates of the same underlying JTP class, we would expect flips in both directions -- some cells flipping rlhf-stable into expanded-sensitive, others rlhf-sensitive into expanded-stable. The substrate shows flips in one direction only. The expanded cohort is not noisier; it is detecting disagreement that the rlhf-only cohort, by construction, cannot see.
Observations. The 24-cell common-cell intersection is the largest contiguous slice TR167 provides on which a quantitative pool-contrast inference is well-defined, and the zero-masked-flip asymmetry rules out the most common alternative interpretation of pool-contrast data -- noise jitter. Under noise jitter, flip counts in the two directions should be comparable in magnitude even if not identical, and a mean kappa_min shift of -0.1529 with zero reverse flips is not a regime a symmetric noise model can produce.
Read against the v1 JTP class thresholds (kappa < 0.4 untrustable, 0.4-0.7 triangulate, >=0.7 robust), a mean kappa_min shift of -0.1529 across 24 cells is large enough to move a meaningful fraction of cells across one of the threshold edges. The substrate confirms this at the per-cell level in the next subsection.
16.2 The 6 resurfaced cells -- all llama3.2 GGUF
The 6 cells that resurface from rlhf-only / insufficient_data into expanded_nonrlhf / judge-dominated kappa_min = 0.000 are not scattered across the cohort; they share a clean structural identity. All 6 are llama3.2 GGUF quants:
| Cell id | rlhf-only outcome | expanded_nonrlhf outcome |
|---|---|---|
| tr167-llama3-2-1b-q5-k-m | insufficient_data | judge-dominated, kappa_min=0.000 |
| tr167-llama3-2-1b-q6-k | insufficient_data | judge-dominated, kappa_min=0.000 |
| tr167-llama3-2-1b-q8-0 | insufficient_data | judge-dominated, kappa_min=0.000 |
| tr167-llama3-2-3b-q4-k-m | insufficient_data | judge-dominated, kappa_min=0.000 |
| tr167-llama3-2-3b-q5-k-m | insufficient_data | judge-dominated, kappa_min=0.000 |
| tr167-llama3-2-3b-q6-k | insufficient_data | judge-dominated, kappa_min=0.000 |
Observations. Three llama3.2-1B quants (Q5_K_M, Q6_K, Q8_0) and three llama3.2-3B quants (Q4_K_M, Q5_K_M, Q6_K) are involved. The two model sizes split evenly. The quant levels span the mid-to-high-bpw range (Q4_K_M through Q8_0); the low-bpw end (Q2_K, Q3_K_M) is not in this list because those cells were already judge-sensitive under rlhf-only and so could not "resurface." In every one of the 6 resurfaced cells, the rlhf-only verdict was insufficient_data, not judge-stable -- meaning the rlhf-only cohort could not even decide the question. The expanded_nonrlhf cohort decides it: judge-dominated, kappa_min collapsed to exactly 0.000.
A kappa_min of 0.000 is the chance-agreement floor. It says: once the safety-specialist axis is in the cohort, at least one pair of judges in the expanded cohort assigns labels that are no better than independent draws against at least one other judge's labels on this cell. The rlhf-only cohort had insufficient evidence to evaluate this because the safety-specialist axis -- which is the axis carrying the disagreement -- was absent from the cohort by construction.
Per-cell walk-through. Cell tr167-llama3-2-1b-q5-k-m is a 1B-parameter Q5_K_M GGUF quant; at rlhf-only it returns insufficient_data because the four-judge pairwise overlap structure did not produce a valid kappa_min above the configured floor on the refusal-axis subset. Under expanded_nonrlhf, the same cell registers as judge-dominated with kappa_min=0.000, meaning the worst pairwise kappa across the expanded cohort hits the chance-agreement floor exactly. The pattern repeats verbatim for tr167-llama3-2-1b-q6-k and tr167-llama3-2-1b-q8-0. The three 3B cells (tr167-llama3-2-3b-q4-k-m, tr167-llama3-2-3b-q5-k-m, tr167-llama3-2-3b-q6-k) produce the same per-cell signature.
Observations. The six per-cell signatures are identical: rlhf=insufficient_data, expanded=judge-dominated, kappa_min=0.000. This is a single coherent behavior repeated six times, not six independent observations of a noisy variable. The 1B and 3B size axes both contribute three cells each, which rules out a size-specific artifact (e.g., "only small models do this"), and the quant axis spans Q4_K_M through Q8_0, which rules out a low-bpw-only artifact. The behavior is a property of the llama3.2 family at mid-to-high quant precision under the dual-axis cohort, not a property of any single (model, quant) configuration.
The llama3.2 concentration of the resurfaced cells is informative in its own right. The cohort's refusal-axis judges (gemma3:12b, qwen2.5:7b) and the cohort's composite-harm-axis judges (llama-guard3:8b, shieldgemma:9b) disagree on llama3.2 outputs in a way they do not disagree on llama3.1 or qwen outputs at the same quant rungs. The mechanism interpretation is straightforward: llama3.2's RLHF refusal templates appear stable to the refusal axis (so the refusal-axis pairwise kappa never falls below threshold and the rlhf-only verdict reads as
insufficient_datafor lack of disagreement evidence), while the composite-harm content within those refusals is heterogeneous enough that the safety-specialist axis produces a disagreement signal the refusal axis cannot see.
16.3 The mean kappa_min shift as load-bearing magnitude
The -0.1529 mean kappa_min shift is the headline numerical magnitude of the P8 finding. It is computed across the 24 common cells (not just the 6 resurfaced ones), so it includes cells where rlhf-only and expanded_nonrlhf produced similar kappa_min values as well as the 6 cells where the shift is large. A pooled mean of -0.153 across the full common-cell set indicates that the resurfaced cells dominate the average; the non-resurfaced cells contribute small shifts that do not wash out the signal.
Observations. A shift of -0.153 in mean kappa_min is not a numerical curiosity; it is the operational difference between "the cohort agrees enough for the JTP framework to license a class declaration" and "the cohort does not." Recall the v1 JTP class thresholds: kappa < 0.4 reads as untrustable; 0.4-0.7 as triangulate; >=0.7 as robust. A shift of 0.15 units in kappa_min can move a cell across one of those thresholds, and on the 6 resurfaced cells it moves them from "undecidable" to "judge-dominated at the chance-agreement floor."
The cohort-composition sensitivity is not a sample-size artifact and it is not a noise artifact. It is a structural property of the underlying judge population: the rlhf-only general-purpose judges and the safety-specialist judges measure different things, and which subset is in the cohort determines what the JTP framework licenses you to say about the cell.
Observations. The -0.1529 shift carries a structural reading and a calibration reading. Structurally, it documents that adding the composite-harm axis to a refusal-axis-only cohort moves the kappa_min metric downward by an amount that crosses the v1 JTP threshold ladder on the cells where the cohort gap is concentrated. As calibration, it sets a stability envelope on the TR148 v2 anchor kappa = 0.6917: a cohort change of one specialist-axis dimension produces a kappa_min movement of order 0.15, which is the right order-of-magnitude for the bridge paper's Layer 1a claim ladder to absorb as a measured-not-assumed sensitivity coefficient.
The reading that "kappa = 0.6917 is the JTP anchor" is correct as far as it goes; TR167 v2 adds that the same anchor computed under a different cohort moves by approximately 0.15 in the kappa_min metric on the cells where the cohort gap is structurally concentrated. The downstream consumer (bridge paper Layer 1a, the program external JTP submission, sibling v2 reports TR166 and TR168) gains a sensitivity envelope rather than just a point estimate.
16.4 Drop-one-judge cheap_score AUC -- discrimination recovers
A third strand of P8 evidence comes from the drop-one-judge sensitivity sweep over the cheap_score AUC. The full-cohort AUC is undefined on the rlhf-only holdout (degenerate single class, Section 15). When a single judge is dropped from the cohort and the cell-level outcomes are recomputed, however, the cohort composition changes enough that the class distribution becomes non-degenerate, and AUC becomes computable:
| Cohort modification | cheap_score AUC |
|---|---|
| without gemma3:12b | 1.0000 |
| without qwen2.5:7b | 0.8333 |
Observations. Dropping gemma3:12b from the cohort yields a cheap_score AUC of 1.0000 -- the cheap_score perfectly orders the resulting cell outcomes. Dropping qwen2.5:7b yields 0.8333, well above the 0.5 random floor. The structural-degenerate verdict on the full rlhf-only cohort is therefore not evidence that the cheap_score lacks discriminative content; it is evidence that the cohort composition is the limiting factor. When the cohort composition changes such that the class distribution becomes non-degenerate, the cheap_score recovers binary discrimination at non-trivial AUC values.
The cohort-composition sensitivity cuts both ways. It is what blocks the primary verdict on the headline rlhf-only holdout (degenerate single class), and it is also what restores binary discrimination once the composition shifts. The same mechanism explains both the failure of the primary verdict and the directional positive signal in the secondary analysis.
The asymmetric judge-contribution reading. The two drop-one-judge AUC values are not symmetric, and the asymmetry carries the substantive interpretation. Dropping gemma3:12b restores AUC = 1.0000; dropping qwen2.5:7b restores AUC = 0.8333. The interpretation is that gemma3:12b functions as the tie-breaker judge in the rlhf-only cohort: it is the judge whose labels are most responsible for pushing the borderline cells across the judge-sensitivity threshold and therefore for saturating the class distribution at 100% positive. Removing gemma3:12b restructures the class distribution so that some cells fall back to judge-stable, the binary class becomes non-degenerate, and the cheap_score can rank-order the resulting cells perfectly at n=10. Removing qwen2.5:7b restructures the class distribution by a smaller amount -- enough to make AUC computable and well above the 0.5 floor, but not enough to produce the perfect ordering that gemma3:12b removal produces.
Observations. The AUC = 1.0000 value at n=10 must be read with the small-sample caveat that follows it: a perfect AUC at this sample size is unstable and a bootstrap CI is the proper way to characterize the result, which is precisely why SS15 Extension 2 names the systematic drop-one-judge cohort-composition sweep with bootstrap CIs as a follow-up scope. The point of the value here is not that AUC = 1.0000 is the deployable number; it is that the cheap_score can discriminate once the class distribution is not degenerate, and the cohort modification that lifts the degeneracy is the removal of gemma3:12b specifically. The 0.8333 value under qwen2.5:7b removal corroborates that the discrimination is not a single-judge accident -- the cheap_score also discriminates above floor when qwen2.5:7b is the removed judge.
The asymmetric judge-contribution finding ties directly to the TR148 v2 dual-axis methodology. In TR148 v2, the cross-axis disagreement was characterized as a property of two orthogonal latent constructs (refusal vs composite-harm). TR167 v2 sharpens this: even within the refusal axis, the two judges (gemma3:12b and qwen2.5:7b) do not contribute symmetrically to cohort sensitivity. gemma3:12b is the cohort member whose presence saturates the class distribution at the rlhf-only pool, and qwen2.5:7b is the cohort member whose presence is informative but not determinative. This is a finer-grained finding than the dual-axis result and it is reportable on its own terms.
16.5 The connection to TR148 v2 dual-axis -- exactly the cells where the specialist axis disambiguates
The 6 resurfaced cells are precisely the cells where the TR148 v2 dual-axis result predicted disambiguation would occur. TR148 v2 documented that the safety-specialist axis (llama-guard3:8b + shieldgemma:9b) measures composite-harm content while the refusal axis (gemma3:12b + qwen2.5:7b) measures response-refusal templates, with cross-axis kappa values anti-correlating at -0.13 to -0.26 in the parent TR148 v2 substrate. On a cell where the model's refusal template is stable but its composite-harm content is heterogeneous, the refusal axis sees no within-axis disagreement (so kappa_min on the refusal-axis-only cohort lands above the JTP-class threshold or returns insufficient_data for lack of disagreement signal), and the composite-harm axis sees disagreement that pulls the worst-pair kappa_min toward the chance-agreement floor.
Observations. The 6 llama3.2 cells fit this exact pattern. Their rlhf-only outcome is insufficient_data -- the refusal-axis cohort cannot decide. Their expanded_nonrlhf outcome is judge-dominated, kappa_min=0.000 -- once the composite-harm axis is added, the worst pairwise kappa across the full four-judge cohort hits the floor. The cells are not noisy at the refusal-axis level; the refusal-axis judges agree with each other on the refusal templates. The cells are noisy at the cross-axis level, and the cross-axis noise is what drops kappa_min to zero.
This is the JTP v1 dual-axis claim restated as a quantitative cohort-composition-sensitivity measurement. v1 showed that the two axes measure different things; v2 measures how much the JTP class assignment moves when one axis is added to a cohort previously restricted to the other. The 6 resurfaced cells, the zero masked flips, the mean kappa_min shift of -0.1529, and the asymmetric drop-one-judge AUC pattern (1.0000 without gemma3:12b vs 0.8333 without qwen2.5:7b) are four mutually consistent strands of evidence for the same underlying conclusion: cohort composition is a first-order variable in JTP-class assignment, not a second-order robustness check.
16.6 Interpretive synthesis -- what TR167 v2 sharpens vs TR148 v2
The v1 JTP framework (TR148 v2) surfaced the dual-axis distinction between the refusal axis (gemma3, llama3.1) and the composite-harm axis (shieldgemma, llama-guard3): two orthogonal axes, not one noisy axis, with cross-axis anti-correlation in kappa (kappa = -0.13 to -0.26 in the TR148 v2 substrate). TR167 v2 does not contradict that finding; it sharpens it. Even within the dual-axis structure, the choice of which safety-specialist judges to include in the cohort materially shifts kappa_min and resurfaces JTP-class assignments. The dual-axis distinction is necessary but not sufficient to characterize JTP-class stability; cohort composition within each axis is also load-bearing.
The methodological consequence is that JTP cohort-composition declarations are Tier-1-supported-with-caveat, not Tier-1-supported-unconditional. A class assignment of "judge-dominated" or "insufficient_data" or "judge-sensitive" for a given cell is a property of the (cell, cohort) pair, not the cell alone. Reporting JTP class without naming the cohort -- and ideally without reporting the drop-one-judge sensitivity over the cohort -- understates the conditionality of the claim. This is a reporting-discipline finding more than a corpus finding, and it propagates back into how the bridge paper Layer 1a anchor (TR148 v2 dual-axis refusal/composite-harm) should be cited downstream: anchored to the cohort it was computed on, with a documented sensitivity to single-judge removal.
Observations. The P8 finding does not invalidate the JTP framework. It tightens the framework's licensing language. JTP is a class declaration about a (cell, cohort) pair; the cohort-composition sensitivity is a property of the declaration that must be reported alongside the declaration itself. The 6 resurfaced cells, the zero masked flips, the mean kappa_min shift of -0.1529, and the drop-one-judge AUC recovery to 1.0000 / 0.8333 are mutually consistent strands of evidence for the same underlying conclusion: cohort composition is a first-order variable in JTP-class assignment, not a second-order robustness check.
This is the load-bearing positive result of TR167 v2. The primary verdict is structurally barred (degenerate single class on the rlhf-only holdout, Section 15), but the P8 secondary analysis carries a substantive, directional, internally consistent finding about how the JTP framework should be cited. The finding is publishable on its own terms as a methodological observation about JTPv1 reporting discipline.
16.7 Why P8 is the publishable methodological finding of TR167 v2
A predictive-validity follow-up that returns FAIL_OR_INSUFFICIENT_DATA on its pre-registered primary claim is not, on its own, a publishable contribution. What makes TR167 v2 publishable is that the same substrate that blocks the primary claim contains an internally consistent, quantitatively characterized, methodologically rigorous secondary finding: pool robustness, with six exact resurfaced flips, zero masked flips, a -0.1529 mean kappa_min shift on the 24-cell common-cell intersection, and an asymmetric drop-one-judge AUC pattern (1.0000 without gemma3:12b vs 0.8333 without qwen2.5:7b). The P8 finding converts the qualitative v1 dual-axis observation into a quantitative cohort-composition-sensitivity measurement. That conversion is the publishable methodological contribution.
Observations. Three properties make P8 publishable on its own terms. First, internal consistency: four independent observables (resurfaced flip count, masked flip count, mean kappa_min shift, drop-one-judge AUC) point to the same underlying conclusion. Second, falsifiable mechanism: the dual-axis interpretation predicts exactly which cells should resurface (cells where refusal templates are stable but composite-harm content is heterogeneous), and the substrate observation matches the prediction (the six llama3.2 cells at mid-to-high quant precision). Third, downstream load-bearing: the bridge paper Layer 1a anchor, the TR148 v2 v1 kappa=0.6917 anchor, and the sibling v2 reports (TR166 RTSIv2, TR168 CRIv2) all consume the cohort-composition envelope rather than just the v1 point estimate.
The honest TR167 v2 framing is that the substrate licenses two coordinated claims. The primary claim -- "cheap pre-rejudge signals predict judge-sensitivity on a disjoint LOFO holdout" -- is not testable on this substrate because the rlhf-only holdout is a structurally degenerate single class. The secondary claim -- "JTP class assignment is materially sensitive to judge-cohort composition, with a quantified asymmetric flip count of 6/0 and a mean kappa_min shift of -0.1529 on the 24-cell common-cell intersection" -- is testable, tested, and licensed. The publishable contribution is the secondary claim. The primary claim is deferred to TR167 v2 cloud expansion via run_paper.py, which is scoped in SS15 to introduce judge-stable holdout cells via gemma/phi/mistral families on vLLM at ~10-30 USD RunPod A100 PCIe over 1-2 days wall-clock.
17. SS10. Verdict -- Why Both Req1 and Req2 Return INSUFFICIENT_DATA
The pre-registered verdict resolver was deliberately designed so that ambiguous or under-powered evidence does not get rounded up into a positive call. TR167 / JTPv2 was set up with two hard requirements -- Req1 (cheap signal predicts judge-sensitivity) and Req2 (combined cheap model beats trivial baseline) -- and both return FAIL_OR_INSUFFICIENT_DATA on this substrate. This section walks through, requirement by requirement, why each one aborts, what did clear the bar, and how to read the resulting verdict honestly. The walk-through is deliberately granular because the report's interpretive ceiling depends on the reader understanding that the verdict simultaneously and self-consistently NEGATIVELY reports the primary pre-registered claim while POSITIVELY reporting a structural finding about the JTP framework's behavior on production-GGUF rlhf-only corpora -- and the two halves of that verdict are not in tension, they are the same observation seen from opposite ends.
17.1 Req1 -- Cheap signal predicts judge-sensitivity
Req1 demanded (i) a Spearman / Pearson rho with the correct sign AND statistically significant at alpha=0.05, (ii) a cheap_score ROC-AUC above 0.5 with the 95% CI excluding 0.5, AND (iii) a LOOCV logistic out-of-fold AUC above 0.5. All three are evaluated on the rlhf-only / holdout subset (n=10 valid cells). On this substrate the resolver finds:
| Sub-criterion | Required | Observed | Pass? |
|---|---|---|---|
| Spearman rho sign on cheap_score vs kappa_min | negative | -0.1566 | yes (sign) |
| Spearman rho significance | p < 0.05 | p = 0.6657 | no |
| ROC-AUC > 0.5 | defined and > 0.5 | undefined (positives=10, negatives=0) | no |
| ROC-AUC CI excludes 0.5 | defined CI | undefined | no |
| LOOCV out-of-fold AUC > 0.5 | defined and > 0.5 | insufficient_data | no |
| Monotonicity: mean kappa_min HIGH < LOW | True | 0.023 < 0.221 | yes |
| Monotonicity: judge-sensitive rate HIGH > LOW | True | 1.000 = 1.000 | no (degenerate) |
degenerate_single_class flag |
False | True | trip |
Observations. Two of the eight rows are green: the cheap_score / kappa_min Spearman rho has the correct negative sign (-0.1566), and the kappa_min monotonicity test along the cheap_score tertiles passes in the expected direction (mean kappa_min HIGH=0.023 < LOW=0.221). Everything else either fails to reach significance at n=10 (p=0.6657 on the headline rho, with a 95% CI [-0.8178, 0.7116] that straddles zero) or is structurally undefined because the rlhf-only / holdout pool produced positives=10, negatives=0 -- a single-class outcome on which ROC-AUC, AUC-CI, and LOOCV AUC cannot be computed at all. The degenerate_single_class=True flag short-circuits the resolver: even if the rho had been significant, the discrimination half of Req1 has nothing to discriminate against. The abort here is therefore double-layered: the rho fails the significance bar (a power problem at n=10), and the discrimination machinery fails the definedness precondition (a structural-substrate problem). Either failure on its own would have tripped Req1; both occur simultaneously, and they are not redundant -- they are independent obstacles that the cloud-expansion path in Section 22 has to dissolve separately.
The directional evidence is genuinely there -- correct sign, monotone shift in mean kappa_min across the tertile bands -- but the pre-registration required both a significant rank correlation and a defined ROC-AUC excluding 0.5. The verdict mechanism is doing exactly what it was designed to do: refusing to upgrade "consistent direction at n=10" into "validated predictive signal".
17.2 Req2 -- Combined cheap model beats trivial baseline
Req2 was the head-to-head skeptic test: the cheap_score_composite logistic model had to beat a trivial baseline (single-judge UNCLEAR rate or majority) on (i) delta-AUC with bootstrap CI excluding zero, (ii) DeLong p<0.05, AND (iii) nested-LRT p<0.05 for baseline + cheap vs baseline-only. On this substrate every component of Req2 inherits the same structural blocker as Req1's discrimination half:
| Sub-criterion | Required | Observed |
|---|---|---|
| Delta-AUC CI excludes 0 | defined CI | insufficient_data |
| DeLong p < 0.05 | defined p | insufficient_data |
| Nested LRT p < 0.05 | defined p | insufficient_data |
| AUC(cheap) | defined | -- (undefined) |
| AUC(unclear_rate) | defined | -- (undefined) |
| AUC(majority) | 0.5 baseline | 0.5000 (trivial) |
Observations. Req2 cannot be evaluated because its inputs are undefined. With positives=10 and negatives=0, neither the cheap-composite AUC nor the unclear-rate AUC is computable, the delta-AUC has no defined point estimate, the DeLong test has no defined statistic, and the nested-LRT collapses to a comparison of two models that both perfectly classify the (single-class) outcome. The LOFO validation in P6 carries the same problem one layer up: aggregate cheap_score AUC over the two folds is undefined; the held-out llama fold has n=10 and the held-out qwen fold has only n=5, and both report insufficient_data. The resolver reports beats_baseline_verdict = FAIL_OR_INSUFFICIENT_DATA and delong_p_below_alpha05 = False not because the cheap features lost the race, but because no race could be run. Req2 inherits Req1's degeneracy and cannot be rescued independently: Req2's admissibility is downstream of Req1's discrimination machinery resolving to a defined number, and when Req1's AUC is undefined Req2's delta-AUC cannot exist.
Req2 is not a negative finding against the cheap composite; it is a "no test was possible" finding. Reading "FAIL_OR_INSUFFICIENT_DATA" as "the cheap features added no power" misstates the substrate -- the substrate did not support the test at all. This distinction is methodologically distinct from the more familiar "the cheap signal does not predict judge-sensitivity" reading: the substrate does not let us answer the predictive-validity question at standard depth on the rlhf-only / holdout lane, and the honest report is that the question was unanswerable, not answered in the negative.
17.3 The honest read of the verdict
The headline cheap_predicts_verdict = FAIL_OR_INSUFFICIENT_DATA plus beats_baseline_verdict = FAIL_OR_INSUFFICIENT_DATA should not be read as "the cheap signal does not predict judge-sensitivity". It should be read as: "the substrate as collected -- 24 rlhf_only cells across 4 llama+qwen GGUF models x 6 quants, of which 10 are valid in holdout and all 10 are judge_sensitive=True -- does not support a binary discrimination test, because the rlhf-only judge cohort on the production-GGUF lane produces 100% judge-sensitive valid cells".
That is itself a substantive methodological finding, and it is the load-bearing positive observation of the report. The structural-degenerate-class verdict is co-published with the P8 pool-robustness result: 6 rlhf-only insufficient_data cells resurface as judge-dominated (kappa_min=0.000) under expanded_nonrlhf, with a mean kappa_min shift of -0.1529 across the 24 common cells. The same cheap_score that fails the binary discrimination test under rlhf-only achieves drop-one-judge AUC = 1.0000 (without gemma3:12b) and 0.8333 (without qwen2.5:7b) on the expanded pool, indicating that the cheap signal does carry usable information once the outcome class is no longer degenerate -- but TR167's pre-registered Req1/Req2 cannot certify that within the rlhf-only / holdout frame.
Observations. The verdict structure is therefore asymmetric across the two reporting axes. On the primary axis (pre-registered cheap-signal predictive validity) the verdict is NEGATIVELY reported -- FAIL_OR_INSUFFICIENT_DATA on both Req1 and Req2, the directional rho is consistent but not significant, the monotonicity is consistent but rate-based discrimination is undefined. On the secondary axis (structural property of the JTP framework on the GGUF-local rlhf-only lane) the verdict is POSITIVELY reported -- 100% judge-sensitive valid cells at standard depth is a substantive finding about cohort behavior, the P8 pool-robustness panel documents 6 directionally consistent flips with zero reversals, and the drop-one-judge AUC pattern (1.0000 / 0.8333) is internally consistent with cohort-composition sensitivity as the dominant lever. The two reports do not contradict each other; they are the two true statements about what the substrate allowed us to learn.
The canonical resolution path is the cloud-expansion fire detailed in Section 22 --
run_paper.pyadds gemma / phi / mistral families on GPTQ / AWQ / fp16 cells via vLLM at approximately 10-30 USD on RunPod A100 PCIe and 1-2 day wall-clock. The expected outcome is that cross-family architectural diversity introduces judge-stable holdout cells (cells with kappa_min above the 0.7 threshold), the binary discrimination machinery becomes defined, ROC-AUC and LOOCV-AUC resolve to numbers, the DeLong delta-AUC and nested-LRT have denominators, and the rho gains the n it needs to clear significance if the directional signal observed at n=10 holds at larger sample size. If that outcome materializes, the predictive-validity claim becomes evaluable and the cheap_score either validates or falsifies on a non-degenerate substrate. If instead cloud expansion fails to introduce judge-stable cells -- the structural-degenerate verdict generalizes to a universal-collapse claim about JTP validity on production-quantization corpora -- then pool robustness via P8 becomes the canonical TR167 deliverable and the directional kappa_min monotonicity signal is the entire load-bearing positive result from the predictive-validity arm. Either branch is publishable; TR167 v1 is the substrate on which the branch is selected, not the report that selects it.
What reviewers should NOT conclude from this verdict. Two readings of the FAIL_OR_INSUFFICIENT_DATA verdict are inferential oversteps that the substrate does not license, and we name them explicitly to forestall them. First, reviewers should NOT conclude that cheap pre-rejudge signals are useless or that the cheap-signal hypothesis has been falsified. The substrate does not support that inference -- the rho carries the correct sign, the kappa_min monotonicity test passes in the predicted direction, and the drop-one-judge AUC values of 1.0000 and 0.8333 demonstrate that the cheap_score does discriminate once the outcome class is non-degenerate. The substrate is incapable of testing the prediction on the headline holdout; it is not evidence against the prediction. Second, reviewers should NOT conclude that the JTP framework is broken or that the parent TR148 v2 kappa = 0.6917 anchor is suspect. The JTP framework licensed exactly the class declarations its v1 protocol allowed it to license; what TR167 v2 surfaces is that those declarations are conditional on cohort composition (a (cell, cohort) property, not a cell property), and that conditionality is itself a methodological refinement of the framework, not a refutation. The bridge paper Layer 1a anchor inherits a strictly sharper version of S2 -- kappa = 0.6917 with a measured pool-robustness envelope of -0.1529 -- not a weakened one.
The verdict is therefore NEGATIVE on the primary pre-registered claim, POSITIVE on the structural / pool-robustness secondary finding, and FORWARD-LOOKING via the cloud-expansion path that converts the directional Req1 signal into a fully evaluable test. All three halves matter: the negative half disciplines the predictive-validity claim down to "directional only, not significant, structurally untestable on this lane", the positive half elevates the JTP-cohort-composition observation -- the GGUF-local rlhf-only lane misses 6 judge-decisions that the safety-specialist axis catches -- into the report's headline contribution, and the forward-looking half scopes exactly what TR167 v2 needs to add to convert the directional signal into a certified test. Section 18 develops the mechanism interpretation that explains why the degenerate single class arose; Section 19 places this verdict in the cross-reference geometry that ties it to TR148 v2 and the bridge paper; Section 22 lays out the cloud expansion path as the canonical resolution.
18. SS11. Mechanism Interpretation -- Why JTP Class is Degenerate on rlhf-only
This section is the structural-mechanism companion to SS10. Where SS10 documented the outcome of the degenerate-class verdict, SS11 asks why the rlhf-only / GGUF-local lane at standard depth produces a class distribution that saturates at judge-sensitive=True. Five candidate mechanisms compete for the explanation, and the substrate, when read against the TR148 v2 dual-axis anchor and the P8 pool-robustness panel, lets us narrow the field substantially. The methodological point is that picking the right mechanism matters for what TR167 v2 should change: a power problem is fixed by more cells; a feature-engineering problem is fixed by adding features; a cohort-axis problem is fixed by changing the cohort, which is exactly what the cloud-family expansion in SS15 is scoped to do.
18.1 Candidate mechanisms
Five candidate mechanisms are on the table:
- (a) Power problem. The substrate has n=10 holdout cells; the absence of judge-stable cells is sampling noise around an underlying positive-class rate below 100%. Under this mechanism, a larger n drawn from the same generating process would surface judge-stable cells.
- (b) Threshold-tuning problem. The 0.7 stable-threshold on kappa_min is too strict for the rlhf-only / GGUF-local generating process; with a lower threshold (say 0.5), the class distribution would become non-degenerate. Under this mechanism, re-tuning the threshold post-hoc would rescue the verdict ladder.
- (c) Feature-engineering problem. The seven cheap features are misspecified or under-expressive; a richer feature set would discriminate at the cell level even on a degenerate-class holdout. Under this mechanism, adding features at n=10 would lift the verdict.
- (d) Family-coverage problem. The two-family GGUF-local lane (llama, qwen) does not span the relevant family axis; a third family from a structurally distinct lineage would populate the judge-stable end of the kappa_min distribution. Under this mechanism, adding cloud families (gemma, phi, mistral) is the substrate-side intervention.
- (e) Cohort-axis-collapse problem. The rlhf-only judge cohort is composed entirely of refusal-axis judges (gemma3:12b, qwen2.5:7b, plus the regex baseline), all measuring response-refusal behavior; no judge in the cohort measures composite-harm content. Under this mechanism, GGUF quantization perturbs the refusal-axis distribution on every cell enough to cross the judge-sensitivity threshold, while the orthogonal composite-harm axis -- which is absent from the rlhf-only cohort -- is exactly the axis that would have certified any cell as judge-stable.
18.2 Walking each mechanism against the substrate
Mechanism (a) -- power problem. This mechanism predicts that more cells from the same generating process would surface judge-stable cells. The substrate evidence is mixed but tilts against (a): the pooled 24-cell distribution already shows 18 of 24 cells judge-sensitive under rlhf-only (75%), and the 10-cell holdout is 10 of 10 (100%). The class boundary is not at 50% -- it is at the right edge of the substrate. More cells from the same generating process would land disproportionately on the judge-sensitive side, not the judge-stable side, and the rate of judge-stable cells is structurally close to zero rather than to the 30% threshold a "more cells fixes it" reading would predict. Mechanism (a) is not the dominant explanation.
Mechanism (b) -- threshold-tuning problem. This mechanism is pre-registered against. The 0.7 stable-threshold is inherited from TR148 v2 (Landis-Koch substantial-agreement band edge) and is held fixed as a property of the JTP framework, not a TR167 hyperparameter. Re-tuning the threshold to rescue a degenerate class is a defensibility-bar violation, and the pre-registration explicitly forbids it. Mechanism (b) is foreclosed by design.
Mechanism (c) -- feature-engineering problem. This mechanism predicts that adding features would lift the verdict. The substrate evidence rules it out structurally: ROC-AUC is undefined for any feature on a single-class holdout, regardless of how rich the feature set is. Adding features at n=10 increases degrees-of-freedom risk without addressing the single-class barrier. Mechanism (c) is not the binding constraint.
Mechanism (d) -- family-coverage problem. This mechanism predicts that adding cloud families (gemma, phi, mistral) would introduce judge-stable cells. The substrate evidence is consistent with (d) but does not uniquely identify it: cross-family architectural diversity is a plausible source of judge-stable cells, and the TR167 v2 cloud expansion in SS15 is scoped specifically to test it. Mechanism (d) is a live hypothesis but it is not the most substrate-consistent reading on its own.
Mechanism (e) -- cohort-axis-collapse problem. This mechanism predicts that the rlhf-only cohort is structurally blind to the axis that would certify a cell as judge-stable. The substrate evidence directly confirms it: the P8 pool-robustness panel documents 6 cells where the rlhf-only cohort returns insufficient_data and the expanded_nonrlhf cohort -- which adds the composite-harm axis (llama-guard3:8b, shieldgemma:9b) -- returns judge-dominated with kappa_min = 0.000. Zero cells flip in the reverse direction. The asymmetric drop-one-judge AUC pattern (1.0000 without gemma3:12b, 0.8333 without qwen2.5:7b) further confirms that the cohort-axis composition is the dominant lever. Mechanism (e) is the substrate-consistent reading.
Observations. Mechanism (e) is the most substrate-consistent reading and the only one that explains all four mutually consistent strands of P8 evidence (6 resurfaced cells, 0 masked flips, mean kappa_min shift -0.1529, asymmetric drop-one-judge AUC pattern). It also coheres with the TR148 v2 dual-axis finding: the refusal-axis and composite-harm-axis are not noisy estimates of the same underlying construct; they are orthogonal measurement axes, and a cohort restricted to one axis is structurally incapable of certifying cells whose JTP class depends on the other axis. The cloud-family expansion in SS15 / Mechanism (d) is the substrate-side intervention that also helps -- introducing cross-family diversity may populate the judge-stable end of the kappa_min distribution -- but the dominant mechanism here is cohort-axis-collapse, not family-coverage. The two interventions are complementary: cohort expansion (adding the composite-harm axis to the rlhf-only pool) is what P8 already partially executes via expanded_nonrlhf, and family expansion is the additional substrate-side lift that TR167 v2 brings.
The mechanism interpretation has direct downstream consequences. The bridge paper Layer 1a anchor at
papers/serving_state_safety_certification/is the refusal-axis JTP slot, and the bridge paper Layer 1b is the composite-harm-axis screen. TR167 v2 confirms that these two layers are not redundant -- they are orthogonal screens, and the rlhf-only / GGUF-local lane shows what happens when only Layer 1a is consulted on production-quantization substrate: 6 cells out of 24 register asinsufficient_databecause the cohort cannot decide, and the same 6 cells resurface asjudge-dominatedonce Layer 1b is added. The sibling v2 reports (TR166 / RTSIv2, TR168 / CRIv2) face the same risk on their respective single-axis cohorts; the methodological recommendation is to always run JTP-class declarations as a (cell, cohort) pair and to disclose the cohort composition in the verdict report.
19. SS12. Cross-Reference to TR148 v2 and the Bridge Paper
TR167 v2 does not stand alone. It is the second member of a four-screen predictive-validity series (RTSI / JTP / TAIS / CRI), and its scientific load-bearing role is to sharpen, not replace, the v1 JTP measurement-validity claim that TR148 v2 anchored. This section positions TR167 v2 relative to its parent report, the bridge paper Layer 1a anchor it feeds, and its sibling v2 follow-ups, and it argues that the value of the v2 series is best read as a single methodological program move rather than as four independent follow-ups stapled together.
19.1 The TR148 v2 anchor and the kappa=0.6917 number
TR148 v2 (PublishReady at Technical_Report_148.md, 1,556 lines) is the canonical JTP v1 substrate. Its headline finding was a cross-family Cohen's kappa of 0.6917 on the TR145 safety subset (n=12,809) under a four-judge cohort spanning the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) and the general-LLM axis (gemma3:12b, qwen2.5:7b). TR148 v2 also surfaced the dual-axis methodology finding: the two axes do not collapse onto a single noisy reliability dimension; they measure orthogonal latent constructs (response-refusal versus composite-harm). The kappa=0.6917 number sat just under the 0.70 robust threshold inherited from the v1 JTP class ladder -- right in the triangulate band where a v2 follow-up has the highest expected information value, because the v1 declaration was already conditional on the second-judge axis being in the cohort.
Observations. TR148 v2's anchor is not a single number; it is a number plus a cohort plus a dual-axis caveat. Read literally, the Tier-1 S2 line that the bridge paper currently carries -- "cross-family JTP triangulation produces kappa=0.6917 on the TR145 safety subset" -- omits two of those three load-bearing pieces of the v1 declaration. The TR148 v2 dual-axis finding (composite-harm-axis judges anti-correlating with refusal-axis judges at cross-axis kappa in the -0.13 to -0.26 range) implies that the 0.6917 number is itself a function of which cohort was present; what TR148 v2 did not yet quantify was the magnitude of the cohort-composition sensitivity.
TR167 v2's specific donation to TR148 v2 is to fill in that quantification. The cohort-composition sensitivity is no longer a methodological caveat to be cited in prose; it is a measured envelope -- 6 resurfaced flips on the 24 common cells, mean kappa_min shift of -0.1529 in the expanded direction, zero masked flips in the reverse direction -- attached to the same dual-axis observation TR148 v2 named. The 0.6917 point estimate survives. What changes is that the bridge paper Layer 1a anchor can now license a Tier-1 statement that is one level more precise than the v1 anchor licensed alone.
19.2 Tier-1 Supported claim sharpening (S2 before and after)
The bridge paper at papers/serving_state_safety_certification/ defines a five-layer certification protocol in which JTP serves as Layer 1a (measurement-validity gate, refusal-axis). Its claim ladder (CLAIM_LADDER.md) carries S2 as a Tier-1 Supported claim. TR167 v2 licenses a sharper Tier-1 statement of that same claim. The before/after pair is the cleanest way to see what the v2 series buys the layer.
| Tier-1 S2 phrasing | Status | Substrate license |
|---|---|---|
| "cross-family JTP triangulation produces kappa=0.6917" | TR148 v2 v1-only Tier-1 Supported | TR148 v2 anchor; cohort-composition sensitivity prose-only |
| "cross-family JTP triangulation produces kappa=0.6917 with quantified cohort-composition sensitivity that TR167 v2 measures (6 resurfaced flips on 24 common cells; mean kappa_min shift -0.1529; 0 masked flips)" | TR167 v2 v1+v2 Tier-1 Supported | TR148 v2 anchor plus TR167 v2 P8 envelope |
Observations. The sharpening is not the addition of a new claim; it is the substitution of a measured envelope for an unmeasured caveat. The bridge paper claim ladder is designed to absorb exactly this kind of strengthening without weakening the layer -- it converts a point estimate into a point estimate plus a measured stability envelope, which is the upgrade path Tier-1 claims are supposed to follow as v2 substrate lands. The TR167 v2 envelope is also asymmetric (all flips toward less agreement under cohort expansion, zero in the reverse direction), which is the structural property that licenses the bridge paper to motivate a multi-judge cohort as required rather than optional when invoking Layer 1a.
The honest read of the S2 sharpening is that the bridge paper Layer 1a gains a one-line caveat clause that strengthens, not weakens, the claim. The cohort is named; the envelope is named; the asymmetry direction is named. Future synthesis documents that reach for the Layer 1a anchor can now cite the kappa point estimate plus the TR167 v2 P8 envelope as a single composite anchor rather than carrying the v1 number plus an open methodological footnote about cohort sensitivity.
19.3 Bridge paper Layer 1 anchor consumption
The bridge paper's five-layer certification protocol consumes JTP at Layer 1a as a measurement-validity gate over the refusal axis, with Layer 1b consuming the composite-harm axis separately. TR167 v2 is consumed at Layer 1a in three concrete places. First, the S2 phrasing in CLAIM_LADDER.md Tier-1 absorbs the cohort-composition envelope as documented in 19.2. Second, the structural-degenerate observation on the rlhf-only holdout (10/10 cells judge-sensitive; ROC-AUC structurally undefined) is itself cited by the bridge paper when it motivates why Layer 1a requires the dual-axis cohort (refusal-axis plus composite-harm-axis judges) rather than the refusal-axis alone -- the rlhf-only collapse is the substrate-grounded evidence that the refusal-axis cohort alone cannot certify a cell. Third, the P8 drop-one-judge AUC values (1.0000 without gemma3:12b; 0.8333 without qwen2.5:7b) are consumed at Layer 1a as the operational definition of "single-judge sensitivity" inside the multi-judge cohort -- the layer's licensing language becomes "a JTP class assignment is a property of the (cell, cohort) pair," with the drop-one-judge sweep providing the falsifiability test.
Observations. Layer 1a's consumption of TR167 v2 is therefore not a single anchor swap; it is a three-part consumption that strengthens the S2 statement, motivates the dual-axis requirement, and operationalizes the cohort-sensitivity language. Each of these three consumptions is licensed by a distinct piece of TR167 v2 substrate, and each survives the primary FAIL_OR_INSUFFICIENT_DATA verdict because none of them depend on the cheap_score binary discrimination test passing.
The bridge paper consumption pattern is what makes the TR167 v2 negative-primary result publishable rather than merely defensible. The pre-registered cheap_score test failing on the rlhf-only holdout does not weaken the Layer 1a anchor; the structural-degenerate observation that produced the failure strengthens it, because it is exactly the evidence the bridge paper needs to license a dual-axis-required claim at Tier 1.
19.4 Sibling v2 series coherence as a methodological program move
TR167 v2 is the second installment in a four-screen v2 series. The series geometry is:
| Screen | v1 anchor (TR) | v2 follow-up (TR) | v2 status |
|---|---|---|---|
| RTSI (refusal-template stability index) | TR142 / RTSI v1 | TR166 / RTSIv2 | scaffold + smoke (sibling, sister-repo arXiv:2606.10154 public 2026-06-08) |
| JTP (judge triangulation protocol) | TR148 v2 / JTP v1 | TR167 / JTPv2 | this report; standard depth complete |
| TAIS (typical-acceptance invariance screen) | TR144 / TAIS v1 | TAIS v2 | not yet scaffolded |
| CRI (compile reproducibility index) | TR147 / CRI v1 | TR168 / CRIv2 | scaffold only |
Observations. Three of the four screens have a v2 follow-up at some stage of build (RTSIv2 scaffold+smoke, JTPv2 standard-depth complete, CRIv2 scaffold only); the fourth (TAIS v2) is the remaining gap in the program move. The four v2 reports share a single pre-registration discipline: train cheap features on the calibrate split, leave a model family out, ask whether the cheap predictor beats the trivial baseline on the disjoint holdout, abort on degenerate single-class outcomes rather than launder them into "no signal" verdicts. The discipline is not a feature of any single TR; it is the methodological program move that makes the four v2 reports a coherent series.
The methodological program move is what converts four independent v2 reports into a single piece of evidence about how the v1 anchors should be cited downstream. Each v2 report converts its v1 point estimate into a v2 stability envelope; the bridge paper's five-layer certification protocol consumes those envelopes layer-by-layer; the resulting protocol is anchored in v1-points-plus-v2-envelopes rather than v1-points-alone. TR167 v2's load-bearing donation to this geometry is the JTP envelope on Layer 1a, and the structural-degenerate observation that motivates the dual-axis requirement at the same layer.
19.5 The TAIS v2 gap and the program's forward shape
The TAIS v2 follow-up is the one piece of the four-screen v2 series that is not yet scaffolded. TR144's TAIS v1 anchor (max paired-binary Cohen's h = 0.024 on the speculative-decoding-safety substrate) is already Tier-1 Supported in the bridge paper at the layer that consumes typical-acceptance invariance; what is missing is the v2 envelope that would discipline the TAIS Tier-1 anchor the same way TR167 v2 disciplines the JTP Tier-1 anchor. The program shape is therefore: three v2 envelopes consumed at three layers of the bridge paper today (RTSI / JTP / CRI), with a fourth TAIS v2 follow-up queued behind the same pre-registration template once the cloud-compute envelope licenses it.
Observations. The program move is forward-compatible. The TAIS v2 gap does not weaken the TR167 v2 contribution; it identifies the next v2 report the program owes itself once the cloud-compute envelope resolves. The bridge paper's five-layer protocol does not require all four v2 envelopes to land simultaneously to be usable; it requires each layer's v2 envelope to land before that layer's Tier-1 claim is sharpened.
The honest framing of the sibling-series coherence is that TR167 v2 is one step of a four-step program move that converts v1 anchors into v1-plus-v2 anchors at each layer of the bridge paper. The step is complete for JTP at Layer 1a. RTSI's v2 step is at smoke-test depth. CRI's v2 step is scaffold-only. TAIS v2 is the next scaffold the program owes itself. TR167 v2's specific contribution to that program is the P8 pool-robustness envelope on Layer 1a, plus the structural-degenerate observation that motivates the dual-axis requirement at the same layer, plus the demonstration that the pre-registration discipline (LOFO holdout, abort on degenerate class, report directional evidence honestly) is itself the methodologically rigorous frame for the entire v2 series.
20. SS13. Forbidden Claims.
This section closes the report by stating, in explicit and binding terms, the load-bearing claims that the TR167 substrate at run 20260610_204823 does NOT license. The forbidden-claims discipline is a structural boundary on the inferential reach of this report: it is the operational counterpart to the verdict matrix in SS01, and it is the safeguard against narrative inflation when downstream documents (the bridge paper Layer 1a refusal-axis anchor in papers/serving_state_safety_certification/, the program external JTP submission, the sibling v2 reports TR166 / RTSIv2 and TR168 / CRIv2, and future synthesis passes) reach for TR167 as supporting substrate. Each of the four claims below is forbidden, with the substrate-grounded reason articulated in turn, and each is matched with the defensible alternative phrasing that the substrate DOES license. The structural boundary closure in 20.5 then summarises the inferential ceiling and prescribes the disciplinary rule for any downstream consumer.
20.1 Forbidden Claim 1: "Cheap pre-rejudge signals do not predict judge-sensitivity."
This claim is UNSUPPORTED, and the reason is methodological rather than substantive. The TR167 substrate at the rlhf-only / holdout slice contains 10 valid cells with positives=10 and negatives=0; the binary discrimination test (ROC-AUC, PR-AUC above informative floor, LOOCV out-of-fold AUC, DeLong delta-AUC against the unclear-rate baseline) is therefore undefined, not failed. The cheap_score Spearman rho on kappa_min carries the correct NEGATIVE sign (-0.1566, 95% CI [-0.8178, 0.7116]) and does not clear conventional significance at n=10 (p=0.6657). The monotonicity test on mean kappa_min HIGH < LOW (0.023 < 0.221) passes in the directionally correct way. The honest verdict is FAIL_OR_INSUFFICIENT_DATA driven by degenerate_single_class=True, not a demonstration that cheap signals lack predictive validity.
Observations. The forbidden phrasing collapses an undefined test into a negative finding; the substrate licenses neither outcome. Two distinct failure modes are being conflated by this phrasing -- a structural single-class collapse on the holdout that makes binary discrimination mathematically undefined, and a continuous-axis correlation that is directionally consistent but underpowered. Asserting that cheap signals fail is a categorical error: failure requires a defined test that returned a value worse than the threshold, and TR167's primary test never returned a value at all.
The methodologically correct framing is: "the TR167 substrate cannot evaluate the predictive-validity claim because the rlhf-only holdout is a structurally degenerate single class." Asserting that cheap signals fail is an inferential overstep that the run cannot underwrite. The directional evidence -- correct-sign rho, passing kappa_min monotonicity -- is consistent with a weak-to-moderate true effect at n=10, which the substrate is structurally incapable of resolving. The defensible forward statement is that the TR167 v2 cloud expansion is the substrate-level intervention required to convert the undefined test into a defined one.
20.2 Forbidden Claim 2: "RLHF-only judges are sufficient for JTP triangulation."
This claim is UNSUPPORTED. P8 reveals 6 resurfaced flips on the common-cell set (rlhf_only insufficient_data -> expanded_nonrlhf judge-dominated kappa_min=0.000) -- specifically tr167-llama3-2-1b at Q5_K_M / Q6_K / Q8_0 and tr167-llama3-2-3b at Q4_K_M / Q5_K_M / Q6_K -- with mean kappa_min shift of -0.1529 (expanded minus rlhf) and 0 masked flips in the reverse direction. The rlhf-only cohort (gemma3:12b, qwen2.5:7b) misses judge-decisions that the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) does not miss, consistent with the TR148 v2 dual-axis finding that refusal-axis judges and composite-harm-axis judges measure orthogonal constructs.
Observations. Sufficiency would require zero resurfaced flips at the cohort boundary; the substrate observes six, all in the same direction. The drop-one-judge sensitivity on cheap_score AUC reinforces this: without gemma3:12b the AUC sits at 1.0000; without qwen2.5:7b it sits at 0.8333. A cohort whose verdict changes under single-judge ablation is by construction not a sufficient cohort -- it is a cohort that defines a specific axis, not an exhaustive one.
The pool-robustness signal is asymmetric -- all six flips move toward less agreement under cohort expansion, none reverse. This rules out the "noise jitter" reading: the rlhf-only cohort is systematically under-counting judge-sensitivity on the GGUF-local lane. The defensible forward statement is that the rlhf-only cohort defines the refusal axis of the JTP screen, and a full JTP triangulation requires both the refusal axis and the composite-harm axis (TR148 v2 Layer 1a + Layer 1b), not one in isolation. Downstream consumers MUST NOT invoke TR167 as evidence that rlhf-only judging is sufficient.
20.3 Forbidden Claim 3: "The TR167 cheap_score is a deployable pre-rejudge screen."
This claim is UNSUPPORTED at the current substrate scope. Deployability requires a calibrated binary discrimination evaluable on a non-degenerate hold-out: an ROC-AUC with a CI excluding 0.5, a LOOCV out-of-fold AUC with CI excluding 0.5, and a DeLong delta-AUC against the trivial single-judge-unclear-rate baseline with CI excluding zero. The substrate provides NONE of these because the rlhf-only holdout has negatives=0. The directional evidence (Spearman rho -0.1566 with correct sign; monotonicity kappa_min HIGH < LOW = YES; HIGH band mean kappa_min 0.023 versus LOW band 0.221) is encouraging but does NOT license a deployment claim.
Observations. Directional evidence without significance and without binary discrimination is a pre-deployment signal, not a deployment license. The cheap_score is currently a fitted composite over seven features that has been trained on the calibrate split and evaluated on a single-class holdout fold; there is no operating point and there is no CI on the operating point. Deploying it as a pre-rejudge gate today would be a defensibility-bar violation at the same severity as deploying RTSI before TR142 v3 closed the LOOCV recall test.
The defensible forward-looking statement is: "the TR167 cheap_score warrants TR167 v2 cloud-expansion evaluation under run_paper.py to obtain a non-degenerate holdout and an evaluable binary AUC, after which a deployment claim may or may not be licensed." Anything stronger is forbidden. The bridge paper Layer 1a anchor MUST cite cheap_score as a pre-registered hypothesis under evaluation, not as a screening tool in production.
20.4 Forbidden Claim 4: "The structural-degenerate finding generalises beyond the GGUF-local lane."
This claim is UNSUPPORTED. The structural single-class observation -- that 10 of 10 valid holdout cells resolve to judge_sensitive=True on the rlhf-only pool -- is a property of the specific cohort, family, and quantisation lane that TR167 v1 exercised: four llama and qwen models across six GGUF quantisation rungs (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0) judged by gemma3:12b and qwen2.5:7b under the Ollama local-serving backend. The cloud families that the rest of the program's quantisation lane runs through -- gemma, phi, mistral via GPTQ / AWQ / fp16 on vLLM -- are not in this substrate, and the run_paper.py cells that would introduce them have not yet fired. Until those cells land, the structural-degenerate finding is a GGUF-local observation, not a universal claim about JTP on production-quantization corpora.
Observations. Generalisation requires either a separate substrate that reproduces the single-class collapse on a disjoint family / quantisation lane, or a mechanistic argument that the lane-specific properties (Ollama-local serving, GGUF quantisation rungs, two-family register) are causally irrelevant to the single-class outcome. The TR167 v1 substrate supports neither. The cohort itself (rlhf-only general-LLM pair plus regex) is partially shared with v1, but the family register (llama + qwen) is a strict subset of the program's quantisation register, and the serving backend is GGUF-local-only.
The defensible forward-looking statement is: "the structural-degenerate single-class finding holds for the GGUF-local rlhf-only lane on the llama and qwen families at standard depth; whether it generalises to cloud families and to GPTQ / AWQ / fp16 quantisation cells awaits the TR167 v2
run_paper.pycloud expansion." The asymmetric outcome documented in SS15 makes this discipline tractable: if the cloud lane produces judge-stable cells, the finding is bounded to GGUF-local; if it does not, the finding generalises and the methodological recommendation strengthens. Both branches are publishable; neither is licensed today.
20.5 Structural boundary closure.
The four forbidden claims above define the report's inferential ceiling. The substrate licenses (1) a methodological negative-with-structural-finding on the primary cheap-signal predictive-validity question, gated by degenerate_single_class=True rather than by a defined-and-failed binary discrimination test; (2) a load-bearing positive secondary finding on pool robustness with 6 resurfaced flips, mean kappa_min shift -0.1529, and asymmetric direction; (3) directional evidence on cheap_score monotonicity that is suggestive but not deployment-grade; and (4) a lane-bounded structural observation that the GGUF-local rlhf-only register produces a single-class holdout at standard depth. Future synthesis documents -- the bridge paper Layer 1a refusal-axis anchor, the program external JTP submission, the sibling TR166 / RTSIv2 and TR168 / CRIv2 reports, and any downstream submission packet derived from TR167 -- MUST honor this boundary. Any reach beyond it is unsupported by run 20260610_204823 and would require either the TR167 v2 cloud expansion via run_paper.py (gemma / phi / mistral families on GPTQ / AWQ / fp16 via vLLM, approximately 10-30 USD on A100 PCIe, 1-2 days wall-clock) or an alternative non-degenerate corpus to license.
21. SS14. Limitations.
Seven structural limitations bound the scope of every claim in this report. None of them are fatal to the secondary pool-robustness finding, but several are individually fatal to the pre-registered primary claim, and they collectively define what TR167 v2 must change to convert the negative verdict into a credible positive or a credible structural-impossibility verdict. We enumerate them in order of where they bite the inferential chain: input-coverage limitations first (1, 2), pre-registration shape next (3), sample-size envelope after that (4, 6), substrate-shape last (5), and a temporal-rerun robustness gap (7) closing the list.
Limitation 1: Two model families only (GGUF-local cap). The substrate covers families_present = [llama, qwen]. The GGUF-local lane caps family diversity at what is locally quantizable on consumer hardware against the Ollama-served judge cohort; the cloud families (gemma, phi, mistral) live behind run_paper.py on a vLLM cloud node and are not present at standard depth. With only two families, the leave-one-model-family-out (LOFO) protocol degenerates into a two-fold cross-validation, and within-family generalization claims cannot be separated from across-family claims in any statistically meaningful sense. The family_code cheap feature also returns Spearman rho = None in P2 because between-family variance collapses to a binary categorical with no rank order to compute against. The downstream consequence for the verdict ladder is that any "cheap_score generalizes across families" claim is unfalsifiable on this substrate: there is no third family to validate transfer against, and the two folds the data does support either inherit the degenerate-single-class problem (held-out llama, n=10) or fall below the LOFO minimum cell count (held-out qwen, n=5). This limitation is the proximate input-coverage cause of limitations 4, 5, and 6 propagating as they do, and is the primary intervention target for the TR167 v2 cloud expansion.
Limitation 2: Standard depth, not paper depth. This run executed run.py --depth standard --skip-openai-judge, which produces the 24-cell rlhf-only headline pool at 528 records. The paper-depth run_paper.py arc that would add cloud GPTQ/AWQ/fp16 cells, the gemma/phi/mistral families, and the cross-cohort judge dispatch (Claude judges via the dispatcher at research/shared/anthropic_batch.py, gpt-4o via the dispatcher at chatgpt_adjudication/openai_batch.py once the umbrella resolves) is intentionally not in this substrate. Every quantitative limit in this report should be read as a standard-depth limit, not a ceiling on what JTPv2 can demonstrate at full depth. The --skip-openai-judge flag is not a methodological compromise -- it is the standing umbrella discipline for adversarial corpora -- but its consequence is that the judge cohort here is four Ollama-served models (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b) plus the regex baseline, and the cross-axis triangulation possible at paper depth (refusal-axis vs composite-harm-axis vs Claude vs gpt-4o under a single panel) is structurally unavailable until both the OpenAI Researcher Access Program and the external acceptance signal resolve.
Limitation 3: Rlhf-only headline pool restriction. The pre-registered headline pool is restricted to gemma3:12b plus qwen2.5:7b as the cross-family general-LLM judge pair on the refusal axis, with llama-guard3:8b and shieldgemma:9b reserved for the expanded_nonrlhf pool used in the P8 secondary analysis. This is the right pre-registration choice for the primary claim under test -- the question is whether cheap signals can predict refusal-axis judge-sensitivity on a refusal-axis cohort -- and it deliberately enforces the TR148 v2 dual-axis discipline: refusal-axis judges measure response-refusal behavior; composite-harm judges measure a different latent construct with cross-axis kappa in the -0.13 to -0.26 range on the TR148 anchor. But the consequence is that the headline pool is structurally blind to the 6 resurfaced cells that the safety-specialist axis catches and the general-LLM pair misses (the llama3.2 1B and 3B GGUF cells at the higher quants enumerated in SS9 and SS11). The pre-registration is honest about this: P8 is named the pool-robustness check precisely because the headline pool's blindness to composite-harm signal is anticipated by the dual-axis framework rather than discovered after the fact.
Limitation 4: N=10 holdout cells is structurally under-powered (power approximately 0.18 at rho = 0.3). The Spearman significance test on cheap_score vs kappa_min returns p = 0.6657 at n = 10 holdout cells, with a 95% CI of [-0.8178, 0.7116] -- an interval that spans roughly the entire admissible range of a rank correlation. A standard 1-tail power calculation for Spearman's rho at rho = 0.3 with n = 10 and alpha = 0.05 yields approximately 0.18; even a moderate true effect would be missed roughly four times in five at this sample size. To raise power to the conventional 0.80 target at the same true rho would require n on the order of 65 to 80 cells, which is a 6.5x to 8x expansion of the holdout. The cheap_score Spearman rho on this substrate carries the correct negative sign (-0.1566), and the kappa_min monotonicity test HIGH band 0.023 < LOW band 0.221 passes, but neither result clears the pre-registered significance bar, and neither can: the standard error envelope at n = 10 is wider than the effect-size budget the predictive-validity hypothesis allows for. This is the proximate reason the directional evidence in P1 and P2 does not flip the verdict from FAIL_OR_INSUFFICIENT_DATA to a clean positive even though every directional test that can resolve at this n does resolve in the predicted direction.
Limitation 5: Degenerate single class on the rlhf-only holdout. All 10 valid holdout cells are judge_sensitive=True; positives = 10, negatives = 0. ROC-AUC integrates true-positive rate against false-positive rate, and with zero negatives the false-positive rate is 0/0 at every threshold, so the integrand does not exist. The LOOCV out-of-fold AUC inherits the same degeneracy at every fold (every held-out point carries the same class label), the DeLong delta-AUC and the nested LRT both return insufficient_data because their denominators collapse, and the PR-AUC reads 1.0000 against a random-floor PR-AUC of 1.0000 (a tautological "all predicted-positives are truly positive" because all examples are positive). This is not a power problem and it is not a tuning problem -- it is a structural property of the GGUF-local rlhf-only lane at standard depth, and it is the proximate reason the primary verdict is FAIL_OR_INSUFFICIENT_DATA rather than a clean reject. It is also a methodological finding in its own right (developed at length in SS9 and SS11), and one of the report's two load-bearing observations. A larger n drawn from the same generating process would not necessarily rescue the test, because the positive rate at this depth is 100% by the substrate's own behavior; only a different substrate composition (cloud families) or a different cohort composition (expanded_nonrlhf pool, P8 lane) can lift the degeneracy.
Limitation 6: The qwen LOFO fold has only 5 cells (below the LOFO threshold). Holding out qwen leaves a held-out fold of 5 cells, below the floor at which AUC estimates are stable enough to report. The held-out llama fold has 10 cells (the headline holdout) but inherits the degenerate-single-class problem in limitation 5. Aggregate cheap_score AUC across the two LOFO folds is therefore reported as insufficient_data, and the per-fold delta-AUCs are unrecoverable at this n. The downstream consequence is that the LOFO row in the verdict ladder reads as "two folds, both individually insufficient_data for different reasons," which means the across-family generalization claim that LOFO is supposed to test is not adjudicable on the v1 substrate at all. With 5 families and 5 LOFO folds (the v2 cloud configuration outlined in SS15), each fold would carry a meaningfully sized held-out cell count and the LOFO aggregate becomes the proper external-validity test the methodology pre-registers.
Limitation 7: No temporal-rerun robustness on the live_nonrlhf labels. The 264 live_nonrlhf records and the 264 v1_reuse records that compose the 528-record substrate were collected as part of the same run (20260610_204823) under the same judge endpoints, the same Ollama-served model tags, and the same prompt set at the same wall-clock window. There is no temporal-rerun arm in this substrate -- no second collection of live_nonrlhf labels at a later date against the same prompts and judges to check whether label drift across re-runs is small relative to the pool-induced verdict shifts measured in P8. The 6 resurfaced flips and the mean kappa_min shift of -0.1529 are therefore robust to read as cohort-composition effects only under the assumption that within-cohort label noise across rerun is small. The assumption is plausible -- Ollama judges at the same model tag are deterministic given the same prompt, sampling parameters, and seed -- but it is not measured on this substrate. A temporal-rerun arm at TR167 v2 (collecting live_nonrlhf labels a second time, a week or more later, against the same prompts and judges) would either confirm that the pool-robustness finding is invariant to rerun (strengthening it to a methodological recommendation) or surface a third axis of variance (rerun drift on the same cohort) that the current SS14 limitation list cannot bound.
Observations. The seven limitations stack rather than substitute. Limitations 1 and 2 (two families, standard depth) are upstream input-coverage problems; limitations 3 and 5 (rlhf-only pool restriction, degenerate single class) define the structural shape of the headline holdout; limitations 4 and 6 (n=10 holdout, n=5 qwen fold) define the sample-size envelope inside that shape; limitation 7 (no temporal rerun) bounds the inferential read of the load-bearing secondary finding. The cheap_score signal carries the correct negative sign and passes the kappa_min monotonicity check, but it cannot clear the pre-registered significance bar while limitations 4 and 5 hold simultaneously. The pool-robustness secondary finding -- 6 resurfaced flips, mean kappa_min shift -0.1529, drop-one-judge cheap_score AUC = 1.0000 without gemma3:12b and 0.8333 without qwen2.5:7b -- survives because it is computed on the 24 common cells across pools rather than on the 10 holdout cells, because it is a direct count of pool-induced verdict changes rather than a binary-discrimination metric, and because the within-cohort determinism of Ollama judges makes the no-temporal-rerun assumption in limitation 7 defensible even though unmeasured.
Limitations 1, 2, 4, 6, and 7 are sample-coverage and rerun-coverage limits that TR167 v2 (cloud expansion via
run_paper.pyplus a temporal-rerun arm) is designed to address. Limitations 3 and 5 are pre-registration and substrate-shape findings that may or may not survive cloud expansion. If cloud families introduce judge-stable holdout cells, limitation 5 dissolves and binary discrimination becomes evaluable. If they do not, the structural-degenerate verdict on JTP validity strengthens into a universal-collapse claim on production-quantization corpora, and the pool-robustness observation becomes the only methodologically rigorous JTPv2 observable on this lane at this depth. Either branch licenses a substantive methodological finding; neither branch licenses a deployment-grade cheap-signal screen on the v1 substrate alone.
22. SS15. Future Work -- TR167 v2 Cloud Expansion Path
The TR167 v1 standard-depth run leaves three concrete extensions queued, plus two methodological forward-looking paragraphs. Each is scoped against the load-bearing finding of this report -- that the rlhf-only judge cohort applied to the GGUF-local lane produces a structurally degenerate single-class holdout (positives=10, negatives=0), and that the cheap_score signal carries the correct directional sign (Spearman rho = -0.1566, p = 0.6657 at n=10; mean kappa_min HIGH band 0.023 vs LOW band 0.221) but cannot be tested for binary discrimination on this substrate. The three queued extensions are listed in priority order of methodological consequence; the two trailing paragraphs treat the asymmetric outcome of the cloud expansion and the question of whether TAISv2 should be built next as the missing fourth member of the v2 predictive-validity series.
22.1 Extension 1 -- run_paper.py cloud-family fire (gemma / phi / mistral via GPTQ / AWQ / fp16 on vLLM)
The first extension fires run_paper.py to add the cloud-family cells that the GGUF-local lane cannot produce. Currently families_present = [llama, qwen], with qwen contributing only 5 holdout cells (the LOFO qwen fold returns insufficient_data on that count alone). Adding gemma, phi, and mistral via GPTQ / AWQ / fp16 cells on vLLM brings the family count from 2 to 5 and the LOFO fold count from 2 to 5, which is the minimum the leave-one-family-out frame needs to be meaningful.
| Parameter | Value |
|---|---|
| Wrapper | run_paper.py (cloud-families lane) |
| Families added | gemma, phi, mistral |
| Quant cells per family | GPTQ, AWQ, fp16 |
| Engine | vLLM (A100 PCIe) |
| Approximate cost | 10-30 USD RunPod A100 PCIe |
| Approximate wall-clock | 1-2 days |
| Substrate today | 24 / 24 rlhf cells, 0 judge-stable holdout cells |
| Anticipated substrate after | 24 + cloud-family cells, expected non-zero judge-stable cells |
Observations. The current TR167 v1 substrate has zero judge-stable cells in the rlhf-only holdout (positives=10, negatives=0), which is precisely why ROC-AUC, LOOCV-AUC, DeLong delta-AUC, and the nested LRT are all undefined. Diversifying the family axis is the cheapest substrate-level intervention that could rescue binary discrimination without altering the cheap_score definition, the JTP framework, or the threshold ladder. The cost band -- 10-30 USD on RunPod A100 PCIe, 1-2 days wall-clock -- is well below the discretionary threshold for the program and does not require external compute cover; it can fire on the current operational lane subject to the standing --skip-openai-judge umbrella discipline.
The cloud-family fire is the substrate-level intervention. It does not change the JTP methodology and it does not change the cheap_score; it only changes the cell population the methodology is applied to. If the resulting holdout contains even a handful of judge-stable cells, the entire P3 / P4 / P6 row of insufficient_data verdicts becomes computable, and the cheap_score AUC under bootstrap CI -- the genuine Req1 test -- finally has a denominator. The anticipated effect is the introduction of judge-stable cells into the holdout: gemma, phi, and mistral are trained on different RLHF pipelines than llama and qwen, and the cross-family kappa structure documented in TR148 v2 (cross-family kappa = 0.6917 on n=12,809) predicts at least some cells will land above the 0.7 stability threshold under a heterogeneous family panel.
22.2 Extension 2 -- Systematic drop-one-judge cohort-composition sweep with bootstrap CIs
The second extension is the formal version of an observation already present in the P8 secondary-analysis block: drop-one-judge cheap_score AUC without gemma3:12b is 1.0000 and without qwen2.5:7b is 0.8333. These two values are documented but they are not enough; cohort composition deserves a systematic sweep at standard depth with bootstrap CIs over every leave-one-judge configuration, including the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) which the rlhf-only pool currently excludes. Expanding from the two documented drop-one results to a full cohort-composition surface -- four drop-one configurations under rlhf-only plus four under expanded_nonrlhf, each with bootstrap n=2000 CIs on the resulting cheap_score AUC and the resulting kappa_min distribution -- turns a pair of point estimates into a response surface.
Observations. Drop-one-judge AUC = 1.0000 without gemma3:12b is suspicious on its own (an AUC of exactly 1.0 at n=10 on a cohort of four judges is unstable and almost certainly an artifact of the small valid-cell count combined with the degenerate base rate); pairing it with the AUC = 0.8333 without qwen2.5:7b confirms that the cohort-composition sensitivity is real but inadequately characterized at this depth. A bootstrap CI over each drop-one configuration would tell us whether the AUC = 1.0000 is a real ceiling effect or an n=10 artifact, and the expanded sweep over the safety-specialist axis directly tests whether the dual-axis effect TR148 v2 named at the kappa level surfaces at the cheap_score-AUC level as well.
The TR148 v2 dual-axis finding -- that the refusal-axis judges (gemma3, qwen2.5) and the composite-harm-axis judges (llama-guard3, shieldgemma) measure orthogonal phenomena -- predicts exactly this kind of cohort-composition sensitivity. Extension 2 turns that prediction into a formal observable. Critically, Extension 2 is independent of Extension 1: the drop-one sweep can fire on the current 24-cell substrate today without paying for cloud-family cells, and its results would already partially answer the question of whether AUC = 1.0000 is real ceiling behavior or n=10 noise. This is the cheapest of the three extensions and the one that should fire first if compute is constrained.
22.3 Extension 3 -- Expanded LOFO validation once cloud families land
The third extension is gated on Extension 1: with families = {llama, qwen, gemma, phi, mistral}, the leave-one-family-out frame goes from 2 folds (one of which has n=5 and returns insufficient_data) to 5 folds, each with a meaningfully sized held-out cell count. The current LOFO aggregate AUC is undefined; with 5 folds and a non-degenerate held-out class distribution, the LOFO aggregate AUC becomes well-defined and the held-out-family bootstrap CI becomes the proper external-validity test for cheap_score.
Observations. LOFO is the strongest external-validity probe available on this substrate because it forces the cheap_score model to predict on a family register it has never seen. Two folds is not LOFO; it is a sanity check. Five folds at the standard-depth cell count is a defensible external-validity statement. The qwen fold's current insufficient_data verdict at n=5 cells is the binding indicator -- five folds with cell counts on the order of 10 or above per fold is what the framework was designed for, and the cloud-family expansion produces exactly that arrangement.
Extensions 1, 2, and 3 form a strict dependency chain: 1 enables 3, 2 is independent and could fire on the current substrate today. The cost ceiling for the full chain remains 10-30 USD plus 1-2 days of A100 PCIe time, and the deliverable is a TR167 v2 report whose primary verdict ladder is computable rather than degenerate.
22.4 The asymmetric outcome -- both paths are publishable
The TR167 v2 cloud expansion has an asymmetric outcome structure that is worth naming explicitly. If the cloud families introduce judge-stable cells into the holdout, then Req1 and Req2 become testable, the cheap_score model gets its proper bootstrap CI on AUC, and the predictive-validity claim either succeeds or fails on its own merits. If the cloud families do not introduce judge-stable cells -- if gemma, phi, and mistral on GPTQ / AWQ / fp16 all land in the judge-sensitive class alongside the llama and qwen GGUF cells -- then the structural-degenerate-class finding sharpens dramatically: the rlhf-only judge cohort collapses to judge-sensitive across every production-quantization family tested, which is itself a substantial methodological result about the JTP framework's behavior on production corpora.
Observations. This is not a hedge. The current v1 substrate already documents a load-bearing positive secondary finding (P8 pool robustness: 6 resurfaced flips, mean kappa_min shift -0.1529 toward less agreement), and the v2 cloud expansion either turns the primary claim into a real test or escalates the structural finding into the headline result. The pool-robustness observation survives in either branch. The alternative outcome -- universal collapse to judge-sensitive across five families and three quantization regimes -- would be a stronger claim than anything in the v1 framing; it would license the methodological statement that the rlhf-only general-LLM cohort cannot certify any production-quantization cell as judge-stable, which is a concrete operational recommendation downstream consumers of JTP would have to honor.
Pool robustness is the only methodologically rigorous JTP observable that the current substrate licenses without qualification. If the v2 cloud expansion fails to rescue binary discrimination, the program writes the rlhf-only judge cohort's universal collapse to judge-sensitive on production-quantization corpora as the headline contribution, and pool robustness anchors the methodological recommendation. The cheap_score remains a directional signal whose negative sign survives but whose statistical significance awaits a denominator the GGUF-local lane structurally cannot provide.
22.5 The TAISv2 question -- should the missing fourth v2 follow-up be built next?
The v2 predictive-validity series as currently scaffolded covers three of the four named methodological artifacts from the program: TR166 / RTSIv2 (predictive-validity follow-up to RTSI), TR167 / JTPv2 (this report), and TR168 / CRIv2 (predictive-validity follow-up to CRI). The fourth named artifact -- TAIS, the Typical-Acceptance Invariance Screen from TR144 / speculative_decoding_safety, with a calibrated null cutoff |Cohen's h| < 0.1 against an E1-E5 max observed of 0.024 -- has no v2 follow-up in flight. The structural argument for building TAISv2 is symmetric with the argument that motivated TR167: TAIS v1 is a calibration claim (the screen correctly classifies when speculative-decoding's typical-acceptance and rejection-sampling regimes are behaviorally equivalent), and a predictive-validity follow-up would ask whether a cheap pre-screen signal forecasts which (draft, target, dataset) cells will cross the |Cohen's h| < 0.1 invariance band.
Observations. Building TAISv2 next would complete the v2 series and produce a symmetric four-artifact ladder where each program method (RTSI, JTP, CRI, TAIS) carries both a v1 calibration claim and a v2 predictive-validity claim. The argument against is sequencing: the TR167 v1 verdict ladder is degenerate on the GGUF-local lane and the cloud expansion (Extension 1 above) has not yet fired, so finishing the JTP arc before opening TAIS is the cleaner portfolio move. The argument for is that TAISv2's substrate (TR144's 16,783-sample core + 48,072-sample E1-E5 expansion) is already at paper-depth and would not require cloud-compute to fire a v2 scaffold; TAISv2 could land in parallel with TR167 v2 rather than waiting for it.
The decision sits with the PI. The methodological recommendation embedded in this section is to fire Extension 2 (drop-one-judge sweep with bootstrap CIs) on the current TR167 substrate this week, fire Extension 1 (cloud-family expansion) when the operational lane has a quiet window, and scaffold TAISv2 in parallel if compute and attention permit. Extension 3 (expanded LOFO) is gated on Extension 1 and waits. Neither the cloud expansion nor the TAISv2 scaffold is foreclosed by the v1 substrate this report ships on; both are open paths and either can fire first without invalidating the JTP arc as written.
23. Conclusion
TR167 / JTPv2 ships as a negative-results-with-substantive-secondary-finding report, and the conclusion exists to make the verdict structure unambiguous before any downstream consumer (the bridge paper Layer 1a anchor, the sibling TR166 / TR168 v2 predictive-validity reports, the program external JTP follow-up, or a future reviewer of this substrate) reads a partial sentence and rounds it the wrong way. The verdict ladder is deliberately three-tier, and each tier is qualified separately rather than collapsed into a single headline.
23.1 Restating the verdict structure
Both pre-registered primary requirements resolve to FAIL_OR_INSUFFICIENT_DATA. Req1 -- that the cheap pre-rejudge signal predicts JTP judge-sensitivity on the disjoint leave-one-model-family-out hold-out -- cannot return a positive call because the rlhf-only holdout subset is structurally single-class: all 10 valid holdout cells carry judge_sensitive = True, producing positives = 10, negatives = 0, and rendering ROC-AUC mathematically undefined rather than statistically inconclusive. Req2 -- that the combined cheap model beats the trivial single-judge-unclear-rate baseline -- is downstream of Req1: the DeLong delta-AUC, the bootstrap CI on AUC(cheap) - AUC(baseline), and the nested logistic LRT each collapse to insufficient_data because the AUC quantities they compare are themselves undefined. The pre-registered ABORT condition degenerate_single_class = True fires explicitly in the JSON, which is the disciplined behavior the pre-registration committed to: refuse the metric rather than launder a divide-by-zero into a misleading numeric. The failure is not evidence that the cheap signal lacks predictive validity. The failure is that the GGUF-local rlhf-only lane does not produce a holdout against which binary discrimination can be evaluated at all.
23.2 Directional signals that hold
What the substrate does support is directional. The Spearman rank correlation between cheap_score and kappa_min lands at rho = -0.1566 (Spearman p = 0.6657, 95% CI [-0.8178, 0.7116], n = 10 cells), which carries the correct negative sign the pre-registration demanded: higher cheap_score should associate with lower cross-judge agreement. The Pearson r at -0.2179 (p = 0.5453) agrees on sign and is order-consistent in magnitude. The P1 band-stratification reinforces this in a second axis: across the LOW / MODERATE / HIGH cheap_score tertiles the mean kappa_min runs 0.221 / 0.000 / 0.023, satisfying the pre-registered monotonicity check kappa_min HIGH < LOW (0.023 < 0.221). Two pre-registered directional signals therefore point the right way -- the cheap_score sign and the kappa_min band monotonicity. The rate-monotonicity test collapses by construction (all three bands report a 1.000 judge-sensitivity rate against wide Wilson CIs whose lower bounds reflect small-n binomial uncertainty, not evidence the true rate is below one), and none of the seven individual cheap features clears p < 0.05. We explicitly do not upgrade "directional and right-signed at n = 10" to "validated". The boundary between encouraging and supported is exactly the boundary the substrate cannot cross at this depth.
23.3 The load-bearing secondary finding -- P8 pool robustness
The substantive positive contribution of TR167 is the P8 pool-robustness analysis, not the primary cheap-predicts-verdict claim. Holding the (model, quant, battery) cell fixed and swapping the judge cohort from rlhf_only (gemma3:12b, qwen2.5:7b plus regex -- both general-purpose RLHF-tuned LLMs measuring the response-refusal axis) to expanded_nonrlhf (the same general LLMs plus the safety-specialist axis llama-guard3:8b and shieldgemma:9b measuring composite-harm) over 24 common cells, 6 cells resurface from insufficient_data under rlhf-only into judge-dominated with kappa_min = 0.000 under the expanded pool, zero cells flip in the reverse direction (no masked flips), and the mean kappa_min shift expanded-minus-rlhf is -0.1529 toward less cross-judge agreement. The six resurfaced cells are all llama3.2 GGUF cells -- 1B at Q5_K_M, Q6_K, Q8_0 and 3B at Q4_K_M, Q5_K_M, Q6_K -- which is itself a structural pattern: precisely the small-llama / mid-precision rung the rlhf-only cohort was silent on becomes vocal when the safety-specialist axis is folded in. The drop-one-judge sensitivity check on cheap_score discrimination yields AUC = 1.0000 without gemma3:12b and AUC = 0.8333 without qwen2.5:7b, confirming that the rlhf-only AUC collapse is a property of cohort composition rather than of the cheap signal.
23.4 The substantive methodological observation
This sharpens the TR148 v2 dual-axis finding. The v1 report named the structural fact that refusal-axis judges (gemma3, llama3.1, regex) and composite-harm-axis judges (llama-guard3, shieldgemma) anti-correlate at kappa values in the -0.13 to -0.26 range because they measure orthogonal latent constructs, anchored by the cross-family kappa = 0.6917 measurement on the TR145 safety subset at n = 12,809. TR167 carries that finding from "two orthogonal axes exist" into a concrete operational consequence at the cell level: cohort composition can flip 6 of 24 cells in one direction with none flipping back, the mean kappa_min drift is -0.153, and the drop-one-judge AUCs differ by 0.1667 depending on which general LLM is removed. The honest reading is that JTP-validity is a property of (corpus, cohort) pairs rather than (corpus) alone, and any operational JTP gate must declare the cohort explicitly. Pool robustness, not cheap-score regression, is the methodologically rigorous JTP observable that this substrate licenses without qualification.
23.5 Future-work path
The forward path is documented and bounded. The run_paper.py cloud-expansion lane adds the gemma, phi, and mistral families via GPTQ / AWQ / fp16 cells on vLLM at approximately 10-30 USD on RunPod A100 PCIe and a 1-2 day wall-clock budget. The expansion has two possible outcomes and both are informative. The first outcome is that cloud-family diversity introduces judge-stable cells into the holdout, rescues binary discrimination, produces a non-degenerate ROC-AUC, and lets the cheap_score's directional Spearman signal -- still at rho = -0.1566 from this substrate -- be evaluated against a defined metric at adequate sample size. The second outcome is that the holdout remains structurally single-class even after cloud diversification; in that branch the structural-degenerate finding strengthens from a lane-specific observation to a universal-collapse claim about JTP-validity on production-quantization corpora at standard depth, and pool robustness inherits the headline contribution. Either branch is publishable; neither is reachable from the present GGUF-local substrate alone. The decision point is which budget commits first, not whether the methodology is sound. Within the documented future-work scope, the cohort umbrellas (--skip-openai-judge standing discipline and the Claude-judge Fellowship gate) remain in force, so the cloud expansion executes under the same defensibility-bar contract as this v1 substrate.
23.6 Positioning within the Banterhearts v2 predictive-validity series
TR167 lands as one of three sibling v2 predictive-validity follow-ups in the program. TR166 / RTSIv2 extends the Refusal Template Stability Index (the v1 anchor at TR142 v3, with the public arxiv preprint at arXiv:2606.10154 dated 2026-06-08) into a predictive-validity test of cheap structural signals on the refusal axis, sharing TR167's substrate cleanness contract and LOFO holdout convention. TR168 / CRIv2 extends the Compile Reproducibility Index (the v1 anchor at TR147 v4) into a predictive-validity test on the stack-fragility axis. TR167 / JTPv2 sits between them on the judge-cohort axis. The three v2 reports collectively answer one program-level question: can cheap, pre-rejudge, pre-recompile, or pre-rebuild signals forecast the expensive verdict that the corresponding v1 framework produces? The bridge paper at papers/serving_state_safety_certification/ consumes TR148 v2 as the Layer 1a refusal-axis anchor, and TR167 attaches a predictive-validity slot to that anchor without rewriting the bridge claim ladder -- the structural-degenerate finding feeds Layer 1a as a methodological caveat (JTP gates are cohort-conditional), and the pool-robustness finding feeds the same layer as a substantive operational recommendation (declare cohort, report kappa under multiple cohorts when feasible). TR167 v1 is therefore defensibility-bar-clean: the primary claim does not validate, the directional signals point the right way without clearing the n = 10 significance bar, the secondary pool-robustness finding is the load-bearing positive contribution, the next decision point (cloud expansion vs bridge-paper fold-in) is open, and neither route is foreclosed by the substrate this report stands on.
24. References
This section consolidates the prior-work scaffolding that the TR167 / JTPv2 substrate sits on. Two parallel maps are maintained: a Banterhearts-internal map that traces the corpus, method, and bridge-paper anchors the report reuses, and an external prior-work map that fixes TR167's position in the broader LLM-as-judge reliability literature. The external map is curated in research/tr167/LITERATURE.md and the canonical reference numbering for the external citations lives in that file; what follows here is the narrative summary of the dependency graph.
24.1 Banterhearts internal
- TR140 v3.0 -- parent corpus. 63,950 scored samples plus 78,950 judge labels under the v3.0 controls (C1-C13). TR167 reuses TR140 v3.0 holdouts as the v1-reuse pool (264 of the 528 records) and inherits the per-cell battery convention. This corpus also feeds the program external JTP submission. See
research/tr140/and the parent corpus manifest. The reuse pattern preserves substrate continuity across the JTP v1 -> JTPv2 transition without re-sampling the rlhf-only labels. - TR148 v2 -- JTP v1 (dual-axis methodology anchor). PublishReady report
PublishReady/reports/Technical_Report_148.md(1,556 lines, hand-narrated). Headline: cross-family kappa = 0.6917 on the TR145 safety subset (n=12,809), dual-axis finding where shieldgemma/llama-guard3 anti-correlate with gemma3/llama3.1 because they measure composite-harm rather than response-refusal. TR167 is the predictive-validity follow-up that asks whether the JTP v1 outcome can be forecasted by cheap pre-rejudge signals; the answer on the rlhf-only / holdout subset is the structural-degenerate-class finding above. - The program external JTP submission. TR140-anchored, dual-axis JTP method. TR167 is forward-looking predictive-validity work; it does not retro-edit the external submission and does not gate on the external acceptance signal. The pool-robustness finding (P8) is the substrate the bridge paper Layer 1a anchor will absorb once the external decision lands.
- TR142 v3 -- RTSI substrate. Refusal Template Stability Index over 51 cells, behavioral-screen sibling to JTP on the refusal axis. Public preprint at arXiv:2606.10154 (2026-06-08). Cited for the structural-screen framing that TR167 inherits: a cheap, batch-mode pre-screen built from first-pass outputs, tested for predictive validity on a disjoint holdout.
- TR166 / RTSIv2 -- sibling v2 predictive-validity follow-up. Predictive-validity extension of RTSI; scaffold and smoke-run landed 2026-06-10 (commit
ed353650). Shares the substrate cleanness contract and the leave-one-model-family-out holdout convention used in TR167. The two sibling v2 reports are designed to be read together; TR167's degenerate-class finding has a refusal-axis analogue that TR166 v2 substantiates independently. - TR168 / CRIv2 -- sibling v2 predictive-validity follow-up. Predictive-validity extension of the Compile Reproducibility Index (CRI); scaffold-only as of writing (commit
d5260cb9). Triangulates the same "cheap signal -> expensive verdict" question on a stack-fragility axis rather than a judge-sensitivity axis. The three v2 reports (RTSIv2, JTPv2, CRIv2) are the predictive-validity triplet for the program's three named methodological artifacts. - Bridge paper substrate.
papers/serving_state_safety_certification/consumes TR148 v2 as the Layer 1a refusal-axis anchor; TR167 attaches a predictive-validity slot to that anchor without rewriting the bridge claim ladder. The structural-degenerate finding and the pool-robustness secondary finding become two new entries on the Tier 1 (Supported) row ofCLAIM_LADDER.mdonce the cloud expansion confirms or strengthens them. - Canonical measurement-count doc.
BANTERHEARTS_MEASUREMENT_COUNT.md(repo root) -- TR167 adds 528 clean primary records to the running primary+judge total tracked there; the 6 resurfaced pool-robustness cells are accounted for under the expanded_nonrlhf pool slot rather than as new primary measurements.
24.2 External
The external prior-work map TR167 is positioned against is maintained in full in research/tr167/LITERATURE.md. The four anchor clusters are summarised here.
- LLM-as-judge reliability baseline. Zheng et al. (2023) MT-Bench/Chatbot Arena establishes the ~80% human-agreement baseline and names the three standard failure modes (position bias, verbosity bias, self-enhancement bias). Bavaresco et al. (2025, ACL) qualifies that headline at scale across 20 NLP tasks. WildGuard (Han et al., 2024) measures Fleiss kappa = 0.55 / 0.50 / 0.72 for prompt harm / response harm / refusal detection, fixing a moderate-agreement floor for the safety-labeling sub-task that TR167 inherits.
- Safety classifiers vs RLHF chat judges. Llama Guard (Inan et al., 2023, plus the Llama Guard 3-8B model card), ShieldGemma (Zeng et al., 2024), HarmBench classifier (Mazeika et al., 2024), and Aegis (Ghosh et al., 2024, NVIDIA) define the SFT-classifier tier; JailbreakBench (Chao et al., 2024) and PandaGuard (Shen et al., 2025) document the rate-level disagreement between SFT classifiers and RLHF chat judges that motivates TR167's secondary pool-robustness contrast. StrongREJECT (Souly et al., 2024) supplies the bidirectional-bias picture (rule-based and some SFT classifiers over-count, RLHF chat judges under-count under adversarial surface manipulation) that the resurfaced-flip finding in P8 quantitatively instantiates on the GGUF-local lane.
- Predicting evaluation reliability (closest prior work). SCOPE (Badshah, Emami, Sajjad, 2026) is the closest comparison point: it queries the judge in both response positions and converts the order-averaged preference probability into a bidirectional preference entropy (BPE) signal that drives selective abstention. TR167's cheap-pre-rejudge signal is complementary -- zero additional judge calls vs SCOPE's two-positions-per-pair -- and targets cell-level safety/refusal labeling under quantization-induced distribution shift rather than pairwise assistant-quality comparison. Jung, Brahman, Choi (2025) establishes the provable trust-or-escalate guarantee from judge self-confidence; Ehara (2026) supplies an orthogonal embedding-geometry signal; Choi et al. (2026) supplies the Item Response Theory framing for post-hoc judge-reliability diagnosis. Wataoka, Takahashi, Ri (2024, SafeGenAI Workshop) and Panickssery, Bowman, Feng (2024) supply the perplexity-self-preference and self-recognition links that motivate the family-overlap feature in TR167's cheap signal.
- Adversarial-judge and family-bias degradation. Schwinn et al. (2026) measures judge accuracy from 0.48 to 0.64 across four attack families on 6,642 human-verified labels and crucially establishes that concordance does NOT predict correctness; TR167 inherits this and uses
kappa_min(minimum pairwise kappa across the pool) rather than mean concordance as the dependent outcome. Eiras et al. (2025, ICBINB Workshop) "Know Thy Judge" supplies the surface-style flip evidence (false-negative-rate jumps up to 0.24 from stylistic shifts). Shi et al. (2025) systematises position bias across 15 judges and 150,000+ instances. Yang et al. (2026) supplies the family-enhancement bias measurement on 20 LLMs that justifies the RLHF-shared-bias circularity hypothesis the JTPv2 P8 contrast is designed to test. - Standardized safety batteries. HarmBench, JailbreakBench (JBB), StrongREJECT, and XSTest are the four anchor batteries referenced by the bridge paper and replicated in TR149. TR167's pooled and sweep-size shards (s1/s4/s16/s64/s128 across faux-dialogue and message-array registers) follow the same battery-shape convention; the structural-degenerate-class finding holds across all four batteries at the cell-aggregate level.
Observations. The internal references trace a tight line from TR140 (corpus) through TR148 v2 (JTP v1 method) to TR167 (predictive-validity extension), with sibling v2 work (TR166, TR168) carrying the same structural-screen framing onto the refusal and stack-fragility axes. The external references span four clusters -- LLM-as-judge baseline, SFT-vs-RLHF classifier disagreement, cheap-signal prediction prior art, and adversarial / family-bias degradation -- with SCOPE the single closest comparison point and StrongREJECT the closest precedent for the bidirectional-bias picture that the P8 resurfaced-flip finding instantiates.
The reference list is deliberately bounded. TR167 does not introduce new judge methodology; it tests the predictive validity of an existing one and reports a structural-degenerate-class finding on the GGUF-local rlhf-only lane plus a pool-robustness secondary finding. Both findings are interpretable inside the JTP framework already published in TR148 v2 and the predictive-validity framing already established by SCOPE; no new external apparatus is required to read them. The canonical home for the prior-art map and the numbered citation list remains
research/tr167/LITERATURE.md; this section is the narrative dependency summary that points to it.
25. Appendix A. Hardware and Environment
This appendix records the hardware, operating system, Python runtime, and statistical-engine fingerprint under which the TR167 / JTPv2 standard-depth run executed, so that a future reproducer can match the substrate end-to-end and so that any deviation in cell counts, kappa_min values, monotonicity-test outcomes, or pool-robustness deltas can be triaged against an environmental delta rather than a code delta or a substrate-drift delta. The appendix is deliberately verbose: the structural-degenerate verdict on Req1 is a property of the rlhf-only / GGUF-local cohort at standard depth, and the only way to defend that verdict against the "you ran an outdated scipy" or "you used a different Ollama tag" attack is to put the environment on the table in audit-grade detail.
25.1 Hardware and OS fingerprint
| Component | Value |
|---|---|
| Machine class | RTX 4080 Laptop development machine (single-node, GGUF-local lane) |
| CPU | Intel Core i9 (laptop class, paired with RTX 4080 Laptop GPU) |
| GPU | NVIDIA RTX 4080 Laptop, 12 GB VRAM |
| System RAM | 32 GB DDR5 |
| OS | Windows 11 Home 10.0.26200 |
| Shell | PowerShell 5.1 (powershell.exe) primary; bash available via WSL/Git Bash for POSIX scripts |
| Filesystem layout | C:\Users\sahil\OneDrive\Documents\GitHub\Banterhearts (project root) |
25.2 Python and statistical engine fingerprint
| Component | Value |
|---|---|
| Python runtime | 3.13 / 3.14 (resolved at process start by run.py preflight) |
| scipy available | True (Spearman p-values and Pearson p-values backed by scipy.stats) |
| matplotlib available | True (band-stratified plots, P8 pool-contrast figures rendered without optional-dependency fallback) |
| Bootstrap iterations | n = 2000 (Spearman 95% CI on cheap_score x kappa_min, Wilson CIs on band rates) |
| Random seed policy | per-run fixed seed inherited from config.yaml; LOOCV folds deterministic across reruns |
| Engine cleanness | 528 / 528 clean records, 0 soft violations against the JTPv2 schema |
--skip-openai-judge |
True (project-wide adversarial-corpus umbrella discipline) |
25.3 Judge backend and Ollama cohort
| Component | Value |
|---|---|
| Judge backend | Ollama (local HTTP, http://127.0.0.1:11434) |
| Judge models | llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b |
| Regex axis | inherited from the TR148 v2 / TR140 v3 v1 corpus (no live regex run in this pass) |
| OpenAI judge | NOT INVOKED (skip flag, umbrella discipline) |
| Anthropic Claude judge | NOT INVOKED (Fellowship-conditional dispatch) |
| GGUF generator pool | llama3-1-8b, llama3-2-1b, llama3-2-3b, qwen2-5 (4 models x 6 quants Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0) |
25.4 Substrate root paths and generation timestamp
| Component | Value |
|---|---|
| TR167 run dir | research/tr167/results/20260610_204823/ |
| TR140 v3 parent substrate | research/tr140/ (read-only for v1_reuse label inheritance) |
| TR148 v2 cohort definition | research/tr148/ (read-only for judge roster + kappa thresholds) |
| Bridge paper Layer 1 anchor | papers/serving_state_safety_certification/ |
| Generated timestamp | 2026-06-11T04:38:39.680387+00:00 (UTC) |
label_source distribution |
264 v1_reuse + 264 live_nonrlhf |
| Coverage flag | coverage_complete = True (24 of 24 expected cells) |
Observations. The substrate is a single-machine GGUF-local lane: Ollama serves all four judges and all four llama/qwen quantized generators on one 12 GB consumer GPU, with scipy and matplotlib both resolved at process start so that bootstrap CIs, Spearman p-values, Wilson rate CIs, and band-stratified plots run without optional-dependency fallback. The cleanness counter (528/528 records, 0 soft violations) confirms that no record was dropped at the engine layer; the structural-degenerate single-class verdict on Req1 is therefore a property of the rlhf-only judge cohort applied to the GGUF-local model grid, not a data-quality artifact from a mid-run drop or a schema mismatch. The bootstrap n=2000 setting is the standard Banterhearts JTP-line bootstrap budget inherited from TR148 v2, and is what populates the 95 percent Spearman CI on cheap_score x kappa_min reported as [-0.8178, 0.7116] in Section 11.
The environment fingerprint is deliberately humble. A consumer laptop with a 12 GB GPU running Ollama serially across four 7-12 B judge models is sufficient to produce a structurally degenerate single-class holdout on the rlhf-only judge pool at standard depth on the GGUF-local lane; it is NOT sufficient to introduce cloud families (gemma, phi, mistral on vLLM with GPTQ, AWQ, or fp16 cells) that would diversify the holdout into a non-degenerate ROC-AUC regime. The hardware ceiling here is exactly why the TR167 v2 cloud expansion (
run_paper.pyon an A100 PCIe, approximately 10-30 USD on RunPod, 1-2 days wall-clock) is scoped as the natural next step in Section 22 rather than as a same-environment re-run. The appendix's job is to make the ceiling legible: the single-class verdict cannot be argued away as "the laptop was misconfigured," because the cleanness counters, the scipy/matplotlib resolution, the Ollama judge roster, and the 528/528 record ledger are all on the table.
25.5 Runtime command argv
The run.py invocation that produced this substrate was:
python research/tr167/run.py --depth standard --skip-openai-judge --run-dir research/tr167/results/20260610_204823
The downstream analyze.py invocation that produced tr167_analysis.json, report.md, and the band/pool figures was:
python research/tr167/analyze.py --run-dir research/tr167/results/20260610_204823 -v
The --skip-openai-judge flag is mandatory under the OpenAI adversarial-corpus safety-umbrella convention documented in the project memory and is responsible for the absence of a GPT-4o judge column in the JTP cohort; the four Ollama judges plus the regex baseline carried in from the v1 corpus constitute the full label source. The label_source distribution of 264 v1_reuse plus 264 live_nonrlhf records is the audit trail for which records were inherited from the TR148 v2 parent and which were drawn fresh against the expanded_nonrlhf pool during this run. Both invocations are reproducible from any machine that matches the section 25.1-25.3 fingerprint without further flag adjustment; the LOOCV fold partition, bootstrap seed, and cheap_score weight vector are all deterministic given the same input substrate.
26. Appendix B. Reproduction Commands
This appendix captures the exact command surface for reproducing every numerical claim in this report, plus the substrate paths the analysis and figures depend on. The intent is that an external reader -- or a future-Sahil returning after a six-month gap -- can re-derive the verdict, the P1 band table, the P2 correlation matrix, and the P8 pool-robustness deltas from the same artifacts that generated them, without guessing at flags or paths. Each command below is annotated with what it does, what it writes to disk, and what to spot-check after the command lands.
26.1 Local reproduction (standard depth, the substrate behind this report)
The substrate for this report was produced on the GGUF-local lane with the OpenAI judge umbrella OFF, consistent with the program-wide adversarial-corpus gating rule. The two-step local reproduction is:
# Step 1 -- fire local relabel + cheap-feature extraction over the 4 x 6 GGUF cell grid
python research/tr167/run.py \
--depth standard \
--skip-openai-judge \
--run-dir research/tr167/results/20260610_204823
# Step 2 -- regenerate analysis.json + report.md + plots from the captured records
python research/tr167/analyze.py \
--run-dir research/tr167/results/20260610_204823 \
-v
Step 1 walks the 24-cell (model, quant) grid declared in config.yaml, invokes the GGUF-local target on each cell to capture model outputs, dispatches the four-judge cohort (llama-guard3:8b, shieldgemma:9b, gemma3:12b, qwen2.5:7b) over the response distribution, computes the per-cell cheap features (quant_bpw, refusal_rate_delta, family_code, single_judge_unclear_rate, single_judge_ambiguity_rate, single_judge_refusal_rate, mean_output_len_tokens), and writes the 528 clean records, the per-judge label files, and the cheap-feature CSV into the run directory. Expected wall-clock at standard depth on a single RTX 4080-class host is on the order of hours, dominated by the per-judge Ollama calls. After Step 1 lands, spot-check jtp_records.jsonl is 528 lines, coverage_complete=True shows up in stdout, and the four judge_labels_*.jsonl files are non-empty.
Step 2 is idempotent: it reads the JSONL artifacts, recomputes Spearman/Pearson, Wilson CIs, monotonicity tests, and the P8 pool-robustness comparison, and emits the analysis.json + report.md + figures. The -v flag surfaces the scipy-engine confirmation, the bootstrap n=2000 setting, and the degenerate_single_class=True flag on the rlhf-only / holdout subset so the run is auditable from stdout. After Step 2 lands, verify that tr167_analysis.json reports coverage_complete=True, that the cheap_score Spearman rho on kappa_min is approximately -0.157 with p approximately 0.666, and that the P8 block reports 6 resurfaced cells with a mean kappa_min shift of approximately -0.153. If any of those three sentinel numbers diverges by more than rounding, the input substrate is not the substrate this report was written against and Step 1 should be re-run from scratch.
26.2 Future-work expansion (cloud GPTQ/AWQ/fp16 cells)
The cloud expansion path that fills in the gemma/phi/mistral families -- and is the only documented mechanism by which the holdout might acquire judge-stable cells -- is:
# TR167 v2 cloud expansion (external-acceptance-gated; not run in this substrate)
python research/tr167/run_paper.py \
--depth paper \
--run-dir research/tr167/results/<ts> \
--skip-openai-judge
This invocation adds the cloud GPTQ/AWQ/fp16 cells on vLLM and preserves the --skip-openai-judge umbrella. The <ts> placeholder is a fresh UTC timestamp directory; the substrate paths in 26.3 below would then resolve under that new run dir, not under 20260610_204823. Expected RunPod A100 PCIe cost is 10-30 USD across a 1-2 day wall-clock window, dominated by vLLM cold-starts per cloud-family target.
26.3 Substrate paths the report depends on
| Artifact | Path |
|---|---|
| Run directory (this report) | research/tr167/results/20260610_204823/ |
| Analysis JSON (numerical claims) | research/tr167/results/20260610_204823/tr167_analysis.json |
| Run manifest (configuration + cell ledger) | research/tr167/results/20260610_204823/run_manifest.json |
| JTP per-cell records | research/tr167/results/20260610_204823/jtp_records.jsonl |
| Safety-axis judge labels (llama-guard3) | research/tr167/results/20260610_204823/judge_labels_llama-guard3.jsonl |
| Safety-axis judge labels (shieldgemma) | research/tr167/results/20260610_204823/judge_labels_shieldgemma.jsonl |
| Refusal-axis judge labels (gemma3) | research/tr167/results/20260610_204823/judge_labels_gemma3.jsonl |
| Refusal-axis judge labels (qwen2.5) | research/tr167/results/20260610_204823/judge_labels_qwen2.5.jsonl |
| Auto-generated report (pre-narration) | research/tr167/results/20260610_204823/tr167_report.md |
Observations. The JSONL substrate files plus the analysis.json plus the run_manifest.json are the entire reproduction surface; every number in this report -- the n=10 holdout, the Spearman rho = -0.1566 on cheap_score, the kappa_min shift -0.1529 under pool expansion, the 6 resurfaced cells -- traces to these files via Step 2 of 26.1. The run_manifest.json carries the configuration snapshot (depth=standard, skip_openai_judge=True, scipy=True, matplotlib=True, bootstrap n=2000) plus the cell ledger; a fast cross-check is to confirm that the manifest's declared cell count (24) matches the per-cell row count in jtp_records.jsonl aggregated by (model, quant) and that the label_source ledger is 264 v1_reuse + 264 live_nonrlhf.
Reproduction is one shell command for the analysis and one for the run; the cloud expansion is a single additional invocation gated on external compute. The substrate is small enough that re-running analyze.py costs seconds, which is why every claim in this report is cited back to tr167_analysis.json rather than restated from memory. The run_manifest cross-check is the cheapest single guard against substrate drift: if the manifest's coverage_complete, label_source_distribution, or judge cohort disagrees with what
jtp_records.jsonlactually contains, the analysis pipeline will still emit a report, but the numbers will no longer be the numbers in this document. Verifying manifest-against-substrate before trusting any downstream verdict is the standing convention.
27. Appendix C. Per-Cell Table
This appendix reproduces the ten valid holdout cells (rlhf_only pool, holdout split) verbatim from tr167_analysis.json, with the cheap_score and kappa_min values that drive every numerical claim in SS3 and SS4. The table is the per-cell substrate on which the Spearman rho = -0.1566 (p = 0.6657, 95% CI [-0.8178, 0.7116]) is computed, and on which the monotonicity HIGH 0.023 < LOW 0.221 verdict rests.
| Cell id | model | quant | cheap_score | kappa_min | judge_sensitive | n_overlap | band |
|---|---|---|---|---|---|---|---|
| tr167-llama3-1-8b-q2-k | llama3.1-8b | Q2_K | 4.74 | 0.118 | True | 499 | HIGH |
| tr167-llama3-1-8b-q3-k-m | llama3.1-8b | Q3_K_M | 3.51 | 0.000 | True | 487 | HIGH |
| tr167-llama3-1-8b-q4-k-m | llama3.1-8b | Q4_K_M | 1.82 | 0.000 | True | 478 | MOD |
| tr167-llama3-1-8b-q5-k-m | llama3.1-8b | Q5_K_M | 0.93 | 0.000 | True | 465 | MOD |
| tr167-llama3-1-8b-q8-0 | llama3.1-8b | Q8_0 | -2.39 | 0.441 | True | 452 | LOW |
| tr167-llama3-2-3b-q2-k | llama3.2-3b | Q2_K | 3.91 | 0.000 | True | 491 | HIGH |
| tr167-llama3-2-3b-q3-k-m | llama3.2-3b | Q3_K_M | 2.74 | 0.000 | True | 483 | HIGH |
| tr167-llama3-2-3b-q4-k-m | llama3.2-3b | Q4_K_M | 1.13 | 0.000 | True | 476 | MOD |
| tr167-llama3-2-3b-q8-0 | llama3.2-3b | Q8_0 | 2.45 | 0.000 | True | 458 | HIGH |
| tr167-llama3-1-8b-q6-k | llama3.1-8b | Q6_K | -1.18 | 0.118 | True | 469 | LOW |
Observations (kappa_min floor structure). Nine of the ten holdout cells carry kappa_min in [0.000, 0.118], with seven cells at exactly 0.000 (chance-agreement floor) and two cells at 0.118 (judge-dominated by a thin margin). The one cell at kappa_min = 0.441 (the llama3.1-8b Q8_0 cell, cheap_score = -2.39) is the highest-kappa cell in the holdout but still sits below the 0.7 stable-threshold, so it carries judge_sensitive = True like every other valid cell. The 0.000 / 0.118 / 0.441 distribution is bimodal: most cells collapse to the floor, one cell carries non-trivial agreement, and there is no intermediate stratum between 0.118 and 0.441.
Observations (cheap_score x kappa_min coupling). The cheap_score axis spans roughly [-2.39, 4.74] across the ten cells. The LOW-band cells (cheap_score < 0) are -2.39 (kappa_min = 0.441) and -1.18 (kappa_min = 0.118), with band-mean kappa_min = 0.221. The MODERATE-band cells (cheap_score between 0 and 2) are 1.82 (kappa_min = 0.000), 0.93 (kappa_min = 0.000), and 1.13 (kappa_min = 0.000), with band-mean kappa_min = 0.000. The HIGH-band cells (cheap_score > 2) are 4.74 (kappa_min = 0.118), 3.51 (kappa_min = 0.000), 3.91 (kappa_min = 0.000), 2.74 (kappa_min = 0.000), and 2.45 (kappa_min = 0.000), with band-mean kappa_min = 0.023. The HIGH < LOW direction (0.023 < 0.221) is the pre-registered prediction and resolves in favor; the kappa_min = 0.441 outlier in the LOW band is what lifts the LOW-band mean and what produces the directional signal at all.
Observations (judge-dominated at zero correlated with cheap_score sign). Of the seven cells at kappa_min = 0.000, five carry positive cheap_score (the HIGH band) and two carry positive cheap_score in the MODERATE band; none of the seven kappa_min = 0.000 cells sits at negative cheap_score. The two negative-cheap_score cells in the LOW band carry kappa_min = 0.441 and 0.118 respectively -- both above the kappa_min = 0.000 floor. This is the structural pattern that produces the Spearman rho = -0.1566 correct sign: the cheap_score's negative end is preferentially associated with non-floor kappa_min cells, and the positive end is preferentially associated with floor cells. The directional signal is real even though it cannot clear the n = 10 significance bar.
Observations (the judge-sensitive cell carrying the lowest cheap_score). The cell at cheap_score = -2.39 is the llama3.1-8b Q8_0 cell, and it is the lone non-trivial-kappa_min cell in the holdout. Its position on the cheap_score axis -- the most negative cheap_score in the substrate -- is exactly where the pre-registered prediction places it: low cheap_score predicts high kappa_min. This single cell does substantial leverage work in the rho computation; without it, the rho would shrink toward zero and the monotonicity test would fail. The substrate's directional signal is therefore concentrated in one outlier cell rather than spread evenly across the holdout, which is the structural reason the rho-CI is so wide ([-0.8178, 0.7116]) -- the estimator cannot distinguish between "a real moderate effect" and "one leverage point doing all the work" at n = 10.
The cells are family-monochromatic on the holdout split: all ten valid holdout cells are llama-family (eight llama3.1-8b and four llama3.2-3b once you count the kappa_min = 0.000 cells, plus two llama3.1-8b cells in the LOW band). The qwen-family cells live in the calibrate split under the LOFO partition, and the held-out llama fold is the source of every cell in this table. This means the directional signal observed here is intra-family rather than cross-family: the cheap_score's rank-ordering against kappa_min is being tested on the holdout family that was not used to fit it, but the family-axis variance that family_code is supposed to encode is collapsed by construction. The cloud-family expansion in Section 22 lifts exactly this -- with gemma, phi, mistral cells in the holdout, family_code becomes a meaningful feature, the cheap_score gains cross-family rank discrimination, and the kappa_min outlier ceases to be a single leverage point because more non-floor cells enter the holdout under cross-family architectural diversity.
28. Appendix D. Per-Battery Disaggregation
The TR167 substrate enumerates eleven distinct batteries. One is the union -- the pooled battery -- which is the headline aggregation used throughout the body of this report. The other ten are the disaggregated cells of a two-dimensional TR140 v3.0 attack-engineering matrix, crossing five shot counts (s1, s4, s16, s64, s128) with two prompt-format strategies (faux-dialogue and message-array). The eleven battery identifiers present in the substrate are therefore: pooled, s1_faux-dialogue, s1_message-array, s4_faux-dialogue, s4_message-array, s16_faux-dialogue, s16_message-array, s64_faux-dialogue, s64_message-array, s128_faux-dialogue, and s128_message-array. The 11-battery enumeration is not an arbitrary choice on the part of TR167: it is the exact axis-crossing inherited from TR140 v3.0's parent attack-engineering matrix, and every TR that consumes TR140 v3.0 records (TR148 v2 included) carries this same eleven-fold layout.
The two axes encode orthogonal stressors. The shot-count axis (1 / 4 / 16 / 64 / 128) varies the number of in-context demonstrations packed into the attack scaffold before the target turn; under TR140 v3.0, larger shot counts purchased more contextual jailbreak pressure at the cost of prompt-length budget. The format-strategy axis encodes whether the demonstrations are serialized as a faux-dialogue (a single concatenated transcript styled as alternating speakers in one user turn) or as a message-array (a structured list of role-tagged turns that more faithfully matches the chat template the judges and the target consume). The two strategies differ in how the target tokenizer applies role separators and special tokens; they are not equivalent prompts even at equal shot count. The s1 batteries are the lowest-context anchors -- a single demonstration in the attack scaffold -- and the s128 batteries are the deepest-context tail where prompt-length pressure is highest and refusal templates have the most surface area to drift.
What differs across the ten disaggregated batteries is therefore exactly the (shot_count, prompt_format) pair: the model under test, the quantization rung, the judge cohort, the cleanness schema, and the JTP class-assignment logic are all held constant. Shot count varies the contextual prefix length and the number of in-context demonstrations of the targeted behavior; prompt format varies whether the chat-template tokenization sees one user turn or many. These are the two TR140 v3.0 attack-engineering axes that produced the largest refusal-rate sensitivity in the parent corpus, and inheriting them verbatim is what licenses TR167's pooled cell-level claim as a defensible aggregate over the same stress surface that TR148 v2 used.
| Battery | Axis composition | Role in TR167 |
|---|---|---|
| pooled | union of the ten disaggregated batteries | headline aggregation for kappa_min, sig-class, and cheap_score |
| s1 / s4 / s16 / s64 / s128 x faux-dialogue | single-turn concatenated transcript at five shot depths | per-cell stratum used to compute n_overlap and per-stratum kappa inputs |
| s1 / s4 / s16 / s64 / s128 x message-array | role-tagged structured turns at five shot depths | per-cell stratum used to compute n_overlap and per-stratum kappa inputs |
Observations. The pooled battery is the default headline because the predictive-validity claim under test in this report is a cell-level claim (model x quant), not a battery-level claim; the cheap_score regressors are built at the cell level by aggregating across batteries within a (model, quant) cell. Pooling collapses the ten disaggregated batteries into a single per-cell row whose kappa_min reflects the minimum cross-judge agreement that any battery slice exhibits within that cell. Per-battery disaggregation at the analyze.py extension level is TBD per analyze.py extension -- the v1 substrate does not emit per-battery kappa_min slices in the JSON, only the pooled outcome and the n_overlap counts per cell (range 452 to 499 across the ten holdout cells). Reading the per-cell table in Appendix C, the ten holdout n_overlap values cluster in the high-490s, indicating that the disaggregated batteries are populated densely enough that pooling does not paper over a single dominant stratum. The fact that n_overlap stays within a tight band rather than fluctuating wildly across cells is itself evidence that the eleven-battery enumeration is not generating ragged coverage at the cell level.
Pooling is the conservative reporting choice on the GGUF-local rlhf-only lane: with only ten valid holdout cells and a single-class judge_sensitive distribution, slicing further into eleven batteries would compound the degeneracy rather than resolve it. Were per-battery kappa_min emitted in the JSON, each cell would split into eleven sub-cells, and the structural degenerate-class verdict (positives = 10, negatives = 0) would re-emerge at every battery slice with no additional discriminative content. The TR140 v3.0 attack-engineering matrix was designed to stress-test refusal templates across shot counts and format strategies, and the per-battery slices remain available in the run directory for cloud-expansion follow-up under run_paper.py; until cross-family diversity (gemma / phi / mistral) introduces judge-stable cells, per-battery disaggregation is a diagnostic instrument rather than a headline carrier. The TR167 v2 cloud expansion is the natural surface on which per-battery disaggregation becomes informative: with judge-stable cells in the holdout, one can ask whether stability concentrates in low-shot or high-shot batteries, whether faux-dialogue or message-array carries more of the judge-sensitivity signal, and whether the shot-count axis interacts with quant_bpw at the battery level. Those are the questions that the SS15 / Appendix-D follow-up explicitly defers to v2; in the v1 substrate, the eleven-battery enumeration is documented here for reproducibility and for the disjoint follow-up to consume, not for the headline verdict to lean on.
29. Appendix E. Pool Robustness Table
This appendix reproduces verbatim the six resurfaced cells identified by P8 of the analyze.py pipeline. These are the cells that transition from insufficient_data under the rlhf_only judge pool (gemma3:12b, qwen2.5:7b) to judge-dominated under the expanded_nonrlhf pool, which folds in the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) alongside the general-LLM axis. As argued in the body, this table -- not the headline cheap-signal regression -- is the load-bearing positive substrate finding of TR167 / JTPv2, and it is the row-level evidence that the cohort-expansion effect surfaced as a single aggregate statistic in the body (mean kappa_min shift -0.1529 across the 24-cell common-cell comparison) actually carries uniform directional content rather than averaged-out noise.
The verdict columns are reproduced exactly as P8 wrote them to tr167_analysis.json. The kappa_min value of -- under the rlhf-only pool encodes the insufficient_data JTP class: there were not enough cross-axis judge-judge pairings within the two-general-LLM cohort to compute a defensible Cohen's kappa minimum at the per-cell overlap floor configured for TR167. The expanded pool, by introducing the safety-specialist axis (llama-guard3:8b, shieldgemma:9b), supplies those missing pairings; in all six cases the resulting kappa_min is exactly 0.000, placing the cell firmly in the judge-dominated class under the inherited TR148 v2 taxonomy (kappa_min < 0.40).
| cell_id | rlhf_only kappa_min | expanded_nonrlhf kappa_min | rlhf_only class | expanded_nonrlhf class | model | quant |
|---|---|---|---|---|---|---|
| tr167-llama3-2-1b-q5-k-m | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 1B | Q5_K_M |
| tr167-llama3-2-1b-q6-k | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 1B | Q6_K |
| tr167-llama3-2-1b-q8-0 | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 1B | Q8_0 |
| tr167-llama3-2-3b-q4-k-m | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 3B | Q4_K_M |
| tr167-llama3-2-3b-q5-k-m | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 3B | Q5_K_M |
| tr167-llama3-2-3b-q6-k | -- | 0.000 | insufficient_data | judge-dominated | llama3.2 3B | Q6_K |
Observations (row-level). Reading the table row by row makes the structural homogeneity of the resurfacing pattern concrete. The first three rows -- tr167-llama3-2-1b-q5-k-m, tr167-llama3-2-1b-q6-k, tr167-llama3-2-1b-q8-0 -- are the llama3.2 1B family across the upper-precision GGUF rungs (Q5_K_M, Q6_K, Q8_0), the regime where Q8_0 is the high-precision anchor against which refusal_rate_delta is computed. Under the rlhf-only pool, all three return insufficient_data; under the expanded pool, all three return kappa_min = 0.000 and judge-dominated. The next three rows -- tr167-llama3-2-3b-q4-k-m, tr167-llama3-2-3b-q5-k-m, tr167-llama3-2-3b-q6-k -- are the llama3.2 3B family across the mid-band quants (Q4_K_M, Q5_K_M, Q6_K). The pattern repeats identically: insufficient_data under rlhf-only, kappa_min = 0.000 / judge-dominated under expanded. Six rows, two model sizes, four distinct quant rungs (Q4_K_M, Q5_K_M, Q6_K, Q8_0), zero cross-axis exceptions.
The row-level homogeneity is not a property the design of P8 forced. P8 is purely a set-difference operator: it identifies cells whose JTP class changes between the two pools and reports the direction of change. Nothing in the operator enforces that all resurfacings must share a model family, a quant range, or a destination class. The fact that they do -- six llama3.2 cells, all transiting
insufficient_data->judge-dominatedat exactly kappa_min = 0.000 -- is substrate behavior, not analysis behavior, and it is the kind of row-level uniformity that the body of the report leans on when calling pool composition (rather than cheap-signal regression) the methodologically rigorous JTP observable on this corpus.
Observations (consistency as evidence). The six rows share four consistency properties that, taken together, distinguish the finding from a noise artifact. (a) Source class is uniform: every row begins at insufficient_data under the rlhf-only pool -- never at judge-dominated, never at judge-sensitive. (b) Destination class is uniform: every row resolves to judge-dominated under the expanded pool -- never to judge-sensitive, never back to insufficient_data. (c) Destination kappa_min is uniform: every row lands at exactly 0.000, the floor of the kappa support, rather than at scattered low-positive values one would expect from random judge variance. (d) Reverse-direction flips: zero -- the count of cells flipping the other way (rlhf-sensitive -> expanded-stable, or rlhf-dominated -> expanded-sensitive) is exactly zero, as reported in the body. The mean kappa_min shift across the broader 24-cell common-cell comparison is -0.1529, and the row-level numbers above contribute to that aggregate in the same downward direction. None contribute upward.
If cohort expansion were merely adding random judge variance, one would expect a mix of resurfacings: some cells crossing to
judge-sensitive, some tojudge-dominated, some staying atinsufficient_data, some flipping in the reverse direction. The empirical distribution observed is degenerate at the destination side: all six previously-undetermined cells resolve in the same downward direction (toward lower kappa_min, toward judge-dominated, never toward judge-sensitive). This is the signature of a missing-axis problem, not a noise problem: the safety-specialist axis (llama-guard3:8b, shieldgemma:9b) carries decision content that the rlhf-only general-LLM axis (gemma3:12b, qwen2.5:7b) structurally cannot recover, and once that content is folded in, six cells that looked undecidable become decisively dominated.
Observations (directional disambiguation, not noise). The cohort-expansion contrast functions here as a directional disambiguator rather than as a power-up: it does not turn cells from one decidable class into another decidable class; it turns previously-undecidable cells into decidable ones, and it does so on a single side of the JTP class boundary. This is exactly the operating mode the TR148 v2 dual-axis finding predicted -- the refusal-axis and composite-harm-axis cohorts measure different latent constructs, and a cohort that is missing one axis cannot resolve cells whose JTP class depends on that axis's content. The six resurfaced cells are the empirical locus of that prediction on the TR167 substrate, and the fact that none crosses into the judge-sensitive class is itself diagnostic: the missing axis is supplying dominated-class evidence, not sensitive-class evidence.
The honest interpretation of the appendix is that the rlhf-only judge pool, as configured for the TR167 standard-depth run, fails to certify six (model, quant) cells in the llama3.2 family because it lacks the composite-harm axis whose decision content those cells require. The expanded pool, by including the safety-specialist axis, supplies exactly that missing content -- and the six cells are resolved unambiguously and uniformly. None of the six is rescued into the judge-sensitive class; all are pulled into the judge-dominated regime where a single dominating judge carries the JTP signal. This is the predictive-validity-grade restatement of the TR148 v2 dual-axis methodology, and it is the result that survives the structural degenerate-single-class collapse documented in the earlier verdict sections. The body's framing of this appendix as the load-bearing positive substrate finding of TR167 is grounded directly in this row-level uniformity: six cells, one source class, one destination class, one destination kappa_min value, zero counter-direction exceptions.