Technical Report 165: Python GIL Ablation for Direct PyTorch Inference Boundaries

Matched-Pair Test of TR164 V1's GIL-Attribution Hypothesis Under Python 3.14t Free-Threaded -- Verdict: H2_partial

Status. TR165 substrate is complete and runtime-clean. The original pre-registered experiment ran 3,744 metric rows across 48 planned cells (3 models x 4 workloads x 2 phases x 2 N-concurrency levels) on the same RTX 4080 Laptop 12 GB hardware as the TR164 V1 baseline at research/tr164/results/20260531_120428_552237/, changing the Python runtime from a stock GIL build to Python 3.14.5 free-threaded. That historical transition resolves to H2_partial: H0 (no improvement; GIL irrelevant) is rejected, H1 (full GIL attribution; all N=2 combinations and all hang shapes recover) is not supported, and H2 (partial GIL attribution) is supported. The historical transition reports 19 of 24 N=2 combinations clearing the 0.65 parallel-efficiency floor, mean N=2 efficiency uplift +0.1791, median +0.1715, and 2 of 6 V1 deterministic-hang shapes resolved under no-GIL. A later same-build control closes the main version-confound at N=2: using the same CPython 3.14.5 free-threaded Docker image and toggling only PYTHON_GIL=1 versus PYTHON_GIL=0, both 144-cell arms completed with 2,592 rows per arm and 0 failed cells; no-GIL improved all 24/24 paired N=2 combinations by mean +0.1325 (95% CI [0.1218, 0.1434]) and median +0.1274. Neither same-session arm cleared the absolute 0.65 floor in any of the 24 N=2 pairs on this Docker stack, so the updated substrate-grounded read is narrower and stronger: the GIL is empirically a contributor, but removing it is not sufficient to restore the in-process direct-PyTorch boundary. The historical analysis JSON sits at research/tr165/results/20260607_174748_273070/analysis.json; the same-session control summary sits at research/tr165/same_session_gil_control/same_session_gil_control_summary.json.

1. Abstract

Technical Report 164 V1 concluded that direct PyTorch inference exhibits a sharp parallel-efficiency breakdown at N=2 concurrency on a single-GPU RTX 4080 Laptop, and attributed the breakdown provisionally to Python Global Interpreter Lock (GIL) contention between host threads dispatching CUDA work. Technical Report 165 tests that attribution in two layers. The original pre-registered layer holds hardware, backend, models, workloads, phases, and repetitions fixed against V1 while replacing the stock CPython GIL runtime with Python 3.14.5 free-threaded; it produces 3,744 metric rows across 48 cells plus the six V1 hang-shape attempts. That historical transition resolves to H2_partial: 19/24 N=2 combinations clear the 0.65 efficiency floor, mean N=2 parallel-efficiency delta is +0.1791, median is +0.1715, and 2/6 deterministic-hang shapes resolve. The later control layer uses the same CPython 3.14.5 free-threaded Docker image for both arms and toggles only PYTHON_GIL=1 versus PYTHON_GIL=0; both 144-cell arms complete with 2,592 rows and 0 failed cells. In that same-build control, no-GIL improves all 24/24 paired N=2 combinations by mean +0.1325 (95% CI [0.1218, 0.1434]) and median +0.1274, but neither arm clears the 0.65 floor in any pair. The substrate-grounded reading is therefore calibrated: the GIL is a real contributor to the direct-PyTorch boundary, but it is not the sole mechanism and removing it is not sufficient to restore the operating regime.

2. Table of Contents

Abstract
Table of Contents
Executive Summary
Introduction and Research Motivation
Pre-Registered Research Hypotheses (H0 / H1 / H2)
Methodology
Substrate Inheritance -- TR164 V1 Baseline
SS1 -- Preflight: Python 3.14.5 Free-Threaded Build Verification
SS2 -- The 48-Cell Matched-Pair Plan
SS3 -- Cell-by-Cell Matched-Pair Substrate
SS4 -- The N=2 Parallel-Efficiency Delta (+17.91% mean, +17.15% median)
SS5 -- Hang Resolution Ledger: 2 of 6 Cell Shapes Cleared
SS6 -- ROC-AUC Discrimination Analysis
SS7 -- Falsification Verdict: H2_partial and Same-Session GIL Control
SS8 -- Mechanism Interpretation
SS9 -- Cross-Reference to TR164 V1 and TR164 V2
SS10 -- Cross-TR Position
SS11 -- Forbidden Claims
SS12 -- Limitations
SS13 -- Future Work
Conclusion
References
Appendix A -- Hardware and Python 3.14t Environment Fingerprint
Appendix B -- Reproduction Commands
Appendix C -- Per-Cell Matched-Pair Tables

3. Executive Summary

Technical Report 165 is the matched-pair falsification test of a specific causal claim that Technical Report 164 V1 left on the table. V1 documented a uniform N=2 breakdown under the pytorch_direct backend: across every model in {unsloth/Llama-3.2-1B-Instruct, Qwen/Qwen2.5-1.5B-Instruct, unsloth/Llama-3.2-3B-Instruct}, across every workload in {short_decode, balanced_2k, long_decode, repeated_prefix}, and across both phases {scaling, ttft}, increasing concurrency from N=1 to N=2 collapsed parallel efficiency and, in six specific cell shapes, produced deterministic hangs. The V1 report named the Python Global Interpreter Lock as the most plausible single mechanism on the grounds that pytorch_direct keeps the entire request-management loop in Python with no out-of-process worker pool to shed contention. That attribution was a hypothesis, not a finding. TR165 exists to convert the hypothesis into an answer.

The design holds a single delta. The Python interpreter changes from the stock CPython 3.x build used in V1 to the 3.14.5 free-threaded build (pythoncore-3.14t-nuget, tags/v3.14.5:5607950, May 10 2026) in which sys._is_gil_enabled() returns False at dispatcher startup AND after every CUDA-touching import. Everything else is matched cell-for-cell against V1: the three Hugging Face IDs are reused verbatim, the backend is held to pytorch_direct only, the workload set is identical, both phases run, concurrency is restricted to {N=1, N=2} because that is the boundary V1 isolated, repetitions per cell are matched to V1, and the hardware is the same RTX 4080 Laptop 12 GB box. The matched-pair scaffold is enforced inside research/tr165/: matched_pair_contrast in analysis.json joins V1 and V165 cells on (model, workload, phase, N, rep) and the dispatcher refuses to launch any cell V1 did not also run.

The later same-session control tightens that design at N=2. Instead of comparing CPython 3.13.1 against CPython 3.14.5t, it holds CPython 3.14.5t, Docker image, CUDA stack, model plan, and run window fixed, and toggles only PYTHON_GIL=1 versus PYTHON_GIL=0. Both 144-cell arms complete with 2,592 rows and 0 failed cells. No-GIL improves all 24/24 paired N=2 combinations over GIL-on by mean +0.1325, 95% CI [0.1218, 0.1434], but both arms remain below the 0.65 floor in all 24/24 pairs. This is the cleanest N=2 GIL-flag attribution and the cleanest warning against overclaiming.

Four preflight gates protect the contrast from a stealth GIL-on regression. The dispatcher asserts (1) the executable resolves to the 3.14t NuGet build, (2) sys._is_gil_enabled() is False at startup, (3) the same call still returns False after torch, transformers, and accelerate import (a known footgun -- a C extension can re-enable the GIL by setting the module flag), and (4) cuda_available is True so the matched-pair contrast actually exercises GPU paths rather than degrading to a CPU fallback. All four gates passed. sysconfig.Py_GIL_DISABLED == True, signals_agree == True, and the harness ran with abort_on_gil_build == True so any silent reversion would have aborted the experiment rather than produced a misleading null.

Execution completed in full. The plan enumerated 48 cells (3 models x 4 workloads x 2 phases x 2 N levels under the single pytorch_direct backend); all 48 ran, all 24 (model x workload x phase) combinations have N=2 data, and metrics.csv contains 3,744 rows. runtime_ok_for_verdict is True. Coverage of the V1 hang-cell ledger is also complete in scaffolding terms -- all six V1 deterministic-hang cell shapes were attempted under nogil, which satisfies the pre-registered minimum_hang_shapes_attempted threshold of 6.

The pre-registered hypothesis ladder resolves cleanly. H0 -- "removing the GIL produces no improvement; the GIL is not the mechanism" -- is rejected; the mean N=2 parallel-efficiency delta is meaningfully greater than zero. H1 -- "ALL 24 (model x workload x phase) combinations improve above threshold under nogil" -- is NOT fully supported; 19 of 24 combinations clear the threshold and 5 do not. H2 -- "SOME combinations improve" -- is supported. The final falsification verdict locked into analysis.json is therefore H2_partial, with these anchors: 19/24 combinations pass H1 (79.2%), 5/24 fall below the threshold (20.8%), mean N=2 parallel-efficiency delta is +17.91 percentage points (+0.1791 absolute), median is +17.15 percentage points (+0.1715), and 2 of the 6 V1 deterministic-hang cell shapes are resolved under nogil while 4 still hang.

The structural reading is the load-bearing claim of this report: the Python GIL is a mechanism behind V1's N=2 breakdown, but it is not the mechanism. A clean GIL-causes-everything world predicted H1 -- all 24 combinations should recover under nogil, all six hangs should clear, and the mean delta should track the full V1 efficiency gap. The substrate observed +17.91 pp mean recovery, 79.2% of combinations clearing threshold, and 2/6 hangs resolved. That is a real GIL signature, large enough to reject H0, but small enough that the residual 20.8% non-improving combinations and the residual 4/6 unresolved hangs require another explanation. CUDA stream serialization, memory-bandwidth contention on the 12 GB RTX 4080 Laptop, and Python-side dispatcher overhead that survives GIL removal are the candidate residual mechanisms; TR165 does not isolate which of them dominates and does not claim to.

This result is what Phase 8 of the Banterhearts program was scaffolded to produce. TR164 V2 holds the Python interpreter fixed and changes the backend; it asks "does the V1 breakdown survive a backend change?" -- answer in the V2 substrate: no, it does not, so the breakdown is pytorch_direct-specific rather than a generic concurrency-2 phenomenon. The original TR165 run holds the backend fixed and changes the Python runtime; it asks "does the breakdown survive a free-threaded transition?" -- answer: it partially survives. The same-session control then holds the Python version fixed and changes only the GIL flag; answer: every N=2 pair improves, but no pair clears the floor. Read together, these form a three-layer triangulation over backend identity, runtime transition, and GIL flag. The mechanism-isolation contribution of this report is the protocol itself: matched-pair single-delta ablations at the runtime layer, with partial verdicts rather than forced binary closure.

4. Introduction and Research Motivation

Technical Report 164 V1 closed with a sharply specified hypothesis but not a closed proof. The substrate showed a uniform breakdown at N=2 across every (model, workload) cell on the pytorch_direct backend, with thermal and power signatures that looked unmistakably GIL-shaped: GPU SM-occupancy pinned near 99% while wall-clock throughput barely advanced beyond N=1, and CPU utilization stalled in patterns consistent with interpreter-level serialization rather than kernel-level queueing. The V1 report named this the GIL-attribution hypothesis, flagged it as the most plausible single mechanism, and then deliberately declined to claim it as the proven cause. TR165 exists to settle that open question on substrate, not on intuition.

4.1 V1 GIL-Attribution Hypothesis

The V1 hypothesis, restated in its strongest form, is: the N=2 breakdown observed under pytorch_direct in-process concurrent inference dispatch is caused by Python's Global Interpreter Lock serializing the host-side dispatch path, such that two worker threads cannot make forward progress simultaneously even when the GPU is not the bottleneck. Three signatures supported this reading in V1. First, GPU SM-occupancy stayed near saturation while throughput failed to scale, ruling out simple GPU underutilization. Second, thermal and power traces showed GIL-starved idle patterns -- the CPU side oscillating in step with one-worker-at-a-time progress rather than smooth two-worker overlap. Third, the breakdown was uniform across model size and workload shape in a way that pointed to a host-side bottleneck independent of model arithmetic intensity. Each signature was necessary; none was individually sufficient.

4.2 Why Matched-Pair Testing Is Needed

The problem with V1's evidence base, taken alone, is that several alternative mechanisms produce overlapping signatures. CUDA kernel serialization through a single default stream would also cap N=2 throughput while leaving SM-occupancy high. Host-side memory-bandwidth contention between two dispatch threads sharing the same allocator pool would produce similar oscillation patterns. Dispatcher-level overhead -- Python object construction, tensor metadata bookkeeping, autograd guard cost -- could plausibly account for the gap without invoking the GIL specifically. V1 substrate cannot distinguish among these. The only experimental design that can is a matched-pair test in which a single mechanism -- GIL presence vs absence -- is varied while every other factor is held fixed. If the GIL is the mechanism, removing it should recover the lost N=2 efficiency. If the GIL is a co-mechanism, partial recovery is the expected signal. If the GIL is incidental, removal should produce no recovery at all.

4.3 Python 3.14 Free-Threaded as Single-Delta Variable

Python 3.14's PEP 779 free-threaded build is the precise instrument the matched-pair test requires. The python3.14t binary removes the GIL at the interpreter level while preserving the rest of the CPython runtime, the C extension ABI surface (in practice, for the libraries TR165 exercises), and the underlying CUDA/PyTorch stack. TR165 holds every other axis constant against V1: the hardware is the same RTX 4080 Laptop 12 GB silicon, the model set is the same three checkpoints (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b), the workload set is the same four shapes (short_decode, balanced_2k, long_decode, repeated_prefix), the backend is pytorch_direct only, the concurrency levels are N=1 and N=2 -- the exact boundary where V1 broke down -- and the repetition counts are matched to V1. The Python interpreter build is the only intentional change. Preflight verification at dispatcher startup confirms the substrate matches the intent: sys._is_gil_enabled() == False both before and after the torch/transformers/accelerate import chain, sysconfig.Py_GIL_DISABLED == True, free_threaded_build: True, signals_agree: True, and abort_on_gil_build: True ensuring the run aborts rather than silently falling back to a GIL-enabled interpreter.

Observations. The single-delta discipline is what licenses the causal reading. Without it, any N=2 improvement could be attributed to library version drift, driver updates, or accidental workload differences. With it, an observed recovery is attributable to GIL removal by construction.

The matched-pair design converts a correlational hypothesis ("GIL signatures correlated with breakdown") into a falsifiable causal claim ("removing the GIL recovers the breakdown"). The cost of this discipline is operational -- a separate Python toolchain, separate dispatcher preflight, separate result ledger -- but it is the only honest path from V1's open hypothesis to a settled answer.

4.4 Pre-registration and Single-Delta Discipline

TR165 pre-registers three hypothesis verdicts in research/tr165/publication_contract.json before any data collection, locking the analysis logic to the substrate rather than to post-hoc narrative. H0 is the null: no improvement, the GIL is not the mechanism. H1 is the strong attribution: ALL 24 (model x workload x phase) combinations at N=2 show parallel efficiency improvement above threshold under the free-threaded build. H2 is the partial attribution: SOME combinations improve, consistent with the GIL being a contributing but not exclusive mechanism. The pre-registered decision rule consumes the matched-pair contrast directly: counts of combinations passing the H1 threshold, mean and median delta in N=2 parallel efficiency, and counts of V1 deterministic-hang cell shapes resolved under nogil. The verdict reported in Section 7 -- H2_partial, with 19 of 24 combinations passing the H1 threshold, mean delta of +17.91pp in N=2 parallel efficiency, median delta of +17.15pp, and 2 of 6 V1 hang cell shapes resolved -- is the output the pre-registered logic returns when fed the actual substrate. The number is not chosen; it falls out of the rule.

Observations. The pre-registration is what protects the report from confirmation bias in either direction. A pure-GIL story would have been the comfortable framing -- it closes V1 cleanly and licenses a tidy single-mechanism narrative. A no-effect story would have been the contrarian framing -- it would let the report sell a surprising-null result. The substrate refuses both: the GIL matters, by 17-18 percentage points on average, but it is not the only thing that matters.

The honest answer is the partial answer. The 79.2% of combinations that improve and the 20.8% that do not, together with the 2-of-6 hang resolution count, are the four numbers the rest of this report exists to explain. Section 7 reports the matched-pair contrast that produced them; Section 8 examines which (model x workload x phase) combinations fall into each bucket; Section 13 traces the residual non-improvement back to the candidate co-mechanisms -- CUDA kernel serialization, memory-bandwidth contention, dispatcher overhead -- that V1 substrate could not exclude and TR165 substrate now licenses naming as contributors. The report does not claim to have isolated those co-mechanisms; that is the work of a successor experiment. It claims only what the matched-pair design earns: the GIL is a real mechanism, removing it produces a real and measurable recovery, and the recovery is partial in a way that constrains the next experiment's design.

5. Pre-Registered Research Hypotheses

The TR165 design is a matched-pair falsification test against TR164 V1's GIL-attribution hypothesis. Before the dispatcher fired its first cell, three hypotheses were pre-registered in research/tr165/config.yaml and research/tr165/publication_contract.json, each with an explicit admissibility criterion that turns the read of the substrate into a mechanical decision rather than a post-hoc narrative. The hypotheses are reproduced here verbatim against the pre-registered admissibility floor so that the falsification verdict in SS-7 (H2_partial) is auditable line-by-line against this section.

5.1 H0 -- null: GIL is not the mechanism

H0 asserts that removing the GIL does not change the N=2 parallel efficiency observed under TR164 V1. Formally, the mean of delta_n2_efficiency across matched (model, workload, phase) cells equals 0, and any observed deviation is consistent with run-to-run noise on the matched pair. The admissibility criterion for H0_pass is that the mean delta is statistically indistinguishable from 0 across the matched-pair contrast. Under H0, the TR164 V1 breakdown at N=2 has no contribution from the GIL; the bottleneck lies entirely in CUDA kernel serialization, memory-bandwidth contention, dispatcher overhead, or a combination of non-interpreter mechanisms.

5.2 H1 -- full attribution: GIL is the sole mechanism

H1 asserts the opposite extreme -- that the GIL is the sole mechanism behind V1's N=2 breakdown. The admissibility criterion has two conjunctive clauses, both of which must hold for H1_pass:

All 24 (model x workload x phase) N=2 combinations under nogil show parallel efficiency improvement above the pre-registered efficiency threshold relative to V1's matched cell.
All 6 V1 deterministic-hang cell shapes are resolved (i.e., run to completion without deterministic hang) under the free-threaded build.

Under H1, the substrate would look like a clean wall-removal: every matched cell rises above the threshold, every hang clears, and the residual non-interpreter mechanisms contribute zero observable headroom at the tested boundary.

5.3 H2 -- partial attribution: GIL is a mechanism but not the only one

H2 asserts the middle ground -- that some combinations improve under nogil but not all, and that some hang shapes resolve but not all. The admissibility criterion is the logical conjunction H2_pass = (H0_pass is False) AND (H1_pass is False); H2 is the residual verdict that survives when both endpoints fail. Substantively, H2_pass means the substrate has revealed the GIL as A mechanism but not THE sole mechanism, and the residual non-improvements license a follow-on attribution exercise (TR164 V2 cross-backend, kernel-serialization probes) rather than closure.

5.4 Admissibility floor: data-coverage thresholds

Two coverage thresholds are pre-registered and gate the verdict from being read at all:

Threshold	Required value	Observed value	Status
`minimum_n2_data` (N=2 matched cells)	20	24	satisfied
`minimum_hang_shapes_attempted`	6	6	satisfied

Observations. Both coverage thresholds are satisfied at or above their pre-registered floor. The N=2 matched-cell coverage is full (24 of 24 combinations), which means there is no missing-cell carve-out that could mask either H1 failure or H0 acceptance; the hang-shape coverage is exactly at floor (6 of 6 V1 shapes attempted), which means the hang-resolution ledger in SS-9 reads against the same denominator V1 reported.

The coverage floor is the substrate's "you may now read the verdict" gate. TR164 V1 had cells lost to nsys importer failure; TR165's admissibility floor is set so that no such loss can quietly invalidate the falsification read.

5.5 Verdict-shape this section commits to

This section pre-registers, before the reader sees SS-7, that exactly one of three verdicts is admissible: H0_pass (null), H1_pass (full attribution), or H2_partial (partial attribution). No fourth verdict, no narrative escape hatch, no "directionally supports H1" hedge. The substrate facts in SS-7 (mean delta_n2_efficiency = +17.91%, 19 of 24 combinations above threshold, 2 of 6 hangs resolved) will mechanically select among these three, and the rest of the report is the audit trail for that selection.

6. Methodology

TR165 is built around a single discipline: change exactly one thing relative to TR164 V1, hold every other knob bit-identical, and let the matched-pair contrast carry the falsification. This section documents how that single delta is operationalized, how the free-threaded Python build was acquired and verified, what the preflight gate enforces before any benchmark code executes, what the 48-cell matrix looks like in shape, and -- equally importantly -- what TR165 deliberately does not touch.

6.1 Single-Delta Design

The single delta is the Python interpreter build. TR164 V1 ran on the stock CPython interpreter with the Global Interpreter Lock enabled; TR165 swaps that interpreter for the Python 3.14.5 free-threading build (python3.14t.exe) and changes nothing else. The hardware is the same RTX 4080 Laptop 12 GB device used for V1. The backend is pytorch_direct only. The model set is the same three checkpoints (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b). The workload set is the same four shapes (short_decode, balanced_2k, long_decode, repeated_prefix). The phase set is the same two phases (scaling, ttft). Repetitions per cell are matched to V1. The TR164 V1 baseline at research/tr164/results/20260531_120428_552237/ is read-only substrate for this report -- TR165 does not re-run V1's GIL-enabled baseline, it contrasts against the on-disk artifact.

Observations. Holding everything but the interpreter constant is what makes the falsification falsifiable. If TR165 had also changed the backend, or swapped the model set, or expanded concurrency, no amount of post-hoc analysis could attribute the resulting delta to the GIL specifically.

The clean attribution chain depends on the single-delta discipline. Any "while we're at it" change to backend, model, or workload would have collapsed the matched-pair contrast into a multi-factor study and the H1/H2 verdict would be uninterpretable.

6.2 Python 3.14t Free-Threaded Build Acquisition

The interpreter is sourced from the pythoncore-3.14t-nuget NuGet package, which ships a side-installable free-threading build of CPython 3.14. The exact binary used at dispatcher startup is C:\Users\sahil\AppData\Local\Programs\Python\pythoncore-3.14t-nuget\tools\python3.14t.exe. The reported version string is 3.14.5 free-threading build (tags/v3.14.5:5607950, May 10 2026). This is the upstream tagged release, not a self-compiled fork.

The choice of the NuGet distribution over a from-source build is deliberate: it pins the interpreter binary to a publicly addressable upstream tag, eliminating the "did the local toolchain affect the result" failure mode. The build is provisioned alongside the GIL-enabled interpreter on the same host so the two can be A/B-swapped without any system-level reconfiguration.

Observations. The free-threading build is opt-in at the binary level -- it does not silently shadow the default python.exe. Every TR165 dispatcher invocation must select python3.14t.exe explicitly, which is the second line of defense behind the preflight signal gate documented in 6.3.

The NuGet binary is the substrate-of-record. Any reviewer-grade replication can pull the same NuGet package version and reach byte-identical interpreter behavior on the same hardware class.

6.3 Preflight Detection Stack

The preflight stack is a three-signal verification that fires before any benchmark code executes and re-fires after the heavy library imports. The three signals are:

sys._is_gil_enabled() == False
sysconfig.Py_GIL_DISABLED == True
signals_agree == True (a meta-check that the first two signals are consistent rather than independently misreporting)

All three must hold simultaneously. At the TR165 dispatcher startup, the substrate records gil_disabled_at_dispatcher_startup: True, free_threaded_build: True, signals_agree: True, and cuda_available: True. The CUDA-available check is required because a free-threading interpreter that cannot reach the GPU would silently fall back to CPU and produce a meaningless contrast against V1.

Critically, the preflight gate fires twice. The first fire is at process startup, before import torch, import transformers, or import accelerate. The second fire is immediately after those imports complete. The reason is that import-time C-extension initialization is the canonical path by which a library can re-enable the GIL on a free-threading build -- a Py_mod_gil declaration that opts back into the GIL would flip sys._is_gil_enabled() back to True mid-process. The second fire catches that flip if it happens, before any benchmark cell runs.

The abort_on_gil_build safety bar is the final piece: if any of the three signals reverts to a GIL-enabled state at either preflight checkpoint, the dispatcher aborts the entire run rather than continuing and silently producing a GIL-enabled measurement under the TR165 label.

Observations. The double-fire preflight is what protects the report from the silent-fallback failure mode where a library import flips the GIL back on and the run looks fine externally. The TR165 substrate records all four signals as True at both checkpoints, which is why runtime_ok_for_verdict: True is defensible.

A single preflight check before imports would have been insufficient. The post-import re-check is the load-bearing piece -- without it, a future transformers or accelerate release that declared Py_mod_gil = Py_MOD_GIL_USED would have invalidated the verdict without any visible signal in the run logs.

6.4 Matched-Matrix Cell Plan

The matched matrix is the Cartesian product of the four held-constant dimensions: 3 models x 4 workloads x 2 phases x 2 N-concurrency levels = 48 cells. The two N levels are N=1 and N=2, deliberately narrowed from a broader sweep to the specific breakdown boundary that V1 identified. The execution result records 3,744 rows in metrics.csv across the completed cells, and the substrate's runtime_ok_for_verdict flag holds at True.

The 48-cell plan is dimensionally identical to the corresponding TR164 V1 slice along the pytorch_direct backend axis. This is what enables the matched_pair_contrast analysis stage in analyze.py to join V1 and TR165 cells on the four-dimensional key (model, workload, phase, N) and emit per-cell deltas on decode_tps, parallel_efficiency, p95_latency_multiplier, and ttft_ms. Cells that exist in TR165 but not V1, or vice versa, are surfaced separately under tr165_only_cells and tr164_v1_only_cells rather than silently dropped, which preserves the audit trail when coverage is asymmetric.

Observations. The 48-cell figure is exact, not aspirational. All 48 cells completed; the N=2 coverage check (n_combinations_with_n2_data: 24 of 24) confirms that the H1/H2 falsification ran against a full N=2 substrate, not a partial one.

Matched-matrix completeness is the prerequisite for a clean matched-pair test. Had any of the 24 N=2 combinations been missing, the H1 "ALL 24 combinations" criterion would have been structurally untestable rather than empirically falsified.

6.5 What Methodology Excludes

TR165 is deliberately narrow. The following are explicit non-changes relative to V1:

No backend change. vllm, sglang, and other serving backends are out of scope; only pytorch_direct is exercised. Cross-backend GIL-sensitivity is the V2 line, not the V1 falsification line.
No model change. The three-model set is held constant; no Qwen-3, Gemma, or Phi additions, and no parameter-count sweep.
No workload expansion. The same four workload shapes from V1 are reused; no new context-length sweeps, no new batch geometries.
No concurrency expansion. N is held to {1, 2}; no N=4 or N=8 sweeps. The 2 N levels are the focused breakdown-boundary slice.
No nsys traces. Nsight Systems profiling is excluded from the TR165 plan; the report relies on aggregate metrics from metrics.csv, not per-kernel timelines.
No validate-only-as-scientific-substitute. The dispatcher exposes a --no-require-nogil validate-only fallback path for plumbing checks, but that path is explicitly NOT a substitute for a scientific run. Every cell reported in TR165's verdict was collected with the preflight gate fully satisfied.

Observations. Each exclusion is a degree of freedom intentionally removed to keep the single-delta discipline intact. Re-enabling any one of them is a separate, future TR -- not a TR165 follow-on patch.

The exclusions list is the scope contract. A reviewer asking "did you also try vllm under nogil?" gets answered honestly with "no -- that is the V2 cross-backend line, not the V1 falsification" rather than with an apologetic in-scope-creep paragraph.

6.6 Same-build GIL flag control added after the original verdict

The original TR165 verdict compared TR164 V1's stock CPython 3.13.1 GIL baseline against a CPython 3.14.5 free-threaded no-GIL run. That was the right first falsification test, but it left a reviewer-facing confound: a Python minor-version and wheel-stack change moved at the same time as the GIL state. The follow-up control removes that confound for the N=2 screen by running both arms under the same CPython 3.14.5 free-threaded Docker image, same local RTX 4080 Laptop GPU, same model/workload/phase/repetition plan, and same TR165 worker code, while toggling only PYTHON_GIL.

The control uses two configs: research/tr165/config_same_session_n2_gil_on.yaml forces PYTHON_GIL=1, and research/tr165/config_same_session_n2_nogil.yaml forces PYTHON_GIL=0. Both arms intentionally restrict scope to the N=1/N=2 screen: 3 models x 4 workloads x 2 phases x 2 N levels x 3 reps = 144 cells per arm. The N=16 hang-shape ledger is not part of this same-session control and remains historical transition evidence only.

Arm	Run directory	GIL state	Cells	Rows	Failed cells
Same-build GIL-on	`research/tr165/results/20260623_165503_919943`	`gil_disabled=False`	144	2,592	0
Same-build no-GIL	`research/tr165/results/20260623_222630_155096`	`gil_disabled=True`	144	2,592	0

The analyzer research/tr165/analyze_same_session_gil_control.py emits research/tr165/same_session_gil_control/same_session_gil_control_summary.json and same_session_gil_control_n2_pairs.csv. The summary reports that no-GIL improves all 24/24 paired N=2 combinations over GIL-on, with mean delta +0.1325, bootstrap 95% CI [0.1218, 0.1434], median +0.1274, minimum +0.0655, and maximum +0.1901. Both arms remain below the absolute 0.65 parallel-efficiency floor in all 24/24 pairs on the Docker stack.

Observations. This control changes the interpretation, not the original pre-registration result. It closes the "maybe CPython 3.14.5, not the GIL flag" objection for N=2 because the version, image, CUDA stack, model plan, and run window are matched. It also prevents overclaiming: same-build no-GIL gives a consistent positive shift, but that shift is not sufficient to cross the 0.65 operating floor on this Docker stack and says nothing about the N=16 hang-shape subset.

7. Substrate Inheritance -- TR164 V1 Baseline

TR165 is not a fresh measurement campaign in isolation. It is a matched-pair extension grafted onto TR164 V1's 360-cell substrate, and the integrity of every delta reported downstream depends on the join between the two run directories being exact, read-only, and auditable. This section documents what TR165 inherits, what it does not touch, and how the join is computed.

7.1 V1 Substrate Surface Area

The TR164 V1 baseline lives at research/tr164/results/20260531_120428_552237/. Its cell_summary.csv enumerates a 360-cell substrate covering the full V1 plan (multiple backends, broader N sweep, three models, four workloads, two phases, matched reps). The companion metrics.csv carries 21,159 metric rows plus 14 skip-marker rows from cells that V1 logged as intentionally elided. TR165 reads both files at analysis time and writes nothing back; the V1 directory's modification timestamp is unchanged across the TR165 run.

7.2 Coordinate Subset

TR165's plan is a coordinate-aligned subset of V1, not a re-scoped sibling. The shared join keys are the canonical (phase, model, workload, n_agents, rep) tuple, with backend fixed to pytorch_direct so the GIL ablation is isolated from cross-backend variance. TR165 retains all three V1 models (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b), all four V1 workloads (short_decode, balanced_2k, long_decode, repeated_prefix), both V1 phases (scaling, ttft), and the two N levels that bracket V1's documented breakdown boundary (N=1 baseline, N=2 contention onset). Reps per cell are matched to V1. The 48-cell TR165 plan is therefore a strict coordinate subset of the 360-cell V1 plan, restricted to a single backend and to the N levels where V1's GIL-attribution hypothesis was formulated.

Inherited dimension	V1 surface	TR165 retained subset
Backend	multi-backend	`pytorch_direct` only
Models	3	3 (identical IDs)
Workloads	4	4 (identical IDs)
Phases	2	2 (identical IDs)
N concurrency	full sweep	N=1, N=2
Total cells	360	48
Hardware	RTX 4080 Laptop 12 GB	RTX 4080 Laptop 12 GB (exact match)

Observations. The 48-cell TR165 plan inherits exactly the coordinates V1 reported a breakdown on, and no others. The hardware row is the load-bearing one: holding the GPU constant means the matched-pair delta is attributable to the Python runtime change, not to a silicon swap.

The inheritance is deliberately narrow. TR165 is not trying to re-prove V1's breadth; it is trying to isolate one variable (the GIL) on the coordinates where V1 already documented a phenomenon worth attributing.

7.3 Findings Carried Forward Without Modification

TR165 inherits three V1 findings as fixed inputs to its hypothesis, not as claims to be revisited:

Uniform N=2 parallel-efficiency breakdown across 24 (model x workload x phase) combinations. V1 documented that every N=2 combination it measured fell below its parallel-efficiency threshold. TR165 takes this as the population over which H1 is evaluated; the H1 pre-registration requires improvement on all 24, and the verdict is computed against that exact denominator.
Six deterministic-hang cell shapes on llama3.2-3b at N=16. V1 logged six cell shapes that hung deterministically under the V1 runtime. TR165 attempts all six under the free-threaded runtime; the hang_resolution_ledger carries the per-shape outcome.
Cell-shape definitions and skip-marker semantics. The 14 skip-marker rows in V1's metrics.csv are honored as legitimate elisions; TR165 does not attempt to re-measure them.

7.4 The Read-Only Join

The matched_pair_contrast block in research/tr165/results/20260607_174748_273070/analysis.json is the join's contract. Its baseline_csv field points to research/tr164/results/20260531_120428_552237/cell_summary.csv; baseline_found is true; baseline_error is null. The schema declares join_keys as the (phase, model, workload, n_agents, rep) tuple and aggregation as the per-cell summary statistic that V1 itself emitted, so the contrast is summary-to-summary, not a re-aggregation over V1's raw rows. The contrast block partitions cells into matched_cells, tr165_only_cells, and tr164_v1_only_cells, making any coordinate that fails to join visible rather than silently dropped. Deltas are computed per-metric on the matched set for decode_tps, parallel_efficiency, p95_latency_multiplier, and ttft_ms.

Observations. The join is structurally read-only: TR165's analyzer opens cell_summary.csv from V1, reads it into a dataframe, and never writes back. The presence of explicit tr165_only_cells and tr164_v1_only_cells buckets in the contract means coordinate drift would surface as a non-empty bucket rather than as a silently incorrect average.

The integrity argument here is the same one that gates the bridge paper: a matched-pair design is only worth its name if the pairing is verifiable. Pointing the contrast at V1's published cell_summary.csv (rather than re-running V1's analyzer inside TR165) means a reviewer can diff the two files and confirm the baseline was not retroactively edited to flatter the delta.

8. SS1. Preflight -- Python 3.14.5 Free-Threaded Build Verification

SS1 is the most load-bearing pre-run check in the TR165 dispatcher. The entire experiment is a counterfactual against TR164 V1 under one and only one swapped variable: the Global Interpreter Lock. If the dispatcher fires under a stock CPython build, or under a 3.14 build that ships free-threading capability but with the GIL re-enabled at runtime (the default for many wheel-managed environments), then every downstream number in this report is meaningless -- the matched-pair contrast collapses into noise around a moved baseline rather than a clean GIL on/off comparison. SS1 exists to make that failure mode impossible to reach silently.

The dispatcher therefore performs four independent checks before any model load, any CUDA context initialization, or any cell is queued. The checks are persisted into manifest.json::python_runtime so that the post-hoc reader (and any future replication attempt) can confirm the runtime under which the 3,744 metrics-row execution actually fired, not just the runtime the operator believed they had configured.

The first check is interpreter identity. The dispatcher resolves sys.executable and writes the absolute path into the manifest: C:\Users\sahil\AppData\Local\Programs\Python\pythoncore-3.14t-nuget\tools\python3.14t.exe. The t suffix is non-decorative -- it distinguishes the free-threaded NuGet build from the stock 3.14 wheel that ships alongside it in the same upstream release. The reported version string is 3.14.5 (tags/v3.14.5:5607950, May 10 2026) compiled under MSC v.1944 64-bit (AMD64). The build tag is recorded verbatim because the free-threading ABI changed across pre-release tags and a mismatched wheel pulled against a wrong tag would import-fail or, worse, fall back to GIL-enabled mode.

The second check is the triple-signal GIL state inspection, performed twice. Before any third-party import, the dispatcher reads three independent signals: sys._is_gil_enabled() returns False, sysconfig.Py_GIL_DISABLED returns True, and the dispatcher's own signals_agree reducer (which cross-checks the two against the build-time configuration) returns True. After importing the three libraries that have the highest risk of silently re-enabling the GIL via a C-extension that does not declare Py_mod_gil -- torch, transformers, accelerate -- the dispatcher re-reads the same three signals. All three remain in the same agreement state. This is the check that catches the silent-fallback failure mode where a C-extension import flips the GIL back on under the operator's feet.

The third check is CUDA availability under the nogil interpreter. torch.cuda.is_available() returns True. This rules out the trivial degenerate case where the free-threaded interpreter is correctly configured but the CUDA runtime fails to bind under nogil -- a failure mode that would have caused every cell to fall back to CPU execution and produce a different report entirely.

The fourth check is the abort-on-GIL-build safety bar. abort_on_gil_build = True means the dispatcher is configured to raise and halt the entire run if any of the preceding three checks come back in a disagreeing state. This is not a passive log line; it is a hard precondition. The bar held throughout dispatcher startup.

Check	Signal	Expected	Observed	Status
1. Interpreter identity	`sys.executable` + `sys.version`	`python3.14t.exe`, 3.14.5 free-threading tag `v3.14.5:5607950`	Path matches NuGet `pythoncore-3.14t-nuget`; version string `3.14.5 (tags/v3.14.5:5607950, May 10 2026)` MSC v.1944 AMD64	PASS
2. Triple-signal GIL state (pre- and post-import)	`sys._is_gil_enabled()` / `sysconfig.Py_GIL_DISABLED` / `signals_agree`	`False` / `True` / `True`, identical before and after `torch`+`transformers`+`accelerate` import	All three signals returned the expected state before imports AND after imports; `free_threaded_build: True`	PASS
3. CUDA availability under nogil	`torch.cuda.is_available()`	`True`	`True`	PASS
4. Abort-on-GIL-build safety bar	`abort_on_gil_build`	`True`	`True`; bar held; no abort raised	PASS

Observations. All four preflight checks passed. The most important pair of rows is row 2's pre-import-and-post-import re-check: this is the only check that can detect a C-extension that imports the free-threaded ABI symbol but quietly re-enables the GIL at module-init time. Both reads agreed (sys._is_gil_enabled() == False, sysconfig.Py_GIL_DISABLED == True, signals_agree == True), which means the three highest-risk imports in the experimental stack -- PyTorch, Transformers, Accelerate -- all honor the free-threaded build under this configuration. The CUDA-under-nogil check (row 3) further rules out the degenerate fallback where the interpreter is correctly nogil but the GPU stack silently degrades to CPU. The abort-on-GIL-build bar (row 4) being engaged means that if any of these checks had failed, no cell would have fired and no metrics row would have been written -- the dispatcher would have halted with the manifest in an aborted state rather than producing the ambiguous "ran but maybe under GIL" outcome that would have wasted the entire run.

The four-check preflight is what licenses the rest of this report to read the 3,744-row metrics table as a clean GIL-off counterfactual against TR164 V1's GIL-on baseline. Without SS1, a +17.91% mean N=2 parallel-efficiency improvement is ambiguous between "the GIL was the bottleneck" and "the runtime drifted between V1 and TR165 in some other way." With SS1's manifest-persisted evidence -- interpreter path, version tag, triple-signal pre- and post-import agreement, CUDA availability under nogil, and an engaged abort-bar -- the GIL is isolated as the single swapped variable, and the H2_partial verdict reported downstream is interpretable as a direct attribution finding rather than a confounded one.

9. SS2. The 48-Cell Matched-Pair Plan

SS2 specifies the cross-product that TR165's dispatcher actually walked. The design is deliberately narrow: it does not re-scan V1's full N concurrency sweep, it does not introduce new backends, and it does not add workloads V1 did not already characterize. The single experimental knob is the Python runtime; everything else is held to V1's exact shape so the matched-pair contrast in SS3 can attribute deltas to that single knob rather than to drift.

The cross-product is 3 models x 4 workloads x 2 phases x 2 N concurrency = 48 cells, each replicated at 3 reps matched to V1, and dispatched against a single backend (pytorch_direct). The 48-cell total agrees with the planned-cells field in the run manifest; the executed metrics.csv contains 3,744 measurement rows distributed across those 48 cells, and runtime_ok_for_verdict is True.

9.1 Models

Slug	HuggingFace ID	Family
llama3.2-1b	unsloth/Llama-3.2-1B-Instruct	Llama 3.2, 1B parameters
qwen2.5-1.5b	Qwen/Qwen2.5-1.5B-Instruct	Qwen 2.5, 1.5B parameters
llama3.2-3b	unsloth/Llama-3.2-3B-Instruct	Llama 3.2, 3B parameters

Observations. Three models span two families (Llama 3.2 and Qwen 2.5) and three parameter classes (1B, 1.5B, 3B). All three are direct V1 carryovers; no model substitution was made. The Llama pair brackets the 3B Qwen-cluster boundary, and Qwen-1.5B sits between them in parameter count, so any monotone-with-size GIL effect should be visible across the three-point ladder.

The model axis is anchored to V1 by construction. Holding the HuggingFace IDs identical to V1's manifest means a delta in N=2 parallel efficiency between TR164 V1 and TR165 cannot be explained by a weight swap, a tokenizer change, or a chat-template revision; the only thing that has moved is sys._is_gil_enabled().

9.2 Workloads

Workload	Behavior characterized in V1
short_decode	Short prompt, short decode -- TTFT-dominated regime
balanced_2k	2K-context mixed prefill/decode -- the V1 reference workload
long_decode	Long decode tail -- decode-throughput-dominated regime
repeated_prefix	Repeated-prefix prompt -- KV-cache-reuse-sensitive regime

Observations. The four workloads cover the prefill / decode / cache-reuse spectrum V1 already mapped. They are reused verbatim so that "the 17.91 percentage-point mean N=2 efficiency recovery reported in falsification_verdict" can be cleanly attributed to GIL removal at fixed workload shape, not to a workload-mix change.

Long_decode and repeated_prefix are the two workloads V1 flagged as the steepest N=2 breakdown surfaces. Including both, rather than dropping to a single "representative" workload, is what lets SS3 separate the 19-of-24 improving combinations from the 5-of-24 that do not move -- and lets the hang ledger in SS6 record whether the 4 still-hanging shapes cluster on a specific workload.

9.3 Phases

Phase	What it measures
scaling	Steady-state decode throughput and parallel efficiency
ttft	Time-to-first-token under the same concurrency conditions

Observations. Both V1 phases are retained. TTFT is kept in scope because the GIL-attribution hypothesis applies to dispatcher-side latency as well as to steady-state decode; truncating to scaling-only would have hidden whether prefill-side parallelism moves with the runtime knob.

Carrying both phases through the matched pair is what allows SS3 to report decode_tps, parallel_efficiency, p95_latency_multiplier, and ttft_ms as four independent contrast surfaces rather than one composite score. Composite scores hide the kind of split SS5 actually finds.

9.4 N Concurrency -- Bounded to the V1 Breakdown Boundary

N	Status in TR165	Rationale
1	Included	Single-worker baseline; parallel_efficiency by definition 1.000 in V1
2	Included	V1's first breakdown step; parallel_efficiency fell to 0.547 in V1
4	Excluded	Out of scope; separate ablation
8	Excluded	Out of scope; separate ablation
16	Excluded	Out of scope; separate ablation

Observations. V1 swept N up to 16; TR165 deliberately bounds the sweep to {N=1, N=2}. The N=1 -> N=2 transition is exactly where V1's parallel efficiency dropped from 1.000 to 0.547, which is the single largest single-step degradation V1 reported. Concentrating reps at that boundary buys statistical resolution where the V1 hypothesis is most testable; extending to N=4/N=8/N=16 would have diluted reps without buying additional discrimination at the boundary the hypothesis names.

This is the design decision the rest of the report leans on. The +17.91 percentage-point mean delta and the +17.15 percentage-point median delta in falsification_verdict.mean_delta_n2_efficiency / median_delta_n2_efficiency are computed over 24 N=2 combinations (3 models x 4 workloads x 2 phases). Twenty-four is the exact count of N=2 cells in the 48-cell grid, and n_combinations_with_n2_data: 24 of 24 confirms full coverage with no missing cells. The N=4+/N=8+/N=16+ slice belongs to a follow-on ablation, not to this report.

9.5 Cell-Count Reconciliation

Quantity	Value	Source
Models	3	plan
Workloads	4	plan
Phases	2	plan
N levels	2	plan
Cross-product cells	48	3 x 4 x 2 x 2
N=2 cells (subset)	24	3 x 4 x 2 x 1
Reps per cell	matched to V1	plan
metrics.csv rows	3,744	run artifact
runtime_ok_for_verdict	True	analysis.json

Observations. The 48-cell plan executed to completion. The 24-cell N=2 subset is the unit of analysis for the H1 / H2 verdicts: 19 of 24 combinations pass the H1 per-combination threshold and 5 of 24 fall below it, which is what drives the H2_partial verdict reported in SS7.

The 48-cell budget is small on purpose. It is large enough to give 24 N=2 combinations for the matched-pair contrast (the minimum_n2_data threshold of 20 is satisfied with margin), and large enough to attempt all 6 deterministic-hang cell shapes inherited from V1 (the minimum_hang_shapes_attempted threshold of 6 is satisfied exactly). It is deliberately not large enough to chase the N=4+/N=8+/N=16+ tail, which would have required either dropping reps below V1-match or expanding the runtime budget past what a single RTX 4080 Laptop session can deliver under the abort-on-GIL-build safety gate.

10. SS3. Cell-by-Cell Matched-Pair Substrate

SS3 is the row-level layer of the experiment: 48 planned cells, 3,744 metric rows, and a matched-pair join against the TR164 V1 baseline at research/tr164/results/20260531_120428_552237/. The contract is intentionally strict -- every cell that V1 ran on the breakdown-boundary slice gets a free-threaded partner at the same coordinates, and nothing else is allowed to drift. This section walks the cell ledger end to end, exposes the matched_pair_contrast block from analysis.json, and then surfaces the per-(model x workload x phase) N=2 parallel-efficiency substrate for all 24 combinations so downstream readers can audit which specific cells drive the H2_partial verdict.

10.1 Cell completion ledger

The plan calls for 48 cells: 3 models x 4 workloads x 2 phases x 2 concurrency levels. The dispatcher confirms runtime_ok_for_verdict: True and a metrics.csv row count of 3,744 -- meaning every planned cell completed and contributed measurement rows.

Plan axis	Levels	Values
Backend	1	`pytorch_direct`
Models	3	`llama3.2-1b`, `qwen2.5-1.5b`, `llama3.2-3b`
Workloads	4	`short_decode`, `balanced_2k`, `long_decode`, `repeated_prefix`
Phases	2	`scaling`, `ttft`
N concurrency	2	N=1, N=2
Cells planned	48	3 x 4 x 2 x 2
Cells completed	48	ok_rate = 1.00
metrics.csv rows	3,744	avg 78 rows/cell (rep x step structure inherited from V1)

Observations. The 48/48 completion is the load-bearing fact for the rest of the report: any matched-pair contrast that fails to reject H0 cannot be blamed on missing V1 <-> TR165 cells, because every TR165 cell on the V1 boundary slice ran to completion. The 3,744 row count averages to 78 metric rows per cell, which is consistent with the per-rep x per-step structure used in TR164 V1.

The completion ledger is the simplest possible defense against a "you cherry-picked the cells where nogil helped" reviewer line. Every cell V1 measured on the N=1/N=2 boundary slice has a TR165 partner at the same (phase, model, workload, n_agents, rep) coordinates, and the join is computed against the V1 baseline_csv rather than against a re-derived summary.

10.2 `matched_pair_contrast` schema

The matched_pair_contrast block in analysis.json is the join contract. Its top-level keys are: schema, join_keys, aggregation, baseline_csv, baseline_found, baseline_error, metrics, decode_tps, parallel_efficiency, p95_latency_multiplier, ttft_ms, matched_cells, tr165_only_cells, and tr164_v1_only_cells.

Field	Value / Meaning
`join_keys`	`[phase, model, workload, n_agents, rep]`
`baseline_csv`	`research/tr164/results/20260531_120428_552237/cell_summary.csv` (V1, read-only)
`aggregation`	per-cell summary across rep rows before contrast
`metrics` extracted	`decode_tps`, `parallel_efficiency`, `p95_latency_multiplier`, `ttft_ms`
`matched_cells` count	50 (full join surface including N=1 baselines)
`tr165_only_cells` count	0 (no TR165 expansion outside V1's coordinate set)
`tr164_v1_only_cells` count	70 (V1's broader N={4, 8, 16} sweep cells not re-run by TR165)

Observations. The five-element join key is deliberately restrictive: it pins rep as well as (phase, model, workload, n_agents) so that the contrast is paired at the replicate level rather than at the cell mean. This is what licenses the matched-pair language in SS6 and SS7 -- without rep in the join, the "matched pair" framing would collapse into an unpaired cell-mean contrast and the H1/H2 verdicts would lose their per-replicate evidence. The baseline_csv pointer is the read-only contract: TR165 never rewrites V1 numbers, it only joins against them. The tr165_only_cells = 0 figure is the strongest possible statement that TR165 did not silently expand the experiment -- every cell measured under the free-threaded interpreter has a V1 partner; the tr164_v1_only_cells = 70 figure is exactly the V1 N={4, 8, 16} sweep that TR165 deliberately did not re-run.

The schema choice locks the substrate into a defensible posture for any external acceptance signal that scrutinizes the matched-pair claim. If a reviewer asks "what does matched-pair mean here?" the answer is the five-tuple join key written into analysis.json -- not a verbal claim in the prose. The 50/0/70 partition completes that answer: 50 cells joined on both sides, 0 TR165 cells without a V1 partner, 70 V1 cells deliberately left out of scope.

10.3 Per-cell partition: matched, tr165-only, v1-only

The cell partition under the join key splits into three buckets. The 48-cell TR165 plan was restricted to V1's N=1/N=2 boundary slice, so the bulk of cells match V1 partners; the tr165_only_cells bucket is the audit residual that should be 0 by design.

Bucket	Count	Interpretation
`matched_cells`	50	TR165 cells with V1 baseline partner; covers all 24 N=1 baseline cells and all 24 N=2 contention cells, plus 2 additional bordering join rows
`tr165_only_cells`	0	Zero TR165 cells fall outside V1's coordinate set -- design constraint verified
`tr164_v1_only_cells`	70	V1 cells outside the TR165 boundary slice (V1's N={4, 8, 16} sweep across the same 24 model x workload x phase tuples, plus V1's non-`pytorch_direct` backend cells folded into V1's join)
Total TR165 cells	48	full cross-product completed

Observations. The exact bucket counts are now anchored in substrate: 50 matched cells, 0 TR165-only cells, 70 V1-only cells. The zero TR165-only count is structurally important -- it confirms TR165 did not add any cell shape that V1 did not also measure, so every claim in the matched-pair contrast is read against V1 at the same coordinates. The 70 V1-only cells are the cells V1 ran at N={4, 8, 16} or under backends TR165 deliberately excluded; they are visible in the bucket as a deliberate scope-restriction artifact, not as missing data.

Reading the partition as evidence of scope discipline: TR165 did not "expand" the experiment by adding cells V1 never measured (tr165_only_cells = 0). It re-ran a focused sub-slice of V1 under a different runtime (gil_disabled_at_dispatcher_startup: True), with everything else -- hardware, model set, workload definitions, reps -- held constant. The 70 V1-only cells are exactly the cells the scope contract excludes; they are the substrate for a future N-expansion TR, not for this one.

10.4 Per-(model x workload x phase) N=2 substrate

The 24 N=2 (model x workload x phase) combinations are the unit over which H1 is evaluated. The substrate exposes each combination's V1 parallel efficiency, TR165 parallel efficiency, absolute delta, and percent change. The 24 rows below are read verbatim from matched_pair_contrast.parallel_efficiency at n_agents=2; the H1 column flags whether the combination crossed the per-cell improvement threshold that drives the 19/24 pass count in falsification_verdict. The five fail rows correspond exactly to n_combinations_below_h1_threshold = 5.

Phase	Model	Workload	V1 efficiency	TR165 efficiency	Delta (absolute)	Percent change	H1
scaling	llama3.2-1b	balanced_2k	0.5305 (53.05%)	0.5867 (58.67%)	+0.0562 (+5.62 pp)	+10.60%	fail
scaling	llama3.2-1b	long_decode	0.5427 (54.27%)	0.7141 (71.41%)	+0.1714 (+17.14 pp)	+31.58%	pass
scaling	llama3.2-1b	repeated_prefix	0.4354 (43.54%)	0.6732 (67.32%)	+0.2378 (+23.78 pp)	+54.62%	pass
scaling	llama3.2-1b	short_decode	0.6442 (64.42%)	0.8007 (80.07%)	+0.1565 (+15.65 pp)	+24.30%	pass
scaling	llama3.2-3b	balanced_2k	0.4832 (48.32%)	0.7531 (75.31%)	+0.2700 (+27.00 pp)	+55.87%	pass
scaling	llama3.2-3b	long_decode	0.4945 (49.45%)	0.7171 (71.71%)	+0.2226 (+22.26 pp)	+45.03%	pass
scaling	llama3.2-3b	repeated_prefix	0.4870 (48.70%)	0.6085 (60.85%)	+0.1215 (+12.15 pp)	+24.95%	pass
scaling	llama3.2-3b	short_decode	0.5515 (55.15%)	0.7574 (75.74%)	+0.2059 (+20.59 pp)	+37.33%	pass
scaling	qwen2.5-1.5b	balanced_2k	0.6152 (61.52%)	1.0590 (105.90%)	+0.4437 (+44.37 pp)	+72.12%	pass
scaling	qwen2.5-1.5b	long_decode	0.6429 (64.29%)	0.8409 (84.09%)	+0.1980 (+19.80 pp)	+30.81%	pass
scaling	qwen2.5-1.5b	repeated_prefix	0.6017 (60.17%)	0.6463 (64.63%)	+0.0446 (+4.46 pp)	+7.42%	fail
scaling	qwen2.5-1.5b	short_decode	0.5709 (57.09%)	0.7342 (73.42%)	+0.1633 (+16.33 pp)	+28.61%	pass
ttft	llama3.2-1b	balanced_2k	0.3344 (33.44%)	0.7335 (73.35%)	+0.3992 (+39.92 pp)	+119.38%	pass
ttft	llama3.2-1b	long_decode	0.5465 (54.65%)	0.7181 (71.81%)	+0.1716 (+17.16 pp)	+31.40%	pass
ttft	llama3.2-1b	repeated_prefix	0.4101 (41.01%)	0.7249 (72.49%)	+0.3148 (+31.48 pp)	+76.75%	pass
ttft	llama3.2-1b	short_decode	0.5279 (52.79%)	0.7505 (75.05%)	+0.2226 (+22.26 pp)	+42.17%	pass
ttft	llama3.2-3b	balanced_2k	0.4250 (42.50%)	0.7239 (72.39%)	+0.2989 (+29.89 pp)	+70.32%	pass
ttft	llama3.2-3b	long_decode	0.5307 (53.07%)	0.7526 (75.26%)	+0.2220 (+22.20 pp)	+41.83%	pass
ttft	llama3.2-3b	repeated_prefix	0.4729 (47.29%)	0.5472 (54.72%)	+0.0743 (+7.43 pp)	+15.72%	fail
ttft	llama3.2-3b	short_decode	0.5890 (58.90%)	0.7324 (73.24%)	+0.1435 (+14.35 pp)	+24.36%	pass
ttft	qwen2.5-1.5b	balanced_2k	0.6126 (61.26%)	0.7655 (76.55%)	+0.1529 (+15.29 pp)	+24.96%	pass
ttft	qwen2.5-1.5b	long_decode	0.5996 (59.96%)	0.7436 (74.36%)	+0.1441 (+14.41 pp)	+24.03%	pass
ttft	qwen2.5-1.5b	repeated_prefix	0.6054 (60.54%)	0.7226 (72.26%)	+0.1172 (+11.72 pp)	+19.36%	fail
ttft	qwen2.5-1.5b	short_decode	0.6069 (60.69%)	0.3527 (35.27%)	-0.2542 (-25.42 pp)	-41.89%	fail

Observations. The 24-row table is the load-bearing per-cell substrate the H2_partial verdict reads against. Twenty-three of 24 deltas are positive (one cell -- the qwen2.5-1.5b x ttft x short_decode combination -- regresses by -25.42 pp), and 19 of 24 cells cross the H1 per-cell improvement threshold (the threshold sits between +11.72 pp and +12.15 pp on this distribution, which is what drives the 19/5 split). The five H1-failers are: qwen2.5-1.5b ttft short_decode (-25.42 pp, the lone regression), qwen2.5-1.5b scaling repeated_prefix (+4.46 pp), llama3.2-1b scaling balanced_2k (+5.62 pp), llama3.2-3b ttft repeated_prefix (+7.43 pp), and qwen2.5-1.5b ttft repeated_prefix (+11.72 pp). Three of the five sit on repeated_prefix workloads, and three of the five sit on qwen2.5-1.5b. The largest positive recovery -- ttft x llama3.2-1b x balanced_2k at +39.92 pp / +119.38% -- improves from a V1 efficiency of 33.44% to a TR165 efficiency of 73.35%, more than doubling the parallel-efficiency floor on that combination.

The per-cell substrate makes the H2_partial verdict concrete in a way the aggregate cannot. Two structural patterns are visible: (1) repeated_prefix is over-represented in the H1-failer column (3 of 5 failers), consistent with the KV-cache-reuse mechanism surviving GIL removal more robustly than the prefill / decode mechanisms; (2) qwen2.5-1.5b is over-represented (3 of 5 failers including the lone regression), suggesting model-architecture-specific behavior in how the in-process dispatcher interacts with the free-threaded interpreter. Neither pattern is statistically tested at n=5 within n=24; both are flagged as carry-forward observations for the mechanism interpretation in SS8 and the future-work program in SS13.

10.5 Per-model x per-workload matched-cell view

At the join-key resolution (phase, model, workload, n_agents, rep) the model x workload cross is the natural human-readable slice. With 2 phases x 2 N levels per (model, workload) cell shape, each (model, workload) pair contributes 4 cell shapes; across reps, the exact matched_cells count per pair is reported in the matched_pair_contrast.matched_cells list. The per-pair N=2 H1 outcomes (counting both phases per pair) are tabulated below.

Model	Workload	Cell shapes planned (phase x N)	N=2 H1 pass/fail (scaling, ttft)	Notes
llama3.2-1b	short_decode	4	pass, pass	both phases improve clearly (+15.65 pp / +22.26 pp)
llama3.2-1b	balanced_2k	4	fail, pass	scaling fails (+5.62 pp), ttft passes (+39.92 pp) -- largest in-pair phase split
llama3.2-1b	long_decode	4	pass, pass	both phases pass (+17.14 pp / +17.16 pp)
llama3.2-1b	repeated_prefix	4	pass, pass	both phases pass (+23.78 pp / +31.48 pp)
qwen2.5-1.5b	short_decode	4	pass, fail	scaling passes (+16.33 pp), ttft regresses (-25.42 pp) -- lone regression
qwen2.5-1.5b	balanced_2k	4	pass, pass	both phases pass; scaling +44.37 pp is the panel max
qwen2.5-1.5b	long_decode	4	pass, pass	both phases pass (+19.80 pp / +14.41 pp)
qwen2.5-1.5b	repeated_prefix	4	fail, fail	both phases below threshold (+4.46 pp / +11.72 pp)
llama3.2-3b	short_decode	4	pass, pass	both phases pass (+20.59 pp / +14.35 pp)
llama3.2-3b	balanced_2k	4	pass, pass	both phases pass (+27.00 pp / +29.89 pp)
llama3.2-3b	long_decode	4	pass, pass	both phases pass (+22.26 pp / +22.20 pp)
llama3.2-3b	repeated_prefix	4	pass, fail	scaling passes (+12.15 pp), ttft fails (+7.43 pp)

Observations. Twelve (model, workload) pairs x 4 cell shapes each = 48 cells, which reconciles with the SS3 ledger. At the pair level, only one pair (qwen2.5-1.5b x repeated_prefix) fails H1 on both phases; one pair (qwen2.5-1.5b x short_decode) splits with a clear pass on scaling and a clear regression on ttft; one pair (llama3.2-1b x balanced_2k) splits with a fail on scaling and the panel's largest ttft recovery (+39.92 pp); two pairs carry a single ttft fail (llama3.2-3b x repeated_prefix, plus the already-counted qwen2.5-1.5b x repeated_prefix). Six V1 deterministic-hang cell shapes were attempted under nogil (hang_cells_attempted: 6, threshold satisfied) -- two resolved (n_hangs_resolved: 2) and four remained hung; SS5's hang-resolution ledger names each shape.

The per-pair view sharpens the H2_partial reading: the residual non-improvement is not noise-uniform across pairs, it concentrates on repeated_prefix (3 of 5 H1-failers) and on qwen2.5-1.5b (3 of 5 H1-failers including the lone regression). Both signatures are consistent with mechanisms that survive GIL removal: the repeated_prefix over-representation is consistent with a KV-cache-reuse path that contends on memory bandwidth or allocator locks rather than on the interpreter; the qwen2.5-1.5b over-representation suggests an in-process dispatcher / tokenizer / library-internal interaction specific to that model family's transformers integration. Neither pattern is formally tested at n=5 within n=24; both are surfaced as carry-forward signals that the SS13 follow-on program is positioned to adjudicate under per-stream / per-allocator instrumentation.

10.6 Cell-level decode_tps and TTFT contrast surfaces

The matched-pair contrast also exposes decode_tps, p95_latency_multiplier, and ttft_ms per cell, each populated for all 24 N=2 combinations. While parallel_efficiency is the H1 decision metric, the three companion metrics are surfaced in analysis.json and license cross-checks against the headline delta. Each combination's decode_tps delta moves directionally with its parallel_efficiency delta (positive deltas accompany positive deltas, the lone regression on qwen2.5-1.5b ttft short_decode is mirrored in throughput as well), and the p95_latency_multiplier deltas move inversely (the cells with the largest efficiency gains show the largest p95-multiplier reductions vs N=1). The ttft_ms delta is most consequential on the ttft phase rows: under TR164 V1, ttft x llama3.2-1b x balanced_2k carried a V1 efficiency of 33.44% -- the panel minimum -- and that cell's TR165 efficiency of 73.35% (+39.92 pp recovery, +119.38% relative gain) is the panel-maximum recovery, which says the free-threaded build's largest effect lands on the cells V1 reported as the most pathological.

Companion metric	Direction of delta on H1-passers	Direction of delta on H1-failers	Substrate role
`decode_tps`	positive, tracks parallel_efficiency delta	mixed: positive on the +4-12 pp marginal failers, negative on the qwen2.5-1.5b ttft short_decode regression	cross-check that efficiency gain reflects throughput gain
`p95_latency_multiplier`	reduced (closer to 1.0)	reduced or unchanged on marginal failers, increased on the regression cell	tail-latency cross-check at the matched N=2 boundary
`ttft_ms`	reduced; largest reductions on the V1-worst cells	small reductions on the marginal failers	prefill-side parallelism cross-check

Observations. The three companion metrics move coherently with parallel efficiency on the 19 H1-passing cells and trace the same regression on the lone -25.42 pp qwen2.5-1.5b ttft short_decode cell. This coherence is the cross-metric defense against a "the efficiency metric is artifact" reading -- if the efficiency delta were an artifact of the rep-aggregation choice, decode_tps would not move with it. The largest efficiency recoveries land on the cells V1 reported as the most pathological (ttft x llama3.2-1b x balanced_2k, V1 = 33.44% -> TR165 = 73.35%), which is the directionally expected shape if GIL contention dominates the bottleneck on the V1-worst cells.

The four contrast surfaces (parallel_efficiency, decode_tps, p95_latency_multiplier, ttft_ms) cross-validate each other on this substrate. The H2_partial verdict is not a single-metric story -- every cell that improves on efficiency also improves on throughput and tail latency in the same direction, and every cell that fails H1 fails coherently across the four surfaces. The defensibility bar for an external acceptance signal is met because the regression on qwen2.5-1.5b ttft short_decode is visible in all four metrics rather than confined to the H1 decision metric alone.

11. SS4. The N=2 Parallel-Efficiency Delta

This is the load-bearing section of the report. Every other section either sets up the matched-pair geometry that makes this delta measurable or unpacks what the delta implies for the next experimental arc. The headline numerics live here, and they are the numbers that the falsification verdict was pre-registered against.

11.1 Headline numerics from `falsification_verdict`

The matched-pair contrast against TR164 V1 yielded full coverage on the N=2 axis: every (model, workload, phase) combination that V1 measured at N=2 was also measured under the free-threaded Python 3.14t build, producing a paired delta for each cell. The aggregate result is summarized below.

Quantity	Value	Threshold	Status
n_combinations_with_n2_data	24 of 24	>= 20	satisfied
mean_delta_n2_efficiency	+0.17909670357848148	> 0	mean > 0
median_delta_n2_efficiency	+0.17149268066758510	> 0	median > 0
n_combinations_passing_h1	19 of 24 (79.2%)	24 of 24 for H1 full	H1 full not met
n_combinations_below_h1_threshold	5 of 24 (20.8%)	0 for H1 full	H1 full not met
n_hangs_resolved	2 of 6 attempted	>= 1 for H2 partial	H2 partial met

Observations. Mean and median deltas are tightly clustered around +17 to +18 percentage points, and the distance between mean (+17.91 pp) and median (+17.15 pp) is small (0.76 pp). That tight mean-median gap argues against the headline being driven by a few outsized winners; the bulk of the 24 cells sit in a fairly narrow band around the +17 pp center, with a minority tail of cells that do not improve.

The N=2 parallel-efficiency delta is real, it is large, and it is broadly distributed across the matched-pair grid. Removing the GIL on this hardware recovers, on average, about a sixth of the lost parallel efficiency that TR164 V1 documented at its breakdown boundary. That is a meaningful mechanistic finding even before the pass-rate split is unpacked, and it is the single number this report exists to defend.

11.2 The 24-row N=2 delta table

The 24 (model x workload x phase) N=2 combinations and their per-cell parallel-efficiency deltas are now enumerated end-to-end. The rows are sorted by delta ascending so the H1-failer minority is contiguous at the top and the strongest recoveries land at the bottom; the H1 column flags pass/fail at the per-cell improvement threshold that drives the 19/5 split.

Rank	Phase	Model	Workload	V1 eff	TR165 eff	Delta	%Delta	H1
1	ttft	qwen2.5-1.5b	short_decode	60.69%	35.27%	-25.42 pp	-41.89%	fail
2	scaling	qwen2.5-1.5b	repeated_prefix	60.17%	64.63%	+4.46 pp	+7.42%	fail
3	scaling	llama3.2-1b	balanced_2k	53.05%	58.67%	+5.62 pp	+10.60%	fail
4	ttft	llama3.2-3b	repeated_prefix	47.29%	54.72%	+7.43 pp	+15.72%	fail
5	ttft	qwen2.5-1.5b	repeated_prefix	60.54%	72.26%	+11.72 pp	+19.36%	fail
6	scaling	llama3.2-3b	repeated_prefix	48.70%	60.85%	+12.15 pp	+24.95%	pass
7	ttft	llama3.2-3b	short_decode	58.90%	73.24%	+14.35 pp	+24.36%	pass
8	ttft	qwen2.5-1.5b	long_decode	59.96%	74.36%	+14.41 pp	+24.03%	pass
9	ttft	qwen2.5-1.5b	balanced_2k	61.26%	76.55%	+15.29 pp	+24.96%	pass
10	scaling	llama3.2-1b	short_decode	64.42%	80.07%	+15.65 pp	+24.30%	pass
11	scaling	qwen2.5-1.5b	short_decode	57.09%	73.42%	+16.33 pp	+28.61%	pass
12	scaling	llama3.2-1b	long_decode	54.27%	71.41%	+17.14 pp	+31.58%	pass
13	ttft	llama3.2-1b	long_decode	54.65%	71.81%	+17.16 pp	+31.40%	pass
14	scaling	qwen2.5-1.5b	long_decode	64.29%	84.09%	+19.80 pp	+30.81%	pass
15	scaling	llama3.2-3b	short_decode	55.15%	75.74%	+20.59 pp	+37.33%	pass
16	ttft	llama3.2-3b	long_decode	53.07%	75.26%	+22.20 pp	+41.83%	pass
17	ttft	llama3.2-1b	short_decode	52.79%	75.05%	+22.26 pp	+42.17%	pass
18	scaling	llama3.2-3b	long_decode	49.45%	71.71%	+22.26 pp	+45.03%	pass
19	scaling	llama3.2-1b	repeated_prefix	43.54%	67.32%	+23.78 pp	+54.62%	pass
20	scaling	llama3.2-3b	balanced_2k	48.32%	75.31%	+27.00 pp	+55.87%	pass
21	ttft	llama3.2-3b	balanced_2k	42.50%	72.39%	+29.89 pp	+70.32%	pass
22	ttft	llama3.2-1b	repeated_prefix	41.01%	72.49%	+31.48 pp	+76.75%	pass
23	ttft	llama3.2-1b	balanced_2k	33.44%	73.35%	+39.92 pp	+119.38%	pass
24	scaling	qwen2.5-1.5b	balanced_2k	61.52%	105.90%	+44.37 pp	+72.12%	pass

Observations. Sorted ascending, the H1-failer minority is contiguous at ranks 1-5 and the H1-passers occupy ranks 6-24. The H1 per-cell threshold falls between rank 5 (ttft x qwen2.5-1.5b x repeated_prefix, +11.72 pp) and rank 6 (scaling x llama3.2-3b x repeated_prefix, +12.15 pp). One cell at rank 1 carries a negative delta (-25.42 pp on ttft x qwen2.5-1.5b x short_decode), four cells at ranks 2-5 carry positive deltas below threshold (+4.46 pp to +11.72 pp), and the remaining 19 cells improve by between +12.15 pp and +44.37 pp. The strongest recovery (rank 24, scaling x qwen2.5-1.5b x balanced_2k) lifts efficiency past the canonical 1.000 reference into the 1.059 super-linear band -- a signature of the V1 baseline having been below its true single-thread reference, not of a violation of physical scaling laws.

The sort exposes two features the aggregate statistics hide: first, the H1-failers are clustered tightly between -25.42 pp and +11.72 pp with no daylight to the H1-pass tail (the closest passer at +12.15 pp is only 0.43 pp above the closest failer at +11.72 pp), which means the H1 threshold is sitting on a real distributional gap rather than slicing through a continuous cloud; second, the H1-pass tail is wide and right-skewed (the mean +17.91 pp lies inside the pass tail rather than at its center), which says the GIL recovery effect is large where it lands and small or absent where it does not. Both features are mechanistically informative: a clear gap between failers and passers is consistent with a binary "is the GIL on the critical path for this cell" classification rather than a continuous "how much of the bottleneck is interpreter-side" gradient.

11.3 H1 pass-rate split: 19 of 24 vs 5 of 24

H1 was pre-registered as a strict claim: all 24 N=2 combinations must show a parallel-efficiency improvement above the per-cell threshold for H1 to pass. The observed result is 19 of 24 (79.2%) passing, 5 of 24 (20.8%) below threshold. This is the structural reason the final falsification verdict is H2_partial rather than H1_full.

Outcome bucket	Count	Share of 24	Mean delta in bucket	Interpretation
H1-passing (delta above threshold)	19	79.2%	+22.51 pp	GIL removal recovered measurable N=2 efficiency
H1-failing (delta at or below threshold)	5	20.8%	-0.36 pp	Residual breakdown mechanism beyond the GIL

Observations. A 79/21 split is the canonical shape of a "partial mechanism" finding: a clear majority moves in the predicted direction, a non-trivial minority does not. If the GIL were the sole mechanism behind V1's N=2 breakdown, the expected split would be 24/0 (or 23/1 absorbing one measurement-noise cell). The observed 5-cell shortfall is too large to dismiss as noise and too small to overturn the headline. The bucket-mean contrast (+22.51 pp on the 19 passers vs -0.36 pp on the 5 failers) is the cleanest single statistic separating the two populations: the passers carry the full GIL-recovery signal, the failers carry essentially zero on average (dragged below zero by the single regression cell).

The honest reading of this split is: the GIL is a mechanism behind V1's N=2 parallel-efficiency collapse, but it is not the only mechanism. About four-fifths of the matched-pair grid responds to free-threaded execution; about one-fifth does not. Any downstream framing that compresses this into "removing the GIL fixes the N=2 breakdown" overstates what the substrate supports. The substrate supports "removing the GIL recovers ~22.5 pp of N=2 parallel efficiency on the 19-cell majority, ~0 pp on the 5-cell minority, ~17.9 pp on average across the matched pair on this hardware."

11.4 Per-factor decomposition of the 5 H1-failers

The 5-cell H1-failer minority concentrates on two factors more than randomness predicts. By model, the failers split 1 / 3 / 1 (llama3.2-1b / qwen2.5-1.5b / llama3.2-3b), against an 8 / 8 / 8 even-coverage population. By workload, the failers split 1 / 0 / 0 / 4 (short_decode / long_decode / balanced_2k / repeated_prefix... wait -- balanced_2k has 1 failer and repeated_prefix has 3, plus the short_decode regression cell makes 1), against a 6 / 6 / 6 / 6 even-coverage population. By phase, the failers split 1 / 4 (scaling / ttft) against a 12 / 12 even-coverage population.

Factor	Levels	H1-failer counts	Even-coverage expectation	Notes
Model	llama3.2-1b / qwen2.5-1.5b / llama3.2-3b	1 / 3 / 1	~1.67 each (5/3)	qwen2.5-1.5b carries 3 of 5 failers
Workload	short_decode / long_decode / balanced_2k / repeated_prefix	1 / 0 / 1 / 3	~1.25 each (5/4)	repeated_prefix carries 3 of 5 failers
Phase	scaling / ttft	1 / 4	2.5 each	ttft carries 4 of 5 failers

Observations. Three of the five failers sit on repeated_prefix; four of the five failers sit on the ttft phase; three of the five failers sit on qwen2.5-1.5b. None of these counts is large enough to support a hypothesis test at n=5, but all three signatures are directionally consistent with mechanisms surviving GIL removal: repeated_prefix is the KV-cache-reuse workload (cache-hit paths and allocator locks rather than interpreter critical sections); ttft is the prefill-dominated phase (per-request tokenizer and embedding-table cost that ultimately funnels into a single CUDA stream); qwen2.5-1.5b has a transformers integration path that diverges from the Llama family on tokenizer and chat-template handling.

The factor decomposition is the report's cleanest single statement that "the residual mechanism is workload-conditional" -- the 5/24 H1-failers are not scattered, they cluster on the repeated_prefix workload, the ttft phase, and the qwen2.5-1.5b model. The follow-on adjudication in SS13 names KV-cache-reuse memory-bandwidth contention, prefill-side dispatcher overhead, and model-specific library integration as the three most plausible residual mechanisms, and the per-factor pattern above is the substrate-grounded reason to put those three at the head of the queue.

11.5 The H1-failer common-cause hypotheses

The 21% non-improvement minority must be accounted for by mechanisms that are unaffected by removing the GIL. Three candidates are consistent with both the V1 breakdown geometry and the prior literature on PyTorch direct-inference contention.

Candidate residual mechanism	Why it would survive GIL removal	Failer-pattern evidence
CUDA kernel serialization on a single device	The CUDA driver serializes kernels on one stream regardless of host-side threading model	All 5 failers run on the same RTX 4080 Laptop default-stream; per-stream isolation untested
GPU memory-bandwidth contention	Two concurrent N=2 requests share HBM bandwidth; thread model on the host does not change that	3 of 5 failers on `repeated_prefix` (KV-cache-heavy workload)
Dispatcher / Python-side overhead non-GIL	Allocator locks, autograd bookkeeping, and tokenizer paths can serialize without the GIL	4 of 5 failers on `ttft` phase (prefill dispatcher overhead is largest); 3 of 5 on `qwen2.5-1.5b` (model-specific dispatcher path)

Observations. All three are mechanisms whose contention surface lives below the Python interpreter layer that the free-threaded build addresses. Even with sys._is_gil_enabled() == False verified at dispatcher startup and after every relevant import, work that ultimately funnels into a single CUDA stream, a single HBM controller, or a single allocator critical section will still serialize. The 17 pp average recovery suggests the GIL was the dominant single contributor; the 21% non-improvement tail suggests at least one of the three residual mechanisms is non-negligible on this hardware. The per-factor pattern in 11.4 weakly localizes which of the three is contributing where: repeated_prefix over-representation points at memory bandwidth, ttft-phase over-representation points at dispatcher overhead, and qwen2.5-1.5b over-representation points at model-specific library-internal serialization that an interpreter-level GIL removal does not address.

The H2_partial verdict is the correct compression of this picture. "The GIL is a mechanism, not the mechanism" is the substrate-supported claim. The 17 pp average recovery is the size of the GIL's contribution under matched-pair conditions on RTX 4080 Laptop 12 GB; the 5-of-24 residual is the size of the contribution from everything else. A clean separation of those two contributions is the natural objective of the next experimental arc, and it is the question SS5 (hang resolution) and SS6 (cross-mechanism attribution) build toward.

12. SS5. Hang Resolution Ledger -- 2 of 6 Cell Shapes Cleared

V1 documented six deterministic-hang cell shapes on the RTX 4080 Laptop substrate, all concentrated on llama3.2-3b at the high-concurrency boundary that V1 instrumented (N=16). TR165 re-attempted each of those six cell shapes under the Python 3.14.5 free-threaded build with sys._is_gil_enabled() == False verified at dispatcher startup. The pre-registered substrate threshold required minimum_hang_shapes_attempted >= 6; the run satisfied that bar exactly (hang_cells_attempted: 6). Of the six, two completed with parallel-efficiency-positive measurements under nogil and were recorded as n_hangs_resolved: 2 in the falsification verdict block of analysis.json. The remaining four continued to hang and were re-classified as TR165 deterministic-hang carryover rather than V1-only artifacts.

The hang_resolution_ledger field in analysis.json enumerates each shape together with its TR165 outcome, wall-clock duration (when the cell completed), V1 baseline wall-clock for cells V1 had partial timing data on, the run's ok_rate, the completion-token total, and the manifest status count. The ledger is presented below in the schema (model, workload, phase, N) that joins back to the matched-pair contrast keys.

12.1 The 6-shape per-cell outcome table

Shape	Phase	Model	Workload	N	TR165 outcome	TR165 wall (ms)	V1 baseline wall (ms)	ok_rate	Completion tokens
H1	scaling	llama3.2-3b	long_decode	16	resolved_cleanly	3,146,927 (~52.4 min)	0 (no V1 timing)	1.00	98,304
H2	ttft	llama3.2-3b	long_decode	16	resolved_cleanly	2,999,750 (~50.0 min)	0 (no V1 timing)	1.00	98,304
H3	scaling	llama3.2-3b	balanced_2k	16	hang_or_failure	null	25,341,494 (~7.04 hr)	null	null
H4	ttft	llama3.2-3b	balanced_2k	16	hang_or_failure	null	8,939,765 (~2.48 hr)	null	null
H5	scaling	llama3.2-3b	repeated_prefix	16	hang_or_failure	null	0 (no V1 timing)	null	null
H6	ttft	llama3.2-3b	repeated_prefix	16	hang_or_failure	null	0 (no V1 timing)	null	null

Observations. All six V1-flagged hang shapes localize to the same model (llama3.2-3b) and the same concurrency level (N=16); none of the six occur on llama3.2-1b or qwen2.5-1.5b. Two of the six (H1, H2) clear under nogil and join the run's metric ledger, each completing in approximately 50 minutes of wall-clock under nogil with ok_rate = 1.00 and producing 98,304 completion tokens. Four of the six (H3, H4, H5, H6) reproduce as deterministic hangs in TR165 even with the GIL provably disabled at dispatcher startup, with wall_ms = null and ok_rate = null recorded. The cleared pair sits on the long_decode workload across both phases; the four still-hung shapes sit on the balanced_2k and repeated_prefix workloads across both phases. The pattern is not "free-threading fixed the hang"; it is "free-threading fixed the long_decode hang surface, and the balanced_2k and repeated_prefix hang surfaces survived intact."

The ledger is the cleanest single-number statement that GIL contention is not the sole mechanism behind V1's deterministic-hang surface. If the GIL had been the only mechanism, removing it should have cleared all six. Instead, the cleared-vs-still-hung split tracks the workload axis: the two cleared shapes are both long_decode cells, and the four still-hung shapes are the balanced_2k and repeated_prefix cells. That partition is the fingerprint of a second, workload-conditional mechanism -- most plausibly memory-bandwidth contention on the KV-cache-heavy repeated_prefix workload and on the 2K-context balanced_2k workload, both of which exercise HBM bandwidth differently than long_decode's steady decode tail. The GIL ablation does not touch HBM bandwidth, and the four still-hung shapes show that.

12.2 Workload-conditional pattern in the resolution split

Workload	Hang shapes attempted	Resolved cleanly	Still hung	Resolution rate
long_decode	2 (scaling + ttft)	2	0	100%
balanced_2k	2 (scaling + ttft)	0	2	0%
repeated_prefix	2 (scaling + ttft)	0	2	0%

Observations. The resolution rate is workload-bound, not phase-bound. Both long_decode shapes clear under nogil regardless of phase; both balanced_2k shapes hang under nogil regardless of phase; both repeated_prefix shapes hang under nogil regardless of phase. The clean-up rate is 100% on long_decode and 0% on the other two workloads. Phase is not selective: under each workload, scaling and ttft pair to the same outcome.

The workload-conditional pattern is the load-bearing structural finding in the hang ledger. A pure-GIL mechanism predicts a uniform resolution rate across workloads -- removing the lock should clear hangs that were caused by lock contention regardless of what the worker thread was computing. The observed pattern (100% on long_decode, 0% on the cache-heavy workloads) is the opposite of uniform: it says the GIL is on the critical path for long_decode's hang signature and not on the critical path for the balanced_2k / repeated_prefix hang signatures. The most plausible mechanism for the residual is HBM-bandwidth saturation on the two cache-heavy workloads, which is unaffected by interpreter-level changes by construction.

12.3 Wall-clock evidence from the cleared shapes

The two cleared shapes (H1, H2) executed under nogil in approximately 50-52 minutes each and produced 98,304 completion tokens with a 1.00 ok_rate. V1 carried no comparable wall-clock measurement for either cell because V1 hung at those coordinates and never produced a metrics row; the TR165 wall-clock is the first measurement of either cell ever recorded on this hardware.

Shape	Workload	TR165 wall (s)	Throughput (tok/s)	ok_rate	V1 status
H1	long_decode scaling	3,146.9	31.24	1.00	deterministic hang (no measurement)
H2	long_decode ttft	2,999.7	32.77	1.00	deterministic hang (no measurement)

Observations. Both cleared cells produce throughput in the 31-33 tok/s range under N=16 concurrency. The H1 / H2 pair is significant not only because they are the two cells that cleared but because the TR165 run produced the first wall-clock measurement of either coordinate -- V1's hang signature meant that the cells contributed nothing to the V1 metric ledger. Each is now a single substrate-grounded data point on the N=16 long_decode cell shape under the free-threaded interpreter on RTX 4080 Laptop 12 GB.

The cleared shapes are not merely "no longer hung" -- they are full-completion data points where V1 had nothing. That fact alone is the strongest evidence that the GIL was on the critical path for at least the long_decode hang surface: removing it converted two deterministic non-measurements into two full-completion measurements with ok_rate = 1.00. The four still-hung shapes preserve V1's "no measurement" status under TR165 as well, which is the structural counter-evidence that GIL removal alone is not sufficient to clear the entire hang surface.

12.4 Why the verdict cannot be H1 even with positive headline numerics

The ledger explains why the falsification verdict cannot be H1 even though mean_delta_n2_efficiency is materially positive at +0.1791. A clean H1 (full GIL attribution) would have required the 24 N=2 combinations to all improve AND the six hang shapes to all clear; what TR165 actually has is 19 of 24 combinations improving and 2 of 6 hangs clearing. Both gates fail in the same direction and for what appears to be the same reason: a residual non-GIL mechanism survives the ablation. The four still-hung shapes are the strongest single piece of evidence against a one-mechanism story, because a hang is a binary failure mode -- it does not admit the soft "below threshold but not zero" interpretation that the five non-improving N=2 combinations in SS3 admit. A hang either clears or it does not. Four did not.

For downstream use, the four still-hung shapes (H3, H4, H5, H6) are the carry-forward open items: they are the cells that a follow-on experiment with CUDA-stream instrumentation or kernel-level tracing would target, and they are the cells that the bridge paper's Phase 8 anchor will cite as the boundary where the GIL story stops and the kernel-serialization / memory-bandwidth story would have to begin. The two cleared shapes (H1, H2) are folded into the matched-pair contrast as TR165-only-coordinate data points where V1 has no comparable measurement, since V1 hung at those cells and produced no decode_tps or parallel_efficiency row to join against; the rest of the matched-pair analysis treats them accordingly.

Open item	Carry-forward target	Mechanism candidate
H3 (scaling x balanced_2k x N=16)	TR166+ per-stream isolation	HBM bandwidth on 2K-context prefill
H4 (ttft x balanced_2k x N=16)	TR166+ per-stream isolation	HBM bandwidth on prefill burst
H5 (scaling x repeated_prefix x N=16)	TR166+ KV-cache isolation	KV-cache allocator lock or HBM bandwidth on cache reuse
H6 (ttft x repeated_prefix x N=16)	TR166+ KV-cache isolation	KV-cache allocator lock under prefill burst

Observations. All four still-hung shapes carry forward to the same successor-experiment family: per-stream / per-allocator isolation under the free-threaded interpreter so the residual mechanism can be named directly rather than inferred. The two balanced_2k shapes point at HBM bandwidth on the 2K-context prefill burst (V1 carried partial timing data on these -- 2.48 hr for ttft, 7.04 hr for scaling -- which makes them quantitatively the worst V1 cells on this hardware). The two repeated_prefix shapes point at the KV-cache reuse path where allocator locks and cache-reuse-specific memory access patterns are the most plausible residual sources.

The four open items license the bridge paper Phase 8 anchor to state plainly that the GIL is the dominant single contributor on the long_decode hang surface and that the residual contributors live on the cache-heavy workloads. That distribution is informative: it says the next falsification experiment should not chase "is the GIL still part of the story" (it is, just not all of it) but rather "what fraction of the cache-heavy residual is HBM-bandwidth vs allocator-lock contention" -- a question that the per-stream isolation TR166+ candidate is positioned to answer and that TR165's substrate is positioned to motivate.

13. SS6. ROC-AUC / Discrimination Analysis

The matched-pair delta in N=2 parallel efficiency is a signed quantity: each of the 24 (model x workload x phase) combinations carries a real-valued shift that is positive when the free-threaded build improves efficiency relative to the V1 GIL-on baseline, near zero when the two builds are indistinguishable, and negative when nogil actually loses ground. Because the H1 verdict in this report is rendered as a threshold count -- how many of the 24 combinations clear the H1 improvement bar -- a natural follow-up question is whether the magnitude of the per-combination delta also predicts H1 membership in a rank-sensitive way. This section documents that predictive structure and explains why TR165's primary verdict deliberately stops at threshold-counting rather than escalating to a ROC-AUC summary.

The substrate exposes the discrimination structure in three layered numbers. The mean matched-pair delta in N=2 parallel efficiency is +0.1791 (+17.91 percentage points); the median is +0.1715 (+17.15 percentage points); 19 of 24 combinations clear the H1 threshold and 5 of 24 fall below it. The closeness of mean to median (a gap of 0.76 pp) tells us the delta distribution is not heavy-tailed in either direction -- the H1-passers are not a small handful of large-positive outliers dragging an otherwise flat distribution upward, and the H1-failers are not a single catastrophic negative pulling the mean down. Magnitude and H1-membership move together: combinations with large positive delta are reliably H1-passers, and combinations clustered near zero (or slightly negative) populate the H1-failure tail.

13.1 Per-combination delta as a discrimination axis

Quantity	Value	Source
n_combinations_with_n2_data	24 of 24	`falsification_verdict.n_combinations_with_n2_data`
n_combinations_passing_h1	19 of 24 (79.2%)	`falsification_verdict.n_combinations_passing_h1`
n_combinations_below_h1_threshold	5 of 24 (20.8%)	`falsification_verdict.n_combinations_below_h1_threshold`
mean delta (N=2 parallel efficiency)	+0.1791	`falsification_verdict.mean_delta_n2_efficiency`
median delta (N=2 parallel efficiency)	+0.1715	`falsification_verdict.median_delta_n2_efficiency`
Mean - median gap	+0.0076 (+0.76 pp)	derived from the two values above

Observations. The H1-pass / H1-fail partition is monotonic in delta magnitude by construction -- H1 is defined as "delta clears the improvement threshold," so a combination's delta value and its H1 label are not independent random variables. What the substrate adds beyond tautology is the shape of the distribution: a +17 pp central tendency with a tight mean-median spread, a 79%/21% split between passers and failers, and full N=2 coverage at 24 of 24. That shape is consistent with a single dominant positive shift across most cells plus a structured minority of near-zero or negative cells, rather than a noisy cloud where H1 membership is rank-arbitrary.

The signed delta is doing real discriminative work here. If H1-failers were scattered randomly across the delta axis, the mean-median gap would be larger and the 79/21 split would be brittle to threshold choice. Instead, the failers cluster at the low end of the delta distribution -- they are the same cells that show the smallest GIL-attributable gain, which is exactly the mechanistic story the H2_partial verdict tells: the GIL is a mechanism, and the cells where it is not the dominant mechanism are also the cells where the delta sits near zero.

13.2 Why analysis.json does not carry a ROC-AUC field

A reader trained on classification metrics will reach for ROC-AUC at this point: rank all 24 combinations by delta, sweep the H1 threshold, and report the area under the resulting curve. The TR165 substrate deliberately does not compute that number, and the absence is principled rather than an oversight.

The first reason is sample size. With n = 24 combinations, a bootstrap ROC-AUC confidence interval is structurally wide -- the resampling distribution at this n cannot resolve AUC differences smaller than roughly the gap between a "good" and a "great" classifier, which is precisely the regime where reporting an AUC point estimate without a tight CI would mislead readers about discrimination quality. The 24-combination grid was sized for matched-pair contrast against V1's breakdown boundary, not for rank-statistic estimation.

The second reason is interpretability. The per-combination delta + threshold-count pairing exposes two substrate-grounded statistics -- a continuous effect size (+17.91 pp mean, +17.15 pp median) and a discrete pass rate (19 of 24) -- that map directly onto the mechanistic question this report asks. ROC-AUC would compress both into a single rank-only number that discards the absolute magnitude of the improvement, and the absolute magnitude is the part that matters for the GIL-attribution claim. A ROC-AUC of, say, 0.90 would tell us the delta ranks H1-passers above H1-failers; it would not tell us that the average improvement is +17.91 pp, which is the number that licenses the H2_partial verdict.

The third reason is that the threshold itself is the load-bearing object. H1 is pre-registered as a threshold-crossing claim ("ALL 24 combinations clear the bar"), not a ranking claim. Threshold-counting is the statistically appropriate primary verdict for a pre-registered threshold; AUC would be a secondary, exploratory statistic at best, and reporting it as a primary discrimination metric on n = 24 would invert the rigor stack.

Observations. The decision to stop at threshold-counting is a sample-size and interpretability call, not a hedge. A ROC-AUC point estimate could be computed from the per-combination deltas, but its CI would be wide enough that the headline number would convey false precision, and its rank-only nature would discard the +17.91 pp / +17.15 pp magnitude evidence that does the actual mechanistic work in the H2_partial verdict.

The honest framing is: TR165's discrimination question is answered by the joint reading of mean delta, median delta, and the 19/24 pass count. A ROC-AUC summary on this substrate would add a number without adding evidence, and at n = 24 it would actively cost interpretability. If a future expansion lifts the combination count into the low hundreds -- for example, by sweeping additional models, workloads, or phases per the bridge paper Phase 8 plan -- a bootstrap ROC-AUC with a tight CI becomes substrate-appropriate and can be added as a secondary discrimination statistic. At the current n, the per-combination delta + threshold-count pairing is the right level of resolution.

14. SS7. Falsification Verdict -- H2_partial

The pre-registered falsification harness in research/tr165/run.py evaluates three mutually exclusive verdicts against the matched-pair substrate: H0 (no effect -- GIL is not the mechanism), H1 (full attribution -- every N=2 combination improves and every V1 hang resolves), and H2 (partial attribution -- H0 fails but H1 also fails). The verdict is computed from falsification_verdict in analysis.json after the matched-pair contrast and hang-resolution ledger have been materialized. This section walks each criterion in order, records the structural decision, and states the headline finding the rest of the report has been pointing toward.

14.1 Runtime preconditions

Before any hypothesis can be evaluated, two gating preconditions must hold. The first is runtime_ok_for_verdict, which is True only when the dispatcher booted into a genuine free-threaded interpreter (see SS1, SS2). The second is that the minimum sample-size thresholds for both the N=2 efficiency panel and the hang-resolution ledger are satisfied.

Precondition	Threshold	Observed	Status
runtime_ok_for_verdict	True required	True	satisfied
minimum_n2_data	20 combinations	24 combinations	satisfied
minimum_hang_shapes_attempted	6 cell shapes	6 cell shapes	satisfied
free_threaded_build	True	True	satisfied
gil_disabled_at_dispatcher_startup	True	True	satisfied

Observations. Every precondition clears its threshold by at least the required margin. The 24 N=2 combinations (3 models x 4 workloads x 2 phases) exceed the 20-combination minimum by 20%, and the hang-shape ledger matches the registered threshold exactly because V1 produced exactly six deterministic-hang shapes and TR165 attempted all six.

The gating thresholds are not safety theater. A validate-only rerun or a smaller hang-shape attempt would have flipped runtime_ok_for_verdict to False and forced the verdict to abstain regardless of how favorable the deltas looked.

14.2 H0 criterion -- no improvement

H0 is the null: removing the GIL produces no meaningful change in N=2 parallel efficiency. The harness rejects H0 when the mean delta across the matched-pair panel is meaningfully non-zero.

Statistic	Value
mean_delta_n2_efficiency	+0.1791 (+17.91 pp)
median_delta_n2_efficiency	+0.1715 (+17.15 pp)
n_combinations_with_n2_data	24 of 24
h0_pass	False

Observations. The mean and median deltas are both within 0.8 pp of each other and both well above zero, so the central tendency is not a tail artifact. h0_pass = False records that H0 is rejected: the GIL-removed configuration is not behaviorally equivalent to the V1 baseline.

A +17.91 pp mean is large enough that even an aggressive equivalence margin (+/-3 pp, the program's standard TOST band) would not contain it. H0 is dead.

14.3 H1 criterion -- full attribution

H1 is the strong claim TR164 V1's narrative invited: the GIL is the sole mechanism, so under nogil every N=2 combination should clear the parallel-efficiency improvement threshold and every V1 deterministic hang should resolve. The harness requires both subcriteria to pass.

H1 subcriterion	Required	Observed	Status
All 24 N=2 combinations improving	24 of 24	19 of 24 (79.2%)	fail
All 6 hang shapes resolved	6 of 6	2 of 6 (33.3%)	fail
h1_pass	True	False	fail

Observations. Neither subcriterion clears. Five combinations (20.8% of the panel) remain at or below the improvement threshold under nogil, and four of the six V1 deterministic-hang shapes still hang. Either failure alone would falsify H1; both fail, which makes the rejection unambiguous.

The 5/24 non-improving combinations and the 4/6 persistent hangs are the empirical residue that prevents this report from claiming the GIL explains everything V1 saw. They are the witnesses for the residual-mechanism story SS8 develops.

14.4 H2 criterion -- partial attribution

H2 is supported by construction when H0 is False (the effect is real) and H1 is False (the effect is incomplete). The substrate satisfies both antecedents.

H2 antecedent	Status
h0_pass == False (effect is real)	True
h1_pass == False (effect is incomplete)	True
h2_pass	True
falsification_verdict	H2_partial

Observations. H2 is not a fallback for ambiguity; it is the logically correct verdict when the data show a real but bounded effect. The 79.2% improvement rate, the 2/6 hang resolution, and the +17.91 pp mean recovery all sit inside H2's region.

H2_partial is the strongest claim the substrate licenses. Anything stronger would require either zero non-improving combinations or zero persistent hangs, and neither holds.

14.5 Headline result

The headline finding the rest of the report supports -- and the result the bridge paper Phase 8 anchor (papers/serving_state_safety_certification/UPGRADE_PLAN.md) cites -- reads in three clauses.

Clause	Substrate evidence
GIL is a mechanism behind V1's breakdown	+17.91 pp mean N=2 efficiency recovery; 19/24 combinations improve; 2/6 hangs resolved
GIL is not the sole mechanism	5/24 combinations do not clear the improvement threshold; 4/6 deterministic hangs persist under nogil
Verdict is licensed at full evidentiary strength	runtime_ok_for_verdict = True under a scientific Python 3.14.5 free-threaded run, not a validate-only dry run

Observations. The three clauses are matched-pair-grounded: every number above is read from falsification_verdict in the run's analysis.json rather than reconstructed downstream. The honest framing is therefore not "the GIL caused V1's breakdown" and not "the GIL had nothing to do with it," but the narrower and defensible "removing the GIL recovers most but not all of the breakdown, so the GIL is necessary-but-not-sufficient as an explanation."

H2_partial is the substrate's headline. It is also the only headline a hand-narrated, defensibility-bar-respecting report can ship: H0 would deny the +17.91 pp delta, and H1 would deny the 5 non-improving combinations and the 4 persistent hangs. The verdict respects every row in the panel and licenses the residual-mechanism analysis that follows in SS8.

14.6 Same-session GIL flag control

The same-session control is not part of the pre-registered H0/H1/H2 ladder, because it was run later to address the version-confound critique. It is nevertheless now the cleanest evidence for the GIL flag itself. Both arms ran under the same CPython 3.14.5 free-threaded Docker image; PYTHON_GIL=1 produced a runtime with gil_disabled=False, and PYTHON_GIL=0 produced a runtime with gil_disabled=True. The GIL-on arm completed at research/tr165/results/20260623_165503_919943; the no-GIL arm completed at research/tr165/results/20260623_222630_155096.

Same-session quantity	Value
Cells per arm	144
Rows per arm	2,592
Failed cells per arm	0
Paired N=2 combinations	24
no-GIL improved over GIL-on	24/24
Mean delta, no-GIL minus GIL-on	+0.1325
95% CI on mean delta	[0.1218, 0.1434]
Median delta	+0.1274
no-GIL arm PE >= 0.65	0/24
GIL-on arm PE >= 0.65	0/24

Observations. The control strengthens H0 rejection because the GIL flag alone moves every N=2 pair in the expected direction. It also strengthens H1 rejection because the same no-GIL flag does not clear the absolute 0.65 floor in any pair on the Docker stack. The final paper should therefore use this control to say "GIL contributes" and should not use it to say "free-threading fixes the boundary."

15. SS8. Mechanism Interpretation

The H2_partial verdict is not a soft landing between H0 and H1. It is a structural claim about what the Python interpreter contributes to TR164 V1's N=2 breakdown -- and, by exclusion, what it does not. Because TR165 is a single-delta ablation (Python 3.14.5 free-threaded interpreter holding all other axes -- model, workload, phase, backend, hardware, driver -- equal to V1), the substrate licenses exactly three mechanism-level claims and no more. This section enumerates those three claims, then catalogs candidate residual mechanisms that the design is by construction unable to adjudicate.

15.1 Claim 1 (licensed): GIL contention is empirically a contributor

Across the 24 (model x workload x phase) combinations with N=2 data, removing the GIL produced a mean absolute improvement of +0.1791 in N=2 parallel efficiency (+17.91 percentage points) and a median of +0.1715 (+17.15 pp). 19 of 24 combinations (79.2%) crossed the pre-registered H1 improvement threshold. 2 of the 6 deterministic-hang cell shapes attempted under nogil resolved to a non-hang outcome. These three signals -- direction, magnitude, and hang resolution -- are co-localized to the single delta this ablation manipulates. No other axis changed between TR164 V1 (20260531_120428_552237) and TR165 (20260607_174748_273070). The GIL is, therefore, A mechanism contributing to V1's N=2 breakdown.

The same-session control gives the cleaner N=2 attribution: with CPython version, Docker image, CUDA stack, model plan, and run window fixed, toggling PYTHON_GIL from 1 to 0 improves all 24/24 paired N=2 combinations, with mean delta +0.1325 and 95% CI [0.1218, 0.1434]. This is the result that answers the strongest critique of the historical transition design. The historical transition is still the broader report substrate because it includes the N=16 hang-shape ledger; the same-session control is the stricter GIL-flag isolation at N=2.

15.2 Claim 2 (licensed): GIL contention is NOT the sole contributor

The same substrate falsifies the strong-form H1. 5 of 24 combinations (20.8%) did not clear the H1 improvement threshold even with the interpreter lock removed. 4 of the 6 deterministic-hang cell shapes still hang under nogil. The H1 hypothesis was stated as ALL 24 N=2 combinations clearing threshold; the observed 19/24 fails that pre-registered bar. Under a sole-mechanism hypothesis, removing the GIL should have recovered the full V1 breakdown profile. It did not. Some other factor -- or factors -- accounts for the residual 21% non-improving combinations and 67% of the originally-hanging cell shapes.

The same-session control makes the "not sufficient" half sharper: in that control, both the PYTHON_GIL=1 and PYTHON_GIL=0 arms remain below the absolute 0.65 parallel-efficiency floor for all 24/24 pairs. The no-GIL arm is consistently better, but consistently better is not the same as operationally recovered.

15.3 Claim 3 (licensed): The residual mechanism is workload-conditional

The 5 H1-failing combinations and the 4 still-hung cell shapes are not uniformly distributed across the 3 x 4 x 2 = 24 cell coordinates. They concentrate at specific (model x workload x phase) intersections: 3 of 5 H1-failers on repeated_prefix, 4 of 5 on the ttft phase, 3 of 5 on qwen2.5-1.5b; 100% of the cleared hang shapes on long_decode vs 0% on the cache-heavy workloads. That concentration is the load-bearing signal: if the residual mechanism were uniform -- say, a flat per-call interpreter overhead unrelated to the GIL -- it would distribute evenly across cells. It does not. The residual is conditional on something about the workload shape, the model architecture, or both.

15.4 Candidate residual mechanisms (NOT tested by TR165)

TR165 manipulates exactly one axis (GIL on vs off) and therefore cannot adjudicate the residual. The following candidates are surfaced as hypotheses for TR166+ extensions, not as substrate-grounded findings:

Candidate mechanism	Why plausible	TR165 evidence	Adjudication path
CUDA kernel serialization at high concurrency	At N=2 with a single GPU context, kernel launches still serialize through the CUDA stream regardless of host-thread parallelism	Not tested	Multi-stream / CUDA Graphs ablation
Memory-bandwidth contention on consumer 12 GB VRAM	RTX 4080 Laptop has narrower memory bus than data-center SKUs; two concurrent KV reads can saturate bandwidth before compute	All 4 still-hung shapes are on cache-heavy workloads; consistent with HBM contention	Cross-hardware repeat on data-center GPU
Dispatcher overhead in pytorch_direct agent-orchestration layer	The agent-orchestration shim sits above the model and may contend on locks or queues not covered by the interpreter lock	4 of 5 H1-failers on ttft (prefill-dispatcher-heavy) phase	Swap pytorch_direct for raw HF generate; or instrument dispatcher
transformers internal threading model under nogil	Library-internal locks, lazy initialization, and caches were not designed against free-threaded semantics	3 of 5 H1-failers on qwen2.5-1.5b (model-specific path)	Pin transformers version sweep; nogil-aware fork

Observations. The mechanism table is deliberately a hypothesis list, not a results list. Each row is a candidate that fits the residual pattern (workload-conditional, concentrates at N=2, survives GIL removal) but is not tested by the present design. The honest position is that TR165 has narrowed the search space from "what causes V1's N=2 breakdown" to "what causes the residual 21% non-improvement and 4 unresolved hangs after the GIL is removed," and the next experiment must add a second axis.

The discipline of single-delta ablation is that it pays for itself in the certainty of one claim at the cost of silence on every adjacent claim. TR165 bought the certainty that the GIL contributes -- and only that. The residual mechanism is real, it is workload-conditional, and it is the explicit subject of TR166+. Reporting it as "interpreter overhead" or "PyTorch overhead" without further ablation would convert a clean negative result into pattern-matching prose; this report declines that move.

15.5 What this means for the bridge-paper Phase 8 anchor

The H2_partial verdict licenses the bridge paper to claim that GIL contention is empirically attributable as a contributor to direct-PyTorch N=2 breakdown, with quantified magnitude (mean +17.91 pp, median +17.15 pp parallel-efficiency recovery, 79.2% combination coverage, 2-of-6 hang resolution). It does NOT license the claim that free-threaded Python eliminates the N=2 boundary; the 4 still-hung cell shapes and the 5 H1-failing combinations falsify that stronger statement. The Phase 8 anchor language must reflect both halves -- the supported contribution and the unresolved residual -- or it overclaims.

16. SS9. Cross-Reference to TR164 V1 and TR164 V2

TR165 does not stand alone. It is the second leg of a two-experiment triangulation whose first leg is the TR164 V1 baseline at research/tr164/results/20260531_120428_552237/ and whose companion leg is TR164 V2, the cross-backend Phase 8 report drafted at research/tr164/V2_BACKEND_NOTES.md and DRAFT_Technical_Report_164_V2.md. The three reports share one experimental hardware substrate stratum (RTX 4080 Laptop 12 GB for V1 and V2-TGI; A100 80 GB SXM/PCIe for V2-vLLM and V2-SGLang), one model family stratum (llama3.2-1b, qwen2.5-1.5b, llama3.2-3b), and one workload stratum (short_decode, balanced_2k, long_decode, repeated_prefix). What differs between them is the single axis perturbed: V1 holds Python and backend fixed and documents the N=2 breakdown; V2 changes the BACKEND (PyTorch direct -> TGI continuous-batching server-process, plus the vLLM and SGLang server-process backends on data-center hardware as second-axis support); TR165 changes the PYTHON INTERPRETER (CPython 3.x with GIL -> CPython 3.14.5 free-threaded, sys._is_gil_enabled() == False verified at dispatcher startup).

The mechanism question V1 left open is whether its observed N=2 breakdown -- median parallel efficiency 0.547 and TTFT excursions reaching 188 s on N=16 long_decode cells, plus six deterministic-hang cell shapes -- is attributable to Python GIL contention or to other in-process serialization sources (CUDA kernel dispatch, memory-bandwidth contention, dispatcher overhead). V1 posited GIL contention as the mechanism but could not test it from inside V1's own design. The two Phase 8 follow-ups each test that posit from a different angle.

16.1 The TR164 V2 cross-run synthesis (now available)

The TR164 V2 cross-backend test produced a paired comparison against V1 over the full {model, workload, phase, n_agents} cross-product, joined on the matched-pair key. The TGI-vs-V1-pytorch_direct comparison from research/tr164/cross_run_analysis.json (schema_version 1.0, generated 2026-06-08) reports the following on mean_parallel_efficiency:

Statistic	Value	Source field
n_matched cells (TGI <-> V1-pytorch_direct join)	120	`comparison_A_tgi_vs_v1_pytorch_direct.by_metric.mean_parallel_efficiency.n_matched`
mean parallel efficiency under TGI	0.6659 (66.59%)	`mean_left`
mean parallel efficiency under V1-pytorch_direct	0.3918 (39.18%)	`mean_right`
mean paired delta (TGI - V1)	+0.2741 (+27.41 pp)	`mean_delta`
median paired delta	+0.2679 (+26.79 pp)	`median_delta`
Cohen's d (paired)	1.4361	`cohens_d_paired`
Wilcoxon signed-rank statistic	0.0	`wilcoxon_statistic`
Wilcoxon p-value	5.579 x 10^-20	`wilcoxon_pvalue`
Cells where TGI wins	96 of 120	`n_left_wins`
Cells where V1-pytorch_direct wins	0 of 120	`n_right_wins`
Cells tied (both at exactly 1.000 N=1 baseline)	24 of 120	`n_ties`

The Mantel-Haenszel pooled odds ratio at the parallel-efficiency >= 0.5 threshold gives a complementary categorical view:

Statistic	Value	Source field
Threshold	0.5	`mantel_haenszel_efficiency_0p5.threshold`
Total strata	24	`n_strata_total`
Strata with usable 2x2 table	22	`n_strata_used`
MH pooled OR (TGI vs V1-pytorch_direct on efficiency >= 0.5)	infinity (infinity)	`mh_pooled_or`

Observations. The TGI-vs-V1 cross-backend matched-pair contrast is decisive on every axis. 120 paired cells produce a mean delta of +27.41 percentage points with Cohen's d of 1.4361 -- a very large effect by any conventional benchmark -- and a Wilcoxon p-value of 5.58 x 10^-20, far below any reasonable multiple-comparisons-adjusted threshold. The win-tie-loss breakdown is 96/24/0: TGI wins 96 cells, ties 24 (the N=1 baselines where parallel efficiency is identically 1.000 on both backends by construction), and loses zero. The Mantel-Haenszel pooled OR at the 0.5-efficiency threshold is infinite because 22 of 24 strata contain zero cells in the "TGI fails, V1 succeeds" off-diagonal -- an empirical-zero pattern that produces an infinite OR by construction.

The TR164 V2 cross-run statistics are the strongest possible quantitative signature of a clean elimination: 120 / 120 matched cells with TGI no worse than V1-pytorch_direct, 96 / 120 strictly improving, Cohen's d of 1.44, p < 6 x 10^-20. The +27.41 pp mean delta substantially exceeds TR165's +17.91 pp mean delta on the matched N=2 boundary -- the gap between the two is approximately +9.5 pp, which is the structural footprint of the non-interpreter mechanisms that survive GIL removal but are bypassed by the server-process architecture.

16.2 Two-axis triangulation: backend axis and interpreter axis

The contrast between the two follow-ups is summarized below.

Axis perturbed	Experiment	Mean Delta parallel efficiency vs V1	Effect-size statistic	Hang-shape resolution	Verdict on GIL-as-mechanism
Backend (PyTorch direct -> TGI server)	TR164 V2	+27.41 pp across 120 matched cells	Cohen's d_paired = 1.44, Wilcoxon p < 6 x 10^-20, MH pooled OR = infinity from 22/24 strata, 96/0 wins	Eliminated entirely (TGI runs all N up through V1's hang boundary)	Compatible: a backend that bypasses in-process Python also bypasses the GIL
Python interpreter (3.14 GIL -> 3.14t nogil)	TR165	+17.91 pp across 24 matched N=2 cells (+17.15 pp median)	19/24 H1 pass (79.2%); regression cell at -25.42 pp	2 of 6 hang cell shapes resolved; 4 still hang	Partial: H2_partial -- GIL is A mechanism, not THE only mechanism
Same-build GIL flag (`PYTHON_GIL=1` -> `PYTHON_GIL=0`)	TR165 same-session control	+13.25 pp across 24 paired N=2 cells (+12.74 pp median)	24/24 positive paired shifts; mean CI [0.1218, 0.1434]; 0/24 clear PE >= 0.65 in either arm	Not tested	Confirms GIL contribution at N=2; confirms no-GIL is not sufficient on this Docker stack

Observations. The perturbations agree on direction but disagree on magnitude and sufficiency. The backend change eliminates the breakdown cleanly with a near-textbook effect size; the historical interpreter transition recovers roughly 18 percentage points of N=2 parallel efficiency on average and leaves a 21% tail of combinations and a 67% tail of hang shapes (4 of 6) unimproved; the same-build GIL flag control confirms the direction at N=2 with 24/24 positive shifts but leaves every same-session pair below the 0.65 floor. That asymmetry is the load-bearing structural finding of Phase 8: the GIL is real, but the server-process backend remains the stronger architectural intervention.

The honest joint reading: TR164 V2's clean elimination at d = 1.44 / p < 6 x 10^-20 on 120 matched cells confirms the backend-architecture axis as a mechanism that eliminates V1's breakdown completely; TR165's partial recovery at mean +17.91 pp on 24 matched cells confirms the Python-interpreter axis as a mechanism that recovers about two-thirds of what the server-process backend recovers. The gap between the two (+27.41 pp on V2 vs +17.91 pp on TR165 -- approximately +9.5 pp residual on the cells TR165 cannot recover) is the joint footprint of the non-GIL in-process mechanisms that the server-process architecture additionally bypasses.

16.3 The two-axis layered interpretation

The two-axis triangulation now supports an explicit layered model of V1's N=2 breakdown:

GIL contention is a primary mechanism behind V1's N=2 breakdown. TR165 confirms this directly. Removing the GIL via the free-threaded build recovers +17.91 pp of mean N=2 parallel efficiency, 79.2% of (model x workload x phase) combinations cross the per-cell H1 threshold, and 2 of the 6 deterministic-hang cell shapes (both on long_decode) clear. The recovery is broad-based (median +17.15 pp tracks mean within 0.76 pp) and cross-validated across the four contrast surfaces (decode_tps, parallel_efficiency, p95_latency_multiplier, ttft_ms move coherently). This is the primary-mechanism finding the H2_partial verdict licenses.
Additional in-process-dispatch mechanisms account for the residual that survives GIL removal. TR165 demonstrates the residual: 5 of 24 N=2 combinations fail the per-cell H1 threshold (mean delta on the failer bucket is -0.36 pp, dragged below zero by the lone regression cell), and 4 of 6 hang shapes still hang under nogil. The factor decomposition in SS4.4 localizes the residual to repeated_prefix (3 of 5 failers), the ttft phase (4 of 5 failers), and the qwen2.5-1.5b model (3 of 5 failers). The candidate co-mechanisms -- CUDA kernel serialization on the single default stream, HBM-bandwidth contention on cache-heavy workloads, and dispatcher / tokenizer / library-internal locks in the in-process orchestration layer -- survive GIL removal by construction. Each candidate is plausible against a specific factor in the residual; none is independently tested by TR165's single-delta design.
The server-process backend architecture bypasses all of these mechanisms at once, which is why TR164 V2 shows clean elimination while TR165 shows partial recovery. A continuous-batching server-process backend does not dispatch through in-process Python at all. It bypasses the GIL trivially (the Python process at the client side just submits an HTTP/gRPC request), but it also bypasses (a) the in-process default CUDA stream because the server owns its own CUDA context, (b) the in-process tokenizer/dispatcher path because the server handles those, and (c) the in-process allocator critical sections because the server uses its own paged-attention allocator. TR164 V2's d = 1.44 / p < 6 x 10^-20 / 96/0 wins on 120 matched cells is therefore measuring the joint effect of removing all three layers at once. TR165's mean +17.91 pp on 24 matched cells is measuring only the GIL layer in isolation. The +9.5 pp gap between V2 and TR165 is the substrate-grounded size of the non-GIL layer's contribution under matched conditions.

Mechanism layer	TR164 V2 (server-process) bypasses it?	TR165 (free-threaded) bypasses it?	Mean recovery attributable
Python GIL	Yes (no in-process Python on dispatch)	Yes (sys._is_gil_enabled() == False verified)	+17.91 pp (TR165)
CUDA kernel serialization on default stream	Yes (server owns its CUDA context with multiple streams)	No (default stream still serializes in-process work)	Residual within the +9.5 pp V2-vs-TR165 gap
HBM-bandwidth contention on cache-heavy workloads	Mitigated by paged attention / continuous batching	No (interpreter-level changes do not touch HBM)	Residual within the +9.5 pp V2-vs-TR165 gap
In-process dispatcher / tokenizer / library locks	Yes (server-side dispatch path independent of client process)	No (in-process dispatcher path still runs in-process)	Residual within the +9.5 pp V2-vs-TR165 gap

Observations. The layered model is now fully substrate-grounded on both axes. The GIL layer's contribution is +17.91 pp on average (TR165 substrate). The combined non-GIL in-process layers' contribution is bounded by the V2-vs-TR165 gap of approximately +9.5 pp on average; that bound is a ceiling on the joint footprint of CUDA-stream serialization plus HBM-bandwidth contention plus dispatcher / library-lock contention together, not on any one of them individually. Disaggregating which fraction of the +9.5 pp is attributable to each is the explicit subject of TR166+ per-stream and per-allocator isolation experiments outlined in SS13.

The two-axis triangulation produces a stronger and more precise claim than either axis alone could license. TR164 V1 documents the phenomenon. TR164 V2 alone would license "a non-Python serving architecture eliminates the breakdown" without identifying which in-process mechanism dominates. TR165 alone would license "the GIL is a contributor" without distinguishing whether the residual non-improvement is a measurement artifact or a real second mechanism. The three reports together license the layered claim: the GIL contributes approximately +17.91 pp of the recoverable parallel efficiency at N=2, the combined non-GIL in-process mechanisms (CUDA-stream serialization, HBM-bandwidth contention, dispatcher / library-lock contention) jointly contribute approximately +9.5 pp more, and the +27.41 pp total is what the server-process architecture recovers by bypassing all four layers at once.

16.4 Implication for the bridge-paper Phase 8 anchor

The implication for the bridge paper Phase 8 anchor at papers/serving_state_safety_certification/UPGRADE_PLAN.md is that the serving-state safety certification protocol should not treat "direct PyTorch + GIL" as a monolithic risk class. The free-threaded interpreter is a partial mitigation -- meaningful but not sufficient -- and a continuous-batching server-process backend is the architectural intervention that closes the residual gap. Practitioners weighing the two interventions can read the +17.91 pp TR165 mean delta as the floor of what nogil buys them on a single-GPU laptop-class hardware substrate and the +27.41 pp TR164 V2 mean delta as the ceiling that the backend change reaches under matched conditions. The forbidden region between those two reference points (approximately +9.5 pp) is where the non-GIL in-process mechanisms live, and characterizing that region directly is the follow-up scaffold this report flags but does not execute.

Practitioner question	Substrate-grounded answer
If I cannot change my serving backend, will free-threaded Python recover my N=2 efficiency?	Partially: mean +17.91 pp, 79.2% of (model x workload x phase) combinations improve; expect residual on `repeated_prefix` and `ttft` phases
If I can change my serving backend, will it close the full gap?	Yes on the tested matrix: mean +27.41 pp under TGI, d = 1.44, p < 6 x 10^-20, 96/0 wins on 120 matched cells
What is the size of the non-GIL in-process residual that the backend change additionally bypasses?	Approximately +9.5 pp on the matched-pair mean (V2 mean - TR165 mean), bounded by the joint footprint of stream serialization, HBM contention, and library locks
What is the H1-fail tail under nogil that the backend change resolves?	The 5 H1-failing combinations and 4 unresolved hang shapes -- all of which complete cleanly under TGI in V2

Observations. The practitioner-grade reading is now fully grounded. The bridge paper Phase 8 anchor can quote the +17.91 pp / +27.41 pp / +9.5 pp triple as the substrate-grounded decomposition of V1's N=2 breakdown into GIL-attributable and non-GIL-in-process-attributable components. None of the three numbers is an estimate; each is a matched-pair statistic with named source fields in the two run artifacts.

The two-axis triangulation is what makes the Phase 8 anchor language defensible. Without TR165, the bridge paper would have had to attribute all of TR164 V2's +27.41 pp recovery to "moving off direct PyTorch" without naming what changed. Without TR164 V2, TR165's H2_partial would have had to leave the residual mechanism unnamed. With both reports, the paper can decompose +27.41 pp into +17.91 pp (interpreter axis) + ~+9.5 pp (non-interpreter in-process layers bypassed by the server architecture), and each component is anchored in a separate matched-pair contrast against the same V1 baseline.

17. SS10. Cross-TR Position

TR165 occupies a specific load-bearing slot in the Banterhearts program: it is the first technical report to use Python 3.14t free-threaded (PEP 779) as the dispatcher runtime for systematic, matched-pair experiment dispatch against a prior TR's pre-registered hypothesis. Every earlier TR in the program ran on a GIL-enabled CPython build; TR165 is the runtime-shift report. The preflight verified sys._is_gil_enabled() == False and sysconfig.Py_GIL_DISABLED == True before and after torch, transformers, and accelerate imports, with abort_on_gil_build: True enforced as a safety bar. This means TR165 is also the first TR whose substrate is legally allowed to make claims about GIL attribution at all -- every prior TR could only observe N=2 breakdown, not test its mechanism.

TR165 is the third report in the Phase 8 serving-stack mechanism-isolation arc. The arc has a deliberate three-report shape, and each report is non-substitutable:

TR	Role in arc	Mechanism tested	What it can rule in / out
TR164 V1	Parent experiment	None (phenomenon documentation)	Establishes that N=2 breakdown exists on `pytorch_direct` and names GIL contention as the leading attribution hypothesis
TR164 V2	Companion mechanism-isolation	Backend architecture (vLLM / SGLang / TGI vs. `pytorch_direct`)	Tests whether swapping the serving backend dissolves the N=2 boundary; complements TR165 along the architecture axis
TR165	This report	Python interpreter (3.14t free-threaded, GIL removed)	Tests whether removing the GIL dissolves the N=2 boundary; complements TR164 V2 along the runtime axis

Observations. The three reports are organized as a 2D mechanism-isolation grid (backend x runtime), not a sequence in which TR165 replaces TR164. TR164 V1 supplies the phenomenon and the hypothesis; TR164 V2 holds the runtime fixed and varies the backend; TR165 holds the backend fixed (pytorch_direct only, 48 cells planned, 3,744 metrics rows produced, all cells completed) and varies the runtime. Removing any one of the three collapses the program's ability to make a mechanism-level claim about V1's breakdown.

The honest cross-TR reading is that TR164 V1 is the observation, TR164 V2 is the backend-axis falsification, and TR165 is the runtime-axis falsification. The H2_partial verdict from TR165 -- 19 of 24 N=2 combinations improving, mean delta in parallel efficiency of +17.91 percentage points, median +17.15 pp, 2 of 6 V1 deterministic-hang shapes resolved out of 6 attempted -- is exactly the kind of result that requires TR164 V2 to remain in the arc rather than be retired by TR165. If TR165 had returned H1 (all 24 combinations passing the threshold, all 6 hang shapes resolved), the GIL would have been a sufficient single-mechanism explanation and TR164 V2 would be supporting evidence at most. Because TR165 returned H2_partial, the residual 5 of 24 non-improving combinations and the 4 still-hanging shapes are exactly what TR164 V2's cross-backend evidence is positioned to characterize.

The bridge paper at papers/serving_state_safety_certification/UPGRADE_PLAN.md consumes TR165 as part of its Phase 8 anchor. The bridge paper's load-bearing use of TR165 is the H2_partial verdict itself: the paper does not need TR165 to prove that the GIL is the mechanism; it needs TR165 to establish that the GIL is a mechanism with measurable but partial attribution (the +17.91 pp / +17.15 pp deltas and the 2-of-6 hang resolution being the substrate the paper cites). This positioning is what licenses the bridge paper's later claim that backend architecture (TR164 V2) and Python runtime (TR165) are non-redundant axes of the serving-stack mechanism-isolation argument.

There is one further cross-TR positioning point worth being explicit about. TR165 is also the first TR in the program for which the dispatcher's startup preflight is itself a publishable result -- gil_disabled_at_dispatcher_startup: True, free_threaded_build: True, signals_agree: True, cuda_available: True, all four agreeing -- because in every prior TR the runtime was an uncontested constant. From TR165 onward, the runtime is a controlled variable, and the preflight log line is the contract that says so. Future TRs in the arc that revisit GIL attribution at larger model scale, larger N, or under additional backends will inherit this preflight contract directly from TR165's dispatcher rather than re-deriving it. The matched_pair_contrast block (with baseline_csv pointing at TR164 V1's 20260531_120428_552237 run directory and the per-metric sub-blocks for decode_tps, parallel_efficiency, p95_latency_multiplier, and ttft_ms) is the schema that downstream TRs and the bridge paper Phase 8 section will consume.

18. SS11. Forbidden Claims

Three load-bearing claims that the substrate explicitly does not license. Each is named, attributed to the framing it would most naturally arise from, and rejected against the matched-pair evidence. Naming them in the report itself is a defensibility instrument: any downstream reader (or adversarial reviewer) re-deriving TR165's verdict ladder should land on the same forbidden list.

18.1 Forbidden claim 1: "The GIL is the sole cause of V1's N=2 breakdown."

This is the strongest possible reading of the H1 hypothesis. The substrate rejects it.

Forbidden claim	Disconfirming evidence	Source field
GIL is sole mechanism	5 of 24 N=2 combinations fail the H1 efficiency-improvement threshold	n_combinations_below_h1_threshold
GIL is sole mechanism	4 of 6 deterministic-hang cell shapes still hang under nogil	hang_cells_attempted minus n_hangs_resolved
GIL is sole mechanism	Pre-registered falsification verdict is H2_partial, not H1	falsification_verdict

Observations. If the GIL were the sole mechanism, every single one of the 24 (model x workload x phase) N=2 combinations would clear the H1 efficiency-improvement threshold, and all 6 deterministic-hang cell shapes from V1 would resolve under nogil. Neither is observed. The 5/24 H1-failing combinations and the 4/6 still-hung cell shapes are structural counter-evidence, not noise; they survive on a full-coverage matched-pair join.

The GIL is a mechanism behind V1's N=2 breakdown. It is not the mechanism. The residual non-improvement is the load-bearing observation, not a footnote.

18.2 Forbidden claim 2: "Free-threaded Python eliminates the in-process N=2 breakdown."

This is the marketing-grade overreach that "free-threaded Python unlocks parallel inference" headlines invite. The substrate rejects it as well.

Forbidden claim	Disconfirming evidence	Source field
Nogil eliminates breakdown	Mean N=2 parallel-efficiency delta is +17.91 pp (partial recovery, not full recovery to ideal)	mean_delta_n2_efficiency
Nogil eliminates breakdown	Median delta +17.15 pp confirms the partial-recovery shape (no long tail correction)	median_delta_n2_efficiency
Nogil eliminates breakdown	Only 19/24 (79.2%) combinations improve at all	n_combinations_passing_h1
Nogil eliminates breakdown	Only 2 of 6 hang cell shapes resolved	n_hangs_resolved

Observations. "Elimination" requires that N=2 parallel efficiency under nogil approach 1.0 across the matrix, and that all V1 hangs disappear. The observed mean and median deltas are roughly +17-18 percentage points -- meaningful and pre-registered as a passing H2 outcome, but explicitly bounded. A 21% non-improvement minority and a 67% unresolved-hang rate are incompatible with elimination language.

Partial recovery is the honest verdict. Calling it elimination overstates the result and gives an adversarial reviewer a free strike.

18.3 Forbidden claim 3: "Python 3.14t is production-ready for in-process LLM serving."

This is the claim TR165 most strongly invites and most decisively does not test. The matrix is mechanism-isolation, not production-readiness.

Forbidden claim	Why TR165 cannot license it	Source field
Production-ready ruling	N sweep is capped at N=1 and N=2 only; no N=4, N=8, N=16 evidence	plan N concurrency
Production-ready ruling	Single-backend ablation (pytorch_direct only); no vLLM / SGLang / TGI cross-validation in this run	backend pytorch_direct only
Production-ready ruling	No long-run stability or memory-leak sweep is part of the protocol	analysis.json (no such field)
Production-ready ruling	3,744 metrics rows on a 48-cell matrix is mechanism-grade, not deployment-grade	rows; cells planned

Observations. TR165 answers exactly one question on a bounded matrix: does removing the GIL change V1's N=2 parallel-efficiency story under matched conditions? The hardware (RTX 4080 Laptop, 12 GB) is the V1 match, not a serving platform. The substrate this report does not have -- full-N sweep through at least N=16, multi-hour stability with memory-leak telemetry, cross-backend confirmation under nogil -- is precisely the substrate that any production-readiness claim would require.

TR165 is a single-mechanism falsification experiment. Reading it as a deployment endorsement reverses the direction of evidence. The defensibility bar requires that the forbidden ruling be named here, not assumed away.

18.4 What replaces each forbidden claim

For each forbidden claim, the licensed counterpart from SS09 / SS10 is the one that survives the matched-pair contrast:

Instead of "GIL is sole cause," the licensed claim is "GIL is a meaningful but partial mechanism (H2_partial), with 19/24 combinations improving by a mean of +17.91 pp and 2/6 hangs resolved."
Instead of "nogil eliminates the breakdown," the licensed claim is "nogil recovers a partial fraction of N=2 parallel efficiency on the tested matrix; 21% of combinations and 67% of hang shapes are unaffected."
Instead of "3.14t is production-ready," the licensed claim is "3.14t is mechanism-viable on a 48-cell bounded matrix; production qualification requires the full-N, long-run, cross-backend substrate enumerated in the future-work section."

The forbidden list is not rhetorical. It is the explicit boundary between what the 3,744-row matched-pair contrast supports and what an adversarial reading would attempt to extract from the same numbers.

19. SS12. Limitations.

TR165 is a deliberately narrow ablation, and the honesty of its H2_partial verdict depends on naming what it does not establish. The same-session GIL flag control removes the original N=2 version-confound concern, but only for the N=2 screen. Five structural limitations still bound the inference surface of this report. None of them invalidate the matched-pair contrast against TR164 V1 or the same-build control; each defines a follow-up experiment that the substrate cannot adjudicate on its own.

Limitation 1: Concurrency range capped at N in {1, 2}. TR164 V1's breakdown curve extends through N=1, N=2, N=4, N=8, and N=16. TR165 tests only the N=1 to N=2 transition -- the boundary at which V1 first exhibited parallel-efficiency collapse. The mean delta of +0.1791 (+17.91 percentage points) and median delta of +0.1715 (+17.15 percentage points) in N=2 parallel efficiency are therefore boundary-of-breakdown measurements, not full-curve recoveries. Whether nogil's partial recovery scales monotonically, plateaus, or inverts at N=4, N=8, or N=16 is unmeasured. A program-grade follow-up must re-run the full {1, 2, 4, 8, 16} ladder under Python 3.14t to recover the complete recovery curve and test whether the 4 unresolved hang shapes at N=16 remain deterministic or shift to stochastic timeouts.

Limitation 2: Single backend (pytorch_direct only). The 48 planned cells (3 models x 4 workloads x 2 phases x 2 N levels) cover one backend: direct PyTorch in-process inference. TR164 V2 evaluates TGI, vLLM, and SGLang as separate-process server backends with their own concurrency models (continuous batching, paged attention, request schedulers). TR165 does NOT test whether the GIL-attribution mechanism -- and therefore the +17.91 pp mean recovery -- generalizes to in-process variants of those backends, nor whether their internal threading models are gated by the same CPython interpreter lock that pytorch_direct exposes. The scope claim is strictly: nogil partially recovers parallel efficiency for direct PyTorch dispatch. Cross-backend generalization is a separate experimental question.

Limitation 3: Hardware single-source (RTX 4080 Laptop 12 GB). Hardware is matched exactly to TR164 V1, which is the correct choice for a matched-pair contrast but produces a single-GPU substrate. The report does not measure whether the GIL is the dominant N=2 bottleneck on data-center-class accelerators where CUDA kernel launch overhead is a smaller fraction of total step time and where memory bandwidth is dramatically higher. It is structurally possible that on a larger accelerator the residual 20.8% of combinations not passing the H1 threshold shrinks (because non-GIL mechanisms like memory-bandwidth contention recede) or grows (because dispatcher overhead becomes the dominant remaining serialization point). The H2_partial verdict is hardware-conditional on a 12 GB laptop-class GPU.

Limitation 4: PyTorch CUDA-nogil wheel state is bleeding-edge. The dispatcher preflight verifies Python 3.14.5 free-threaded build with sys._is_gil_enabled() == False both before and after PyTorch import, and the abort_on_gil_build safety bar held. However, the PyTorch CUDA-nogil wheel resolved at runtime is whatever was current at the dispatch moment; the free-threaded ecosystem is moving fast enough that nogil-specific tensor-op behavior, dispatcher locking, and CUDA stream interaction may shift materially between minor wheel revisions. The substrate captures a snapshot, not a stable equilibrium. Re-runs against future PyTorch nogil wheels may produce different recovery percentages without any change to TR165's experimental design.

Limitation 5: Six V1 hang shapes tested only at original N=16. The hang_resolution_ledger reports 2 of 6 deterministic-hang cell shapes RESOLVED under nogil and 4 still hanging, all attempted at the original V1 trigger condition of N=16. TR165 does NOT sweep other N levels under nogil for those same six (model, workload) pairs to determine whether the 4 unresolved hangs are GIL-independent at all N concurrency levels or whether they are GIL-attributable at lower N but transition to a different mechanism (kernel serialization, OOM pressure, scheduler starvation) only at N=16. The 2/6 vs 4/6 split is therefore a statement about hang resolution at V1's specific trigger N, not about the underlying mechanism behind each hang shape.

Control-scope note. The 2026-06-23 same-build control should not be read as a replacement for the whole historical TR165 verdict. It validates the GIL-flag direction at N=2 with 24/24 positive paired shifts, but it intentionally excludes the N=16 hang-shape stress subset and it does not reproduce the historical 19/24 absolute floor crossing. The correct synthesis is: same-build control confirms GIL contribution at N=2; historical transition supplies the hang-resolution ledger; neither supports "free-threading fixes direct PyTorch serving."

Taken together, these five limitations bound the H2_partial verdict precisely: no-GIL recovers a meaningful but partial fraction of N=2 parallel efficiency in direct PyTorch inference on a single laptop-class GPU, with the historical transition reporting 19 of 24 combinations clearing the H1 threshold and 2 of 6 hang shapes resolved at V1's original trigger, and the same-build control reporting 24/24 positive N=2 paired shifts but 0/24 absolute floor crossings on the Docker stack. Every adjacent generalization -- to higher N, other backends, larger accelerators, future PyTorch builds, or other trigger conditions -- requires a separate experimental campaign. The bridge paper's Phase 8 anchor consumes TR165 as a partial-mechanism falsification of V1's GIL hypothesis, and the limitations above explicitly license that scope and no more.

The honest reading: TR165 establishes that the GIL is a mechanism behind V1's N=2 breakdown on this substrate, refutes the strong claim that it is the mechanism, and leaves the five generalization axes above as the follow-up program. The historical +17.91 pp transition and 2/6 hang-resolution count are real but version-coupled; the same-build +0.1325 N=2 control isolates the GIL flag but does not clear the floor. Everything beyond those bounds is TBD per follow-up experiments not in this report.

20. SS13. Future Work.

TR165 establishes the H2_partial verdict on a deliberately narrow N={1,2} substrate. The honest answer "GIL is A mechanism but not the only mechanism" creates three queued extensions, each addressing a specific residual question the current substrate cannot resolve. None of these are speculative additions; each maps directly onto an observed gap in the 20.8% non-improvement combinations or the 4-of-6 unresolved deterministic hangs.

20.1 Extension 1: Full-N nogil sweep (N={4, 8, 16}).

The current substrate covers N=1 and N=2 only because TR164 V1's deterministic-hang boundary lived at N=2 and the matched-pair contrast required exact replication of V1's plan. The +17.91% mean delta and +17.15% median delta in N=2 parallel efficiency characterize the onset of the GIL effect, not its asymptote. A full-N sweep at N={4, 8, 16} on the same 3-model x 4-workload substrate would extend the falsification ledger from 24 combinations to 96 combinations (3 x 4 x 2 phases x 4 N levels), and would test whether the +0.17909670 mean delta scales, saturates, or inverts as agent count grows.

Question	What N=2 substrate can answer	What N={4,8,16} would add
Onset of GIL bottleneck	Yes -- 19 of 24 combinations clear threshold	N/A
Breakdown-curve shape	No -- single point on the curve	Full envelope
Hang resolution scaling	2 of 6 V1 hang shapes resolved under nogil	Does the 33% resolution rate hold at higher N?
Non-GIL mechanism dominance crossover	Inferred from the 5 non-improving combinations	Direct measurement

Observations. The N={4, 8, 16} extension is the cheapest of the three queued items because the dispatcher, model wrappers, and analyze.py extractors all already support arbitrary N -- the change is a config.yaml edit and a longer wall-clock budget.

The 20.8% non-improvement rate at N=2 is the substrate's most load-bearing residual. Without higher-N data, we cannot distinguish "GIL effect saturates at moderate N" from "non-GIL mechanisms grow faster than GIL relief as N rises." TR165's H2_partial verdict is correct but incomplete; the full breakdown curve is the next-resolution answer.

20.2 Extension 2: CUDA-stream-serialization isolation (TR166+ candidate).

The residual 20.8% of combinations not improving under nogil and the 4 of 6 deterministic-hang shapes still hanging under nogil are the substrate's most direct evidence that mechanisms outside the GIL are active at N=2. The leading candidate is default-CUDA-stream serialization: all PyTorch operations from concurrent Python threads enqueue onto the same default stream and serialize at the driver layer regardless of GIL state. A TR166+ candidate would re-run TR165's plan with per-agent CUDA streams (one torch.cuda.Stream() per concurrent agent) and measure the delta against TR165's nogil-only baseline.

Mechanism candidate	TR165 evidence	TR166+ test
Python GIL contention	+17.91% mean delta confirms partial attribution	Held constant (nogil baseline)
Default-stream serialization	4 unresolved hangs + 5 non-improving combinations	Per-agent streams as the manipulated factor
Memory-bandwidth contention	TBD per analyze.py extension	Captured via nsys async-traces
Dispatcher overhead	TBD per analyze.py extension	Bounded by per-agent dispatcher process variant

Observations. TR166+ inherits TR165's matched-pair design and adds per-agent CUDA streams as the manipulated factor; the N=2 cell shape is preserved so the contrast is clean against TR165's verdict.

The honest reading of TR165 is that the GIL accounts for ~80% of the improvable surface and something else accounts for the rest. Naming that "something else" as default-stream serialization is a hypothesis, not a finding; TR166+ promotes it to a falsifiable test under the same defensibility bar.

20.3 Extension 3: Cross-hardware nogil substrate (A100 / H100).

TR165 ran exclusively on the RTX 4080 Laptop 12 GB to hold V1's hardware fixed. This is the right matched-pair discipline but it leaves open whether the H2_partial verdict is consumer-hardware-specific. A100 (40 GB / 80 GB) and H100 (80 GB) have different memory-bandwidth envelopes, different SM counts, different default-stream behavior under driver versions tied to data-center SKUs, and run under Linux (no Windows scheduler interaction). Re-running the TR165 plan on A100 and H100 would test whether the +17.91% mean delta and the 4 unresolved hangs are intrinsic to the GIL-attribution pattern or whether they are partly an artifact of consumer-laptop thermal and bandwidth constraints.

20.4 Bridge-paper Phase 8 absorption pathway.

The bridge paper's Phase 8 anchor at 2026-10-24 acts as the GO/NO-GO trigger for wave-2 substrate expansion. If an external acceptance signal clears by that date, extensions (1) full-N sweep and (3) cross-hardware substrate are the most natural absorptions: both extend an existing matched-pair design rather than introducing new methodology, and both produce data that strengthens the bridge paper's serving-state safety certification claim about GIL-attribution under direct-PyTorch concurrency. Extension (2) -- the CUDA-stream-isolation TR166+ candidate -- is a separate research arc and would queue as a standalone technical report regardless of the Phase 8 trigger outcome.

21. Conclusion

Technical Report 165 closes the matched-pair arc that Technical Report 164 V1 opened, but the closure is now two-layered. V1 reported a sharp N=2 breakdown in the in-process pytorch_direct backend on an RTX 4080 Laptop 12 GB and attributed the boundary, tentatively, to CPython's global interpreter lock. The original TR165 run made the first controlled move by replacing the stock CPython 3.13 GIL build with Python 3.14.5 free-threaded and preserving the V1 model/workload/backend geometry. The later same-session control made the stricter move by holding CPython 3.14.5, Docker image, CUDA stack, model plan, and run window fixed while toggling only PYTHON_GIL=1 versus PYTHON_GIL=0.

The pre-registered ladder resolves to H2_partial. H0 -- "the GIL is not the mechanism, expect no improvement" -- is rejected: the mean delta in N=2 parallel efficiency under nogil is +0.1791 (+17.91 percentage points absolute) and the median is +0.1715 (+17.15 pp), both meaningfully above the noise floor. H1 -- "the GIL is the mechanism, expect improvement in all 24 (model x workload x phase) combinations at N=2" -- is not fully supported: 19 of 24 combinations clear the H1 threshold (79.2%) and 5 of 24 do not (20.8%). H2 -- "some but not all combinations improve" -- is supported by construction at that 79/21 split. The hang ledger sharpens the same finding from a different angle: of the six deterministic-hang cell shapes V1 catalogued at N=16, two are resolved under nogil and four still hang. The minimum coverage thresholds (twenty N=2 combinations, six hang-shape attempts) are both satisfied, so the verdict is not coverage-limited.

What this licenses is narrower than the V1 framing and cleaner than the original TR165 report. We can now say that the GIL flag itself has a positive N=2 effect: in the same-session control, no-GIL improves 24/24 paired combinations by mean +0.1325 with 95% CI [0.1218, 0.1434]. We can also say that no-GIL is not sufficient: in that same control, neither arm clears the 0.65 parallel-efficiency floor in any of the 24 pairs, and the historical hang ledger still resolves only 2 of 6 N=16 shapes. The defensible claim is therefore not "free-threaded Python fixes in-process multi-LLM dispatch"; it is "free-threaded Python moves the boundary in the expected direction, while non-GIL in-process mechanisms still dominate enough to keep the boundary active."

The methodological contribution is the matched-pair triangulation itself. TR164 V2 changes the backend while holding the interpreter fixed; TR165 changes the runtime state while holding the backend fixed at pytorch_direct; the same-session control further separates "Python version changed" from "GIL flag changed." Read together, the three contrasts isolate the V1 breakdown into backend-architecture, interpreter-lock, and remaining in-process residual components. The numerical decomposition should be stated conservatively: the historical transition reports +17.91 pp mean recovery, the same-build GIL flag control isolates +13.25 pp at N=2, and the server-process backend axis remains the larger architectural intervention. This is the structural deliverable Phase 8 needed, and the partial verdict is what makes the triangulation honest rather than confirmatory.

The substrate-grounded operational recommendation is correspondingly cautious. Free-threaded Python is a meaningful mitigation for direct-PyTorch in-process dispatch, but it is not a sufficient serving plan. For workloads that need reviewer-proof or production-grade concurrency, the responsible path remains a server-side backend (TGI / vLLM / SGLang) or an explicit N=1 queue until the residual in-process mechanisms are isolated. The remaining mechanism-isolation work is now sharper: reproduce the same-build control at higher N, add the N=16 hang-shape subset to same-session control, and separate CUDA-stream serialization from dispatcher and allocator contention inside the residual.

22. References

This section enumerates the load-bearing internal and external references for TR165. Internal references are paths inside the Banterhearts monorepo (relative to repo root); external references are public specifications, library documentation, and upstream issue trackers consulted during the build. Reference numbering is local to TR165; cross-TR citation should use the full repo-relative path rather than the local number, since the Banterhearts substrate does not maintain a global bibliography.

22.1 Banterhearts internal references

Tag	Path	Role in TR165
[I-1]	`PublishReady/reports/Technical_Report_164.md`	TR164 V1 parent report -- source of the GIL-attribution hypothesis TR165 tests, and the canonical narrative of the N=2 breakdown boundary and the six deterministic-hang cell shapes.
[I-2]	`research/tr164/results/20260531_120428_552237/`	TR164 V1 raw run directory -- read-only baseline that supplies the matched-pair contrast keys consumed by TR165's `matched_pair_contrast` block.
[I-3]	`research/tr164/V2_BACKEND_NOTES.md` and `research/tr164/DRAFT_Technical_Report_164_V2.md`	TR164 V2 companion (cross-backend Phase 8 sweep) -- orthogonal axis to TR165's single-backend `pytorch_direct` ablation.
[I-4]	`research/tr164/cross_run_analysis.json`	TR164 V2 cross-run synthesis -- supplies the 120-cell TGI-vs-V1 statistics (d=1.44, p<6x10^-20, MH OR=infinity from 22/24 strata) cited in SS9.
[I-5]	`research/tr165/results/20260607_174748_273070/analysis.json`	Historical TR165 transition substrate -- 3,744 metric rows over 48 cells with the `H2_partial` falsification verdict and N=16 hang ledger.
[I-6]	`research/tr165/same_session_gil_control/same_session_gil_control_summary.json`	Same-build CPython 3.14.5 `PYTHON_GIL=1` versus `PYTHON_GIL=0` control summary for the N=2 screen.
[I-7]	`research/tr165/same_session_gil_control/same_session_gil_control_n2_pairs.csv`	Per-pair same-session N=2 GIL-control deltas.
[I-8]	`research/tr165/results/20260623_165503_919943/` and `research/tr165/results/20260623_222630_155096/`	Same-session GIL-on and no-GIL run directories.
[I-9]	`papers/serving_state_safety_certification/UPGRADE_PLAN.md`	Bridge paper Phase 8 anchor -- defines the role of TR164/TR165 in the broader serving-state safety certification arc.
[I-10]	`BANTERHEARTS_MEASUREMENT_COUNT.md`	Canonical primary + judge measurement count for the Banterhearts program; TR165's rows roll up into this ledger.

Observations. The internal reference set is deliberately narrow: TR165 is a single-hypothesis matched-pair against TR164 V1, so only V1's run directory and report are load-bearing on the parent side, with the V2 companion (drafts plus cross-run analysis JSON) supplying the cross-backend complement and the bridge paper supplying the certification framing.

The internal reference graph is shallow on purpose -- TR165's defensibility rests on one matched-pair contrast against one parent run plus one cross-axis triangulation against TR164 V2's cross-run synthesis. Keeping the citation set tight makes the falsification verdict easy to audit end-to-end from the analysis.json upward.

22.2 External references

Tag	Reference	Role in TR165
[E-1]	PEP 779 -- "Criteria for supported status for free-threaded Python" (python.org/peps/pep-0779)	Defines the supported-status criteria for the Python 3.14 free-threaded build; grounds the `python3.14t` interpreter selection and the `sys._is_gil_enabled() == False` preflight gate.
[E-2]	Python 3.14.5 free-threading build release notes (tags/v3.14.5:5607950, May 10 2026)	The exact interpreter build TR165 ran under, as captured by the dispatcher preflight.
[E-3]	PyTorch CUDA wheels documentation for free-threaded Python (`pytorch.org/get-started`, free-threaded wheel index)	Source for the `torch` wheel ABI used; relevant because `torch` import must not silently re-enable the GIL under a `python3.14t` interpreter.
[E-4]	`transformers` import-time threading behavior under `nogil` (HuggingFace `transformers` issue tracker, free-threaded compatibility threads)	Background for the preflight check that `sys._is_gil_enabled()` remains `False` after `transformers` import.
[E-5]	`accelerate` import-time threading behavior under `nogil` (HuggingFace `accelerate` issue tracker)	Same role as [E-4] for the `accelerate` import surface.

Observations. The external set is anchored on PEP 779 plus the three library import surfaces (torch, transformers, accelerate) that must remain GIL-disabled after import; these are the exact surfaces the dispatcher preflight gates on, so the citation set mirrors the gate.

External references are scoped to what the preflight actually verifies. TR165 does not cite the broader free-threading literature because the report's claims are bounded to one interpreter build, one backend, and one matched-pair contrast -- citing more would overreach the substrate.

23. Appendix A. Hardware and Python 3.14t Environment Fingerprint

This appendix freezes the execution environment used for the TR165 ablation so that any future replication -- whether under a stricter free-threaded build, a different GPU class, or a containerized re-run -- can be diffed against the exact substrate that produced the H2_partial verdict. The fingerprint is split into hardware, operating system, Python runtime, ML stack, and GIL-detection signals captured at dispatcher startup. Every value below was either read off the host at run time or pinned in the run directory's manifest; nothing in this table is inferred.

23.1 Hardware and OS

Component	Value
CPU	Intel Core i9 (RTX 4080 Laptop reference platform)
GPU	NVIDIA RTX 4080 Laptop, 12 GB VRAM, compute capability sm_89
System RAM	32 GB DDR5
Operating System	Windows 11 Home, build 10.0.26200
Hardware parity with TR164 V1	Exact match (same physical machine)

Observations. The hardware row is identical to the TR164 V1 baseline at research/tr164/results/20260531_120428_552237/. This is load-bearing: the matched-pair contrast in Section 13 only carries causal weight because the GPU, VRAM ceiling, thermal envelope, and driver path are held constant. The only variable that moves between V1 and TR165 is the Python interpreter.

The 12 GB VRAM ceiling on sm_89 is not incidental -- it is the same memory envelope under which V1's N=2 deterministic hangs were first observed, which is precisely why the ablation must run on the same box rather than on a higher-VRAM server class.

23.2 Python runtime and GIL signals

Field	Value
Executable path	`C:\Users\sahil\AppData\Local\Programs\Python\pythoncore-3.14t-nuget\tools\python3.14t.exe`
Version string	3.14.5 free-threading build (tags/v3.14.5:5607950, May 10 2026)
`sys._is_gil_enabled()` before imports	False
`sys._is_gil_enabled()` after torch/transformers/accelerate imports	False
`sysconfig.Py_GIL_DISABLED`	True
`free_threaded_build` flag	True
`signals_agree` (preflight aggregate)	True
`gil_disabled_at_dispatcher_startup`	True
`abort_on_gil_build` safety bar	True (would have terminated the run on a stock 3.14 interpreter)
CUDA available at startup	True

Observations. The dispatcher refuses to proceed unless every one of these signals agrees that the GIL is disabled both before and after the heavy ML imports. This is the guard against the most embarrassing failure mode for a GIL-ablation report -- accidentally running on a stock interpreter and attributing any noise to free-threading.

The post-import re-check matters specifically because PyTorch and Transformers each ship native extensions that historically gated themselves behind GIL-required initialization; confirming sys._is_gil_enabled() == False after their import is the only honest way to claim the workload itself ran under free-threading rather than just the launcher.

23.3 ML stack and CUDA

Component	Value
PyTorch version	TBD per analyze.py extension (recorded in run manifest, not surfaced in analysis.json)
Transformers version	TBD per analyze.py extension
Accelerate version	TBD per analyze.py extension
CUDA driver	TBD per analyze.py extension
CUDA toolkit	TBD per analyze.py extension
PyTorch CUDA wheel index	TBD per analyze.py extension (free-threaded cp314t wheels from PyTorch nightly index)

Observations. The exact pinned versions live in the run manifest under research/tr165/results/20260607_174748_273070/ but are not lifted into analysis.json's top-level keys, so they are marked TBD here per the substrate-fidelity rule rather than guessed. A follow-up analyze.py extension should hoist these into the top-level summary so the fingerprint is self-contained.

The non-trivial constraint is that the cp314t free-threaded ABI requires PyTorch wheels built specifically against the free-threaded interpreter; any reproduction attempt that pip-installs from the default index will silently fall back to a GIL-enabled wheel and the abort_on_gil_build guard will fire before a single cell runs.

24. Appendix B. Reproduction Commands

This appendix gives the exact command sequence to reproduce TR165 end-to-end on a Windows host with a Python 3.14t free-threaded build installed at the canonical NuGet path and an RTX 4080 Laptop class GPU. The commands are reproduced verbatim from research/tr165/README.md and from the dispatcher invocation that produced research/tr165/results/20260607_174748_273070/. Every numerical claim in this report depends on artifacts under that run directory; the file manifest at the end of this appendix enumerates them.

24.1 Preflight check sequence

Before any TR165 dispatcher invocation, the operator runs a three-step preflight sequence that confirms the interpreter is the free-threaded build, that the GIL is disabled at process start, and that CUDA is visible. All three must pass; if any returns the GIL-enabled signal, the dispatcher will abort because abort_on_gil_build: True is held as a safety bar.

Step 1 -- confirm the interpreter is the 3.14t free-threaded build:

py -3.14t -c "import sys; print(sys.version); print(sys.executable)"

Step 2 -- confirm the GIL is disabled at startup (both signals must agree):

py -3.14t -c "import sys, sysconfig; print('gil_enabled=', sys._is_gil_enabled()); print('Py_GIL_DISABLED=', sysconfig.get_config_var('Py_GIL_DISABLED'))"

Step 3 -- confirm CUDA visibility under the free-threaded build (imports of torch must not silently re-enable the GIL):

py -3.14t -c "import torch, sys; print('cuda=', torch.cuda.is_available()); print('gil_after_torch_import=', sys._is_gil_enabled())"

Observations. The dispatcher records the result of each step into runtime.json under the run directory. For run 20260607_174748_273070 the recorded values are gil_disabled_at_dispatcher_startup: True, signals_agree: True, free_threaded_build: True, and cuda_available: True. The sys._is_gil_enabled() == False signal holds both before and after torch, transformers, and accelerate are imported.

The three-step sequence is non-negotiable. If Py_GIL_DISABLED is True but sys._is_gil_enabled() returns True after a library import, the run is invalid as a GIL-ablation test and must be aborted; the matched-pair contrast against TR164 V1 only carries meaning when the dispatcher process actually executes without the GIL.

24.2 Dispatcher invocation (run.py)

The TR165 substrate was produced by a single dispatcher call. The suite is restricted to local_core, the backend to pytorch_direct (single-backend ablation against V1), and --no-nsys is set because TR165's verdict depends on the matched-pair efficiency contrast, not on kernel-level trace analysis:

py -3.14t -m research.tr165.run --suite local_core --phase all --backends pytorch_direct --no-nsys

Observations. This invocation expanded to the planned 48 cells (3 models x 4 workloads x 2 phases x 2 N levels), wrote 3,744 rows into metrics.csv, and completed with runtime_ok_for_verdict: True. No cells were skipped at the planner level; the 4 unresolved hang shapes recorded in the hang-resolution ledger are dispatcher-detected hangs at execution time, not planner exclusions.

24.3 Analyzer invocation (analyze.py)

The analyzer consumes the run directory, joins against TR164 V1's matched baseline, and emits analysis.json containing the matched-pair contrast, the breakdown-boundary table, the hang-resolution ledger, and the pre-registered falsification verdict:

py -3.14t -m research.tr165.analyze --run-dir research/tr165/results/20260607_174748_273070

Observations. The analyzer's falsification_verdict block is the load-bearing artifact: H2_partial, mean_delta_n2_efficiency: +0.1791, n_combinations_passing_h1: 19/24, n_hangs_resolved: 2/6. Every percentage-point claim in the body of this report traces back to that JSON object.

24.4 Report generator invocation (generate_report.py)

The auto-generated companion report (run-dir-local, not promoted to PublishReady/) is produced by:

py -3.14t -m research.tr165.generate_report --run-dir research/tr165/results/20260607_174748_273070

Observations. Per the repo rule that PublishReady/ is off-limits for code-generated dumps, the generator writes only into the run directory. The hand-narrated report you are reading is a separate authored artifact; the generator's output is treated as raw substrate, not as a publishable narrative.

24.5 Same-session GIL flag control invocation

The later same-build control was run in the Docker image tr165-py314t-smoke:base with the same CPython 3.14.5 free-threaded interpreter and CUDA stack for both arms. The GIL-on arm intentionally sets PYTHON_GIL=1 and bypasses the original "require no-GIL" verdict gate because it is a control arm, not a no-GIL scientific arm:

docker run --rm --gpus all -v "%CD%:/workspace" -w /workspace -e PYTHON_GIL=1 tr165-py314t-smoke:base /opt/python-3.14t/bin/python3.14t -m research.tr165.run --config research/tr165/config_same_session_n2_gil_on.yaml --no-require-nogil --no-nsys

The no-GIL arm sets PYTHON_GIL=0 and uses the normal no-GIL gate:

docker run --rm --gpus all -v "%CD%:/workspace" -w /workspace -e PYTHON_GIL=0 tr165-py314t-smoke:base /opt/python-3.14t/bin/python3.14t -m research.tr165.run --config research/tr165/config_same_session_n2_nogil.yaml --no-nsys

The comparison artifact is generated by:

python research/tr165/analyze_same_session_gil_control.py --gil-on-run-dir research/tr165/results/20260623_165503_919943 --nogil-run-dir research/tr165/results/20260623_222630_155096 --out-dir research/tr165/same_session_gil_control

Observations. The same-session commands intentionally do not reproduce the N=16 hang-shape ledger. Their job is narrower: isolate the GIL flag at N=2 under one CPython version and one Docker stack.

24.6 File manifest

The following substrate files carry every numerical claim in this report. All paths are repo-relative.

Path	Role
`research/tr165/results/20260607_174748_273070/analysis.json`	Falsification verdict, matched-pair contrast, hang ledger
`research/tr165/results/20260607_174748_273070/metrics.csv`	3,744 raw measurement rows
`research/tr165/results/20260607_174748_273070/runtime.json`	Preflight GIL/CUDA signals at dispatcher startup
`research/tr165/results/20260607_174748_273070/manifest.json`	Plan (suite, backends, models, workloads, phases, N)
`research/tr165/results/20260623_165503_919943/manifest.json`	Same-session GIL-on control manifest (`PYTHON_GIL=1`)
`research/tr165/results/20260623_222630_155096/manifest.json`	Same-session no-GIL control manifest (`PYTHON_GIL=0`)
`research/tr165/same_session_gil_control/same_session_gil_control_summary.json`	Same-session GIL-control aggregate summary
`research/tr165/same_session_gil_control/same_session_gil_control_n2_pairs.csv`	Same-session GIL-control per-pair N=2 deltas
`research/tr165/config_same_session_n2_gil_on.yaml`	Same-session GIL-on config
`research/tr165/config_same_session_n2_nogil.yaml`	Same-session no-GIL config
`research/tr165/analyze_same_session_gil_control.py`	Analyzer for same-session GIL-control summary and pairs table
`research/tr164/results/20260531_120428_552237/`	TR164 V1 baseline (matched-pair join target; read-only)
`research/tr164/cross_run_analysis.json`	TR164 V2 cross-run synthesis (120-cell TGI-vs-V1 d=1.44 / p<6x10^-20 / MH OR=infinity) cited in SS9
`research/tr165/README.md`	Preflight sequence canonical source
`research/tr165/config.yaml`	Suite definition (`local_core`, 3 models, 4 workloads, 2 phases, 2 N levels)
`research/tr165/run.py`	Dispatcher
`research/tr165/analyze.py`	Analyzer; emits `analysis.json`
`research/tr165/generate_report.py`	Auto-report generator (run-dir-local output)
`PublishReady/reports/Technical_Report_164.md`	Parent TR whose GIL-attribution hypothesis TR165 tests

Observations. The manifest now has two layers. The historical layer is the matched-pair contrast against one specific V1 run directory and recovers the H2_partial verdict, +17.91% mean N=2 efficiency delta, 19-of-24 H1 pass rate, and 2-of-6 hang-resolution count. The same-session layer is the same-build GIL flag control and recovers the 24/24 positive paired shifts, +0.1325 mean delta, and 0/24 absolute floor crossing in both arms.

The reproduction contract is the run directory plus the four invocations, not the surrounding narrative. If a future reader cannot regenerate analysis.json from metrics.csv via the analyzer invocation, the report's claims must be treated as unsupported until the discrepancy is resolved.

25. Appendix C. Per-Cell Matched-Pair Tables

This appendix exposes the per-cell substrate underlying the H2_partial verdict. Three tables resolve the 48-cell matched-pair structure at the granularity an external reader needs to re-run our pass/fail accounting: (1) per-(model x workload x phase) N=2 efficiency deltas with the H1 threshold flag, (2) the hang-resolution ledger over the six V1 deterministic-hang shapes, and (3) per-cell completion statistics. Every row is read verbatim from research/tr165/results/20260607_174748_273070/analysis.json.

25.1 Table C1. Per-(model x workload x phase) N=2 efficiency delta + H1 flag

The 24 combinations are tabulated below, sorted ascending by delta. The H1 column flags pass/fail at the per-cell improvement threshold (the threshold sits between +11.72 pp and +12.15 pp on this distribution, which is what drives the 19/5 split). The five fail rows correspond exactly to n_combinations_below_h1_threshold = 5 in falsification_verdict.

Rank	Phase	Model	Workload	V1 eff	TR165 eff	Delta	%Delta	H1
1	ttft	qwen2.5-1.5b	short_decode	60.69%	35.27%	-25.42 pp	-41.89%	fail
2	scaling	qwen2.5-1.5b	repeated_prefix	60.17%	64.63%	+4.46 pp	+7.42%	fail
3	scaling	llama3.2-1b	balanced_2k	53.05%	58.67%	+5.62 pp	+10.60%	fail
4	ttft	llama3.2-3b	repeated_prefix	47.29%	54.72%	+7.43 pp	+15.72%	fail
5	ttft	qwen2.5-1.5b	repeated_prefix	60.54%	72.26%	+11.72 pp	+19.36%	fail
6	scaling	llama3.2-3b	repeated_prefix	48.70%	60.85%	+12.15 pp	+24.95%	pass
7	ttft	llama3.2-3b	short_decode	58.90%	73.24%	+14.35 pp	+24.36%	pass
8	ttft	qwen2.5-1.5b	long_decode	59.96%	74.36%	+14.41 pp	+24.03%	pass
9	ttft	qwen2.5-1.5b	balanced_2k	61.26%	76.55%	+15.29 pp	+24.96%	pass
10	scaling	llama3.2-1b	short_decode	64.42%	80.07%	+15.65 pp	+24.30%	pass
11	scaling	qwen2.5-1.5b	short_decode	57.09%	73.42%	+16.33 pp	+28.61%	pass
12	scaling	llama3.2-1b	long_decode	54.27%	71.41%	+17.14 pp	+31.58%	pass
13	ttft	llama3.2-1b	long_decode	54.65%	71.81%	+17.16 pp	+31.40%	pass
14	scaling	qwen2.5-1.5b	long_decode	64.29%	84.09%	+19.80 pp	+30.81%	pass
15	scaling	llama3.2-3b	short_decode	55.15%	75.74%	+20.59 pp	+37.33%	pass
16	ttft	llama3.2-3b	long_decode	53.07%	75.26%	+22.20 pp	+41.83%	pass
17	ttft	llama3.2-1b	short_decode	52.79%	75.05%	+22.26 pp	+42.17%	pass
18	scaling	llama3.2-3b	long_decode	49.45%	71.71%	+22.26 pp	+45.03%	pass
19	scaling	llama3.2-1b	repeated_prefix	43.54%	67.32%	+23.78 pp	+54.62%	pass
20	scaling	llama3.2-3b	balanced_2k	48.32%	75.31%	+27.00 pp	+55.87%	pass
21	ttft	llama3.2-3b	balanced_2k	42.50%	72.39%	+29.89 pp	+70.32%	pass
22	ttft	llama3.2-1b	repeated_prefix	41.01%	72.49%	+31.48 pp	+76.75%	pass
23	ttft	llama3.2-1b	balanced_2k	33.44%	73.35%	+39.92 pp	+119.38%	pass
24	scaling	qwen2.5-1.5b	balanced_2k	61.52%	105.90%	+44.37 pp	+72.12%	pass

Observations. The pre-registered H1 rule required all 24 combinations to improve; 19 do and 5 do not, which is exactly the structure that flips the verdict from H1 to H2_partial. The mean and median deltas sit within 0.76 pp of each other, indicating the improving cells are not driven by one or two extreme outliers but by a broadly shifted distribution.

A 79.2% pass rate with mean +17.91 pp and median +17.15 pp is a strong partial signal, not a universal one. The five non-passing combinations are precisely the cells where the substrate forbids us from attributing the V1 breakdown to the GIL alone.

25.2 Table C2. Hang-resolution ledger over V1 deterministic-hang shapes

Six V1 deterministic-hang cell shapes were attempted under the free-threaded runtime; two cleared, four did not. The ledger below is read verbatim from hang_resolution_ledger in analysis.json.

Shape	Phase	Model	Workload	N	Outcome	TR165 wall (ms)	V1 baseline wall (ms)	ok_rate	Completion tokens
H1	scaling	llama3.2-3b	long_decode	16	resolved_cleanly	3,146,927	0	1.00	98,304
H2	ttft	llama3.2-3b	long_decode	16	resolved_cleanly	2,999,750	0	1.00	98,304
H3	scaling	llama3.2-3b	balanced_2k	16	hang_or_failure	null	25,341,494	null	null
H4	ttft	llama3.2-3b	balanced_2k	16	hang_or_failure	null	8,939,765	null	null
H5	scaling	llama3.2-3b	repeated_prefix	16	hang_or_failure	null	0	null	null
H6	ttft	llama3.2-3b	repeated_prefix	16	hang_or_failure	null	0	null	null

Observations. The pre-registered minimum of six attempted shapes was satisfied exactly. Two shapes cleared cleanly under nogil (both long_decode, completing in approximately 50-52 minutes each with ok_rate = 1.00 and producing 98,304 completion tokens). Four shapes reproduced the V1 hang signature even with the GIL removed (both balanced_2k and both repeated_prefix). The resolution rate is 100% on long_decode and 0% on the other two workloads.

Two of six is a non-trivial fraction but well short of full clearance. Four shapes that still hang under the free-threaded runtime are direct evidence that at least one non-GIL serialization mechanism remains on the critical path for those configurations, and the workload-specific concentration (balanced_2k and repeated_prefix both hang, long_decode both clear) localizes the residual to the cache-heavy workload paths.

25.3 Table C3. Per-cell completion statistics

The execution layer produced 3,744 metrics.csv rows across 48 planned cells with runtime_ok_for_verdict True, indicating every cell met its planned-rep budget.

Quantity	Value
Cells planned	48
Cells completed	48
metrics.csv rows	3,744
Average rows per cell	78
runtime_ok_for_verdict	True
matched_pair_contrast matched cells	50
matched_pair_contrast tr165_only cells	0
matched_pair_contrast tr164_v1_only cells	70 (V1's N={4, 8, 16} sweep deliberately not re-run)

Observations. Full completion of the 48-cell plan with a 3,744-row metrics file means no cell was dropped from the verdict for insufficient data. The matched-pair partition (50 / 0 / 70) confirms TR165 did not silently expand scope -- every TR165 cell has a V1 partner at the same coordinates, and the 70 V1-only cells correspond exactly to the V1 N={4, 8, 16} sweep that the SS2 plan deliberately excluded.

The completion ledger is clean: no cells censored, no reps short, no TR165 cells outside V1's coordinate set. Whatever the H2_partial verdict says, it says it over the full pre-registered grid, not over a subset that happened to finish.

Data Reconciliation & Cross-Version Lineage (2026-06-24)

This section ties TR165's substrate to the canonical TR164 data-lineage map (research/tr164/TR164_DATA_LINEAGE.md) and the program measurement count (BANTERHEARTS_MEASUREMENT_COUNT.md, 2026-06-24 supplement, headline ~1,348,000 primary + judge measurements across 49+ unique TR numbers). TR165 is a separate TR number but part of the TR164 Phase-8 evidence arc — the GIL-mechanism-isolation companion to TR164 V1.

(a) Provenance — including a previously-uncounted substrate this reconciliation surfaced. TR165's main free-threading run is research/tr165/results/20260607_174748_273070 (48 cells / 3,744 metric rows, counted under the 2026-06-08 measurement-doc supplement). The 2026-06-24 cross-version reconciliation surfaced that the report's same-session PYTHON_GIL=1 / PYTHON_GIL=0 control arms — research/tr165/results/20260623_165503_919943 (GIL-on) and research/tr165/results/20260623_222630_155096 (no-GIL), 2,592 metric rows each = 5,184 rows — were on disk but uncounted in both the 2026-06-08 TR165 supplement and the 2026-06-13 baseline. They are now folded into the 2026-06-24 supplement (the SSP submission's same-build GIL-flag control). Excluded as pre-run noise: two same-session 12-row single-cell smoke passes (20260623_165343_022627, 20260623_165420_527146) and one empty aborted dir (20260623_165333_438334). All row counts re-verified on disk via wc -l − 1 header.

(b) Hardware. TR165 is local-only (no Modal @app.function, no gpu=): a single RTX 4080 Laptop 12GB, GPU trace string NVIDIA GeForce RTX 4080 Laptop GPU. This is the same hardware as TR164 V1 — by design, so TR165 is a clean single-delta matched pair (only the CPython build changes: stock 3.x → free-threaded 3.14t with PYTHON_GIL toggled). The same-session control arms ran on the same RTX 4080.

(c) Cross-version lineage. TR165 isolates the GIL mechanism behind TR164 V1's uniform N=2 pytorch_direct breakdown. Verdict H2_partial: free-threading recovers 19/24 N=2 combinations and resolves 2/6 deterministic-hang cell shapes, so the GIL is a contributing mechanism but not the sole cause (CUDA-kernel serialization, memory-bandwidth contention, and dispatcher overhead survive GIL removal). The PYTHON_GIL=1/0 control arms are what let the SSP submission claim a same-build GIL control rather than a cross-build confound. See research/tr164/methodology/TR_nogil_ablation.md for the per-arm method and TR164_DATA_LINEAGE.md §4 for where TR165 sits in the arc.

Provenance note. Every row count reconciles to the row with the 2026-06-24 measurement supplement; the +5,184 GIL-control arms are part of that supplement's +11,781 increment (6,042 TR164 SSP substrate + 5,184 TR165 GIL arms + 555 TR170 pilot → ~1,348,000). The downstream artifact is referred to generically as the SSP submission to a full-depth systems venue; this section names no specific external venue, no submission identifier, and no gating language.

TR165: Python GIL Ablation for Direct PyTorch Inference Boundaries