darcar0 commited on
Commit
dab6650
Β·
verified Β·
1 Parent(s): 4cd56f8

Rebrand public release to Quotebound 27B

Browse files
README.md CHANGED
@@ -21,75 +21,88 @@ tags:
21
  - research
22
  ---
23
 
24
- # Evidence-Faithful Reasoning
25
 
26
- *Pilot 3 is the standalone model release.*
 
27
 
28
- This adapter turns its reasoning-distilled 27B base model into an
29
- evidence-first reader for closed packets of text. I built it because I wanted
30
- a release where reasoning had to prove itself: every answer has to land on the
31
- right evidence, quote that evidence verbatim, and stop with
32
- `Insufficient evidence.` when the packet does not justify a claim. The result
33
- is the strongest standalone model from the project, packaged here as a LoRA
34
- adapter you can load directly.
35
 
36
- ## Resources & Guides
37
 
38
- - [Technical brief (PDF)](./evidence_faithful_reasoning_release_brief.pdf)
39
- - [Technical note](./technical_note_evidence_faithful_reasoning.md)
40
- - [Fresh public holdout chart](./standalone_holdout_comparison.svg)
41
- - [Frozen benchmark progression chart](./benchmark_progression.svg)
42
- - [Release architecture chart](./project_release_arc.svg)
43
-
44
- ![Fresh public holdout: standalone release vs bridge](./standalone_holdout_comparison.svg)
45
-
46
- *Fresh 36-task mixed public holdout: the standalone release beats the earlier
47
- bridge model on task accuracy, evidence F1, and quote F1, while the
48
- packet-local normalizer lifts the full stack to `0.9093` quote F1.*
49
-
50
- ## Why this release exists
51
-
52
- I built this project to force reasoning models to show their work in the only
53
- place that counts: the evidence itself. Fluent answers were not enough. I
54
- wanted a model that had to retrieve the right units, quote them exactly, and
55
- fail closed when the packet ran out. This page leads with the standalone
56
- release because it is the artifact you can load immediately, inspect directly,
57
- and use without reconstructing the whole benchmark stack.
58
 
59
  ## At a glance
60
 
61
- - LoRA adapter on top of
62
- [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
63
- - Strongest standalone model from the project and the release I want people to
64
- download first.
65
- - On a fresh 36-task public holdout, raw task improves from `0.8611` to
66
- `0.8889`, raw strict from `0.2222` to `0.4444`, and raw quote F1 from
67
- `0.3343` to `0.6815` over the earlier bridge model.
68
- - Zero invalid outputs on every reported evaluation surface.
69
- - The project also produced a benchmark-winning hybrid stack, but that is a
70
- separate result described under *Release architecture*.
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Quick start
73
 
74
- Pilot 3 is a LoRA adapter. Load the base model and attach the adapter:
75
 
76
  ```python
77
  from peft import PeftModel
78
  from transformers import AutoModelForCausalLM, AutoTokenizer
79
 
80
  base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
81
- adapter_id = "darcar0/evidence-faithful-reasoning-pilot-3"
82
 
83
  tokenizer = AutoTokenizer.from_pretrained(base_id)
84
  base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
85
  model = PeftModel.from_pretrained(base, adapter_id)
86
  ```
87
 
88
- The base model is 27B parameters, so load it in your usual quantization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ## Prompt format
91
 
92
- This release works best with an evidence-first prompt that makes the answer
93
  subordinate to the cited text. A minimal version:
94
 
95
  ```
@@ -128,78 +141,84 @@ The model then writes a JSON object with this shape:
128
 
129
  ### Fresh 36-task mixed public holdout
130
 
131
- A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA
132
- tasks, drawn from public sources and de-duplicated against every training,
133
- dev, and `probe_v0` row.
134
 
135
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
136
  |---|---:|---:|---:|---:|
137
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
138
- | Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
139
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
140
- | **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
141
 
142
- The standalone release beats the earlier bridge model on task accuracy,
143
- evidence F1, and quote F1 in both raw and normalized form, ties normalized
144
- strict, and roughly doubles raw quote F1 at the model level.
145
 
146
  ### Fixed dev triage slice (21 tasks)
147
 
148
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
149
  |---|---:|---:|---:|---:|
150
- | Pilot 3 + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
151
 
152
- ### Untouched 104-task Hotpot shadow slice
153
 
154
- Pilot 3 raw improved quote-faithful behavior over the raw bridge model on this
155
- slice, and pilot 3 + `deterministic_v3` matched bridge +
156
- `deterministic_v3` at the system level. That surface remains a narrative
157
- parity result because the report does not publish per-metric cells for it.
 
 
158
 
159
  ## Release architecture
160
 
161
- This project ends in two finished artifacts, not one:
 
 
 
162
 
163
- 1. **Standalone model release** β€” this page. Pilot 3 is the strongest
164
  version of the project's evidence-faithful behavior that moved into the
165
- model itself, evaluated across multiple non-`probe_v0` surfaces.
166
- 2. **Benchmark-facing hybrid stack** β€” bridge `checkpoint-2` plus the
167
- `deterministic_v3` packet-local normalizer. That stack is the benchmark
168
- winner and the only configuration that clears every gate on frozen held-out
169
- `probe_v0`.
170
-
171
- The separation is deliberate. This page is for the standalone release you can
172
- download now. The benchmark winner is documented here because it explains the
173
- project's full result, not because those perfect `probe_v0` numbers belong to
174
- the adapter alone.
 
175
 
176
  ## Intended use
177
 
178
  Use this release for work that has to stay inside a fixed body of text:
179
 
180
  - bounded document QA with explicit evidence requirements,
181
- - claim verification and grounded QA from closed evidence packets,
182
- - policy, compliance, contract, and internal-document workflows where each
183
- answer must be justified from the provided text,
184
  - research on evidence-faithful reasoning and abstention behavior.
185
 
186
  ## Limitations
187
 
188
- - The downloadable artifact is the LoRA adapter only. The base model is
189
- required.
190
- - The `deterministic_v3` packet-local normalizer is not included in this
191
- download. The benchmark-winning configuration is adapter + normalizer, while
192
- the adapter alone reproduces the standalone-model results shown above.
193
- - Perfect `probe_v0` belongs to the benchmark-facing hybrid stack, not to this
194
- adapter alone.
195
- - Specialized for closed-packet reasoning, not open-ended chat or open-domain
196
- QA.
197
- - Frozen `probe_v0` item-level contents are intentionally not published with
198
- the release.
199
-
200
- ## Citation
201
-
202
- References:
203
 
204
  - Base model:
205
  [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
@@ -212,11 +231,11 @@ References:
212
  [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)
213
 
214
  ```bibtex
215
- @misc{darcar0_evidence_faithful_reasoning_pilot_3_2026,
216
- title = {Evidence-Faithful Reasoning: Pilot 3},
217
- author = {darcar0},
218
  year = {2026},
219
  howpublished = {Hugging Face model release},
220
- url = {https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3}
221
  }
222
  ```
 
21
  - research
22
  ---
23
 
24
+ # Quotebound 27B
25
 
26
+ *The standalone model release from Evidence-Faithful Reasoning, built on the
27
+ Qwen 3.5 Opus Distilled 27B base.*
28
 
29
+ Quotebound 27B is the public release name for the standalone model
30
+ previously referred to internally as `pilot 3`. I built it as the
31
+ downloadable model release for Evidence-Faithful Reasoning: a LoRA adapter
32
+ that turns its reasoning-distilled 27B base model into an evidence-first
33
+ reader for closed packets of source text. Every answer has to land on the
34
+ right evidence units, quote them verbatim, and stop with
35
+ `Insufficient evidence.` when the packet does not justify a claim.
36
 
37
+ ![Fresh public holdout: Quotebound 27B versus the prior bridge model](./standalone_holdout_comparison.svg)
38
 
39
+ *On a fresh 36-task public holdout, Quotebound 27B improves task accuracy,
40
+ evidence F1, and quote F1 over the prior bridge model. The packet-local
41
+ quote normalizer carries the full stack to `0.9093` quote F1.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## At a glance
44
 
45
+ - **What it is.** A LoRA adapter on top of
46
+ [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2),
47
+ trained to answer from closed packets of source text under a strict
48
+ answer–evidence–quote–abstain contract.
49
+ - **The headline number.** Raw quote F1 on a fresh public holdout roughly
50
+ doubles over the prior bridge model (`0.3343` β†’ `0.6815`), meaning much
51
+ more of the grounding behavior now lives inside the model itself instead
52
+ of in a post-processing layer.
53
+ - **Other deltas on the same holdout.** Raw task: `0.8611` β†’ `0.8889`.
54
+ Raw strict: `0.2222` β†’ `0.4444`. Raw evidence F1: `0.8815` β†’ `0.9093`.
55
+ Zero invalid outputs across every reported evaluation surface.
56
+ - **What it isn't.** Not a general chatbot. Not a replacement for the
57
+ benchmark-winning hybrid system, which is described below as a separate
58
+ result.
59
+
60
+ ## Read next
61
+
62
+ - [Technical brief (PDF)](./evidence_faithful_reasoning_release_brief.pdf) β€” short, citation-friendly summary.
63
+ - [Technical note](./technical_note_evidence_faithful_reasoning.md) β€” full method, results, and discussion.
64
+ - [Frozen benchmark progression chart](./benchmark_progression.svg)
65
+ - [Release architecture chart](./project_release_arc.svg)
66
 
67
  ## Quick start
68
 
69
+ Load the 27B base model and attach the adapter:
70
 
71
  ```python
72
  from peft import PeftModel
73
  from transformers import AutoModelForCausalLM, AutoTokenizer
74
 
75
  base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
76
+ adapter_id = "darcar0/quotebound-27b"
77
 
78
  tokenizer = AutoTokenizer.from_pretrained(base_id)
79
  base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
80
  model = PeftModel.from_pretrained(base, adapter_id)
81
  ```
82
 
83
+ The base is a 27B-parameter model, so load it in whichever quantization
84
+ your hardware supports (4-bit `bitsandbytes` works for inference).
85
+
86
+ ## The contract
87
+
88
+ Each task arrives with a closed packet of source text. To count as a
89
+ success, the model has to clear four conditions on the same answer:
90
+
91
+ 1. **Answer correctly** β€” return the right answer or label for the task.
92
+ 2. **Pick the right evidence** β€” the cited units must be the packet
93
+ locations that actually support the answer.
94
+ 3. **Quote exact support** β€” every quote is a verbatim substring of its
95
+ cited unit. No paraphrase, no stitching, no ellipsis.
96
+ 4. **Abstain when blocked** β€” if the packet does not justify a claim,
97
+ the answer must be exactly `Insufficient evidence.`
98
+
99
+ Correctness alone is not credited. The model has been trained to fail
100
+ closed when the packet runs out, and to ground every answer it does
101
+ return.
102
 
103
  ## Prompt format
104
 
105
+ The model is trained for an evidence-first prompt that makes the answer
106
  subordinate to the cited text. A minimal version:
107
 
108
  ```
 
141
 
142
  ### Fresh 36-task mixed public holdout
143
 
144
+ A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA
145
+ grounded-QA tasks, drawn from public sources and de-duplicated against
146
+ every training, dev, and held-out probe row.
147
 
148
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
149
  |---|---:|---:|---:|---:|
150
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
151
+ | Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
152
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
153
+ | **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
154
 
155
+ Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
156
+ and quote F1 in both raw and normalized form, ties normalized strict, and
157
+ roughly doubles raw quote F1 at the model level.
158
 
159
  ### Fixed dev triage slice (21 tasks)
160
 
161
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
162
  |---|---:|---:|---:|---:|
163
+ | Quotebound + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
164
 
165
+ ### Untouched 104-task HotpotQA shadow slice
166
 
167
+ On a 104-task HotpotQA shadow slice that was never touched during
168
+ selection, Quotebound raw improved quote-faithful behavior over the prior
169
+ bridge model, and Quotebound plus `deterministic_v3` matched bridge +
170
+ `deterministic_v3` at the system level. The surface is reported as a
171
+ narrative parity result because the freeze memo does not publish
172
+ per-metric cells for it.
173
 
174
  ## Release architecture
175
 
176
+ The project ends in two finished results that are reported separately on
177
+ purpose. One is the strongest full system on the held-out benchmark; the
178
+ other is the strongest standalone model β€” and the artifact you can
179
+ actually download.
180
 
181
+ 1. **Quotebound 27B β€” this page.** The adapter above is the strongest
182
  version of the project's evidence-faithful behavior that moved into the
183
+ model itself, evaluated across multiple surfaces beyond the held-out
184
+ probe.
185
+ 2. **The benchmark-winning hybrid system.** A trained bridge checkpoint
186
+ plus the `deterministic_v3` packet-local quote normalizer. That stack
187
+ is the only configuration that clears every gate of the strict
188
+ contract on the frozen held-out probe (`probe_v0`).
189
+
190
+ The two results do not collapse into one. The hybrid system is the
191
+ benchmark winner. Quotebound 27B is the downloadable model. Perfect
192
+ `probe_v0` belongs to the hybrid system, not to the adapter on this page
193
+ alone.
194
 
195
  ## Intended use
196
 
197
  Use this release for work that has to stay inside a fixed body of text:
198
 
199
  - bounded document QA with explicit evidence requirements,
200
+ - claim verification and grounded QA from closed packets of source text,
201
+ - policy, compliance, contract, and internal-document workflows where
202
+ each answer has to be justified from the provided text,
203
  - research on evidence-faithful reasoning and abstention behavior.
204
 
205
  ## Limitations
206
 
207
+ - The download is the LoRA adapter only β€” the 27B base model is required.
208
+ - The `deterministic_v3` packet-local quote normalizer is *not* shipped
209
+ here. It lives in the project repository as a separate post-processing
210
+ step. Quotebound 27B alone reproduces the raw standalone gains above;
211
+ normalized system-level rows require adapter + normalizer.
212
+ - Perfect `probe_v0` belongs to the benchmark-winning hybrid system, not
213
+ to this adapter alone.
214
+ - Specialized for closed-packet reasoning. Behavior outside that setting
215
+ β€” open chat, open-domain QA, free-form generation β€” is not
216
+ characterized.
217
+ - Raw item-level contents of the held-out probe are intentionally not
218
+ published with the release; the held-out gate has to stay closed to
219
+ remain meaningful.
220
+
221
+ ## Citation and references
222
 
223
  - Base model:
224
  [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
 
231
  [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)
232
 
233
  ```bibtex
234
+ @misc{quotebound_27b_2026,
235
+ title = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
236
+ author = {{darcar0}},
237
  year = {2026},
238
  howpublished = {Hugging Face model release},
239
+ url = {https://huggingface.co/darcar0/quotebound-27b}
240
  }
241
  ```
benchmark_progression.svg CHANGED
evidence_faithful_reasoning_release_brief.md CHANGED
@@ -1,14 +1,14 @@
1
- # Evidence-Faithful Reasoning
2
 
3
- ## Release Brief
4
 
5
- Released: 2026-04-07
6
  Author: darcar0
7
 
8
  Hugging Face model release:
9
- [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
10
 
11
- Companion files on this page:
12
 
13
  - [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
14
  - [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
@@ -17,53 +17,57 @@ Companion files on this page:
17
 
18
  ## Executive summary
19
 
20
- I built this project because I wanted a release where reasoning had to prove
21
- itself from the evidence instead of hiding behind fluent language. The result
22
- is a strict evidence-faithful benchmark, a benchmark-winning hybrid system,
23
- and pilot 3: the standalone model release, shipped here as a LoRA adapter on
24
- Hugging Face.
 
25
 
26
- The contract is strict. On every bounded evidence packet, the system has to:
 
27
 
28
  1. answer correctly,
29
- 2. identify the right evidence,
30
- 3. quote the exact supporting text, and
31
  4. abstain with `Insufficient evidence.` when the packet does not justify a
32
  claim.
33
 
34
- This release ends in two finished outputs. The benchmark-facing winner is a
35
- hybrid stack β€” bridge `checkpoint-2` plus `deterministic_v3` packet-local
36
- quote normalization β€” that clears every gate on the frozen held-out
37
- `probe_v0` benchmark. The main downloadable artifact is pilot 3, the
38
- standalone model release, which beats the earlier bridge model on a fresh
39
- mixed public holdout and roughly doubles raw quote-faithful behavior at the
40
- model level.
41
 
42
- ## Standalone model release
43
 
44
- Pilot 3 is the strongest standalone model from the project and the release I
45
- want people to open first. It is the first standalone checkpoint in the
46
- project that holds up across multiple non-`probe_v0` evaluation surfaces.
 
 
47
 
48
  Fresh 36-task mixed public holdout:
49
 
50
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
51
  |---|---:|---:|---:|---:|
52
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
53
- | Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
54
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
55
- | **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
56
 
57
- Pilot 3 beats bridge on task accuracy, evidence F1, and quote F1 in both raw
58
- and normalized form, ties normalized strict, and roughly doubles raw quote F1
59
- (`0.3343` β†’ `0.6815`) at the model level.
60
 
61
  ## Benchmark-facing winner
62
 
63
  The benchmark-facing hybrid stack is the strongest full system from the
64
- project. Training moved the model past the older frozen baseline.
65
- Deterministic packet-local normalization closed the remaining quote-faithful
66
- gap without leaving the bounded-packet setting.
67
 
68
  | Metric | Frozen probe_v0 |
69
  |---|---:|
@@ -78,27 +82,30 @@ gap without leaving the bounded-packet setting.
78
 
79
  ## Project arc and stopping point
80
 
81
- The release has two public faces: a standalone model you can load directly and
82
- a benchmark-facing hybrid stack that closes the last quote-faithfulness gap.
83
- That split is part of the project story, not something hidden behind the fine
84
- print.
85
 
86
- A targeted follow-up, pilot 4, fixed one specific FEVER
87
- month/date temporal-insufficiency case but weakened broader behavior on larger
88
- evaluation surfaces. I treated that as a stop signal rather than churning for
89
- one more pilot. The release froze at the point where the standalone model was
90
- strongest and the benchmark-facing system was already complete.
 
91
 
92
  ## Intended use and boundaries
93
 
94
- This is a specialized grounded reasoning release, not a general-purpose
95
- chatbot replacement. It is built for bounded document QA, claim verification,
96
- policy/compliance workflows, and other settings where every answer has to be
97
- justified from a closed body of text.
 
98
 
99
  Important boundaries:
100
 
101
- - Perfect `probe_v0` belongs to the hybrid stack, not to pilot 3 alone.
 
102
  - The Hugging Face download is the LoRA adapter only; the benchmark-winning
103
  configuration is adapter + `deterministic_v3`.
104
  - Frozen `probe_v0` item-level contents are intentionally not published with
@@ -106,10 +113,10 @@ Important boundaries:
106
 
107
  ## Release surfaces
108
 
109
- - Pilot 3 on Hugging Face:
110
- [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
111
  - Technical note:
112
- [`docs/technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
113
  - Fresh public holdout chart:
114
  [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
115
  - Frozen benchmark progression chart:
 
1
+ # Quotebound 27B
2
 
3
+ ## Evidence-Faithful Reasoning Release Brief
4
 
5
+ Released: 2026-04-07
6
  Author: darcar0
7
 
8
  Hugging Face model release:
9
+ [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
10
 
11
+ Companion files:
12
 
13
  - [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
14
  - [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
 
17
 
18
  ## Executive summary
19
 
20
+ Quotebound 27B is the standalone model release from Evidence-Faithful
21
+ Reasoning: a research-engineering project on reasoning that has to stay
22
+ recoverable from the source text rather than asserted on top of it. The
23
+ project ships a strict benchmark, the hybrid system that clears that
24
+ benchmark under the full contract, and a standalone model trained to carry
25
+ the same behavior on its own.
26
 
27
+ The contract is strict. On every closed packet of source text, the system
28
+ has to:
29
 
30
  1. answer correctly,
31
+ 2. cite the right evidence units,
32
+ 3. quote those units verbatim, and
33
  4. abstain with `Insufficient evidence.` when the packet does not justify a
34
  claim.
35
 
36
+ The project ends in two finished results from one frame. The
37
+ benchmark-facing winner is a hybrid stack β€” bridge `checkpoint-2` plus
38
+ `deterministic_v3` packet-local quote normalization β€” that clears every
39
+ gate on the frozen held-out `probe_v0` benchmark. The downloadable artifact
40
+ is Quotebound 27B on Hugging Face, which beats the earlier bridge model on
41
+ a fresh mixed public holdout and roughly doubles raw quote-faithful
42
+ behavior at the model level.
43
 
44
+ ## Quotebound 27B
45
 
46
+ Quotebound 27B is the public release name for the standalone model
47
+ previously referred to internally as `pilot 3`. It is the strongest
48
+ standalone model the project produced and the artifact most readers will
49
+ load first. It is the first standalone checkpoint in the project to hold
50
+ up across multiple evaluation surfaces beyond the held-out probe.
51
 
52
  Fresh 36-task mixed public holdout:
53
 
54
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
55
  |---|---:|---:|---:|---:|
56
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
57
+ | Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
58
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
59
+ | **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
60
 
61
+ Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
62
+ and quote F1 in both raw and normalized form, ties normalized strict, and
63
+ roughly doubles raw quote F1 (`0.3343` β†’ `0.6815`) at the model level.
64
 
65
  ## Benchmark-facing winner
66
 
67
  The benchmark-facing hybrid stack is the strongest full system from the
68
+ project. Training moved the model past the older frozen baseline; the
69
+ deterministic packet-local normalizer was the finishing repair, closing the
70
+ remaining quote-faithful gap without leaving the closed-packet boundary.
71
 
72
  | Metric | Frozen probe_v0 |
73
  |---|---:|
 
82
 
83
  ## Project arc and stopping point
84
 
85
+ The release has two public faces: Quotebound 27B, the standalone model that
86
+ loads directly from Hugging Face, and a benchmark-facing hybrid stack that
87
+ closes the last quote-faithfulness gap on the frozen held-out probe. The
88
+ split is part of the project story, not hidden behind the fine print.
89
 
90
+ A narrow follow-up β€” internally `pilot 4` β€” fixed one specific FEVER
91
+ month/date temporal-insufficiency case but weakened broader behavior on
92
+ larger evaluation surfaces. The project read that outcome as a stop signal
93
+ rather than running additional local fixes, and froze at the point where
94
+ Quotebound 27B was strongest and the benchmark-facing system was already
95
+ complete.
96
 
97
  ## Intended use and boundaries
98
 
99
+ This is a release for reasoning over closed packets of source text, not a
100
+ general-purpose chatbot replacement. It is built for bounded document QA,
101
+ claim verification, policy and compliance review, contract reading, and
102
+ other settings where every answer has to be justified from a fixed body of
103
+ text.
104
 
105
  Important boundaries:
106
 
107
+ - Perfect `probe_v0` belongs to the hybrid stack, not to the standalone
108
+ adapter alone.
109
  - The Hugging Face download is the LoRA adapter only; the benchmark-winning
110
  configuration is adapter + `deterministic_v3`.
111
  - Frozen `probe_v0` item-level contents are intentionally not published with
 
113
 
114
  ## Release surfaces
115
 
116
+ - Quotebound 27B on Hugging Face:
117
+ [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
118
  - Technical note:
119
+ [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
120
  - Fresh public holdout chart:
121
  [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
122
  - Frozen benchmark progression chart:
evidence_faithful_reasoning_release_brief.pdf CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c425eff778b5c1c17aebf1c67a4e63f71908d729f3a73a6ee0c3bf053fe63aca
3
- size 317277
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4016a00664573e528b22b0d93a4b0d5ebe740c0c6f4a4e9fd6f6874a02c55d25
3
+ size 243746
project_release_arc.svg CHANGED
standalone_holdout_comparison.svg CHANGED
technical_note_evidence_faithful_reasoning.md CHANGED
@@ -1,90 +1,104 @@
1
- # Technical Note: Evidence-Faithful Reasoning Over Bounded Evidence Packets
2
 
3
  Updated: 2026-04-07
4
 
5
  ## Abstract
6
 
7
- This note describes a research-engineering project on strict evidence-faithful
8
- reasoning in open reasoning-distilled language models. The system answers from
9
- a closed packet of source text and must satisfy four conditions at once:
10
- answer correctly, identify the right evidence units, quote exact supporting
11
- text, and abstain with `Insufficient evidence.` when the packet does not
12
- justify a claim. The project produced two finished outputs from one frame: a
13
- benchmark-winning hybrid system that solves the frozen held-out `probe_v0`
14
- benchmark under the full contract, and pilot 3, the strongest standalone model
15
- artifact from the project, released as a LoRA adapter on Hugging Face. A
16
- narrowly targeted follow-up checkpoint, pilot 4, was rejected as a stop signal
17
- after it traded one local fix for broader regressions. The hybrid stack is the
18
- benchmark-facing release; pilot 3 is the standalone model release.
19
-
20
- **Keywords:** evidence-faithful reasoning, grounded QA, claim verification,
21
- bounded evidence packets, abstention, attribution.
 
 
 
 
 
22
 
23
  ## 1. Problem
24
 
25
- Many language models can produce correct-looking answers while grounding
26
- poorly. In practice, that failure shows up as one or more of the following:
 
27
 
28
- - the answer is right but the cited evidence is wrong
29
- - the quote is too broad, too narrow, or not recoverable from the source text
30
- - the model fails to abstain when the packet does not justify the answer
31
- - the output sounds persuasive even when the evidence is insufficient
 
 
 
32
 
33
- The question is not whether the model can answer from a packet. The question
34
- is whether it can answer **faithfully** β€” with recoverable support and the
35
- correct abstention behavior when the packet falls short.
 
36
 
37
  ## 2. Benchmark contract
38
 
39
- For each task on a bounded evidence packet, the system must clear all four
40
- gates at once:
41
 
42
- 1. **Answer correctly** β€” the right answer or label for the task.
43
- 2. **Pick the right evidence** β€” the cited evidence units must be the packet
44
  locations that actually support the answer.
45
- 3. **Quote exact support** β€” every quote has to be a verbatim substring of
46
- the cited unit; no paraphrase, no stitching, no ellipsis.
47
- 4. **Abstain when blocked** β€” if the packet does not justify a claim, the
48
- answer must be exactly `Insufficient evidence.`
49
 
50
- Correctness alone does not count as success.
51
 
52
  The frozen held-out benchmark surface for the final release cycle is
53
- `data/probe_v0/`. It remained frozen throughout the cycle and was not used as
54
- a tuning surface.
55
 
56
  ## 3. Method
57
 
58
- The project followed a full research-engineering loop:
59
-
60
- 1. build the evaluation harness and frozen baselines
61
- 2. produce a baseline table and failure mapping
62
- 3. test prompt and structure variants
63
- 4. build a training-backed intervention path
64
- 5. add deterministic packet-local quote normalization where the model still
65
- underperformed the strict contract
66
- 6. rerun held-out evaluation on a protected split
67
- 7. run a teacher-student distillation cycle to push the winning behavior into
68
- the model itself
69
- 8. stop when the strongest release artifact emerged
70
-
71
- Train and dev surfaces are public-data-backed and derived from FEVER-style
72
- verify-claim data, HotpotQA-style grounded QA data, and project-local packet
73
- scaffolding built on top of those upstream sources.
 
 
74
 
75
  ## 4. Benchmark-facing system
76
 
77
- The benchmark-facing release is a hybrid system:
 
78
 
79
  - **Bridge checkpoint:** `outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2`
80
- - **Deterministic packet-local normalization:**
81
  `python3 scripts/normalize_quotes.py --mode deterministic_v3`
82
 
83
- Both parts are necessary. The training step closes the gap between the older
84
- frozen baseline and the strict contract on task and evidence accuracy. The
85
- deterministic normalizer is the finishing move that closes the contract on
86
- quote faithfulness and strict grounded success without leaving the closed
87
- packet.
 
88
 
89
  ## 5. Benchmark-facing result
90
 
@@ -105,107 +119,112 @@ label accuracy `1.0000`, grounded QA accuracy `1.0000`, contrastive
105
  consistency `1.0000`, invalid / missing rate `0.0000`. Canonical memo:
106
  `reports/sft_v1_final_artifact_status.md`.
107
 
108
- ## 6. Standalone model: pilot 3
109
 
110
  After the hybrid stack solved the benchmark, the project asked a second
111
  question inside the same frame: how much of that winning behavior can be
112
- moved into the model itself, evaluated outside `probe_v0`?
 
113
 
114
- That question produced pilot 3: a teacher-student distillation checkpoint
115
- released as a LoRA adapter on top of
 
 
116
  [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
117
 
118
  - **Internal checkpoint:**
119
  `outputs/sft_v1_v2_teacher_distill_pilot_v3_partialdev/checkpoint-16`
120
  - **Public release identity:**
121
- [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
122
 
123
- Pilot 3 is the strongest standalone model artifact from the project. It is
124
- the first standalone checkpoint to hold up across multiple non-`probe_v0`
125
- evaluation surfaces, and it roughly doubles raw quote-faithful behavior over
126
- the earlier bridge model on a fresh public holdout. Selection happened
127
  entirely off `probe_v0`; the held-out gate stayed frozen.
128
 
129
- ## 7. Standalone model: results
130
 
131
  ### 7.1 Fresh 36-task mixed public holdout
132
 
133
- A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded QA
134
  tasks, drawn from public sources and de-duplicated against every training,
135
  dev, and `probe_v0` row.
136
 
137
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
138
  |---|---:|---:|---:|---:|
139
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
140
- | Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
141
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
142
- | **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
143
 
144
  See [Figure 2: standalone holdout comparison](./standalone_holdout_comparison.svg).
145
 
146
- Pilot 3 beats bridge on task accuracy, evidence F1, and quote F1 in both raw
147
- and normalized form, ties normalized strict, and roughly doubles raw quote F1
148
- at the model level. Grounded-QA accuracy on this slice is `1.0000` for both
149
- stacks; zero invalid outputs across every reported evaluation surface.
150
- Canonical memo:
151
  `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`.
152
 
153
  ### 7.2 Fixed dev triage slice (21 tasks)
154
 
155
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
156
  |---|---:|---:|---:|---:|
157
- | Pilot 3 + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
158
 
159
- ### 7.3 Untouched 104-task Hotpot shadow slice
160
 
161
- Pilot 3 raw improved quote-faithful behavior over raw bridge, and pilot 3 +
162
- `deterministic_v3` matched bridge + `deterministic_v3` at the system level on
163
- this slice. Reported as a parity outcome in the standalone freeze memo
164
- (`reports/standalone_model_v2_freeze_memo.md`); per-metric numbers were not
165
- recorded for this surface, so it stands as a narrative parity result, not a
166
- table cell.
167
 
168
- ## 8. Stop signal: pilot 4
169
 
170
- A targeted follow-up, pilot 4, was built to fix one specific FEVER
171
- month/date temporal-insufficiency error. It fixed that single row but
172
- weakened broader behavior on the larger evaluation surfaces. That outcome
173
- made pilot 4 useful as a stop signal: further local-fix iteration was
174
- trading visible gains for wider regressions. The project froze at pilot 3
175
- as the strongest standalone model. Canonical memo:
176
  `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`.
177
 
178
  See [Figure 3: project release arc](./project_release_arc.svg) for the full
179
- arc from baseline through pilot 4 stop signal.
180
 
181
  ## 9. Discussion
182
 
183
- **Why the release shape is hybrid + standalone, not one or the other.** The
184
- strict contract has four gates. Training alone closed the task and evidence
185
- gates on `probe_v0`, but bridge `checkpoint-2` did not close the strict
186
- grounded success gate by itself: raw bridge was at strict `0.2727` on
187
- `probe_v0`. Deterministic packet-local normalization was the finishing move
188
- that closed the remaining gates without escaping the closed-packet boundary.
189
- That is what made the hybrid stack the strongest benchmark-facing result and
190
- why it is the benchmark-facing release.
191
-
192
- **What pilot 3 demonstrates.** The teacher-student cycle that produced pilot
193
- 3 asked how much of that winning behavior could move into the model itself,
194
- evaluated entirely off `probe_v0`. The fresh-holdout deltas show that the
195
- model-side gain is real: raw pilot 3 roughly doubles raw bridge on quote F1
196
- (`0.3343` β†’ `0.6815`) and beats raw bridge on task accuracy and evidence F1.
197
- This is the first standalone model in the project that holds up across
198
- multiple non-`probe_v0` surfaces; calling it the standalone release artifact
199
- is faithful to that evidence.
200
-
201
- **Why pilot 4 is informative.** Pilot 4 was a deliberate narrow refinement.
202
- It worked on the one row it targeted and regressed broader behavior. Reading
203
- that result as a stop signal β€” rather than running additional local fixes β€”
204
- is the discipline the project decided to keep visible in the release trail.
 
 
205
 
206
  **Distinction held throughout.** Perfect frozen `probe_v0` belongs to the
207
- hybrid stack, not pilot 3 alone. The release ships both results without
208
- collapsing them into one claim.
209
 
210
  ## 10. Contributions
211
 
@@ -217,35 +236,35 @@ collapsing them into one claim.
217
  contributed.
218
  3. A finished benchmark-winning hybrid artifact: bridge `checkpoint-2` plus
219
  `deterministic_v3` packet-local normalization.
220
- 4. A standalone model release β€” pilot 3, a LoRA adapter on a 27B
221
- reasoning-distilled base β€” that is the first standalone checkpoint to
222
- hold up across multiple non-`probe_v0` evaluation surfaces and that
223
- roughly doubles raw quote-faithful behavior over the earlier bridge.
224
- 5. A stop signal β€” pilot 4 β€” that documents the point at which further
225
- local-fix iteration began trading visible gains for wider regressions.
 
226
 
227
  ## 11. Intended use and limitations
228
 
229
- **Intended use.** Specialized grounded reasoning over bounded evidence
230
- packets: bounded document QA, claim verification, policy and compliance
231
- review, contract reading, and other workflows where every answer has to be
232
- justified from a closed body of text. Also: research on evidence-faithful
233
- reasoning and abstention behavior.
234
 
235
  **Limitations.**
236
 
237
- - This is **not** a general-purpose chatbot replacement. Performance outside
238
  the closed-packet setting is not characterized.
239
- - Perfect `probe_v0` belongs to the **hybrid stack**, not to pilot 3 alone.
240
- Treat `1.0000` numbers on `probe_v0` as the hybrid stack's result.
241
- - The downloadable artifact for pilot 3 is the LoRA adapter only.
242
- `deterministic_v3` is a separate post-processing step that lives in the
243
- project repository; the benchmark-winning configuration is *adapter +
244
- normalizer*.
245
  - Frozen `probe_v0` item-level contents are intentionally not published with
246
  the release in order to preserve the held-out gate.
247
  - Perfect `probe_v0` is not proof of general faithful reasoning. It is proof
248
- that the system meets the strict contract on a single frozen bounded
249
  benchmark.
250
 
251
  ## 12. Surfaces
@@ -255,8 +274,7 @@ Canonical project surfaces:
255
  - final artifact memo: `reports/sft_v1_final_artifact_status.md`
256
  - standalone freeze memo: `reports/standalone_model_v2_freeze_memo.md`
257
  - fresh holdout comparison: `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`
258
- - pilot 4 stop-signal memo: `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`
259
  - release brief: `evidence_faithful_reasoning_release_brief.pdf`
260
- - release page: `index.html`
261
- - model card: `huggingface_model_card_README.md`
262
- - Hugging Face release: [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
 
1
+ # Technical Note: Evidence-Faithful Reasoning Over Closed Evidence Packets
2
 
3
  Updated: 2026-04-07
4
 
5
  ## Abstract
6
 
7
+ This note describes a research-engineering project on strict
8
+ evidence-faithful reasoning in open reasoning-distilled language models.
9
+ The system answers from a closed packet of source text and is scored
10
+ against a four-condition contract: answer correctly, cite the right
11
+ packet units as evidence, quote those units verbatim, and abstain with
12
+ `Insufficient evidence.` when the packet does not justify a claim. The
13
+ project ends in two finished results from the same frame. The first is a
14
+ hybrid system β€” a trained bridge checkpoint plus a packet-local quote
15
+ normalizer β€” that clears every gate of the contract on the frozen
16
+ held-out probe (`probe_v0`). The second is Quotebound 27B, the public
17
+ release name for the standalone model previously referred to internally as
18
+ `pilot 3`, released as a LoRA adapter on Hugging Face
19
+ ([`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)).
20
+ A narrow follow-up β€” internally `pilot 4` β€” was rejected as a stop signal
21
+ after it traded one local fix for broader regressions. The hybrid system
22
+ is the benchmark-facing release; Quotebound 27B is the downloadable model
23
+ release.
24
+
25
+ **Keywords:** evidence-faithful reasoning, grounded QA, claim
26
+ verification, closed-packet reasoning, abstention, attribution.
27
 
28
  ## 1. Problem
29
 
30
+ Language models routinely produce correct-looking answers while grounding
31
+ poorly. In practice, that failure mode shows up as one or more of the
32
+ following:
33
 
34
+ - the answer is right but the cited evidence is wrong;
35
+ - the quote is too broad, too narrow, or not recoverable from the source
36
+ text at all;
37
+ - the model fails to abstain when the packet does not justify the
38
+ answer;
39
+ - the output reads as persuasive even when the underlying evidence is
40
+ insufficient.
41
 
42
+ The question this project asks is not whether a language model can
43
+ answer from a closed packet. It is whether the model can answer
44
+ **faithfully** β€” with recoverable support, and with the correct
45
+ abstention behavior when the packet falls short.
46
 
47
  ## 2. Benchmark contract
48
 
49
+ Each task arrives with a closed packet of source text. To count as a
50
+ success, the system has to clear four conditions on the same answer:
51
 
52
+ 1. **Answer correctly** β€” return the right answer or label for the task.
53
+ 2. **Pick the right evidence** β€” the cited units must be the packet
54
  locations that actually support the answer.
55
+ 3. **Quote exact support** β€” every quote is a verbatim substring of its
56
+ cited unit. No paraphrase, no stitching, no ellipsis.
57
+ 4. **Abstain when blocked** β€” if the packet does not justify a claim,
58
+ the answer must be exactly `Insufficient evidence.`
59
 
60
+ Correctness alone is not credited.
61
 
62
  The frozen held-out benchmark surface for the final release cycle is
63
+ `data/probe_v0/`. It stayed frozen throughout the cycle and was never
64
+ used as a tuning surface.
65
 
66
  ## 3. Method
67
 
68
+ The project worked through a full research-engineering loop. The eval
69
+ harness and a frozen baseline came first; baselines were scored end-to-end
70
+ under the strict contract before any intervention was tried, and a
71
+ failure mapping over those baseline runs identified where the model was
72
+ losing each gate. From there, prompt and structure variants were tested
73
+ against the same harness, and a training-backed intervention path was
74
+ built on top of the strongest learned configuration. Where the model
75
+ still underperformed the contract, a deterministic packet-local quote
76
+ normalizer was layered on as a finishing repair, and held-out evaluation
77
+ was rerun on the protected split. Once the hybrid stack solved the
78
+ held-out probe, a teacher–student distillation cycle pushed the winning
79
+ behavior back into the model itself, evaluated entirely off the held-out
80
+ probe. The cycle was stopped when the strongest standalone release
81
+ artifact emerged and a focused follow-up showed clear regressions.
82
+
83
+ Train and dev surfaces are derived from public FEVER-style
84
+ verify-claim data, public HotpotQA-style grounded-QA data, and
85
+ project-local packet scaffolding built on top of those upstream sources.
86
 
87
  ## 4. Benchmark-facing system
88
 
89
+ The benchmark-facing release is a hybrid of a learned model and a
90
+ deterministic post-processing step:
91
 
92
  - **Bridge checkpoint:** `outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2`
93
+ - **Packet-local quote normalizer:**
94
  `python3 scripts/normalize_quotes.py --mode deterministic_v3`
95
 
96
+ Both parts are necessary. Training closes the gap between the older
97
+ frozen baseline and the strict contract on task and evidence accuracy.
98
+ The packet-local normalizer is the finishing repair: it closes the
99
+ contract on quote faithfulness and strict grounded success without
100
+ leaving the closed-packet boundary, by correcting quote spans against
101
+ the cited units inside each packet.
102
 
103
  ## 5. Benchmark-facing result
104
 
 
119
  consistency `1.0000`, invalid / missing rate `0.0000`. Canonical memo:
120
  `reports/sft_v1_final_artifact_status.md`.
121
 
122
+ ## 6. Quotebound 27B
123
 
124
  After the hybrid stack solved the benchmark, the project asked a second
125
  question inside the same frame: how much of that winning behavior can be
126
+ moved into the model itself, evaluated on surfaces outside the held-out
127
+ probe?
128
 
129
+ That question produced Quotebound 27B β€” the public release name for the
130
+ standalone model previously referred to internally as `pilot 3` β€” a
131
+ teacher-student distillation checkpoint published as a LoRA adapter on top
132
+ of
133
  [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
134
 
135
  - **Internal checkpoint:**
136
  `outputs/sft_v1_v2_teacher_distill_pilot_v3_partialdev/checkpoint-16`
137
  - **Public release identity:**
138
+ [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
139
 
140
+ This is the strongest standalone model artifact from the project. It is the
141
+ first standalone checkpoint to hold up across multiple evaluation surfaces
142
+ beyond `probe_v0`, and on a fresh public holdout it roughly doubles raw
143
+ quote-faithful behavior over the earlier bridge model. Selection happened
144
  entirely off `probe_v0`; the held-out gate stayed frozen.
145
 
146
+ ## 7. Quotebound 27B results
147
 
148
  ### 7.1 Fresh 36-task mixed public holdout
149
 
150
+ A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA
151
  tasks, drawn from public sources and de-duplicated against every training,
152
  dev, and `probe_v0` row.
153
 
154
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
155
  |---|---:|---:|---:|---:|
156
  | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
157
+ | Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
158
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
159
+ | **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
160
 
161
  See [Figure 2: standalone holdout comparison](./standalone_holdout_comparison.svg).
162
 
163
+ Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
164
+ and quote F1 in both raw and normalized form, ties normalized strict, and
165
+ roughly doubles raw quote F1 at the model level. Grounded-QA accuracy on
166
+ this slice is `1.0000` for both stacks; zero invalid outputs across every
167
+ reported evaluation surface. Canonical memo:
168
  `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`.
169
 
170
  ### 7.2 Fixed dev triage slice (21 tasks)
171
 
172
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
173
  |---|---:|---:|---:|---:|
174
+ | Quotebound + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
175
 
176
+ ### 7.3 Untouched 104-task HotpotQA shadow slice
177
 
178
+ The raw standalone adapter improved quote-faithful behavior over raw bridge,
179
+ and Quotebound plus `deterministic_v3` matched bridge + `deterministic_v3`
180
+ at the system level on this slice. The standalone freeze memo
181
+ (`reports/standalone_model_v2_freeze_memo.md`) reports it as a parity
182
+ outcome; per-metric numbers were not recorded for this surface, so it stands
183
+ as a narrative parity result rather than a table cell.
184
 
185
+ ## 8. Stop-signal follow-up
186
 
187
+ A narrow follow-up β€” internally `pilot 4` β€” was built to fix one specific
188
+ FEVER month/date temporal-insufficiency error. It fixed that single row but
189
+ weakened broader behavior on the larger evaluation surfaces. The project read
190
+ that outcome as a stop signal: further local-fix iteration was trading visible
191
+ gains for wider regressions, so Quotebound 27B froze at the prior
192
+ checkpoint. Canonical memo:
193
  `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`.
194
 
195
  See [Figure 3: project release arc](./project_release_arc.svg) for the full
196
+ arc from baseline through stop signal.
197
 
198
  ## 9. Discussion
199
 
200
+ **Why the release shape is hybrid plus standalone.** The strict contract has
201
+ four gates. Training alone closed the task and evidence gates on `probe_v0`,
202
+ but bridge `checkpoint-2` did not close the strict grounded success gate by
203
+ itself: raw bridge sat at strict `0.2727`. Deterministic packet-local
204
+ normalization was the finishing move that closed the remaining gates without
205
+ escaping the closed-packet boundary. That is what made the hybrid stack the
206
+ strongest benchmark-facing result, and why it is the benchmark-facing
207
+ release.
208
+
209
+ **What Quotebound 27B demonstrates.** The teacher-student cycle that
210
+ produced the standalone adapter asked how much of that winning behavior could
211
+ move into the model itself, evaluated entirely off `probe_v0`. The
212
+ fresh-holdout deltas show that the model-side gain is real: the raw
213
+ standalone roughly doubles raw bridge on quote F1 (`0.3343` β†’ `0.6815`) and
214
+ beats it on task accuracy and evidence F1. This is the first standalone
215
+ model in the project to hold up across multiple evaluation surfaces beyond
216
+ the held-out probe; calling it the standalone release artifact is faithful
217
+ to that evidence.
218
+
219
+ **Why the stop signal matters.** The follow-up was a deliberate narrow
220
+ refinement. It worked on the one row it targeted and regressed broader
221
+ behavior. Reading that result as a stop signal β€” rather than running
222
+ additional local fixes β€” is the discipline the project decided to keep
223
+ visible in the release trail.
224
 
225
  **Distinction held throughout.** Perfect frozen `probe_v0` belongs to the
226
+ hybrid stack, not to the standalone adapter alone. The release ships both
227
+ results without collapsing them into one claim.
228
 
229
  ## 10. Contributions
230
 
 
236
  contributed.
237
  3. A finished benchmark-winning hybrid artifact: bridge `checkpoint-2` plus
238
  `deterministic_v3` packet-local normalization.
239
+ 4. A standalone model release β€” Quotebound 27B, a LoRA adapter on a 27B
240
+ reasoning-distilled base β€” that is the first standalone checkpoint in
241
+ the project to hold up across multiple evaluation surfaces beyond
242
+ `probe_v0`, and that roughly doubles raw quote-faithful behavior over
243
+ the earlier bridge.
244
+ 5. A documented stop signal that marks the point at which further local-fix
245
+ iteration began trading visible gains for wider regressions.
246
 
247
  ## 11. Intended use and limitations
248
 
249
+ **Intended use.** Closed-packet reasoning workflows: bounded document QA,
250
+ claim verification, policy and compliance review, contract reading, and
251
+ other settings where every answer has to be justified from a fixed body of
252
+ text. Also: research on evidence-faithful reasoning and abstention behavior.
 
253
 
254
  **Limitations.**
255
 
256
+ - This is **not** a general-purpose chatbot replacement. Behavior outside
257
  the closed-packet setting is not characterized.
258
+ - Perfect `probe_v0` belongs to the **hybrid stack**, not to the standalone
259
+ adapter alone. Treat `1.0000` numbers on `probe_v0` as the hybrid stack's
260
+ result.
261
+ - The downloadable artifact is the LoRA adapter only. `deterministic_v3` is
262
+ a separate post-processing step that lives in the project repository; the
263
+ benchmark-winning configuration is *adapter + normalizer*.
264
  - Frozen `probe_v0` item-level contents are intentionally not published with
265
  the release in order to preserve the held-out gate.
266
  - Perfect `probe_v0` is not proof of general faithful reasoning. It is proof
267
+ that the system meets the strict contract on a single frozen closed-packet
268
  benchmark.
269
 
270
  ## 12. Surfaces
 
274
  - final artifact memo: `reports/sft_v1_final_artifact_status.md`
275
  - standalone freeze memo: `reports/standalone_model_v2_freeze_memo.md`
276
  - fresh holdout comparison: `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`
277
+ - stop-signal memo: `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`
278
  - release brief: `evidence_faithful_reasoning_release_brief.pdf`
279
+ - model card: `README.md`
280
+ - Hugging Face release: [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)