Rebrand public release to Quotebound 27B

Browse files

Files changed (7) hide show

README.md +111 -92
benchmark_progression.svg +69 -69
evidence_faithful_reasoning_release_brief.md +56 -49
evidence_faithful_reasoning_release_brief.pdf +2 -2
project_release_arc.svg +13 -7
standalone_holdout_comparison.svg +129 -130
technical_note_evidence_faithful_reasoning.md +158 -140

README.md CHANGED Viewed

@@ -21,75 +21,88 @@ tags:
   - research
 ---
-# Evidence-Faithful Reasoning
-*Pilot 3 is the standalone model release.*
-This adapter turns its reasoning-distilled 27B base model into an
-evidence-first reader for closed packets of text. I built it because I wanted
-a release where reasoning had to prove itself: every answer has to land on the
-right evidence, quote that evidence verbatim, and stop with
-`Insufficient evidence.` when the packet does not justify a claim. The result
-is the strongest standalone model from the project, packaged here as a LoRA
-adapter you can load directly.
-## Resources & Guides
-- [Technical brief (PDF)](./evidence_faithful_reasoning_release_brief.pdf)
-- [Technical note](./technical_note_evidence_faithful_reasoning.md)
-- [Fresh public holdout chart](./standalone_holdout_comparison.svg)
-- [Frozen benchmark progression chart](./benchmark_progression.svg)
-- [Release architecture chart](./project_release_arc.svg)
-![Fresh public holdout: standalone release vs bridge](./standalone_holdout_comparison.svg)
-*Fresh 36-task mixed public holdout: the standalone release beats the earlier
-bridge model on task accuracy, evidence F1, and quote F1, while the
-packet-local normalizer lifts the full stack to `0.9093` quote F1.*
-## Why this release exists
-I built this project to force reasoning models to show their work in the only
-place that counts: the evidence itself. Fluent answers were not enough. I
-wanted a model that had to retrieve the right units, quote them exactly, and
-fail closed when the packet ran out. This page leads with the standalone
-release because it is the artifact you can load immediately, inspect directly,
-and use without reconstructing the whole benchmark stack.
 ## At a glance
-- LoRA adapter on top of
-  [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
-- Strongest standalone model from the project and the release I want people to
-  download first.
-- On a fresh 36-task public holdout, raw task improves from `0.8611` to
-  `0.8889`, raw strict from `0.2222` to `0.4444`, and raw quote F1 from
-  `0.3343` to `0.6815` over the earlier bridge model.
-- Zero invalid outputs on every reported evaluation surface.
-- The project also produced a benchmark-winning hybrid stack, but that is a
-  separate result described under *Release architecture*.
 ## Quick start
-Pilot 3 is a LoRA adapter. Load the base model and attach the adapter:
 ```python
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer
 base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
-adapter_id = "darcar0/evidence-faithful-reasoning-pilot-3"
 tokenizer = AutoTokenizer.from_pretrained(base_id)
 base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
 model = PeftModel.from_pretrained(base, adapter_id)
 ```
-The base model is 27B parameters, so load it in your usual quantization.
 ## Prompt format
-This release works best with an evidence-first prompt that makes the answer
 subordinate to the cited text. A minimal version:
 ```
@@ -128,78 +141,84 @@ The model then writes a JSON object with this shape:
 ### Fresh 36-task mixed public holdout
-A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA
-tasks, drawn from public sources and de-duplicated against every training,
-dev, and `probe_v0` row.
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
-| Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
-| **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
-The standalone release beats the earlier bridge model on task accuracy,
-evidence F1, and quote F1 in both raw and normalized form, ties normalized
-strict, and roughly doubles raw quote F1 at the model level.
 ### Fixed dev triage slice (21 tasks)
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
-| Pilot 3 + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
-### Untouched 104-task Hotpot shadow slice
-Pilot 3 raw improved quote-faithful behavior over the raw bridge model on this
-slice, and pilot 3 + `deterministic_v3` matched bridge +
-`deterministic_v3` at the system level. That surface remains a narrative
-parity result because the report does not publish per-metric cells for it.
 ## Release architecture
-This project ends in two finished artifacts, not one:
-1. **Standalone model release** — this page. Pilot 3 is the strongest
    version of the project's evidence-faithful behavior that moved into the
-   model itself, evaluated across multiple non-`probe_v0` surfaces.
-2. **Benchmark-facing hybrid stack** — bridge `checkpoint-2` plus the
-   `deterministic_v3` packet-local normalizer. That stack is the benchmark
-   winner and the only configuration that clears every gate on frozen held-out
-   `probe_v0`.
-The separation is deliberate. This page is for the standalone release you can
-download now. The benchmark winner is documented here because it explains the
-project's full result, not because those perfect `probe_v0` numbers belong to
-the adapter alone.
 ## Intended use
 Use this release for work that has to stay inside a fixed body of text:
 - bounded document QA with explicit evidence requirements,
-- claim verification and grounded QA from closed evidence packets,
-- policy, compliance, contract, and internal-document workflows where each
-  answer must be justified from the provided text,
 - research on evidence-faithful reasoning and abstention behavior.
 ## Limitations
-- The downloadable artifact is the LoRA adapter only. The base model is
-  required.
-- The `deterministic_v3` packet-local normalizer is not included in this
-  download. The benchmark-winning configuration is adapter + normalizer, while
-  the adapter alone reproduces the standalone-model results shown above.
-- Perfect `probe_v0` belongs to the benchmark-facing hybrid stack, not to this
-  adapter alone.
-- Specialized for closed-packet reasoning, not open-ended chat or open-domain
-  QA.
-- Frozen `probe_v0` item-level contents are intentionally not published with
-  the release.
-## Citation
-References:
 - Base model:
   [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
@@ -212,11 +231,11 @@ References:
   [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)
 ```bibtex
-@misc{darcar0_evidence_faithful_reasoning_pilot_3_2026,
-  title        = {Evidence-Faithful Reasoning: Pilot 3},
-  author       = {darcar0},
   year         = {2026},
   howpublished = {Hugging Face model release},
-  url          = {https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3}
 }
 ```

   - research
 ---
+# Quotebound 27B
+*The standalone model release from Evidence-Faithful Reasoning, built on the
+Qwen 3.5 Opus Distilled 27B base.*
+Quotebound 27B is the public release name for the standalone model
+previously referred to internally as `pilot 3`. I built it as the
+downloadable model release for Evidence-Faithful Reasoning: a LoRA adapter
+that turns its reasoning-distilled 27B base model into an evidence-first
+reader for closed packets of source text. Every answer has to land on the
+right evidence units, quote them verbatim, and stop with
+`Insufficient evidence.` when the packet does not justify a claim.
+![Fresh public holdout: Quotebound 27B versus the prior bridge model](./standalone_holdout_comparison.svg)
+*On a fresh 36-task public holdout, Quotebound 27B improves task accuracy,
+evidence F1, and quote F1 over the prior bridge model. The packet-local
+quote normalizer carries the full stack to `0.9093` quote F1.*
 ## At a glance
+- **What it is.** A LoRA adapter on top of
+  [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2),
+  trained to answer from closed packets of source text under a strict
+  answer–evidence–quote–abstain contract.
+- **The headline number.** Raw quote F1 on a fresh public holdout roughly
+  doubles over the prior bridge model (`0.3343` → `0.6815`), meaning much
+  more of the grounding behavior now lives inside the model itself instead
+  of in a post-processing layer.
+- **Other deltas on the same holdout.** Raw task: `0.8611` → `0.8889`.
+  Raw strict: `0.2222` → `0.4444`. Raw evidence F1: `0.8815` → `0.9093`.
+  Zero invalid outputs across every reported evaluation surface.
+- **What it isn't.** Not a general chatbot. Not a replacement for the
+  benchmark-winning hybrid system, which is described below as a separate
+  result.
+## Read next
+- [Technical brief (PDF)](./evidence_faithful_reasoning_release_brief.pdf) — short, citation-friendly summary.
+- [Technical note](./technical_note_evidence_faithful_reasoning.md) — full method, results, and discussion.
+- [Frozen benchmark progression chart](./benchmark_progression.svg)
+- [Release architecture chart](./project_release_arc.svg)
 ## Quick start
+Load the 27B base model and attach the adapter:
 ```python
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer
 base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
+adapter_id = "darcar0/quotebound-27b"
 tokenizer = AutoTokenizer.from_pretrained(base_id)
 base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
 model = PeftModel.from_pretrained(base, adapter_id)
 ```
+The base is a 27B-parameter model, so load it in whichever quantization
+your hardware supports (4-bit `bitsandbytes` works for inference).
+## The contract
+Each task arrives with a closed packet of source text. To count as a
+success, the model has to clear four conditions on the same answer:
+1. **Answer correctly** — return the right answer or label for the task.
+2. **Pick the right evidence** — the cited units must be the packet
+   locations that actually support the answer.
+3. **Quote exact support** — every quote is a verbatim substring of its
+   cited unit. No paraphrase, no stitching, no ellipsis.
+4. **Abstain when blocked** — if the packet does not justify a claim,
+   the answer must be exactly `Insufficient evidence.`
+Correctness alone is not credited. The model has been trained to fail
+closed when the packet runs out, and to ground every answer it does
+return.
 ## Prompt format
+The model is trained for an evidence-first prompt that makes the answer
 subordinate to the cited text. A minimal version:
 ```
 ### Fresh 36-task mixed public holdout
+A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA
+grounded-QA tasks, drawn from public sources and de-duplicated against
+every training, dev, and held-out probe row.
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
+| Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
+| **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
+Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
+and quote F1 in both raw and normalized form, ties normalized strict, and
+roughly doubles raw quote F1 at the model level.
 ### Fixed dev triage slice (21 tasks)
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
+| Quotebound + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
+### Untouched 104-task HotpotQA shadow slice
+On a 104-task HotpotQA shadow slice that was never touched during
+selection, Quotebound raw improved quote-faithful behavior over the prior
+bridge model, and Quotebound plus `deterministic_v3` matched bridge +
+`deterministic_v3` at the system level. The surface is reported as a
+narrative parity result because the freeze memo does not publish
+per-metric cells for it.
 ## Release architecture
+The project ends in two finished results that are reported separately on
+purpose. One is the strongest full system on the held-out benchmark; the
+other is the strongest standalone model — and the artifact you can
+actually download.
+1. **Quotebound 27B — this page.** The adapter above is the strongest
    version of the project's evidence-faithful behavior that moved into the
+   model itself, evaluated across multiple surfaces beyond the held-out
+   probe.
+2. **The benchmark-winning hybrid system.** A trained bridge checkpoint
+   plus the `deterministic_v3` packet-local quote normalizer. That stack
+   is the only configuration that clears every gate of the strict
+   contract on the frozen held-out probe (`probe_v0`).
+The two results do not collapse into one. The hybrid system is the
+benchmark winner. Quotebound 27B is the downloadable model. Perfect
+`probe_v0` belongs to the hybrid system, not to the adapter on this page
+alone.
 ## Intended use
 Use this release for work that has to stay inside a fixed body of text:
 - bounded document QA with explicit evidence requirements,
+- claim verification and grounded QA from closed packets of source text,
+- policy, compliance, contract, and internal-document workflows where
+  each answer has to be justified from the provided text,
 - research on evidence-faithful reasoning and abstention behavior.
 ## Limitations
+- The download is the LoRA adapter only — the 27B base model is required.
+- The `deterministic_v3` packet-local quote normalizer is *not* shipped
+  here. It lives in the project repository as a separate post-processing
+  step. Quotebound 27B alone reproduces the raw standalone gains above;
+  normalized system-level rows require adapter + normalizer.
+- Perfect `probe_v0` belongs to the benchmark-winning hybrid system, not
+  to this adapter alone.
+- Specialized for closed-packet reasoning. Behavior outside that setting
+  — open chat, open-domain QA, free-form generation — is not
+  characterized.
+- Raw item-level contents of the held-out probe are intentionally not
+  published with the release; the held-out gate has to stay closed to
+  remain meaningful.
+## Citation and references
 - Base model:
   [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
   [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)
 ```bibtex
+@misc{quotebound_27b_2026,
+  title        = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
+  author       = {{darcar0}},
   year         = {2026},
   howpublished = {Hugging Face model release},
+  url          = {https://huggingface.co/darcar0/quotebound-27b}
 }
 ```

benchmark_progression.svg CHANGED Viewed

evidence_faithful_reasoning_release_brief.md CHANGED Viewed

@@ -1,14 +1,14 @@
-# Evidence-Faithful Reasoning
-## Release Brief
-Released: 2026-04-07
 Author: darcar0
 Hugging Face model release:
-[`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
-Companion files on this page:
 - [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
 - [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
@@ -17,53 +17,57 @@ Companion files on this page:
 ## Executive summary
-I built this project because I wanted a release where reasoning had to prove
-itself from the evidence instead of hiding behind fluent language. The result
-is a strict evidence-faithful benchmark, a benchmark-winning hybrid system,
-and pilot 3: the standalone model release, shipped here as a LoRA adapter on
-Hugging Face.
-The contract is strict. On every bounded evidence packet, the system has to:
 1. answer correctly,
-2. identify the right evidence,
-3. quote the exact supporting text, and
 4. abstain with `Insufficient evidence.` when the packet does not justify a
    claim.
-This release ends in two finished outputs. The benchmark-facing winner is a
-hybrid stack — bridge `checkpoint-2` plus `deterministic_v3` packet-local
-quote normalization — that clears every gate on the frozen held-out
-`probe_v0` benchmark. The main downloadable artifact is pilot 3, the
-standalone model release, which beats the earlier bridge model on a fresh
-mixed public holdout and roughly doubles raw quote-faithful behavior at the
-model level.
-## Standalone model release
-Pilot 3 is the strongest standalone model from the project and the release I
-want people to open first. It is the first standalone checkpoint in the
-project that holds up across multiple non-`probe_v0` evaluation surfaces.
 Fresh 36-task mixed public holdout:
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
-| Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
-| **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
-Pilot 3 beats bridge on task accuracy, evidence F1, and quote F1 in both raw
-and normalized form, ties normalized strict, and roughly doubles raw quote F1
-(`0.3343` → `0.6815`) at the model level.
 ## Benchmark-facing winner
 The benchmark-facing hybrid stack is the strongest full system from the
-project. Training moved the model past the older frozen baseline.
-Deterministic packet-local normalization closed the remaining quote-faithful
-gap without leaving the bounded-packet setting.
 | Metric | Frozen probe_v0 |
 |---|---:|
@@ -78,27 +82,30 @@ gap without leaving the bounded-packet setting.
 ## Project arc and stopping point
-The release has two public faces: a standalone model you can load directly and
-a benchmark-facing hybrid stack that closes the last quote-faithfulness gap.
-That split is part of the project story, not something hidden behind the fine
-print.
-A targeted follow-up, pilot 4, fixed one specific FEVER
-month/date temporal-insufficiency case but weakened broader behavior on larger
-evaluation surfaces. I treated that as a stop signal rather than churning for
-one more pilot. The release froze at the point where the standalone model was
-strongest and the benchmark-facing system was already complete.
 ## Intended use and boundaries
-This is a specialized grounded reasoning release, not a general-purpose
-chatbot replacement. It is built for bounded document QA, claim verification,
-policy/compliance workflows, and other settings where every answer has to be
-justified from a closed body of text.
 Important boundaries:
-- Perfect `probe_v0` belongs to the hybrid stack, not to pilot 3 alone.
 - The Hugging Face download is the LoRA adapter only; the benchmark-winning
   configuration is adapter + `deterministic_v3`.
 - Frozen `probe_v0` item-level contents are intentionally not published with
@@ -106,10 +113,10 @@ Important boundaries:
 ## Release surfaces
-- Pilot 3 on Hugging Face:
-  [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
 - Technical note:
-  [`docs/technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
 - Fresh public holdout chart:
   [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
 - Frozen benchmark progression chart:

+# Quotebound 27B
+## Evidence-Faithful Reasoning Release Brief
+Released: 2026-04-07
 Author: darcar0
 Hugging Face model release:
+[`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
+Companion files:
 - [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
 - [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
 ## Executive summary
+Quotebound 27B is the standalone model release from Evidence-Faithful
+Reasoning: a research-engineering project on reasoning that has to stay
+recoverable from the source text rather than asserted on top of it. The
+project ships a strict benchmark, the hybrid system that clears that
+benchmark under the full contract, and a standalone model trained to carry
+the same behavior on its own.
+The contract is strict. On every closed packet of source text, the system
+has to:
 1. answer correctly,
+2. cite the right evidence units,
+3. quote those units verbatim, and
 4. abstain with `Insufficient evidence.` when the packet does not justify a
    claim.
+The project ends in two finished results from one frame. The
+benchmark-facing winner is a hybrid stack — bridge `checkpoint-2` plus
+`deterministic_v3` packet-local quote normalization — that clears every
+gate on the frozen held-out `probe_v0` benchmark. The downloadable artifact
+is Quotebound 27B on Hugging Face, which beats the earlier bridge model on
+a fresh mixed public holdout and roughly doubles raw quote-faithful
+behavior at the model level.
+## Quotebound 27B
+Quotebound 27B is the public release name for the standalone model
+previously referred to internally as `pilot 3`. It is the strongest
+standalone model the project produced and the artifact most readers will
+load first. It is the first standalone checkpoint in the project to hold
+up across multiple evaluation surfaces beyond the held-out probe.
 Fresh 36-task mixed public holdout:
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
+| Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
+| **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
+Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
+and quote F1 in both raw and normalized form, ties normalized strict, and
+roughly doubles raw quote F1 (`0.3343` → `0.6815`) at the model level.
 ## Benchmark-facing winner
 The benchmark-facing hybrid stack is the strongest full system from the
+project. Training moved the model past the older frozen baseline; the
+deterministic packet-local normalizer was the finishing repair, closing the
+remaining quote-faithful gap without leaving the closed-packet boundary.
 | Metric | Frozen probe_v0 |
 |---|---:|
 ## Project arc and stopping point
+The release has two public faces: Quotebound 27B, the standalone model that
+loads directly from Hugging Face, and a benchmark-facing hybrid stack that
+closes the last quote-faithfulness gap on the frozen held-out probe. The
+split is part of the project story, not hidden behind the fine print.
+A narrow follow-up — internally `pilot 4` — fixed one specific FEVER
+month/date temporal-insufficiency case but weakened broader behavior on
+larger evaluation surfaces. The project read that outcome as a stop signal
+rather than running additional local fixes, and froze at the point where
+Quotebound 27B was strongest and the benchmark-facing system was already
+complete.
 ## Intended use and boundaries
+This is a release for reasoning over closed packets of source text, not a
+general-purpose chatbot replacement. It is built for bounded document QA,
+claim verification, policy and compliance review, contract reading, and
+other settings where every answer has to be justified from a fixed body of
+text.
 Important boundaries:
+- Perfect `probe_v0` belongs to the hybrid stack, not to the standalone
+  adapter alone.
 - The Hugging Face download is the LoRA adapter only; the benchmark-winning
   configuration is adapter + `deterministic_v3`.
 - Frozen `probe_v0` item-level contents are intentionally not published with
 ## Release surfaces
+- Quotebound 27B on Hugging Face:
+  [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
 - Technical note:
+  [`technical_note_evidence_faithful_reasoning.md`](./technical_note_evidence_faithful_reasoning.md)
 - Fresh public holdout chart:
   [`standalone_holdout_comparison.svg`](./standalone_holdout_comparison.svg)
 - Frozen benchmark progression chart:

evidence_faithful_reasoning_release_brief.pdf CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c425eff778b5c1c17aebf1c67a4e63f71908d729f3a73a6ee0c3bf053fe63aca
-size 317277

 version https://git-lfs.github.com/spec/v1
+oid sha256:4016a00664573e528b22b0d93a4b0d5ebe740c0c6f4a4e9fd6f6874a02c55d25
+size 243746

project_release_arc.svg CHANGED Viewed

standalone_holdout_comparison.svg CHANGED Viewed

technical_note_evidence_faithful_reasoning.md CHANGED Viewed

@@ -1,90 +1,104 @@
-# Technical Note: Evidence-Faithful Reasoning Over Bounded Evidence Packets
 Updated: 2026-04-07
 ## Abstract
-This note describes a research-engineering project on strict evidence-faithful
-reasoning in open reasoning-distilled language models. The system answers from
-a closed packet of source text and must satisfy four conditions at once:
-answer correctly, identify the right evidence units, quote exact supporting
-text, and abstain with `Insufficient evidence.` when the packet does not
-justify a claim. The project produced two finished outputs from one frame: a
-benchmark-winning hybrid system that solves the frozen held-out `probe_v0`
-benchmark under the full contract, and pilot 3, the strongest standalone model
-artifact from the project, released as a LoRA adapter on Hugging Face. A
-narrowly targeted follow-up checkpoint, pilot 4, was rejected as a stop signal
-after it traded one local fix for broader regressions. The hybrid stack is the
-benchmark-facing release; pilot 3 is the standalone model release.
-**Keywords:** evidence-faithful reasoning, grounded QA, claim verification,
-bounded evidence packets, abstention, attribution.
 ## 1. Problem
-Many language models can produce correct-looking answers while grounding
-poorly. In practice, that failure shows up as one or more of the following:
-- the answer is right but the cited evidence is wrong
-- the quote is too broad, too narrow, or not recoverable from the source text
-- the model fails to abstain when the packet does not justify the answer
-- the output sounds persuasive even when the evidence is insufficient
-The question is not whether the model can answer from a packet. The question
-is whether it can answer **faithfully** — with recoverable support and the
-correct abstention behavior when the packet falls short.
 ## 2. Benchmark contract
-For each task on a bounded evidence packet, the system must clear all four
-gates at once:
-1. **Answer correctly** — the right answer or label for the task.
-2. **Pick the right evidence** — the cited evidence units must be the packet
    locations that actually support the answer.
-3. **Quote exact support** — every quote has to be a verbatim substring of
-   the cited unit; no paraphrase, no stitching, no ellipsis.
-4. **Abstain when blocked** — if the packet does not justify a claim, the
-   answer must be exactly `Insufficient evidence.`
-Correctness alone does not count as success.
 The frozen held-out benchmark surface for the final release cycle is
-`data/probe_v0/`. It remained frozen throughout the cycle and was not used as
-a tuning surface.
 ## 3. Method
-The project followed a full research-engineering loop:
-1. build the evaluation harness and frozen baselines
-2. produce a baseline table and failure mapping
-3. test prompt and structure variants
-4. build a training-backed intervention path
-5. add deterministic packet-local quote normalization where the model still
-   underperformed the strict contract
-6. rerun held-out evaluation on a protected split
-7. run a teacher-student distillation cycle to push the winning behavior into
-   the model itself
-8. stop when the strongest release artifact emerged
-Train and dev surfaces are public-data-backed and derived from FEVER-style
-verify-claim data, HotpotQA-style grounded QA data, and project-local packet
-scaffolding built on top of those upstream sources.
 ## 4. Benchmark-facing system
-The benchmark-facing release is a hybrid system:
 - **Bridge checkpoint:** `outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2`
-- **Deterministic packet-local normalization:**
   `python3 scripts/normalize_quotes.py --mode deterministic_v3`
-Both parts are necessary. The training step closes the gap between the older
-frozen baseline and the strict contract on task and evidence accuracy. The
-deterministic normalizer is the finishing move that closes the contract on
-quote faithfulness and strict grounded success without leaving the closed
-packet.
 ## 5. Benchmark-facing result
@@ -105,107 +119,112 @@ label accuracy `1.0000`, grounded QA accuracy `1.0000`, contrastive
 consistency `1.0000`, invalid / missing rate `0.0000`. Canonical memo:
 `reports/sft_v1_final_artifact_status.md`.
-## 6. Standalone model: pilot 3
 After the hybrid stack solved the benchmark, the project asked a second
 question inside the same frame: how much of that winning behavior can be
-moved into the model itself, evaluated outside `probe_v0`?
-That question produced pilot 3: a teacher-student distillation checkpoint
-released as a LoRA adapter on top of
 [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
 - **Internal checkpoint:**
   `outputs/sft_v1_v2_teacher_distill_pilot_v3_partialdev/checkpoint-16`
 - **Public release identity:**
-  [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)
-Pilot 3 is the strongest standalone model artifact from the project. It is
-the first standalone checkpoint to hold up across multiple non-`probe_v0`
-evaluation surfaces, and it roughly doubles raw quote-faithful behavior over
-the earlier bridge model on a fresh public holdout. Selection happened
 entirely off `probe_v0`; the held-out gate stayed frozen.
-## 7. Standalone model: results
 ### 7.1 Fresh 36-task mixed public holdout
-A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded QA
 tasks, drawn from public sources and de-duplicated against every training,
 dev, and `probe_v0` row.
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
-| Pilot 3 raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
-| **Pilot 3 + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
 See [Figure 2: standalone holdout comparison](./standalone_holdout_comparison.svg).
-Pilot 3 beats bridge on task accuracy, evidence F1, and quote F1 in both raw
-and normalized form, ties normalized strict, and roughly doubles raw quote F1
-at the model level. Grounded-QA accuracy on this slice is `1.0000` for both
-stacks; zero invalid outputs across every reported evaluation surface.
-Canonical memo:
 `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`.
 ### 7.2 Fixed dev triage slice (21 tasks)
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
-| Pilot 3 + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
-### 7.3 Untouched 104-task Hotpot shadow slice
-Pilot 3 raw improved quote-faithful behavior over raw bridge, and pilot 3 +
-`deterministic_v3` matched bridge + `deterministic_v3` at the system level on
-this slice. Reported as a parity outcome in the standalone freeze memo
-(`reports/standalone_model_v2_freeze_memo.md`); per-metric numbers were not
-recorded for this surface, so it stands as a narrative parity result, not a
-table cell.
-## 8. Stop signal: pilot 4
-A targeted follow-up, pilot 4, was built to fix one specific FEVER
-month/date temporal-insufficiency error. It fixed that single row but
-weakened broader behavior on the larger evaluation surfaces. That outcome
-made pilot 4 useful as a stop signal: further local-fix iteration was
-trading visible gains for wider regressions. The project froze at pilot 3
-as the strongest standalone model. Canonical memo:
 `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`.
 See [Figure 3: project release arc](./project_release_arc.svg) for the full
-arc from baseline through pilot 4 stop signal.
 ## 9. Discussion
-**Why the release shape is hybrid + standalone, not one or the other.** The
-strict contract has four gates. Training alone closed the task and evidence
-gates on `probe_v0`, but bridge `checkpoint-2` did not close the strict
-grounded success gate by itself: raw bridge was at strict `0.2727` on
-`probe_v0`. Deterministic packet-local normalization was the finishing move
-that closed the remaining gates without escaping the closed-packet boundary.
-That is what made the hybrid stack the strongest benchmark-facing result and
-why it is the benchmark-facing release.
-**What pilot 3 demonstrates.** The teacher-student cycle that produced pilot
-3 asked how much of that winning behavior could move into the model itself,
-evaluated entirely off `probe_v0`. The fresh-holdout deltas show that the
-model-side gain is real: raw pilot 3 roughly doubles raw bridge on quote F1
-(`0.3343` → `0.6815`) and beats raw bridge on task accuracy and evidence F1.
-This is the first standalone model in the project that holds up across
-multiple non-`probe_v0` surfaces; calling it the standalone release artifact
-is faithful to that evidence.
-**Why pilot 4 is informative.** Pilot 4 was a deliberate narrow refinement.
-It worked on the one row it targeted and regressed broader behavior. Reading
-that result as a stop signal — rather than running additional local fixes —
-is the discipline the project decided to keep visible in the release trail.
 **Distinction held throughout.** Perfect frozen `probe_v0` belongs to the
-hybrid stack, not pilot 3 alone. The release ships both results without
-collapsing them into one claim.
 ## 10. Contributions
@@ -217,35 +236,35 @@ collapsing them into one claim.
    contributed.
 3. A finished benchmark-winning hybrid artifact: bridge `checkpoint-2` plus
    `deterministic_v3` packet-local normalization.
-4. A standalone model release — pilot 3, a LoRA adapter on a 27B
-   reasoning-distilled base — that is the first standalone checkpoint to
-   hold up across multiple non-`probe_v0` evaluation surfaces and that
-   roughly doubles raw quote-faithful behavior over the earlier bridge.
-5. A stop signal — pilot 4 — that documents the point at which further
-   local-fix iteration began trading visible gains for wider regressions.
 ## 11. Intended use and limitations
-**Intended use.** Specialized grounded reasoning over bounded evidence
-packets: bounded document QA, claim verification, policy and compliance
-review, contract reading, and other workflows where every answer has to be
-justified from a closed body of text. Also: research on evidence-faithful
-reasoning and abstention behavior.
 **Limitations.**
-- This is **not** a general-purpose chatbot replacement. Performance outside
   the closed-packet setting is not characterized.
-- Perfect `probe_v0` belongs to the **hybrid stack**, not to pilot 3 alone.
-  Treat `1.0000` numbers on `probe_v0` as the hybrid stack's result.
-- The downloadable artifact for pilot 3 is the LoRA adapter only.
-  `deterministic_v3` is a separate post-processing step that lives in the
-  project repository; the benchmark-winning configuration is *adapter +
-  normalizer*.
 - Frozen `probe_v0` item-level contents are intentionally not published with
   the release in order to preserve the held-out gate.
 - Perfect `probe_v0` is not proof of general faithful reasoning. It is proof
-  that the system meets the strict contract on a single frozen bounded
   benchmark.
 ## 12. Surfaces
@@ -255,8 +274,7 @@ Canonical project surfaces:
 - final artifact memo: `reports/sft_v1_final_artifact_status.md`
 - standalone freeze memo: `reports/standalone_model_v2_freeze_memo.md`
 - fresh holdout comparison: `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`
-- pilot 4 stop-signal memo: `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`
 - release brief: `evidence_faithful_reasoning_release_brief.pdf`
-- release page: `index.html`
-- model card: `huggingface_model_card_README.md`
-- Hugging Face release: [`darcar0/evidence-faithful-reasoning-pilot-3`](https://huggingface.co/darcar0/evidence-faithful-reasoning-pilot-3)

+# Technical Note: Evidence-Faithful Reasoning Over Closed Evidence Packets
 Updated: 2026-04-07
 ## Abstract
+This note describes a research-engineering project on strict
+evidence-faithful reasoning in open reasoning-distilled language models.
+The system answers from a closed packet of source text and is scored
+against a four-condition contract: answer correctly, cite the right
+packet units as evidence, quote those units verbatim, and abstain with
+`Insufficient evidence.` when the packet does not justify a claim. The
+project ends in two finished results from the same frame. The first is a
+hybrid system — a trained bridge checkpoint plus a packet-local quote
+normalizer — that clears every gate of the contract on the frozen
+held-out probe (`probe_v0`). The second is Quotebound 27B, the public
+release name for the standalone model previously referred to internally as
+`pilot 3`, released as a LoRA adapter on Hugging Face
+([`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)).
+A narrow follow-up — internally `pilot 4` — was rejected as a stop signal
+after it traded one local fix for broader regressions. The hybrid system
+is the benchmark-facing release; Quotebound 27B is the downloadable model
+release.
+**Keywords:** evidence-faithful reasoning, grounded QA, claim
+verification, closed-packet reasoning, abstention, attribution.
 ## 1. Problem
+Language models routinely produce correct-looking answers while grounding
+poorly. In practice, that failure mode shows up as one or more of the
+following:
+- the answer is right but the cited evidence is wrong;
+- the quote is too broad, too narrow, or not recoverable from the source
+  text at all;
+- the model fails to abstain when the packet does not justify the
+  answer;
+- the output reads as persuasive even when the underlying evidence is
+  insufficient.
+The question this project asks is not whether a language model can
+answer from a closed packet. It is whether the model can answer
+**faithfully** — with recoverable support, and with the correct
+abstention behavior when the packet falls short.
 ## 2. Benchmark contract
+Each task arrives with a closed packet of source text. To count as a
+success, the system has to clear four conditions on the same answer:
+1. **Answer correctly** — return the right answer or label for the task.
+2. **Pick the right evidence** — the cited units must be the packet
    locations that actually support the answer.
+3. **Quote exact support** — every quote is a verbatim substring of its
+   cited unit. No paraphrase, no stitching, no ellipsis.
+4. **Abstain when blocked** — if the packet does not justify a claim,
+   the answer must be exactly `Insufficient evidence.`
+Correctness alone is not credited.
 The frozen held-out benchmark surface for the final release cycle is
+`data/probe_v0/`. It stayed frozen throughout the cycle and was never
+used as a tuning surface.
 ## 3. Method
+The project worked through a full research-engineering loop. The eval
+harness and a frozen baseline came first; baselines were scored end-to-end
+under the strict contract before any intervention was tried, and a
+failure mapping over those baseline runs identified where the model was
+losing each gate. From there, prompt and structure variants were tested
+against the same harness, and a training-backed intervention path was
+built on top of the strongest learned configuration. Where the model
+still underperformed the contract, a deterministic packet-local quote
+normalizer was layered on as a finishing repair, and held-out evaluation
+was rerun on the protected split. Once the hybrid stack solved the
+held-out probe, a teacher–student distillation cycle pushed the winning
+behavior back into the model itself, evaluated entirely off the held-out
+probe. The cycle was stopped when the strongest standalone release
+artifact emerged and a focused follow-up showed clear regressions.
+Train and dev surfaces are derived from public FEVER-style
+verify-claim data, public HotpotQA-style grounded-QA data, and
+project-local packet scaffolding built on top of those upstream sources.
 ## 4. Benchmark-facing system
+The benchmark-facing release is a hybrid of a learned model and a
+deterministic post-processing step:
 - **Bridge checkpoint:** `outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2`
+- **Packet-local quote normalizer:**
   `python3 scripts/normalize_quotes.py --mode deterministic_v3`
+Both parts are necessary. Training closes the gap between the older
+frozen baseline and the strict contract on task and evidence accuracy.
+The packet-local normalizer is the finishing repair: it closes the
+contract on quote faithfulness and strict grounded success without
+leaving the closed-packet boundary, by correcting quote spans against
+the cited units inside each packet.
 ## 5. Benchmark-facing result
 consistency `1.0000`, invalid / missing rate `0.0000`. Canonical memo:
 `reports/sft_v1_final_artifact_status.md`.
+## 6. Quotebound 27B
 After the hybrid stack solved the benchmark, the project asked a second
 question inside the same frame: how much of that winning behavior can be
+moved into the model itself, evaluated on surfaces outside the held-out
+probe?
+That question produced Quotebound 27B — the public release name for the
+standalone model previously referred to internally as `pilot 3` — a
+teacher-student distillation checkpoint published as a LoRA adapter on top
+of
 [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2).
 - **Internal checkpoint:**
   `outputs/sft_v1_v2_teacher_distill_pilot_v3_partialdev/checkpoint-16`
 - **Public release identity:**
+  [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)
+This is the strongest standalone model artifact from the project. It is the
+first standalone checkpoint to hold up across multiple evaluation surfaces
+beyond `probe_v0`, and on a fresh public holdout it roughly doubles raw
+quote-faithful behavior over the earlier bridge model. Selection happened
 entirely off `probe_v0`; the held-out gate stayed frozen.
+## 7. Quotebound 27B results
 ### 7.1 Fresh 36-task mixed public holdout
+A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA
 tasks, drawn from public sources and de-duplicated against every training,
 dev, and `probe_v0` row.
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
 | Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
+| Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
 | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
+| **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
 See [Figure 2: standalone holdout comparison](./standalone_holdout_comparison.svg).
+Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
+and quote F1 in both raw and normalized form, ties normalized strict, and
+roughly doubles raw quote F1 at the model level. Grounded-QA accuracy on
+this slice is `1.0000` for both stacks; zero invalid outputs across every
+reported evaluation surface. Canonical memo:
 `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`.
 ### 7.2 Fixed dev triage slice (21 tasks)
 | Stack | Task | Strict | Evidence F1 | Quote F1 |
 |---|---:|---:|---:|---:|
+| Quotebound + `deterministic_v3` | 1.0000 | 0.6190 | 0.8320 | 0.7095 |
+### 7.3 Untouched 104-task HotpotQA shadow slice
+The raw standalone adapter improved quote-faithful behavior over raw bridge,
+and Quotebound plus `deterministic_v3` matched bridge + `deterministic_v3`
+at the system level on this slice. The standalone freeze memo
+(`reports/standalone_model_v2_freeze_memo.md`) reports it as a parity
+outcome; per-metric numbers were not recorded for this surface, so it stands
+as a narrative parity result rather than a table cell.
+## 8. Stop-signal follow-up
+A narrow follow-up — internally `pilot 4` — was built to fix one specific
+FEVER month/date temporal-insufficiency error. It fixed that single row but
+weakened broader behavior on the larger evaluation surfaces. The project read
+that outcome as a stop signal: further local-fix iteration was trading visible
+gains for wider regressions, so Quotebound 27B froze at the prior
+checkpoint. Canonical memo:
 `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`.
 See [Figure 3: project release arc](./project_release_arc.svg) for the full
+arc from baseline through stop signal.
 ## 9. Discussion
+**Why the release shape is hybrid plus standalone.** The strict contract has
+four gates. Training alone closed the task and evidence gates on `probe_v0`,
+but bridge `checkpoint-2` did not close the strict grounded success gate by
+itself: raw bridge sat at strict `0.2727`. Deterministic packet-local
+normalization was the finishing move that closed the remaining gates without
+escaping the closed-packet boundary. That is what made the hybrid stack the
+strongest benchmark-facing result, and why it is the benchmark-facing
+release.
+**What Quotebound 27B demonstrates.** The teacher-student cycle that
+produced the standalone adapter asked how much of that winning behavior could
+move into the model itself, evaluated entirely off `probe_v0`. The
+fresh-holdout deltas show that the model-side gain is real: the raw
+standalone roughly doubles raw bridge on quote F1 (`0.3343` → `0.6815`) and
+beats it on task accuracy and evidence F1. This is the first standalone
+model in the project to hold up across multiple evaluation surfaces beyond
+the held-out probe; calling it the standalone release artifact is faithful
+to that evidence.
+**Why the stop signal matters.** The follow-up was a deliberate narrow
+refinement. It worked on the one row it targeted and regressed broader
+behavior. Reading that result as a stop signal — rather than running
+additional local fixes — is the discipline the project decided to keep
+visible in the release trail.
 **Distinction held throughout.** Perfect frozen `probe_v0` belongs to the
+hybrid stack, not to the standalone adapter alone. The release ships both
+results without collapsing them into one claim.
 ## 10. Contributions
    contributed.
 3. A finished benchmark-winning hybrid artifact: bridge `checkpoint-2` plus
    `deterministic_v3` packet-local normalization.
+4. A standalone model release — Quotebound 27B, a LoRA adapter on a 27B
+   reasoning-distilled base — that is the first standalone checkpoint in
+   the project to hold up across multiple evaluation surfaces beyond
+   `probe_v0`, and that roughly doubles raw quote-faithful behavior over
+   the earlier bridge.
+5. A documented stop signal that marks the point at which further local-fix
+   iteration began trading visible gains for wider regressions.
 ## 11. Intended use and limitations
+**Intended use.** Closed-packet reasoning workflows: bounded document QA,
+claim verification, policy and compliance review, contract reading, and
+other settings where every answer has to be justified from a fixed body of
+text. Also: research on evidence-faithful reasoning and abstention behavior.
 **Limitations.**
+- This is **not** a general-purpose chatbot replacement. Behavior outside
   the closed-packet setting is not characterized.
+- Perfect `probe_v0` belongs to the **hybrid stack**, not to the standalone
+  adapter alone. Treat `1.0000` numbers on `probe_v0` as the hybrid stack's
+  result.
+- The downloadable artifact is the LoRA adapter only. `deterministic_v3` is
+  a separate post-processing step that lives in the project repository; the
+  benchmark-winning configuration is *adapter + normalizer*.
 - Frozen `probe_v0` item-level contents are intentionally not published with
   the release in order to preserve the held-out gate.
 - Perfect `probe_v0` is not proof of general faithful reasoning. It is proof
+  that the system meets the strict contract on a single frozen closed-packet
   benchmark.
 ## 12. Surfaces
 - final artifact memo: `reports/sft_v1_final_artifact_status.md`
 - standalone freeze memo: `reports/standalone_model_v2_freeze_memo.md`
 - fresh holdout comparison: `reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md`
+- stop-signal memo: `reports/sft_v1_v2_teacher_distill_pilot_v4_partialdev_status.md`
 - release brief: `evidence_faithful_reasoning_release_brief.pdf`
+- model card: `README.md`
+- Hugging Face release: [`darcar0/quotebound-27b`](https://huggingface.co/darcar0/quotebound-27b)