quotebound-27b / technical_note_evidence_faithful_reasoning.md
darcar0's picture
Upload release asset technical_note_evidence_faithful_reasoning.md
86c8fe2 verified

Technical Note: Evidence-Faithful Reasoning Over Closed Evidence Packets

Updated: 2026-04-07

Abstract

This note describes a research-engineering project on strict evidence-faithful reasoning in open reasoning-distilled language models. The system answers from a closed packet of source text and is scored against a four-condition contract: answer correctly, cite the right packet units as evidence, quote those units verbatim, and abstain with Insufficient evidence. when the packet does not justify a claim. The project ends in two finished results from the same frame. The first is a hybrid system — a trained bridge checkpoint plus a packet-local quote normalizer — that clears every gate of the contract on the frozen held-out probe (probe_v0). The second is Quotebound 27B, released as a LoRA adapter on Hugging Face (darcar0/quotebound-27b). The hybrid system is the benchmark-facing release; Quotebound 27B is the downloadable model release.

Keywords: evidence-faithful reasoning, grounded QA, claim verification, closed-packet reasoning, abstention, attribution.

1. Problem

Language models routinely produce correct-looking answers while grounding poorly. In practice, that failure mode shows up as one or more of the following:

  • the answer is right but the cited evidence is wrong;
  • the quote is too broad, too narrow, or not recoverable from the source text at all;
  • the model fails to abstain when the packet does not justify the answer;
  • the output reads as persuasive even when the underlying evidence is insufficient.

The question this project asks is not whether a language model can answer from a closed packet. It is whether the model can answer faithfully — with recoverable support, and with the correct abstention behavior when the packet falls short.

2. Benchmark contract

Each task arrives with a closed packet of source text. To count as a success, the system has to clear four conditions on the same answer:

  1. Answer correctly — return the right answer or label for the task.
  2. Pick the right evidence — the cited units must be the packet locations that actually support the answer.
  3. Quote exact support — every quote is a verbatim substring of its cited unit. No paraphrase, no stitching, no ellipsis.
  4. Abstain when blocked — if the packet does not justify a claim, the answer must be exactly Insufficient evidence.

Correctness alone is not credited.

The frozen held-out benchmark surface for the final release cycle is data/probe_v0/. It stayed frozen throughout the cycle and was never used as a tuning surface.

3. Method

The project worked through a full research-engineering loop. The eval harness and a frozen baseline came first; baselines were scored end-to-end under the strict contract before any intervention was tried, and a failure mapping over those baseline runs identified where the model was losing each gate. From there, prompt and structure variants were tested against the same harness, and a training-backed intervention path was built on top of the strongest learned configuration. Where the model still underperformed the contract, a deterministic packet-local quote normalizer was layered on as a finishing repair, and held-out evaluation was rerun on the protected split. Once the hybrid stack solved the held-out probe, a teacher–student distillation cycle pushed the winning behavior back into the model itself, evaluated entirely off the held-out probe. The public release stops at the strongest standalone artifact that preserved the broader gains reported below.

Train and dev surfaces are derived from public FEVER-style verify-claim data, public HotpotQA-style grounded-QA data, and project-local packet scaffolding built on top of those upstream sources.

4. Benchmark-facing system

The benchmark-facing release is a hybrid of a learned model and a deterministic post-processing step:

  • Bridge checkpoint: outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2
  • Packet-local quote normalizer: python3 scripts/normalize_quotes.py --mode deterministic_v3

Both parts are necessary. Training closes the gap between the older frozen baseline and the strict contract on task and evidence accuracy. The packet-local normalizer is the finishing repair: it closes the contract on quote faithfulness and strict grounded success without leaving the closed-packet boundary, by correcting quote spans against the cited units inside each packet.

5. Benchmark-facing result

Frozen held-out probe_v0 progression:

Stack Task Strict Evidence F1 Quote F1
Frozen v2 baseline 0.9545 0.1818 0.8758 0.2494
Bridge checkpoint-2 (raw) 1.0000 0.2727 0.8844 0.4409
Bridge + deterministic_v2 1.0000 0.4091 0.8844 0.5773
Bridge + deterministic_v3 1.0000 1.0000 1.0000 1.0000

See Figure 1: benchmark progression.

The full final-artifact frozen probe_v0 score is task 1.0000, strict grounded success 1.0000, evidence F1 1.0000, quote F1 1.0000, verify label accuracy 1.0000, grounded QA accuracy 1.0000, contrastive consistency 1.0000, invalid / missing rate 0.0000. Canonical memo: reports/sft_v1_final_artifact_status.md.

6. Quotebound 27B

After the hybrid stack solved the benchmark, the project asked a second question inside the same frame: how much of that winning behavior can be moved into the model itself, evaluated on surfaces outside the held-out probe?

That question produced Quotebound 27B, a teacher-student distillation checkpoint published as a LoRA adapter on top of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This is the strongest standalone model artifact from the project. It is the first standalone checkpoint to hold up across multiple evaluation surfaces beyond probe_v0, and on a fresh public holdout it roughly doubles raw quote-faithful behavior over the earlier bridge model. Selection happened entirely off probe_v0; the held-out gate stayed frozen.

7. Quotebound 27B results

7.1 Fresh 36-task mixed public holdout

A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA tasks, drawn from public sources and de-duplicated against every training, dev, and probe_v0 row.

Stack Task Strict Evidence F1 Quote F1
Bridge raw 0.8611 0.2222 0.8815 0.3343
Quotebound raw 0.8889 0.4444 0.9093 0.6815
Bridge + deterministic_v3 0.8611 0.5833 0.8815 0.8815
Quotebound + deterministic_v3 0.8889 0.5833 0.9093 0.9093

See Figure 2: standalone holdout comparison.

Quotebound 27B beats the prior bridge model on task accuracy, evidence F1, and quote F1 in both raw and normalized form, ties normalized strict, and roughly doubles raw quote F1 at the model level. Grounded-QA accuracy on this slice is 1.0000 for both stacks; zero invalid outputs across every reported evaluation surface. Canonical memo: reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md.

7.2 Fixed dev triage slice (21 tasks)

Stack Task Strict Evidence F1 Quote F1
Quotebound + deterministic_v3 1.0000 0.6190 0.8320 0.7095

7.3 Untouched 104-task HotpotQA shadow slice

The raw standalone adapter improved quote-faithful behavior over raw bridge, and Quotebound plus deterministic_v3 matched bridge + deterministic_v3 at the system level on this slice. The standalone freeze memo (reports/standalone_model_v2_freeze_memo.md) reports it as a parity outcome; per-metric numbers were not recorded for this surface, so it stands as a narrative parity result rather than a table cell.

8. Release boundary

The public release centers on the two strongest finished artifacts from the project: the benchmark-winning hybrid stack on frozen probe_v0, and Quotebound 27B as the strongest standalone model. That boundary keeps the package focused on the results that held up across the evaluation surfaces reported here.

9. Discussion

Why the release shape is hybrid plus standalone. The strict contract has four gates. Training alone closed the task and evidence gates on probe_v0, but bridge checkpoint-2 did not close the strict grounded success gate by itself: raw bridge sat at strict 0.2727. Deterministic packet-local normalization was the finishing move that closed the remaining gates without escaping the closed-packet boundary. That is what made the hybrid stack the strongest benchmark-facing result, and why it is the benchmark-facing release.

What Quotebound 27B demonstrates. The teacher-student cycle that produced the standalone adapter asked how much of that winning behavior could move into the model itself, evaluated entirely off probe_v0. The fresh-holdout deltas show that the model-side gain is real: the raw standalone roughly doubles raw bridge on quote F1 (0.33430.6815) and beats it on task accuracy and evidence F1. This is the first standalone model in the project to hold up across multiple evaluation surfaces beyond the held-out probe; calling it the standalone release artifact is faithful to that evidence.

Why the release stops here. The package is intentionally centered on the strongest benchmark-facing system and the strongest standalone model. That keeps the public release focused on finished results rather than on intermediate local variations.

Distinction held throughout. Perfect frozen probe_v0 belongs to the hybrid stack, not to the standalone adapter alone. The release ships both results without collapsing them into one claim.

10. Contributions

  1. A strict packet-faithful reasoning benchmark setup that requires answer, evidence, exact quotes, and Insufficient evidence. abstention together, on a frozen held-out surface.
  2. A baseline-to-intervention evidence trail that documents what training, prompt structure, and deterministic packet-local normalization each contributed.
  3. A finished benchmark-winning hybrid artifact: bridge checkpoint-2 plus deterministic_v3 packet-local normalization.
  4. A standalone model release — Quotebound 27B, a LoRA adapter on a 27B reasoning-distilled base — that is the first standalone checkpoint in the project to hold up across multiple evaluation surfaces beyond probe_v0, and that roughly doubles raw quote-faithful behavior over the earlier bridge.
  5. A public artifact set centered on the benchmark-winning hybrid system, Quotebound 27B, and the technical note that explains how the two fit together.

11. Intended use and limitations

Intended use. Closed-packet reasoning workflows: bounded document QA, claim verification, policy and compliance review, contract reading, and other settings where every answer has to be justified from a fixed body of text. Also: research on evidence-faithful reasoning and abstention behavior.

Limitations.

  • This is not a general-purpose chatbot replacement. Behavior outside the closed-packet setting is not characterized.
  • Perfect probe_v0 belongs to the hybrid stack, not to the standalone adapter alone. Treat 1.0000 numbers on probe_v0 as the hybrid stack's result.
  • The downloadable artifact is the LoRA adapter only. deterministic_v3 is a separate post-processing step that lives in the project repository; the benchmark-winning configuration is adapter + normalizer.
  • Frozen probe_v0 item-level contents are intentionally not published with the release in order to preserve the held-out gate.
  • Perfect probe_v0 is not proof of general faithful reasoning. It is proof that the system meets the strict contract on a single frozen closed-packet benchmark.

12. Surfaces

Canonical project surfaces:

  • final artifact memo: reports/sft_v1_final_artifact_status.md
  • standalone freeze memo: reports/standalone_model_v2_freeze_memo.md
  • fresh holdout comparison: reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md
  • model card: README.md
  • technical note: technical_note_evidence_faithful_reasoning.md
  • Hugging Face release: darcar0/quotebound-27b