Instructions to use darcar0/quotebound-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use darcar0/quotebound-27b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="darcar0/quotebound-27b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("darcar0/quotebound-27b", dtype="auto") - PEFT
How to use darcar0/quotebound-27b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use darcar0/quotebound-27b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "darcar0/quotebound-27b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "darcar0/quotebound-27b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/darcar0/quotebound-27b
- SGLang
How to use darcar0/quotebound-27b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "darcar0/quotebound-27b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "darcar0/quotebound-27b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "darcar0/quotebound-27b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "darcar0/quotebound-27b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use darcar0/quotebound-27b with Docker Model Runner:
docker model run hf.co/darcar0/quotebound-27b
Technical Note: Evidence-Faithful Reasoning Over Closed Evidence Packets
Updated: 2026-04-07
Abstract
This note describes a research-engineering project on strict
evidence-faithful reasoning in open reasoning-distilled language models.
The system answers from a closed packet of source text and is scored
against a four-condition contract: answer correctly, cite the right
packet units as evidence, quote those units verbatim, and abstain with
Insufficient evidence. when the packet does not justify a claim. The
project ends in two finished results from the same frame. The first is a
hybrid system — a trained bridge checkpoint plus a packet-local quote
normalizer — that clears every gate of the contract on the frozen
held-out probe (probe_v0). The second is Quotebound 27B, released as a
LoRA adapter on Hugging Face
(darcar0/quotebound-27b).
The hybrid system is the benchmark-facing release; Quotebound 27B is the
downloadable model release.
Keywords: evidence-faithful reasoning, grounded QA, claim verification, closed-packet reasoning, abstention, attribution.
1. Problem
Language models routinely produce correct-looking answers while grounding poorly. In practice, that failure mode shows up as one or more of the following:
- the answer is right but the cited evidence is wrong;
- the quote is too broad, too narrow, or not recoverable from the source text at all;
- the model fails to abstain when the packet does not justify the answer;
- the output reads as persuasive even when the underlying evidence is insufficient.
The question this project asks is not whether a language model can answer from a closed packet. It is whether the model can answer faithfully — with recoverable support, and with the correct abstention behavior when the packet falls short.
2. Benchmark contract
Each task arrives with a closed packet of source text. To count as a success, the system has to clear four conditions on the same answer:
- Answer correctly — return the right answer or label for the task.
- Pick the right evidence — the cited units must be the packet locations that actually support the answer.
- Quote exact support — every quote is a verbatim substring of its cited unit. No paraphrase, no stitching, no ellipsis.
- Abstain when blocked — if the packet does not justify a claim,
the answer must be exactly
Insufficient evidence.
Correctness alone is not credited.
The frozen held-out benchmark surface for the final release cycle is
data/probe_v0/. It stayed frozen throughout the cycle and was never
used as a tuning surface.
3. Method
The project worked through a full research-engineering loop. The eval harness and a frozen baseline came first; baselines were scored end-to-end under the strict contract before any intervention was tried, and a failure mapping over those baseline runs identified where the model was losing each gate. From there, prompt and structure variants were tested against the same harness, and a training-backed intervention path was built on top of the strongest learned configuration. Where the model still underperformed the contract, a deterministic packet-local quote normalizer was layered on as a finishing repair, and held-out evaluation was rerun on the protected split. Once the hybrid stack solved the held-out probe, a teacher–student distillation cycle pushed the winning behavior back into the model itself, evaluated entirely off the held-out probe. The public release stops at the strongest standalone artifact that preserved the broader gains reported below.
Train and dev surfaces are derived from public FEVER-style verify-claim data, public HotpotQA-style grounded-QA data, and project-local packet scaffolding built on top of those upstream sources.
4. Benchmark-facing system
The benchmark-facing release is a hybrid of a learned model and a deterministic post-processing step:
- Bridge checkpoint:
outputs/sft_v1_v2_qlora_bridge_v1_run1/checkpoint-2 - Packet-local quote normalizer:
python3 scripts/normalize_quotes.py --mode deterministic_v3
Both parts are necessary. Training closes the gap between the older frozen baseline and the strict contract on task and evidence accuracy. The packet-local normalizer is the finishing repair: it closes the contract on quote faithfulness and strict grounded success without leaving the closed-packet boundary, by correcting quote spans against the cited units inside each packet.
5. Benchmark-facing result
Frozen held-out probe_v0 progression:
| Stack | Task | Strict | Evidence F1 | Quote F1 |
|---|---|---|---|---|
Frozen v2 baseline |
0.9545 | 0.1818 | 0.8758 | 0.2494 |
Bridge checkpoint-2 (raw) |
1.0000 | 0.2727 | 0.8844 | 0.4409 |
Bridge + deterministic_v2 |
1.0000 | 0.4091 | 0.8844 | 0.5773 |
Bridge + deterministic_v3 |
1.0000 | 1.0000 | 1.0000 | 1.0000 |
See Figure 1: benchmark progression.
The full final-artifact frozen probe_v0 score is task 1.0000, strict
grounded success 1.0000, evidence F1 1.0000, quote F1 1.0000, verify
label accuracy 1.0000, grounded QA accuracy 1.0000, contrastive
consistency 1.0000, invalid / missing rate 0.0000. Canonical memo:
reports/sft_v1_final_artifact_status.md.
6. Quotebound 27B
After the hybrid stack solved the benchmark, the project asked a second question inside the same frame: how much of that winning behavior can be moved into the model itself, evaluated on surfaces outside the held-out probe?
That question produced Quotebound 27B, a teacher-student distillation
checkpoint published as a LoRA adapter on top of
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.
- Public release identity:
darcar0/quotebound-27b
This is the strongest standalone model artifact from the project. It is the
first standalone checkpoint to hold up across multiple evaluation surfaces
beyond probe_v0, and on a fresh public holdout it roughly doubles raw
quote-faithful behavior over the earlier bridge model. Selection happened
entirely off probe_v0; the held-out gate stayed frozen.
7. Quotebound 27B results
7.1 Fresh 36-task mixed public holdout
A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA
tasks, drawn from public sources and de-duplicated against every training,
dev, and probe_v0 row.
| Stack | Task | Strict | Evidence F1 | Quote F1 |
|---|---|---|---|---|
| Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
| Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
Bridge + deterministic_v3 |
0.8611 | 0.5833 | 0.8815 | 0.8815 |
Quotebound + deterministic_v3 |
0.8889 | 0.5833 | 0.9093 | 0.9093 |
See Figure 2: standalone holdout comparison.
Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
and quote F1 in both raw and normalized form, ties normalized strict, and
roughly doubles raw quote F1 at the model level. Grounded-QA accuracy on
this slice is 1.0000 for both stacks; zero invalid outputs across every
reported evaluation surface. Canonical memo:
reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md.
7.2 Fixed dev triage slice (21 tasks)
| Stack | Task | Strict | Evidence F1 | Quote F1 |
|---|---|---|---|---|
Quotebound + deterministic_v3 |
1.0000 | 0.6190 | 0.8320 | 0.7095 |
7.3 Untouched 104-task HotpotQA shadow slice
The raw standalone adapter improved quote-faithful behavior over raw bridge,
and Quotebound plus deterministic_v3 matched bridge + deterministic_v3
at the system level on this slice. The standalone freeze memo
(reports/standalone_model_v2_freeze_memo.md) reports it as a parity
outcome; per-metric numbers were not recorded for this surface, so it stands
as a narrative parity result rather than a table cell.
8. Release boundary
The public release centers on the two strongest finished artifacts from the
project: the benchmark-winning hybrid stack on frozen probe_v0, and
Quotebound 27B as the strongest standalone model. That boundary keeps the
package focused on the results that held up across the evaluation surfaces
reported here.
9. Discussion
Why the release shape is hybrid plus standalone. The strict contract has
four gates. Training alone closed the task and evidence gates on probe_v0,
but bridge checkpoint-2 did not close the strict grounded success gate by
itself: raw bridge sat at strict 0.2727. Deterministic packet-local
normalization was the finishing move that closed the remaining gates without
escaping the closed-packet boundary. That is what made the hybrid stack the
strongest benchmark-facing result, and why it is the benchmark-facing
release.
What Quotebound 27B demonstrates. The teacher-student cycle that
produced the standalone adapter asked how much of that winning behavior could
move into the model itself, evaluated entirely off probe_v0. The
fresh-holdout deltas show that the model-side gain is real: the raw
standalone roughly doubles raw bridge on quote F1 (0.3343 → 0.6815) and
beats it on task accuracy and evidence F1. This is the first standalone
model in the project to hold up across multiple evaluation surfaces beyond
the held-out probe; calling it the standalone release artifact is faithful
to that evidence.
Why the release stops here. The package is intentionally centered on the strongest benchmark-facing system and the strongest standalone model. That keeps the public release focused on finished results rather than on intermediate local variations.
Distinction held throughout. Perfect frozen probe_v0 belongs to the
hybrid stack, not to the standalone adapter alone. The release ships both
results without collapsing them into one claim.
10. Contributions
- A strict packet-faithful reasoning benchmark setup that requires answer,
evidence, exact quotes, and
Insufficient evidence.abstention together, on a frozen held-out surface. - A baseline-to-intervention evidence trail that documents what training, prompt structure, and deterministic packet-local normalization each contributed.
- A finished benchmark-winning hybrid artifact: bridge
checkpoint-2plusdeterministic_v3packet-local normalization. - A standalone model release — Quotebound 27B, a LoRA adapter on a 27B
reasoning-distilled base — that is the first standalone checkpoint in
the project to hold up across multiple evaluation surfaces beyond
probe_v0, and that roughly doubles raw quote-faithful behavior over the earlier bridge. - A public artifact set centered on the benchmark-winning hybrid system, Quotebound 27B, and the technical note that explains how the two fit together.
11. Intended use and limitations
Intended use. Closed-packet reasoning workflows: bounded document QA, claim verification, policy and compliance review, contract reading, and other settings where every answer has to be justified from a fixed body of text. Also: research on evidence-faithful reasoning and abstention behavior.
Limitations.
- This is not a general-purpose chatbot replacement. Behavior outside the closed-packet setting is not characterized.
- Perfect
probe_v0belongs to the hybrid stack, not to the standalone adapter alone. Treat1.0000numbers onprobe_v0as the hybrid stack's result. - The downloadable artifact is the LoRA adapter only.
deterministic_v3is a separate post-processing step that lives in the project repository; the benchmark-winning configuration is adapter + normalizer. - Frozen
probe_v0item-level contents are intentionally not published with the release in order to preserve the held-out gate. - Perfect
probe_v0is not proof of general faithful reasoning. It is proof that the system meets the strict contract on a single frozen closed-packet benchmark.
12. Surfaces
Canonical project surfaces:
- final artifact memo:
reports/sft_v1_final_artifact_status.md - standalone freeze memo:
reports/standalone_model_v2_freeze_memo.md - fresh holdout comparison:
reports/standalone_model_v2_holdout_v1_bridge_vs_pilot3_status.md - model card:
README.md - technical note:
technical_note_evidence_faithful_reasoning.md - Hugging Face release:
darcar0/quotebound-27b