PBH Applied Systems publishes evaluated open-weight GGUF models for practical AI deployment, with an emphasis on quantized inference, agentic workflows, structured outputs, tool use, and production reliability.
Every model published under this organization is converted, evaluated, and documented by PBH Applied Systems using its proprietary `quant_eval` framework. The evaluation process compares full-precision and quantized variants across agent-adjacent task families including structured JSON output, tool dispatch, multi-turn state retention, mixed natural language plus JSON responses, multiple-choice extraction, fuzz-style constraint adherence, and multi-step planning.
These model cards are designed to support deployment decisions, not just model discovery. Each card documents practical behavior, quantization trade-offs, failure modes, recommended use cases, hardware requirements, and guardrails for production use.
Try the live PBH Applied Systems AI Agent Demo:
https://pbhappliedsystems.com/assistant.html
The demo lets visitors interact with evaluated quantized open-weight models across reasoning, document intelligence, and code automation workflows running on private GPU infrastructure.
π **New flagship dataset β and an argument about what a dataset card should be.**
Most synthetic datasets on the Hub ship row counts, a license, and little else β pipeline opaque, rejection criteria unstated, compliance unaudited. We published the opposite.
**1,116** quality-gated instruction records across **7 regulated domains** (medical, legal, GDPR, privacy, education, e-commerce, transport). Every record cleared a documented cascade, not a vibe check:
- π§ͺ **Dual-signal hallucination gate** β rejects only when embedding cosine *and* keyword-overlap both fail; a low score alone never rejects. - π **Layered PII masking + independent leak audit** β a separate over-reporting scanner found **0.0% residual leak** across all 1,116 records. - π **Whole-corpus evaluation, not a sample** β MATTR **0.769**, mean cosine **0.73**, **0%** near-duplicates, **96.9%** yield. - π§Ύ **The 36 rejections ship too**, each tagged with its failing gate. Removal at the gate is the product; we show our work.
Every number on the card is a field in the evaluation_report.json shipped beside the data β full methodology + provenance (Mistral-Nemo AWQ W4A16 Β· vLLM 0.8.5.post1 Β· Modal A10G).
One release from **SynthEval**: Studio (local GPU) + Cloud (Modal+vLLM), proving quality parity across substrates.
**What it is:** A side-by-side ReAct agent comparison platform running 9 independently evaluated GGUF models. Select any two models, pick an agent template, submit a query, and watch both agents reason through it in real time β with quant_eval v7.21 behavioral scores displayed alongside every response.
**What quant_eval v7.21 measures:** 42 fixture cases across 8 task families β json_multistep, stateful_followup, toolcall_only, mixed_brief_json, toolcall, json, fuzz, mcq. Every model evaluated at both F16 and Q4_K_M precision where hardware permits. The delta is the quantization impact report.
All 18 model cards with full evaluation data are published at: @pbhappliedsystems
Feedback welcome β especially from anyone running evaluations on open-weight quantized models. This is the public-facing surface of a consulting and evaluation practice; the full agent demo is at https://pbhappliedsystems.com/assistant.html