Upload folder using huggingface_hub

865ac14 verified about 2 months ago

3.41 kB

license: mit
tags:
  - hle
  - humanity-last-exam
  - symbolic-reasoning
  - rule-based
  - no-llm
  - no-neural-network
  - wikipedia-only
datasets:
  - cais/hle
metrics:
  - accuracy
model-index:
  - name: Verantyx-hle-4.6
    results:
      - task:
          type: question-answering
          name: Humanity's Last Exam
        dataset:
          type: cais/hle
          name: HLE
          split: test
        metrics:
          - type: accuracy
            value: 4.6
            name: Accuracy (%)

Verantyx HLE — 4.6%

Fully LLM-free symbolic solver for Humanity's Last Exam (HLE) — no neural networks, no language models, pure rule-based reasoning with Wikipedia as the only knowledge source.

Score

Split	Score	Method
Full 2500 questions	115/2500 = 4.6%	atom_cross + knowledge_match + cross_decompose

Approach

Verantyx solves HLE through structural decomposition:

Atom Extraction — Break questions and choices into atomic facts using 200+ regex patterns
Wikipedia Knowledge — Fetch relevant articles as the sole knowledge source
Cross-Decompose — Decompose each MCQ choice individually, cross-match against Wikipedia facts
Atom Relation Classification — LLM-free supports/contradicts/unknown classifier (60+ antonym pairs, negation detection, numeric cross-check)
MCQ全問回答 (Always Answer) — HLE has no wrong-answer penalty; fallback uses best keyword overlap

Pipeline

Question → Fact Atomizer → Wikipedia Fetch → Atom Cross Solver
                                              ↓
                                    Choice Scoring (supports/contradicts)
                                              ↓
                                    Best Choice or Keyword Fallback

Solver Components

Component	Fires	Description
cross_decompose	122	Per-choice decomposition + Wikipedia cross-match
knowledge_match	18	Direct atom-based knowledge matching
atom_cross	fallback	Normalized atom scoring with Wikipedia overlap

Properties

✅ No LLM — zero language model inference (Qwen 7B fully removed)
✅ No neural network — pure rule-based symbolic reasoning
✅ No pattern detectors — DISABLE_PATTERN_DETECTORS=1
✅ No concept boost — DISABLE_CONCEPT_BOOST=1
✅ No wrong-answer penalty exploitation — MCQ全問回答 is valid since HLE scoring has no penalty
✅ Wikipedia-only knowledge — no pre-trained embeddings or cached answers
✅ Deterministic — same input always produces same output

Score History

Version	Score	Method
v1 (with LLM)	2.68%	mcq_direct (Qwen 7B) + cross_decompose
v2 (LLM-free partial)	1.22%	Early LLM removal, limited coverage
v4 (LLM-free full)	4.6%	atom_cross + MCQ全問回答 + normalized scoring

Stats

Total: 2500 questions
Correct: 115 (4.6%)
Time: 98 minutes (4 parallel workers)
Wiki hits: 2298
Knowledge match: 18
Cross decompose: 122 fired

kofdai
/

Verantyx-hle-4.6