Structural Manifold GPT (300M · 512B windows)

TL;DR

  • Decoder-only transformer (16 layers · 16 heads · 1024-dim) trained from scratch on structural manifold signatures extracted from WikiText-103 (raw) using 512 B windows (384 B stride, precision 3).
  • Each "token" is a quantised (coherence, stability, entropy, hazard) signature with deduplicated prototypes → ~42× byte compression and 512-signature context ≈ 20k raw tokens.
  • Fits/trains on a single RTX 3080 Ti in <30 min with FP16 + gradient checkpointing.
Final metrics (eval split = 2%)
-------------------------------------------
  eval_loss      = 6.6456
  perplexity     = 7.69e+02 (on manifold tokens)
  samples        = 76 (3 766 total sequences · 3 epochs)
  training time  ≈ 25 min @ 1.3 s/step (GPU)

Use this repo if you want to benchmark manifold LMs or integrate the encoder/decoder stack into small-context devices. For full pipeline (dataset prep, compression scripts, benchmarking), clone https://github.com/SepDynamics/structural-manifold-compression.


New: STM-FineMath 124M (revision finemath-124m)

  • 12-layer (124M) manifold LM trained on the 10 GB FineMath STEM corpus using the same window=512 B, stride=384 B, precision=3 codec. The builder yields 50 242 samples / 25.7 M manifold tokens (≈0.27 raw tokens per signature) and fits in ~66 minutes on a single RTX 3080 Ti.
  • Final eval loss 6.506 -> manifold perplexity 6.69e2, while GPT-2 medium on the identical math slice lands at 7.75e3 (11.7x worse). See benchmarks/finemath_perplexity_compare.json in the new branch.
  • Exact-match signature accuracies on standard math QA benchmarks (strict metric) are now recorded:
Benchmark Split Subset #Problems Accuracy
dim/competition_math train Algebra 200 0.5%
dim/competition_math train Number Theory 200 0.5%
dim/competition_math train Geometry 200 1.0%
dim/competition_math train Prealgebra 200 0.0%
openai/gsm8k (main) test - 200 0.5%

Each JSON artifact (command + parameters) is stored under benchmarks/ on the finemath-124m branch.

Loading the math-focused checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "scrallex/structural-manifold-compression"
tokenizer = AutoTokenizer.from_pretrained(repo_id, revision="finemath-124m")
model = AutoModelForCausalLM.from_pretrained(repo_id, revision="finemath-124m")

You must supply structural-manifold signatures instead of raw text. Use scripts/data/prepare_causal_dataset.py or scripts/experiments/math_qa_demo.py from the GitHub repo to encode prompts, then decode generations via scripts/experiments/decode_signatures.py (requires a prototype cache built from a math corpus).

Reproducing the posted math benchmarks

python scripts/experiments/eval_math_dataset.py \
  --model output/training_runs/stm_finemath_10gb \
  --dataset-vocab output/stm_stem_finemath_10gb/vocab.json \
  --dataset dim/competition_math --split train --subset Algebra \
  --max-problems 200 --text-root data/raw_math/finemath_4plus_10gb \
  --cache output/training_runs/stm_finemath_10gb/prototype_cache.json \
  --max-new-tokens 64 --match-mode signatures

Switch --dataset openai/gsm8k --dataset-config main --split test --question-field question --answer-field answer for GSM8K. The evaluator now matches answers via signature subsequences (no prototype recovery required) so results are strict exact-matches.


Quick Start & Best-Case Workloads

  • Optimised for structured text: PDF/page OCR exports, news briefs, technical audits—any corpus where 512 B sliding windows capture repeated structure. Expect 30–60× token compression with 94–95 % token accuracy (Fox EN/CN/OmniDoc numbers from the main repo).
  • Reproduce on a sample corpus:
    git clone https://github.com/SepDynamics/structural-manifold-compression.git
    cd structural-manifold-compression
    python scripts/experiments/benchmark_eval.py \
      --dataset briefs=examples/structured_demo/news_sample.jsonl \
      --json-text-key text \
      --window-bytes 512 --stride-bytes 384 --precision 3 \
      --use-native \
      --output-dir output/benchmark_runs/news_demo
    
    Inspect output/benchmark_runs/news_demo/briefs.json for compression, fidelity, and verification stats, then replace news_sample.jsonl with your own JSONL dumps.
  • Future Hugging Face Space: a Gradio front-end (planned) will wrap the same workflow so newcomers can upload JSONL/txt, run compression, and view reconstructions/verification without installing CUDA.

Latest Benchmarks (2025-11-05)

All evaluations ran on a single RTX 3080 Ti (12 GB) with the same environment as the training run. The raw artifacts referenced below are stored in benchmarks/.

Structural compression (WikiText quick-20k slice)

  • Command: python scripts/experiments/benchmark_eval.py --dataset wikitext=data/raw_text/wikitext_train.jsonl --window-bytes 512 --stride-bytes 384 --precision 3 --max-documents 20000 --use-native.
  • Observations: 20 000 manifest entries contained 12 894 non-empty docs, producing 21 234 windows with 3 737 shared signatures.
  • Capacity: 5.88 MB → 191 KB30.8× byte compression; 60.7× token compression (stream) / 60.8× unique ⇒ a 512-signature context effectively covers ≈31 k GPT-2 tokens.
  • Fidelity: 83.7 % token accuracy, 80.5 % recall, 82.0 % F1; character accuracy 83.6 %.
  • Verification: false-positive rate 2.3 × 10⁻⁴ with perfect recall on positive windows.
  • Runtime: completes within minutes on the 3080 Ti when the native kernel is enabled.
  • Artifact: benchmarks/wikitext_structural_20k.json.

GPT-2 perplexity comparison (manifold LM vs. GPT-2 medium)

  • Command: python scripts/experiments/perplexity_compare.py --manifold-model output/training_runs/wikitext_manifold_gpt --manifold-dataset output/wikitext_manifold/hf_dataset --manifold-vocab output/wikitext_manifold/vocab.json --manifold-eval-fraction 0.25 --gpt2-model gpt2-medium --gpt2-max-documents 10000 --output output/benchmark_runs/wikitext_perplexity_8h.json.
  • Manifold LM: 941 sequences / 4.81 × 10⁵ manifold tokensloss 6.60, perplexity 7.33 × 10².
  • GPT-2 medium: 10 000 raw documents / 6.36 × 10⁵ tokensloss 9.25, perplexity 1.04 × 10⁴ (50× higher perplexity on the raw token stream).
  • Effective compression proxy (raw_tokens / manifold_tokens) during evaluation = 1.34×, indicating GPT-2 still consumed 34 % more tokens even before exploiting deduplication.
  • Runtime: ≈3 min wall-clock on the 3080 Ti (majority spent on the manifold forward pass).
  • Artifact: benchmarks/wikitext_perplexity_8h.json.

Files

Path Notes
model.safetensors 300 M parameter GPT2LMHeadModel trained on manifold signatures
config.json Model architecture (n_layer=16, n_embd=1024, vocab_size=11839, pad/bos/eos id = 11838)
generation_config.json Default sampling config (max_length=512)
tokenizer.json / tokenizer_config.json / special_tokens_map.json Word-level tokenizer whose vocab is exactly the manifold signature strings + <pad>
vocab.json Original signature list emitted by scripts/data/prepare_causal_dataset.py
training_args.bin / trainer_state.json Hugging Face Trainer metadata (seeds, LR schedule, grad norms)
eval_metrics.json Recomputed eval loss & perplexity over the 2% hold-out

Data & Compression Pipeline

  1. Raw text → signatures: run the encoder on WikiText (or any UTF-8 corpus) via
    python scripts/data/prepare_causal_dataset.py \
      --text-root data/raw_text/wikitext_train.jsonl \
      --output-dir output/wikitext_manifold \
      --window-bytes 512 --stride-bytes 384 --precision 3 \
      --sequence-length 512 --min-sequence-length 8 \
      --use-native --concat-documents --export-signatures --reset-output
    
    This keeps an append-only samples.jsonl + vocab.json so you can resume mid-run.
  2. Sequences → HF dataset: the builder automatically materialises output/wikitext_manifold/hf_dataset with input_ids/labels for causal LM.
  3. Training: the published checkpoint comes from
    CUDA_VISIBLE_DEVICES=0 python scripts/training/manifold_lm_trainer.py \
      --dataset-path output/wikitext_manifold/hf_dataset \
      --vocab-path output/wikitext_manifold/vocab.json \
      --output-dir output/training_runs/wikitext_manifold_gpt \
      --n-layer 16 --n-head 16 --n-embd 1024 --context-length 512 \
      --per-device-train-batch-size 2 --per-device-eval-batch-size 2 \
      --gradient-accumulation-steps 16 --num-train-epochs 3 \
      --learning-rate 2e-4 --warmup-steps 500 --gradient-checkpointing --fp16 --resume
    

Hardware: single RTX 3080 Ti (12 GB). Training logs: output/training_runs/wikitext_manifold_gpt/train.log in the main repo.


Usage

⚠️ This model expects manifold signatures (not raw text). Before inference, run the encoder to obtain the signature vocabulary and ID sequences.

import json
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

repo_id = "scrallex/structural-manifold-compression"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id, torch_dtype=torch.float16).to("cuda").eval()

def manifold_signatures_to_ids(signatures):
    # signatures = list of strings emitted by the encoder (e.g. 'c0.018_s0.481_e0.982')
    return tokenizer.convert_tokens_to_ids(signatures)

signatures = ["c0.018_s0.481_e0.982", "c0.012_s0.496_e0.988", "c0.017_s0.502_e0.972", ...]
input_ids = manifold_signatures_to_ids(signatures)
inputs = torch.tensor([input_ids[:-1]], device=model.device)
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=64)
next_signature_ids = outputs[0, len(input_ids)-1:]
next_signatures = tokenizer.convert_ids_to_tokens(next_signature_ids.tolist())

To reconstruct human-readable text, feed predicted signatures back through the manifold decoder (see scripts/experiments/manifold_compression_eval.py).


Evaluation

  • Training split: 3 690 sequences (98% of dataset) · Eval: 76 sequences (2%).
  • Final eval loss 6.6456 → perplexity ≈7.7e2 on manifold tokens (see eval_metrics.json).
  • Structural slice (benchmarks/wikitext_structural_20k.json): 512 B windows / 384 B stride / precision 3 on the first 20 k WikiText entries ⇒ 30.8× byte compression, 60.7× token compression, 83.7 % token accuracy, 83.6 % character accuracy, verification FPR 2.3 × 10⁻⁴.
  • GPT-2 comparison (benchmarks/wikitext_perplexity_8h.json): manifold LM perplexity 7.33 × 10² vs. GPT-2 medium 1.04 × 10⁴ over 10 k raw documents (raw tokens over manifold tokens = 1.34× during the shared evaluation).

Future work: increase sequence budget (1k+ signatures), add rotary embeddings for better long-context, benchmark against GPT-2 (raw) to quantify effective perplexity after reconstruction.


Responsible Use & Limitations

  • The model memorises WikiText-103 content; outputs may regurgitate training passages.
  • Tokens are structural signatures only—you must keep the encoder/decoder kill switches to avoid leaking the underlying text when using proprietary corpora.
  • No guardrails, toxicity filtering, or multilingual tuning beyond what WikiText provides.

Report issues or ideas via https://github.com/SepDynamics/structural-manifold-compression.

Downloads last month
6
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using scrallex/structural-manifold-compression 1