Structural Manifold GPT (300M · 512B windows)

TL;DR

Decoder-only transformer (16 layers · 16 heads · 1024-dim) trained from scratch on structural manifold signatures extracted from WikiText-103 (raw) using 512 B windows (384 B stride, precision 3).
Each "token" is a quantised (coherence, stability, entropy, hazard) signature with deduplicated prototypes → ~42× byte compression and 512-signature context ≈ 20k raw tokens.
Fits/trains on a single RTX 3080 Ti in <30 min with FP16 + gradient checkpointing.

Final metrics (eval split = 2%)
-------------------------------------------
  eval_loss      = 6.6456
  perplexity     = 7.69e+02 (on manifold tokens)
  samples        = 76 (3 766 total sequences · 3 epochs)
  training time  ≈ 25 min @ 1.3 s/step (GPU)

Use this repo if you want to benchmark manifold LMs or integrate the encoder/decoder stack into small-context devices. For full pipeline (dataset prep, compression scripts, benchmarking), clone https://github.com/SepDynamics/structural-manifold-compression.

New: STM-FineMath 124M (revision `finemath-124m`)

12-layer (124M) manifold LM trained on the 10 GB FineMath STEM corpus using the same window=512 B, stride=384 B, precision=3 codec. The builder yields 50 242 samples / 25.7 M manifold tokens (≈0.27 raw tokens per signature) and fits in ~66 minutes on a single RTX 3080 Ti.
Final eval loss 6.506 -> manifold perplexity 6.69e2, while GPT-2 medium on the identical math slice lands at 7.75e3 (11.7x worse). See benchmarks/finemath_perplexity_compare.json in the new branch.
Exact-match signature accuracies on standard math QA benchmarks (strict metric) are now recorded:

Benchmark	Split	Subset	#Problems	Accuracy
dim/competition_math	train	Algebra	200	0.5%
dim/competition_math	train	Number Theory	200	0.5%
dim/competition_math	train	Geometry	200	1.0%
dim/competition_math	train	Prealgebra	200	0.0%
openai/gsm8k (main)	test	-	200	0.5%

Each JSON artifact (command + parameters) is stored under benchmarks/ on the finemath-124m branch.

Loading the math-focused checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "scrallex/structural-manifold-compression"
tokenizer = AutoTokenizer.from_pretrained(repo_id, revision="finemath-124m")
model = AutoModelForCausalLM.from_pretrained(repo_id, revision="finemath-124m")

You must supply structural-manifold signatures instead of raw text. Use scripts/data/prepare_causal_dataset.py or scripts/experiments/math_qa_demo.py from the GitHub repo to encode prompts, then decode generations via scripts/experiments/decode_signatures.py (requires a prototype cache built from a math corpus).

Reproducing the posted math benchmarks

python scripts/experiments/eval_math_dataset.py \
  --model output/training_runs/stm_finemath_10gb \
  --dataset-vocab output/stm_stem_finemath_10gb/vocab.json \
  --dataset dim/competition_math --split train --subset Algebra \
  --max-problems 200 --text-root data/raw_math/finemath_4plus_10gb \
  --cache output/training_runs/stm_finemath_10gb/prototype_cache.json \
  --max-new-tokens 64 --match-mode signatures

Switch --dataset openai/gsm8k --dataset-config main --split test --question-field question --answer-field answer for GSM8K. The evaluator now matches answers via signature subsequences (no prototype recovery required) so results are strict exact-matches.

Quick Start & Best-Case Workloads

Optimised for structured text: PDF/page OCR exports, news briefs, technical audits—any corpus where 512 B sliding windows capture repeated structure. Expect 30–60× token compression with 94–95 % token accuracy (Fox EN/CN/OmniDoc numbers from the main repo).

Reproduce on a sample corpus:

git clone https://github.com/SepDynamics/structural-manifold-compression.git
cd structural-manifold-compression
python scripts/experiments/benchmark_eval.py \
  --dataset briefs=examples/structured_demo/news_sample.jsonl \
  --json-text-key text \
  --window-bytes 512 --stride-bytes 384 --precision 3 \
  --use-native \
  --output-dir output/benchmark_runs/news_demo

Inspect output/benchmark_runs/news_demo/briefs.json for compression, fidelity, and verification stats, then replace news_sample.jsonl with your own JSONL dumps.

Future Hugging Face Space: a Gradio front-end (planned) will wrap the same workflow so newcomers can upload JSONL/txt, run compression, and view reconstructions/verification without installing CUDA.

Latest Benchmarks (2025-11-05)

All evaluations ran on a single RTX 3080 Ti (12 GB) with the same environment as the training run. The raw artifacts referenced below are stored in benchmarks/.

Structural compression (WikiText quick-20k slice)

Command: python scripts/experiments/benchmark_eval.py --dataset wikitext=data/raw_text/wikitext_train.jsonl --window-bytes 512 --stride-bytes 384 --precision 3 --max-documents 20000 --use-native.
Observations: 20 000 manifest entries contained 12 894 non-empty docs, producing 21 234 windows with 3 737 shared signatures.
Capacity: 5.88 MB → 191 KB ⇒ 30.8× byte compression; 60.7× token compression (stream) / 60.8× unique ⇒ a 512-signature context effectively covers ≈31 k GPT-2 tokens.
Fidelity: 83.7 % token accuracy, 80.5 % recall, 82.0 % F1; character accuracy 83.6 %.
Verification: false-positive rate 2.3 × 10⁻⁴ with perfect recall on positive windows.
Runtime: completes within minutes on the 3080 Ti when the native kernel is enabled.
Artifact: benchmarks/wikitext_structural_20k.json.

GPT-2 perplexity comparison (manifold LM vs. GPT-2 medium)

Command: python scripts/experiments/perplexity_compare.py --manifold-model output/training_runs/wikitext_manifold_gpt --manifold-dataset output/wikitext_manifold/hf_dataset --manifold-vocab output/wikitext_manifold/vocab.json --manifold-eval-fraction 0.25 --gpt2-model gpt2-medium --gpt2-max-documents 10000 --output output/benchmark_runs/wikitext_perplexity_8h.json.
Manifold LM: 941 sequences / 4.81 × 10⁵ manifold tokens → loss 6.60, perplexity 7.33 × 10².
GPT-2 medium: 10 000 raw documents / 6.36 × 10⁵ tokens → loss 9.25, perplexity 1.04 × 10⁴ (50× higher perplexity on the raw token stream).
Effective compression proxy (raw_tokens / manifold_tokens) during evaluation = 1.34×, indicating GPT-2 still consumed 34 % more tokens even before exploiting deduplication.
Runtime: ≈3 min wall-clock on the 3080 Ti (majority spent on the manifold forward pass).
Artifact: benchmarks/wikitext_perplexity_8h.json.

Files

Path	Notes
`model.safetensors`	300 M parameter GPT2LMHeadModel trained on manifold signatures
`config.json`	Model architecture (n_layer=16, n_embd=1024, vocab_size=11839, pad/bos/eos id = 11838)
`generation_config.json`	Default sampling config (max_length=512)
`tokenizer.json` / `tokenizer_config.json` / `special_tokens_map.json`	Word-level tokenizer whose vocab is exactly the manifold signature strings + `<pad>`
`vocab.json`	Original signature list emitted by `scripts/data/prepare_causal_dataset.py`
`training_args.bin` / `trainer_state.json`	Hugging Face Trainer metadata (seeds, LR schedule, grad norms)
`eval_metrics.json`	Recomputed eval loss & perplexity over the 2% hold-out

Data & Compression Pipeline

Raw text → signatures: run the encoder on WikiText (or any UTF-8 corpus) via

python scripts/data/prepare_causal_dataset.py \
  --text-root data/raw_text/wikitext_train.jsonl \
  --output-dir output/wikitext_manifold \
  --window-bytes 512 --stride-bytes 384 --precision 3 \
  --sequence-length 512 --min-sequence-length 8 \
  --use-native --concat-documents --export-signatures --reset-output

This keeps an append-only samples.jsonl + vocab.json so you can resume mid-run.

Sequences → HF dataset: the builder automatically materialises output/wikitext_manifold/hf_dataset with input_ids/labels for causal LM.

Training: the published checkpoint comes from

CUDA_VISIBLE_DEVICES=0 python scripts/training/manifold_lm_trainer.py \
  --dataset-path output/wikitext_manifold/hf_dataset \
  --vocab-path output/wikitext_manifold/vocab.json \
  --output-dir output/training_runs/wikitext_manifold_gpt \
  --n-layer 16 --n-head 16 --n-embd 1024 --context-length 512 \
  --per-device-train-batch-size 2 --per-device-eval-batch-size 2 \
  --gradient-accumulation-steps 16 --num-train-epochs 3 \
  --learning-rate 2e-4 --warmup-steps 500 --gradient-checkpointing --fp16 --resume

Hardware: single RTX 3080 Ti (12 GB). Training logs: output/training_runs/wikitext_manifold_gpt/train.log in the main repo.

Usage

⚠️ This model expects manifold signatures (not raw text). Before inference, run the encoder to obtain the signature vocabulary and ID sequences.

import json
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

repo_id = "scrallex/structural-manifold-compression"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id, torch_dtype=torch.float16).to("cuda").eval()

def manifold_signatures_to_ids(signatures):
    # signatures = list of strings emitted by the encoder (e.g. 'c0.018_s0.481_e0.982')
    return tokenizer.convert_tokens_to_ids(signatures)

signatures = ["c0.018_s0.481_e0.982", "c0.012_s0.496_e0.988", "c0.017_s0.502_e0.972", ...]
input_ids = manifold_signatures_to_ids(signatures)
inputs = torch.tensor([input_ids[:-1]], device=model.device)
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=64)
next_signature_ids = outputs[0, len(input_ids)-1:]
next_signatures = tokenizer.convert_ids_to_tokens(next_signature_ids.tolist())

To reconstruct human-readable text, feed predicted signatures back through the manifold decoder (see scripts/experiments/manifold_compression_eval.py).

Evaluation

Training split: 3 690 sequences (98% of dataset) · Eval: 76 sequences (2%).
Final eval loss 6.6456 → perplexity ≈7.7e2 on manifold tokens (see eval_metrics.json).
Structural slice (benchmarks/wikitext_structural_20k.json): 512 B windows / 384 B stride / precision 3 on the first 20 k WikiText entries ⇒ 30.8× byte compression, 60.7× token compression, 83.7 % token accuracy, 83.6 % character accuracy, verification FPR 2.3 × 10⁻⁴.
GPT-2 comparison (benchmarks/wikitext_perplexity_8h.json): manifold LM perplexity 7.33 × 10² vs. GPT-2 medium 1.04 × 10⁴ over 10 k raw documents (raw tokens over manifold tokens = 1.34× during the shared evaluation).

Future work: increase sequence budget (1k+ signatures), add rotary embeddings for better long-context, benchmark against GPT-2 (raw) to quantify effective perplexity after reconstruction.

Responsible Use & Limitations

The model memorises WikiText-103 content; outputs may regurgitate training passages.
Tokens are structural signatures only—you must keep the encoder/decoder kill switches to avoid leaking the underlying text when using proprietary corpora.
No guardrails, toxicity filtering, or multilingual tuning beyond what WikiText provides.

Report issues or ideas via https://github.com/SepDynamics/structural-manifold-compression.

Downloads last month: 6

Safetensors

Model size

0.2B params

Tensor type

F32

scrallex
/

structural-manifold-compression

Structural Manifold GPT (300M · 512B windows)

TL;DR

New: STM-FineMath 124M (revision `finemath-124m`)

Loading the math-focused checkpoint

Reproducing the posted math benchmarks

Quick Start & Best-Case Workloads

Latest Benchmarks (2025-11-05)

Structural compression (WikiText quick-20k slice)

GPT-2 perplexity comparison (manifold LM vs. GPT-2 medium)

Files

Data & Compression Pipeline

Usage

Evaluation

Responsible Use & Limitations

Space using scrallex/structural-manifold-compression 1

Structural Manifold GPT (300M · 512B windows)

TL;DR

New: STM-FineMath 124M (revision finemath-124m)

Loading the math-focused checkpoint

Reproducing the posted math benchmarks

Quick Start & Best-Case Workloads

Latest Benchmarks (2025-11-05)

Structural compression (WikiText quick-20k slice)

GPT-2 perplexity comparison (manifold LM vs. GPT-2 medium)

Files

Data & Compression Pipeline

Usage

Evaluation

Responsible Use & Limitations

Space using scrallex/structural-manifold-compression 1

New: STM-FineMath 124M (revision `finemath-124m`)