Structural Manifold GPT (300M · 512B windows)
TL;DR
- Decoder-only transformer (16 layers · 16 heads · 1024-dim) trained from scratch on structural manifold signatures extracted from WikiText-103 (raw) using 512 B windows (384 B stride, precision 3).
- Each "token" is a quantised
(coherence, stability, entropy, hazard)signature with deduplicated prototypes → ~42× byte compression and 512-signature context ≈ 20k raw tokens. - Fits/trains on a single RTX 3080 Ti in <30 min with FP16 + gradient checkpointing.
Final metrics (eval split = 2%)
-------------------------------------------
eval_loss = 6.6456
perplexity = 7.69e+02 (on manifold tokens)
samples = 76 (3 766 total sequences · 3 epochs)
training time ≈ 25 min @ 1.3 s/step (GPU)
Use this repo if you want to benchmark manifold LMs or integrate the encoder/decoder stack into small-context devices. For full pipeline (dataset prep, compression scripts, benchmarking), clone https://github.com/SepDynamics/structural-manifold-compression.
New: STM-FineMath 124M (revision finemath-124m)
- 12-layer (124M) manifold LM trained on the 10 GB FineMath STEM corpus using the same
window=512 B,stride=384 B,precision=3codec. The builder yields 50 242 samples / 25.7 M manifold tokens (≈0.27 raw tokens per signature) and fits in ~66 minutes on a single RTX 3080 Ti. - Final eval loss
6.506-> manifold perplexity6.69e2, while GPT-2 medium on the identical math slice lands at7.75e3(11.7x worse). Seebenchmarks/finemath_perplexity_compare.jsonin the new branch. - Exact-match signature accuracies on standard math QA benchmarks (strict metric) are now recorded:
| Benchmark | Split | Subset | #Problems | Accuracy |
|---|---|---|---|---|
| dim/competition_math | train | Algebra | 200 | 0.5% |
| dim/competition_math | train | Number Theory | 200 | 0.5% |
| dim/competition_math | train | Geometry | 200 | 1.0% |
| dim/competition_math | train | Prealgebra | 200 | 0.0% |
| openai/gsm8k (main) | test | - | 200 | 0.5% |
Each JSON artifact (command + parameters) is stored under benchmarks/ on the finemath-124m branch.
Loading the math-focused checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "scrallex/structural-manifold-compression"
tokenizer = AutoTokenizer.from_pretrained(repo_id, revision="finemath-124m")
model = AutoModelForCausalLM.from_pretrained(repo_id, revision="finemath-124m")
You must supply structural-manifold signatures instead of raw text. Use scripts/data/prepare_causal_dataset.py or scripts/experiments/math_qa_demo.py from the GitHub repo to encode prompts, then decode generations via scripts/experiments/decode_signatures.py (requires a prototype cache built from a math corpus).
Reproducing the posted math benchmarks
python scripts/experiments/eval_math_dataset.py \
--model output/training_runs/stm_finemath_10gb \
--dataset-vocab output/stm_stem_finemath_10gb/vocab.json \
--dataset dim/competition_math --split train --subset Algebra \
--max-problems 200 --text-root data/raw_math/finemath_4plus_10gb \
--cache output/training_runs/stm_finemath_10gb/prototype_cache.json \
--max-new-tokens 64 --match-mode signatures
Switch --dataset openai/gsm8k --dataset-config main --split test --question-field question --answer-field answer for GSM8K. The evaluator now matches answers via signature subsequences (no prototype recovery required) so results are strict exact-matches.
Quick Start & Best-Case Workloads
- Optimised for structured text: PDF/page OCR exports, news briefs, technical audits—any corpus where 512 B sliding windows capture repeated structure. Expect 30–60× token compression with 94–95 % token accuracy (Fox EN/CN/OmniDoc numbers from the main repo).
- Reproduce on a sample corpus:
Inspectgit clone https://github.com/SepDynamics/structural-manifold-compression.git cd structural-manifold-compression python scripts/experiments/benchmark_eval.py \ --dataset briefs=examples/structured_demo/news_sample.jsonl \ --json-text-key text \ --window-bytes 512 --stride-bytes 384 --precision 3 \ --use-native \ --output-dir output/benchmark_runs/news_demooutput/benchmark_runs/news_demo/briefs.jsonfor compression, fidelity, and verification stats, then replacenews_sample.jsonlwith your own JSONL dumps. - Future Hugging Face Space: a Gradio front-end (planned) will wrap the same workflow so newcomers can upload JSONL/txt, run compression, and view reconstructions/verification without installing CUDA.
Latest Benchmarks (2025-11-05)
All evaluations ran on a single RTX 3080 Ti (12 GB) with the same environment as the training run. The raw artifacts referenced below are stored in benchmarks/.
Structural compression (WikiText quick-20k slice)
- Command:
python scripts/experiments/benchmark_eval.py --dataset wikitext=data/raw_text/wikitext_train.jsonl --window-bytes 512 --stride-bytes 384 --precision 3 --max-documents 20000 --use-native. - Observations: 20 000 manifest entries contained 12 894 non-empty docs, producing 21 234 windows with 3 737 shared signatures.
- Capacity: 5.88 MB → 191 KB ⇒ 30.8× byte compression; 60.7× token compression (stream) / 60.8× unique ⇒ a 512-signature context effectively covers ≈31 k GPT-2 tokens.
- Fidelity: 83.7 % token accuracy, 80.5 % recall, 82.0 % F1; character accuracy 83.6 %.
- Verification: false-positive rate 2.3 × 10⁻⁴ with perfect recall on positive windows.
- Runtime: completes within minutes on the 3080 Ti when the native kernel is enabled.
- Artifact:
benchmarks/wikitext_structural_20k.json.
GPT-2 perplexity comparison (manifold LM vs. GPT-2 medium)
- Command:
python scripts/experiments/perplexity_compare.py --manifold-model output/training_runs/wikitext_manifold_gpt --manifold-dataset output/wikitext_manifold/hf_dataset --manifold-vocab output/wikitext_manifold/vocab.json --manifold-eval-fraction 0.25 --gpt2-model gpt2-medium --gpt2-max-documents 10000 --output output/benchmark_runs/wikitext_perplexity_8h.json. - Manifold LM: 941 sequences / 4.81 × 10⁵ manifold tokens → loss 6.60, perplexity 7.33 × 10².
- GPT-2 medium: 10 000 raw documents / 6.36 × 10⁵ tokens → loss 9.25, perplexity 1.04 × 10⁴ (50× higher perplexity on the raw token stream).
- Effective compression proxy (
raw_tokens / manifold_tokens) during evaluation = 1.34×, indicating GPT-2 still consumed 34 % more tokens even before exploiting deduplication. - Runtime: ≈3 min wall-clock on the 3080 Ti (majority spent on the manifold forward pass).
- Artifact:
benchmarks/wikitext_perplexity_8h.json.
Files
| Path | Notes |
|---|---|
model.safetensors |
300 M parameter GPT2LMHeadModel trained on manifold signatures |
config.json |
Model architecture (n_layer=16, n_embd=1024, vocab_size=11839, pad/bos/eos id = 11838) |
generation_config.json |
Default sampling config (max_length=512) |
tokenizer.json / tokenizer_config.json / special_tokens_map.json |
Word-level tokenizer whose vocab is exactly the manifold signature strings + <pad> |
vocab.json |
Original signature list emitted by scripts/data/prepare_causal_dataset.py |
training_args.bin / trainer_state.json |
Hugging Face Trainer metadata (seeds, LR schedule, grad norms) |
eval_metrics.json |
Recomputed eval loss & perplexity over the 2% hold-out |
Data & Compression Pipeline
- Raw text → signatures: run the encoder on WikiText (or any UTF-8 corpus) via
This keeps an append-onlypython scripts/data/prepare_causal_dataset.py \ --text-root data/raw_text/wikitext_train.jsonl \ --output-dir output/wikitext_manifold \ --window-bytes 512 --stride-bytes 384 --precision 3 \ --sequence-length 512 --min-sequence-length 8 \ --use-native --concat-documents --export-signatures --reset-outputsamples.jsonl+vocab.jsonso you can resume mid-run. - Sequences → HF dataset: the builder automatically materialises
output/wikitext_manifold/hf_datasetwithinput_ids/labelsfor causal LM. - Training: the published checkpoint comes from
CUDA_VISIBLE_DEVICES=0 python scripts/training/manifold_lm_trainer.py \ --dataset-path output/wikitext_manifold/hf_dataset \ --vocab-path output/wikitext_manifold/vocab.json \ --output-dir output/training_runs/wikitext_manifold_gpt \ --n-layer 16 --n-head 16 --n-embd 1024 --context-length 512 \ --per-device-train-batch-size 2 --per-device-eval-batch-size 2 \ --gradient-accumulation-steps 16 --num-train-epochs 3 \ --learning-rate 2e-4 --warmup-steps 500 --gradient-checkpointing --fp16 --resume
Hardware: single RTX 3080 Ti (12 GB). Training logs: output/training_runs/wikitext_manifold_gpt/train.log in the main repo.
Usage
⚠️ This model expects manifold signatures (not raw text). Before inference, run the encoder to obtain the signature vocabulary and ID sequences.
import json
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel
repo_id = "scrallex/structural-manifold-compression"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id, torch_dtype=torch.float16).to("cuda").eval()
def manifold_signatures_to_ids(signatures):
# signatures = list of strings emitted by the encoder (e.g. 'c0.018_s0.481_e0.982')
return tokenizer.convert_tokens_to_ids(signatures)
signatures = ["c0.018_s0.481_e0.982", "c0.012_s0.496_e0.988", "c0.017_s0.502_e0.972", ...]
input_ids = manifold_signatures_to_ids(signatures)
inputs = torch.tensor([input_ids[:-1]], device=model.device)
with torch.no_grad():
outputs = model.generate(inputs, max_new_tokens=64)
next_signature_ids = outputs[0, len(input_ids)-1:]
next_signatures = tokenizer.convert_ids_to_tokens(next_signature_ids.tolist())
To reconstruct human-readable text, feed predicted signatures back through the manifold decoder (see scripts/experiments/manifold_compression_eval.py).
Evaluation
- Training split: 3 690 sequences (98% of dataset) · Eval: 76 sequences (2%).
- Final eval loss
6.6456→ perplexity≈7.7e2on manifold tokens (seeeval_metrics.json). - Structural slice (
benchmarks/wikitext_structural_20k.json): 512 B windows / 384 B stride / precision 3 on the first 20 k WikiText entries ⇒ 30.8× byte compression, 60.7× token compression, 83.7 % token accuracy, 83.6 % character accuracy, verification FPR 2.3 × 10⁻⁴. - GPT-2 comparison (
benchmarks/wikitext_perplexity_8h.json): manifold LM perplexity 7.33 × 10² vs. GPT-2 medium 1.04 × 10⁴ over 10 k raw documents (raw tokens over manifold tokens = 1.34× during the shared evaluation).
Future work: increase sequence budget (1k+ signatures), add rotary embeddings for better long-context, benchmark against GPT-2 (raw) to quantify effective perplexity after reconstruction.
Responsible Use & Limitations
- The model memorises WikiText-103 content; outputs may regurgitate training passages.
- Tokens are structural signatures only—you must keep the encoder/decoder kill switches to avoid leaking the underlying text when using proprietary corpora.
- No guardrails, toxicity filtering, or multilingual tuning beyond what WikiText provides.
Report issues or ideas via https://github.com/SepDynamics/structural-manifold-compression.
- Downloads last month
- 6