AOMTS-Base-100M-3k-0MTP-v1-run2

Validation loss: 2.287432 (next-token cross-entropy, nats)

Part of the Aurora Optimized Multi-Token Superposition (AOMTS) experiment series.

This series evaluates whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve language model quality, and whether combining them yields further gains.

Key findings: TST alone improved validation loss by ~0.073 nats over the base (no-TST, no-MTP) model. MTP=1 alone improved by ~0.011 nats. Combining TST with MTP=1 achieved the best result in the series at 2.204673 nats — a total improvement of ~0.083 nats over the base. TST+MTP=2 did not improve over TST+MTP=1, suggesting diminishing returns beyond one MTP head at this scale.

Validation loss is next-token cross-entropy in nats, evaluated on a held-out Wikipedia Markdown validation set using the same 16,000-token BPE vocabulary. Lower is better.

Research artifact. These checkpoints are screening-scale models (3,000 steps, ~100M parameters) released for research and ablation comparison. They are not intended as production models.

Why 3,000 steps? After dozens of prior experiments running 15,000+ steps, it was consistently observed that the winning model was already ahead of competing runs within the first 2,000 steps. Running to 3,000 steps provides a clear signal while keeping turnaround fast enough to run many conditions in parallel.

Why ~100M parameters? After many experiments at 200M–500M parameters, the model that won at larger scale consistently also won at ~100M. Screening at ~100M is therefore a reliable and efficient proxy: top candidates from this series will be scaled further.

What might change at scale? At ~100M parameters and 3,000 steps, the model has limited capacity to predict far into the future — which likely explains why MTP=1 was optimal and MTP=2 did not help. A small model trained on relatively little data cannot reliably leverage the signal from heads that predict multiple steps ahead; the additional auxiliary loss may add noise rather than useful gradient. At larger model sizes and longer training runs, the optimal MTP depth is expected to increase as the model gains the capacity to make accurate multi-step predictions. Similarly, the optimal TST bag size (s=6 here) may shift with scale — larger models may benefit from larger or smaller bags depending on how effectively they can decompress the superposition signal during recovery. Further research is needed to determine how these findings scale across model size, training budget, and TST bag size.

About This Model

The exp4 grid baseline: no Token Superposition Training, no Multi-Token Prediction, WSD learning rate schedule. This is the primary reference point for the AOMTS series — all other runs are compared against it. An earlier run with identical hyperparameters trained before the formal grid sweep is preserved as AOMTS-Base-100M-3k-0MTP-v1 for historical comparison.

Architecture

Parameter	Value
Vocabulary size	16,000
Hidden dimension (d_model)	512
Layers	12
Attention heads	8
KV heads	8
Head dimension	64
FFN hidden dimension	4,800
FFN variant	SwiGLU
Max sequence length	2,048
RoPE θ	10,000
Normalization	RMSNorm
Tied embeddings	Yes

Total parameters: 117,453,312
- Embeddings (tied tok_emb / lm_head): 8,192,000
- Non-embedding, non-MTP (transformer blocks): 109,261,312 (identical across all AOMTS runs)

MTP parameters are auxiliary training heads. They are not used during standard language modeling evaluation and do not affect validation loss — the val loss reported here is computed from the main head only. The non-embedding, non-MTP parameter count (109,261,312) is identical across all runs in this series.

Training

Setting	Value
Total steps	3,000
Batch size	16 sequences
Gradient accumulation	2
Effective batch size	32 sequences / 65,536 model-context tokens per step
Total raw tokens seen	196,608,000
Sequence length	2,048
LR schedule	WSD — 150 warmup steps, stable LR, then linear decay over the last 300 steps (final 10 % of training) to 0.0
Warmup steps	150
Min LR	0.0
Weight decay	0.1

Optimizer: Aurora (matrix weights) + AdamW (embeddings & norms)

Aurora matrix weight lr: 0.02
AdamW embedding/norm lr: 0.0003
Weight decay: 0.1
Gradient clip: 1.0

Multi-Token Prediction (MTP)

Multi-Token Prediction is not used in this model.

Token Superposition Training (TST)

Token Superposition Training is not used in this model.

Dataset

Trained on open-index/open-wikipedia-markdown (Wikipedia Markdown). Tokenized with a custom 16,000-token BPE vocabulary.

Total raw tokens seen: 196,608,000
Model-context tokens per step: 65,536 (16 seqs × 2 grad accum × 2048 seq len)

Usage

Each repo includes modeling_aomts.py — a self-contained inference script with no external model code required.

pip install torch safetensors tokenizers

Command-line generation:

python modeling_aomts.py --repo_dir /path/to/repo --prompt "The theory of" --max_new_tokens 200

Python API:

from modeling_aomts import load_model, generate

model, tokenizer = load_model(".")          # add device="cuda" for GPU
print(generate(model, tokenizer, "The theory of relativity states",
               max_new_tokens=200, temperature=1.0, top_k=50))

Generation options: temperature (lower = less random; 0 = greedy), top_k, top_p (nucleus sampling), max_new_tokens, device, dtype.

Full Experiment Comparison

All AOMTS models at a glance (equal 3,000-step budget, sorted by validation loss):

Model	MTP Depth	TST	LR Schedule	Optim Reset	Val Loss
AOMTS-TST-s6-100M-3k-1MTP-v1	1	Yes (s=6¹, 900 steps)	WSD	—	2.204673
AOMTS-TST-s6-100M-3k-0MTP-v1	0	Yes (s=6¹, 900 steps)	WSD	—	2.213959
AOMTS-TST-s6-100M-3k-2MTP-v1	2	Yes (s=6¹, 900 steps)	WSD	—	2.214605
AOMTS-Base-100M-3k-1MTP-v1	1	No	WSD	—	2.276289
AOMTS-Base-100M-3k-2MTP-v1	2	No	WSD	—	2.284260
AOMTS-Base-100M-3k-0MTP-v1-run2 ← this model	0	No	WSD	—	2.287432
AOMTS-TST-s6-100M-3k-0MTP-RESET-v1	0	Yes (s=6¹, 900 steps)	WSD	Yes²	2.302689
AOMTS-Base-100M-3k-1MTP-Cosine-v1	1	No	Cosine	—	2.354897
AOMTS-Base-100M-3k-0MTP-v1	0	No	WSD	—	2.375539
¹ s = bag size: the number of raw tokens averaged into each compressed embedding position during TST phase 1.

² Optim Reset = phase 2 restarted the optimizer state and LR schedule from scratch rather than carrying them over from phase 1. Models without this flag use a unified schedule across both phases.

References

Peng, B., Gigant, E., Quesnelle, J. (Nous Research, 2025). Token Superposition Training for Language Model Pretraining. arXiv:2605.06546
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437

Downloads last month: 137

Safetensors

Model size

0.1B params

Tensor type

BF16

Dataset used to train hudsongouge/AOMTS-Base-100M-3k-0MTP-v1-run2

Collection including hudsongouge/AOMTS-Base-100M-3k-0MTP-v1-run2

AOMTS: Aurora Optimized Multi Token Superposition

Collection

9 items • Updated 17 days ago • 1

Papers for hudsongouge/AOMTS-Base-100M-3k-0MTP-v1-run2

Efficient Pre-Training with Token Superposition

Paper • 2605.06546 • Published 27 days ago • 46

DeepSeek-V3 Technical Report

Paper • 2412.19437 • Published Dec 27, 2024 • 86