AOMTS-Base-100M-3k-0MTP-v1-run2
Validation loss: 2.287432 (next-token cross-entropy, nats)
Part of the Aurora Optimized Multi-Token Superposition (AOMTS) experiment series.
This series evaluates whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve language model quality, and whether combining them yields further gains.
Key findings: TST alone improved validation loss by ~0.073 nats over the base (no-TST, no-MTP) model. MTP=1 alone improved by ~0.011 nats. Combining TST with MTP=1 achieved the best result in the series at 2.204673 nats — a total improvement of ~0.083 nats over the base. TST+MTP=2 did not improve over TST+MTP=1, suggesting diminishing returns beyond one MTP head at this scale.
Validation loss is next-token cross-entropy in nats, evaluated on a held-out Wikipedia Markdown validation set using the same 16,000-token BPE vocabulary. Lower is better.
Research artifact. These checkpoints are screening-scale models (3,000 steps, ~100M parameters) released for research and ablation comparison. They are not intended as production models.
Why 3,000 steps? After dozens of prior experiments running 15,000+ steps, it was consistently observed that the winning model was already ahead of competing runs within the first 2,000 steps. Running to 3,000 steps provides a clear signal while keeping turnaround fast enough to run many conditions in parallel.
Why ~100M parameters? After many experiments at 200M–500M parameters, the model that won at larger scale consistently also won at ~100M. Screening at ~100M is therefore a reliable and efficient proxy: top candidates from this series will be scaled further.
What might change at scale? At ~100M parameters and 3,000 steps, the model has limited capacity to predict far into the future — which likely explains why MTP=1 was optimal and MTP=2 did not help. A small model trained on relatively little data cannot reliably leverage the signal from heads that predict multiple steps ahead; the additional auxiliary loss may add noise rather than useful gradient. At larger model sizes and longer training runs, the optimal MTP depth is expected to increase as the model gains the capacity to make accurate multi-step predictions. Similarly, the optimal TST bag size (s=6 here) may shift with scale — larger models may benefit from larger or smaller bags depending on how effectively they can decompress the superposition signal during recovery. Further research is needed to determine how these findings scale across model size, training budget, and TST bag size.
About This Model
The exp4 grid baseline: no Token Superposition Training, no Multi-Token Prediction, WSD learning rate schedule. This is the primary reference point for the AOMTS series — all other runs are compared against it. An earlier run with identical hyperparameters trained before the formal grid sweep is preserved as AOMTS-Base-100M-3k-0MTP-v1 for historical comparison.
Architecture
| Parameter | Value |
|---|---|
| Vocabulary size | 16,000 |
| Hidden dimension (d_model) | 512 |
| Layers | 12 |
| Attention heads | 8 |
| KV heads | 8 |
| Head dimension | 64 |
| FFN hidden dimension | 4,800 |
| FFN variant | SwiGLU |
| Max sequence length | 2,048 |
| RoPE θ | 10,000 |
| Normalization | RMSNorm |
| Tied embeddings | Yes |
- Total parameters: 117,453,312
- Embeddings (tied tok_emb / lm_head): 8,192,000
- Non-embedding, non-MTP (transformer blocks): 109,261,312 (identical across all AOMTS runs)
MTP parameters are auxiliary training heads. They are not used during standard language modeling evaluation and do not affect validation loss — the val loss reported here is computed from the main head only. The non-embedding, non-MTP parameter count (109,261,312) is identical across all runs in this series.
Training
| Setting | Value |
|---|---|
| Total steps | 3,000 |
| Batch size | 16 sequences |
| Gradient accumulation | 2 |
| Effective batch size | 32 sequences / 65,536 model-context tokens per step |
| Total raw tokens seen | 196,608,000 |
| Sequence length | 2,048 |
| LR schedule | WSD — 150 warmup steps, stable LR, then linear decay over the last 300 steps (final 10 % of training) to 0.0 |
| Warmup steps | 150 |
| Min LR | 0.0 |
| Weight decay | 0.1 |
Optimizer: Aurora (matrix weights) + AdamW (embeddings & norms)
- Aurora matrix weight lr: 0.02
- AdamW embedding/norm lr: 0.0003
- Weight decay: 0.1
- Gradient clip: 1.0
Multi-Token Prediction (MTP)
Multi-Token Prediction is not used in this model.
Token Superposition Training (TST)
Token Superposition Training is not used in this model.
Dataset
Trained on open-index/open-wikipedia-markdown (Wikipedia Markdown). Tokenized with a custom 16,000-token BPE vocabulary.
- Total raw tokens seen: 196,608,000
- Model-context tokens per step: 65,536 (16 seqs × 2 grad accum × 2048 seq len)
Usage
Each repo includes modeling_aomts.py — a self-contained inference script with no external model code required.
pip install torch safetensors tokenizers
Command-line generation:
python modeling_aomts.py --repo_dir /path/to/repo --prompt "The theory of" --max_new_tokens 200
Python API:
from modeling_aomts import load_model, generate
model, tokenizer = load_model(".") # add device="cuda" for GPU
print(generate(model, tokenizer, "The theory of relativity states",
max_new_tokens=200, temperature=1.0, top_k=50))
Generation options: temperature (lower = less random; 0 = greedy), top_k, top_p (nucleus sampling), max_new_tokens, device, dtype.
Full Experiment Comparison
All AOMTS models at a glance (equal 3,000-step budget, sorted by validation loss):
| Model | MTP Depth | TST | LR Schedule | Optim Reset | Val Loss |
|---|---|---|---|---|---|
| AOMTS-TST-s6-100M-3k-1MTP-v1 | 1 | Yes (s=6¹, 900 steps) | WSD | — | 2.204673 |
| AOMTS-TST-s6-100M-3k-0MTP-v1 | 0 | Yes (s=6¹, 900 steps) | WSD | — | 2.213959 |
| AOMTS-TST-s6-100M-3k-2MTP-v1 | 2 | Yes (s=6¹, 900 steps) | WSD | — | 2.214605 |
| AOMTS-Base-100M-3k-1MTP-v1 | 1 | No | WSD | — | 2.276289 |
| AOMTS-Base-100M-3k-2MTP-v1 | 2 | No | WSD | — | 2.284260 |
| AOMTS-Base-100M-3k-0MTP-v1-run2 ← this model | 0 | No | WSD | — | 2.287432 |
| AOMTS-TST-s6-100M-3k-0MTP-RESET-v1 | 0 | Yes (s=6¹, 900 steps) | WSD | Yes² | 2.302689 |
| AOMTS-Base-100M-3k-1MTP-Cosine-v1 | 1 | No | Cosine | — | 2.354897 |
| AOMTS-Base-100M-3k-0MTP-v1 | 0 | No | WSD | — | 2.375539 |
| ¹ s = bag size: the number of raw tokens averaged into each compressed embedding position during TST phase 1. |
² Optim Reset = phase 2 restarted the optimizer state and LR schedule from scratch rather than carrying them over from phase 1. Models without this flag use a unified schedule across both phases.
References
- Peng, B., Gigant, E., Quesnelle, J. (Nous Research, 2025). Token Superposition Training for Language Model Pretraining. arXiv:2605.06546
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437
- Downloads last month
- 137