Complexity Deep 1.5B v0.13.0
A novel transformer architecture with Mu-Guided Attention, Token-Routed MLP, and INL Dynamics.
Model Details
| Attribute | Value |
|---|---|
| Parameters | ~1.52B |
| Hidden Size | 2048 |
| Layers | 24 |
| Attention Heads | 16 |
| KV Heads (GQA) | 8 |
| Experts | 4 |
| Context Length | 2048 |
| Vocab Size | 32,000 |
| Precision | BF16 |
| Version | 0.13.0 |
Architecture Innovations (v0.13.0)
1. Mu-Guided Attention (INL 2025)
The key innovation: ΞΌ (mu) from the previous layer biases K, Q, AND V projections:
# v0.13.0: KQV order (industry standard like Qwen, Llama, GPT)
# Fused Mu-KQV via concat+cuBLAS (2x faster than 6 separate matmuls)
x_mu = concat([x, mu_prev], dim=-1)
k = x_mu @ concat([W_k, W_mu_k]) # K biased by mu
q = x_mu @ concat([W_q, W_mu_q]) # Q biased by mu
v = x_mu @ concat([W_v, W_mu_v]) # V biased by mu
Why Mu everywhere?
- Top-down guidance: ΞΌ carries global context from previous layers
- Faster convergence: Model learns structure ~2-3x faster
- Better sample efficiency: 50k steps achieves what normally takes 150k+
2. Token-Routed MLP with Mu-Guided Routing
Deterministic expert selection + mu influence:
# Base routing: deterministic by token ID
expert_id = token_id % num_experts
# Mu override: mu can shift expert selection
router_logits = base_router(x) + mu_router(mu_prev)
Benefits:
- Uniform distribution: Each expert receives exactly 25% of tokens
- Zero routing collapse: Frequent tokens spread across all experts
- Mu guidance: Context influences which expert processes each token
- Fused gate+up projection: 1.3x speedup via single matmul
3. INL Dynamics with Contextual Mu
A control system inspired by robotics, now with contextual adaptation:
error = h - mu # deviation from equilibrium
v_next = alpha * v - beta * error # velocity update (momentum + correction)
h_next = h + dt * gate * v_next # position update (integration)
# v0.13.0: Contextual mu for next layer
mu_contextual = mu + mu_proj(h) # mu adapts based on current hidden state
Benefits:
- Smooth trajectories (no jerky token generation)
- Stable convergence (PID-like control)
- Mu Highway: Accumulated context flows across all 24 layers
4. Modern Attention Stack
- KQV Order: Industry standard (Llama, Qwen, GPT) for optimal KV-cache
- GQA: 8 KV heads (2x less KV cache than MHA)
- QK Norm: Attention stability at scale
- SDPA: Flash Attention via PyTorch 2.0+
- RoPE: Rotary positional embeddings
Layer Architecture
Input
β
βΌ
[RMSNorm] ββΊ [Mu-Guided GQA (KQV)] ββΊ [INL Dynamics] ββΊ [RMSNorm] ββΊ [Token-Routed MLP]
β β² β β²
β β β β
β mu_prev mu_contextual βββββββββββββββββββββββ
β β
+βββββββββββββββββββ Residual ββββββββββββΌββββββββββββββββββββββββββββββ+
β β β
βΌ βΌ β
Output ββββββββββββββββββββββββββββββ mu_next (to next layer) ββββββββββββ
Training Status
- Current Step: 750,000 (early checkpoint)
- Target: 1,000,000 steps
- Dataset: FineWeb-Edu (French/English)
- Hardware: H100 80GB
Research Preview - Active training checkpoint demonstrating our novel Mu-Guided architecture. Early results show ~2-3x faster convergence vs standard transformers. Full release at 1M steps.
Generation Example (550k steps)
At 550k steps, the model produces grammatically correct sentences with proper punctuation and structure - demonstrating that Mu-guidance accelerates learning.
Installation
pip install complexity-deep>=0.13.0
Usage
Python API
from complexity_deep import DeepForCausalLM, DeepConfig
from tokenizers import Tokenizer
import torch
# Load model
model = DeepForCausalLM.from_pretrained("Pacific-Prime/pacific-prime")
tokenizer = Tokenizer.from_file("tokenizer.json")
# Generate
input_ids = torch.tensor([tokenizer.encode("Hello").ids])
output = model.generate(input_ids, max_new_tokens=50, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))
Generation Script
# Single prompt
python generate.py "The future of AI is" --max_tokens 100 --temperature 0.8
# Interactive mode
python generate.py --interactive
What's Original Here?
| Innovation | Status | Description |
|---|---|---|
| Mu-Guided KQV | Novel (INL 2025) | ΞΌ biases K, Q, AND V projections |
| Mu-Guided Expert Routing | Novel | ΞΌ influences MLP expert selection |
| Contextual Mu (mu_proj) | Novel | ΞΌ adapts based on hidden state |
| Token-Routed MLP | Novel | Deterministic routing by token ID |
| INL Dynamics | Novel | Robotics control in transformers |
| Fused Mu-KQV (concat+cuBLAS) | Novel | 2x faster than separate projections |
| KQV Order | Industry standard | Like Llama, Qwen, GPT |
Files
model.safetensors- Model weights (~3GB, BF16)config.json- Architecture configuration (v0.13.0)tokenizer.json- BPE tokenizer (32K vocab)
Citation
@software{peyriguere2026complexity,
author = {Peyriguere, Boris},
title = {Complexity-Deep: Token-Routed MLP with Mu-Guided Dynamics for Efficient Transformer Architectures},
year = 2026,
publisher = {Zenodo},
doi = {10.5281/zenodo.18293026},
url = {https://doi.org/10.5281/zenodo.18293026}
}
Roadmap
| Feature | Description | Status |
|---|---|---|
| Continuous Batching | Dynamic request batching | β Done |
| Speculative Decoding | 2-3x faster inference | Planned |
Links
- GitHub - complexity-deep
- GitHub - complexity-inference
- GitHub - complexity-framework
- PyPI - complexity-deep
- PyPI - mu-inference
- PyPI - complexity-framework
- Paper (Zenodo)
License
CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0)
- Downloads last month
- 425

