Complexity Deep 1.5B v0.13.0

A novel transformer architecture with Mu-Guided Attention, Token-Routed MLP, and INL Dynamics.

Model Details

Attribute Value
Parameters ~1.52B
Hidden Size 2048
Layers 24
Attention Heads 16
KV Heads (GQA) 8
Experts 4
Context Length 2048
Vocab Size 32,000
Precision BF16
Version 0.13.0

Architecture Innovations (v0.13.0)

1. Mu-Guided Attention (INL 2025)

The key innovation: ΞΌ (mu) from the previous layer biases K, Q, AND V projections:

# v0.13.0: KQV order (industry standard like Qwen, Llama, GPT)
# Fused Mu-KQV via concat+cuBLAS (2x faster than 6 separate matmuls)
x_mu = concat([x, mu_prev], dim=-1)

k = x_mu @ concat([W_k, W_mu_k])  # K biased by mu
q = x_mu @ concat([W_q, W_mu_q])  # Q biased by mu
v = x_mu @ concat([W_v, W_mu_v])  # V biased by mu

Why Mu everywhere?

  • Top-down guidance: ΞΌ carries global context from previous layers
  • Faster convergence: Model learns structure ~2-3x faster
  • Better sample efficiency: 50k steps achieves what normally takes 150k+

2. Token-Routed MLP with Mu-Guided Routing

Deterministic expert selection + mu influence:

# Base routing: deterministic by token ID
expert_id = token_id % num_experts

# Mu override: mu can shift expert selection
router_logits = base_router(x) + mu_router(mu_prev)

Benefits:

  • Uniform distribution: Each expert receives exactly 25% of tokens
  • Zero routing collapse: Frequent tokens spread across all experts
  • Mu guidance: Context influences which expert processes each token
  • Fused gate+up projection: 1.3x speedup via single matmul

3. INL Dynamics with Contextual Mu

A control system inspired by robotics, now with contextual adaptation:

error = h - mu                      # deviation from equilibrium
v_next = alpha * v - beta * error   # velocity update (momentum + correction)
h_next = h + dt * gate * v_next     # position update (integration)

# v0.13.0: Contextual mu for next layer
mu_contextual = mu + mu_proj(h)     # mu adapts based on current hidden state

Benefits:

  • Smooth trajectories (no jerky token generation)
  • Stable convergence (PID-like control)
  • Mu Highway: Accumulated context flows across all 24 layers

4. Modern Attention Stack

  • KQV Order: Industry standard (Llama, Qwen, GPT) for optimal KV-cache
  • GQA: 8 KV heads (2x less KV cache than MHA)
  • QK Norm: Attention stability at scale
  • SDPA: Flash Attention via PyTorch 2.0+
  • RoPE: Rotary positional embeddings

Layer Architecture

Input
  β”‚
  β–Ό
[RMSNorm] ─► [Mu-Guided GQA (KQV)] ─► [INL Dynamics] ─► [RMSNorm] ─► [Token-Routed MLP]
  β”‚              β–²                         β”‚                              β–²
  β”‚              β”‚                         β”‚                              β”‚
  β”‚         mu_prev                   mu_contextual β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚                                        β”‚
  +─────────────────── Residual ───────────┼──────────────────────────────+
  β”‚                                        β”‚                              β”‚
  β–Ό                                        β–Ό                              β”‚
Output ◄───────────────────────────── mu_next (to next layer)  β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training Status

Training Progress

  • Current Step: 750,000 (early checkpoint)
  • Target: 1,000,000 steps
  • Dataset: FineWeb-Edu (French/English)
  • Hardware: H100 80GB

Research Preview - Active training checkpoint demonstrating our novel Mu-Guided architecture. Early results show ~2-3x faster convergence vs standard transformers. Full release at 1M steps.

Generation Example (550k steps)

Generation Example

At 550k steps, the model produces grammatically correct sentences with proper punctuation and structure - demonstrating that Mu-guidance accelerates learning.

Installation

pip install complexity-deep>=0.13.0

Usage

Python API

from complexity_deep import DeepForCausalLM, DeepConfig
from tokenizers import Tokenizer
import torch

# Load model
model = DeepForCausalLM.from_pretrained("Pacific-Prime/pacific-prime")
tokenizer = Tokenizer.from_file("tokenizer.json")

# Generate
input_ids = torch.tensor([tokenizer.encode("Hello").ids])
output = model.generate(input_ids, max_new_tokens=50, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))

Generation Script

# Single prompt
python generate.py "The future of AI is" --max_tokens 100 --temperature 0.8

# Interactive mode
python generate.py --interactive

What's Original Here?

Innovation Status Description
Mu-Guided KQV Novel (INL 2025) ΞΌ biases K, Q, AND V projections
Mu-Guided Expert Routing Novel ΞΌ influences MLP expert selection
Contextual Mu (mu_proj) Novel ΞΌ adapts based on hidden state
Token-Routed MLP Novel Deterministic routing by token ID
INL Dynamics Novel Robotics control in transformers
Fused Mu-KQV (concat+cuBLAS) Novel 2x faster than separate projections
KQV Order Industry standard Like Llama, Qwen, GPT

Files

  • model.safetensors - Model weights (~3GB, BF16)
  • config.json - Architecture configuration (v0.13.0)
  • tokenizer.json - BPE tokenizer (32K vocab)

Citation

@software{peyriguere2026complexity,
  author       = {Peyriguere, Boris},
  title        = {Complexity-Deep: Token-Routed MLP with Mu-Guided Dynamics for Efficient Transformer Architectures},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18293026},
  url          = {https://doi.org/10.5281/zenodo.18293026}
}

Roadmap

Feature Description Status
Continuous Batching Dynamic request batching βœ… Done
Speculative Decoding 2-3x faster inference Planned

Links

License

CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0)

Downloads last month
425
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 5 Ask for provider support