Nova-2: Multimodal Mamba Γ Transformer Hybrid
210M params
Smilyai-labs notes: Finally! The preview is out!! Nova-2 is a hybrid language model combining Mamba-2 SSM blocks with Grouped-Query Attention Transformers, enhanced with Mixture-of-Experts FFN, vision, and audio adapters.
Architecture Highlights
| Component | Detail |
|---|---|
| Parameters | 239,283,200 |
| Hidden size | 768 |
| Attention heads | 12 (KV: 4) |
| Mamba layers | 12 |
| Transformer layers | 4 |
| MoE experts | 4 (top-2) |
| Context length | 2048 |
| Sliding window | 512 |
| Precision | bfloat16 |
Key Features
- Grouped-Query Attention β 4 KV heads shared across 12 query heads
- Sliding window + global attention β 25% of heads attend to full context
- Mamba-2 SSM β selective state spaces with gated input/output
- Mixture-of-Experts SwiGLU FFN with load-balanced routing
- Vision adapter β patch embedding β mini-ViT β learned projection
- Audio adapter β mel-spectrogram β conv β mini-transformer β projection
- Weight tying between token embeddings and LM head
- NTK-aware RoPE for long-context extrapolation
- EMA weights for stable inference
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Smilyai-labs/Nova-2-preview", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Smilyai-labs/Nova-2-preview", trust_remote_code=True)
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, top_p=0.9)
print(tokenizer.decode(outputs[0]))
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|endoftext|> |
50256 | BOS / EOS |
<|image|> |
50257 | Vision token marker |
<|/image|> |
50258 | Vision end marker |
<|audio|> |
50259 | Audio token marker |
<|/audio|> |
50260 | Audio end marker |
<|pad|> |
50261 | Padding |
Training
Trained with JAX/Flax on TPU using:
- AdamW with cosine warmup-decay schedule
- Gradient clipping (max norm 1.0)
- Z-loss for logit regularization
- MoE load-balancing auxiliary loss
- Activation checkpointing (remat)
- Mixed-precision (bfloat16)
License
Apache 2.0
- Downloads last month
- 227