Harold v0.8
Harold is an experimental continuous diffusion language model built for fast deployment. Unlike standard small LLMs β which are dense autoregressive Transformers shrunk down β Harold is designed from first principles for constrained hardware: subquadratic SSM layers, sparse MoE activation, looped attention for variable-depth reasoning, and parallel diffusion decoding.
Developed by Minya AI Β· Repo Β· Apache 2.0
β οΈ Training in progress. Harold v0.8 is preparing for its pretraining run. Weights will be released upon completion. This card documents architecture and design decisions.
Architecture
Harold v0.8 introduces a three-stage Looped Jamba architecture:
1. Mamba3 Prelude β Sequential Memory (8 layers, run once)
The Prelude processes the input sequence through 8 Mamba3 SSM layers with O(1) per-token memory. It produces an encoded representation e that captures long-range context β this encoding is injected into every loop iteration, keeping the original signal alive across arbitrary reasoning depth.
2. Looped Attention β Deep Reasoning (4 blocks Γ T iterations)
The core innovation: 4 attention blocks are shared and looped T times (default T=8, giving 32 effective attention layers from only 4 sets of weights). Each iteration includes:
- MoDA (Mixture-of-Depths Attention) β each head attends to KV from both the current iteration and previous iterations, recovering features that would otherwise dilute
- mHC (Manifold-Constrained Hyper-Connections, from DeepSeek V4) β the residual stream is widened into 4 parallel streams, mixed via a Sinkhorn-normalized doubly-stochastic matrix that bounds signal amplification to ~1.6Γ
- LTI-stable injection β
h_{t+1} = AΒ·h_t + BΒ·e + block_outwith spectral radius Ο(A) < 1 guaranteed by construction - ACT halting β adaptive computation time per position; easy tokens exit early
- LoRA depth adapter β per-loop specialization without breaking weight sharing
3. Mamba3 Coda β Stabilization (4 layers, run once)
The Coda re-integrates the looped reasoning output into the sequential flow, smoothing and stabilizing before the output head.
Input tokens
β embed + timestep conditioning
[Mamba3 + MoE + AdaLN] Γ 8 (Prelude) β encoding `e`
β
[MoDA-Attn + MoE + AdaLN] Γ 4 blocks Γ 8 loops
β LTI injection of `e` at every loop
β mHC multi-stream residual (Sinkhorn mixing)
β ACT halting + LoRA depth adapter
β
[Mamba3 + MoE + AdaLN] Γ 4 (Coda)
β
RMSNorm β Linear β logits
4. Sparse Mixture of Experts
DeepSeek-style MoE with 1 shared + 16 routed SwiGLU experts, top-3 selection. Both shared and routed experts use SwiGLU (changed from GELU in v0.7 for routed experts, aligning with DeepSeek V3/V4, LLaMA, Mistral). THOR-style deterministic hash routing (no learnable router) provides +5% throughput with identical convergence.
5. Continuous Flow Matching Diffusion
Harold does not predict the next token. Instead, it refines an entire sequence from Gaussian noise toward coherent text using x0-prediction Flow Matching with logit-normal timestep sampling. This enables:
- Parallel decoding β all tokens refined simultaneously
- Native infill β fill-in-the-middle without tricks
- Variable compute β more loop iterations for harder problems (depth extrapolation)
Model Details
| Property | Value |
|---|---|
| Architecture | Looped Jamba (Mamba3 Prelude/Coda + Looped Attention) |
| Parameters | ~3.2B total / ~800M active |
| d_model | 2048 |
| Physical layers | 16 (8 Prelude + 4 Loop + 4 Coda) |
| Effective depth | 44 (8 + 4Γ8 + 4) |
| Attention | GQA 4:1 (32 heads, 8 KV) + MoDA depth KV |
| Mamba3 d_state | 128 |
| MoE | 1 shared + 16 routed SwiGLU (top-3, Hash routing) |
| Loop depth | 8 iterations (extendable at inference) |
| mHC streams | 4 (Sinkhorn-Knopp, 20 iterations) |
| ACT threshold | 0.99 |
| Max seq len | 4096 (YaRN RoPE scale=4.0, ΞΈ=500k) |
| Tokenizer | LLaMA-2 BPE (32,000 vocab) |
| Diffusion | x0-prediction CFM, logit-normal t ~ Ο(N(0, 0.5)) |
| AdaLN dim | 1024 |
Pretraining Dataset Mix
Education/code-oriented mix β prioritizes code and systems content:
| Dataset | Weight | Purpose |
|---|---|---|
| FineWeb-Edu | 20% | High-quality web text |
| SlimPajama | 15% | Thematic diversity |
| GitHub Code | 25% | General code (30+ languages) |
| The Stack (C/C++/Rust) | 10% | Systems & embedded languages |
| C4 | 10% | Web text |
| Wikipedia EN | 10% | Factual grounding |
| Open-Web-Math | 5% | Mathematical reasoning |
| arXiv | 3% | Technical writing |
| PG-19 | 2% | Long-form coherence |
Precision: bfloat16 Optimizer: MuonAdamW Learning rate: 1e-4 β 1e-5 cosine, 2000 warmup steps Effective batch: 4 Γ 32 grad accum Γ 4096 tokens
Key Innovations in v0.8
| Feature | Source | What it does |
|---|---|---|
| Looped Attention | Parcae | Variable-depth reasoning at fixed parameter cost |
| MoDA | HustVL | Cross-layer depth KV attention (~3.7% FLOP overhead) |
| mHC | DeepSeek V4 | Multi-stream residual with Sinkhorn mixing (~6.7% overhead) |
| Mamba3 as Prelude/Coda | Harold design | SSM provides always-on sequential memory; attention is looped |
| SwiGLU everywhere | DeepSeek V3/V4 | Both shared and routed experts use gated SwiGLU |
| LTI injection | Parcae | Spectral-radius-bounded recurrent update, Ο(A) < 1 |
| ACT halting | Graves 2016 | Adaptive compute per position β easy tokens exit early |
Usage
Weights not yet released. The following assumes v0.8 checkpoint is available.
import torch
from transformers import AutoTokenizer
from core.config import ModelConfig
from core.harold import Harold
from sampler import build_sampler
state = torch.load("Harold-v0.8-3B-Base.pt", map_location="cpu", weights_only=False)
model_cfg = state.get("model_cfg", ModelConfig())
model = Harold(model_cfg).cuda().bfloat16()
model.load_state_dict(state["model_state"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("JHN-MACHINE/harold")
tokenizer.pad_token = tokenizer.eos_token
sampler = build_sampler(model, n_steps=32, freeze_threshold=0.9, cfg_scale=3.0)
tokens = sampler.generate(batch_size=1, seq_len=256)
print(tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True))
Depth Extrapolation at Inference
# Default: 8 loops (effective depth 44)
logits = model(input_ids, t=t, n_loops=8)
# Harder problem: 16 loops (effective depth 76)
logits = model(input_ids, t=t, n_loops=16)
# Fast mode: 2 loops (effective depth 16)
logits = model(input_ids, t=t, n_loops=2)
Installation
git clone https://github.com/JHNMACHINE/harold
cd harold
pip install -r requirements.txt
# Mamba3 requires CUDA >= 13
pip install mamba-ssm causal-conv1d
Changelog
| Version | Params | Key changes |
|---|---|---|
| v0.4 | 733M | VP-SDE, GPT-2 tokenizer, pure Transformer |
| v0.5 | 1.25B | Flow Matching, LLaMA-2 tokenizer |
| v0.6 | 1.51B | Jamba (Mamba2), MoE, SFT |
| v0.7 | 3.2B | Mamba3, x0-prediction, HashMoE, FSDP |
| v0.8 | ~3.2B | Looped Jamba, MoDA, mHC (DeepSeek V4), SwiGLU experts, ACT, LTI injection |
Limitations
- Training in progress β generation quality metrics pending pretraining run
- Diffusion latency β requires N forward passes; each pass runs T loop iterations
- mHC overhead β ~4Γ loop FLOP with hc_mult=4 (tunable)
- No SFT yet β instruction following planned post-pretraining
- Mamba3 β requires CUDA >= 13
Citation
@article{Harold,
title = {Harold v0.8: Looped Jamba Diffusion Language Model with MoDA, mHC, and Sparse MoE},
author = {Vecchione, Jonathan},
year = {2026},
url = {https://huggingface.co/JHN-MACHINE/harold}
}
License
Apache 2.0 β see LICENSE
Built by Β· minya.ai