Harold v0.8

Harold is an experimental continuous diffusion language model built for fast deployment. Unlike standard small LLMs — which are dense autoregressive Transformers shrunk down — Harold is designed from first principles for constrained hardware: subquadratic SSM layers, sparse MoE activation, looped attention for variable-depth reasoning, and parallel diffusion decoding.

Developed by Minya AI · Repo · Apache 2.0

⚠️ Training in progress. Harold v0.8 is preparing for its pretraining run. Weights will be released upon completion. This card documents architecture and design decisions.

Architecture

Harold v0.8 introduces a three-stage Looped Jamba architecture:

1. Mamba3 Prelude — Sequential Memory (8 layers, run once)

The Prelude processes the input sequence through 8 Mamba3 SSM layers with O(1) per-token memory. It produces an encoded representation e that captures long-range context — this encoding is injected into every loop iteration, keeping the original signal alive across arbitrary reasoning depth.

2. Looped Attention — Deep Reasoning (4 blocks × T iterations)

The core innovation: 4 attention blocks are shared and looped T times (default T=8, giving 32 effective attention layers from only 4 sets of weights). Each iteration includes:

MoDA (Mixture-of-Depths Attention) — each head attends to KV from both the current iteration and previous iterations, recovering features that would otherwise dilute
mHC (Manifold-Constrained Hyper-Connections, from DeepSeek V4) — the residual stream is widened into 4 parallel streams, mixed via a Sinkhorn-normalized doubly-stochastic matrix that bounds signal amplification to ~1.6×
LTI-stable injection — h_{t+1} = A·h_t + B·e + block_out with spectral radius ρ(A) < 1 guaranteed by construction
ACT halting — adaptive computation time per position; easy tokens exit early
LoRA depth adapter — per-loop specialization without breaking weight sharing

3. Mamba3 Coda — Stabilization (4 layers, run once)

The Coda re-integrates the looped reasoning output into the sequential flow, smoothing and stabilizing before the output head.

Input tokens
    ↓ embed + timestep conditioning
[Mamba3 + MoE + AdaLN] × 8  (Prelude) → encoding `e`
    ↓
[MoDA-Attn + MoE + AdaLN] × 4 blocks × 8 loops
    ↑ LTI injection of `e` at every loop
    ↑ mHC multi-stream residual (Sinkhorn mixing)
    ↑ ACT halting + LoRA depth adapter
    ↓
[Mamba3 + MoE + AdaLN] × 4  (Coda)
    ↓
RMSNorm → Linear → logits

4. Sparse Mixture of Experts

DeepSeek-style MoE with 1 shared + 16 routed SwiGLU experts, top-3 selection. Both shared and routed experts use SwiGLU (changed from GELU in v0.7 for routed experts, aligning with DeepSeek V3/V4, LLaMA, Mistral). THOR-style deterministic hash routing (no learnable router) provides +5% throughput with identical convergence.

5. Continuous Flow Matching Diffusion

Harold does not predict the next token. Instead, it refines an entire sequence from Gaussian noise toward coherent text using x0-prediction Flow Matching with logit-normal timestep sampling. This enables:

Parallel decoding — all tokens refined simultaneously
Native infill — fill-in-the-middle without tricks
Variable compute — more loop iterations for harder problems (depth extrapolation)

Model Details

Property	Value
Architecture	Looped Jamba (Mamba3 Prelude/Coda + Looped Attention)
Parameters	~3.2B total / ~800M active
d_model	2048
Physical layers	16 (8 Prelude + 4 Loop + 4 Coda)
Effective depth	44 (8 + 4×8 + 4)
Attention	GQA 4:1 (32 heads, 8 KV) + MoDA depth KV
Mamba3 d_state	128
MoE	1 shared + 16 routed SwiGLU (top-3, Hash routing)
Loop depth	8 iterations (extendable at inference)
mHC streams	4 (Sinkhorn-Knopp, 20 iterations)
ACT threshold	0.99
Max seq len	4096 (YaRN RoPE scale=4.0, θ=500k)
Tokenizer	LLaMA-2 BPE (32,000 vocab)
Diffusion	x0-prediction CFM, logit-normal t ~ σ(N(0, 0.5))
AdaLN dim	1024

Pretraining Dataset Mix

Education/code-oriented mix — prioritizes code and systems content:

Dataset	Weight	Purpose
FineWeb-Edu	20%	High-quality web text
SlimPajama	15%	Thematic diversity
GitHub Code	25%	General code (30+ languages)
The Stack (C/C++/Rust)	10%	Systems & embedded languages
C4	10%	Web text
Wikipedia EN	10%	Factual grounding
Open-Web-Math	5%	Mathematical reasoning
arXiv	3%	Technical writing
PG-19	2%	Long-form coherence

Precision: bfloat16 Optimizer: MuonAdamW Learning rate: 1e-4 → 1e-5 cosine, 2000 warmup steps Effective batch: 4 × 32 grad accum × 4096 tokens

Key Innovations in v0.8

Feature	Source	What it does
Looped Attention	Parcae	Variable-depth reasoning at fixed parameter cost
MoDA	HustVL	Cross-layer depth KV attention (~3.7% FLOP overhead)
mHC	DeepSeek V4	Multi-stream residual with Sinkhorn mixing (~6.7% overhead)
Mamba3 as Prelude/Coda	Harold design	SSM provides always-on sequential memory; attention is looped
SwiGLU everywhere	DeepSeek V3/V4	Both shared and routed experts use gated SwiGLU
LTI injection	Parcae	Spectral-radius-bounded recurrent update, ρ(A) < 1
ACT halting	Graves 2016	Adaptive compute per position — easy tokens exit early

Usage

Weights not yet released. The following assumes v0.8 checkpoint is available.

import torch
from transformers import AutoTokenizer
from core.config import ModelConfig
from core.harold import Harold
from sampler import build_sampler

state     = torch.load("Harold-v0.8-3B-Base.pt", map_location="cpu", weights_only=False)
model_cfg = state.get("model_cfg", ModelConfig())
model     = Harold(model_cfg).cuda().bfloat16()
model.load_state_dict(state["model_state"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("JHN-MACHINE/harold")
tokenizer.pad_token = tokenizer.eos_token

sampler = build_sampler(model, n_steps=32, freeze_threshold=0.9, cfg_scale=3.0)
tokens  = sampler.generate(batch_size=1, seq_len=256)
print(tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True))

Depth Extrapolation at Inference

# Default: 8 loops (effective depth 44)
logits = model(input_ids, t=t, n_loops=8)

# Harder problem: 16 loops (effective depth 76)
logits = model(input_ids, t=t, n_loops=16)

# Fast mode: 2 loops (effective depth 16)
logits = model(input_ids, t=t, n_loops=2)

Installation

git clone https://github.com/JHNMACHINE/harold
cd harold
pip install -r requirements.txt

# Mamba3 requires CUDA >= 13
pip install mamba-ssm causal-conv1d

Changelog

Version	Params	Key changes
v0.4	733M	VP-SDE, GPT-2 tokenizer, pure Transformer
v0.5	1.25B	Flow Matching, LLaMA-2 tokenizer
v0.6	1.51B	Jamba (Mamba2), MoE, SFT
v0.7	3.2B	Mamba3, x0-prediction, HashMoE, FSDP
v0.8	~3.2B	Looped Jamba, MoDA, mHC (DeepSeek V4), SwiGLU experts, ACT, LTI injection

Limitations

Training in progress — generation quality metrics pending pretraining run
Diffusion latency — requires N forward passes; each pass runs T loop iterations
mHC overhead — ~4× loop FLOP with hc_mult=4 (tunable)
No SFT yet — instruction following planned post-pretraining
Mamba3 — requires CUDA >= 13

Citation

@article{Harold,
  title   = {Harold v0.8: Looped Jamba Diffusion Language Model with MoDA, mHC, and Sparse MoE},
  author  = {Vecchione, Jonathan},
  year    = {2026},
  url     = {https://huggingface.co/JHN-MACHINE/harold}
}

License

Apache 2.0 — see LICENSE

Built by · minya.ai

Downloads last month: -; Downloads are not tracked for this model. How to track