Harold v0.8

Harold is an experimental continuous diffusion language model built for fast deployment. Unlike standard small LLMs β€” which are dense autoregressive Transformers shrunk down β€” Harold is designed from first principles for constrained hardware: subquadratic SSM layers, sparse MoE activation, looped attention for variable-depth reasoning, and parallel diffusion decoding.

Developed by Minya AI Β· Repo Β· Apache 2.0

⚠️ Training in progress. Harold v0.8 is preparing for its pretraining run. Weights will be released upon completion. This card documents architecture and design decisions.


Architecture

Harold v0.8 introduces a three-stage Looped Jamba architecture:

1. Mamba3 Prelude β€” Sequential Memory (8 layers, run once)

The Prelude processes the input sequence through 8 Mamba3 SSM layers with O(1) per-token memory. It produces an encoded representation e that captures long-range context β€” this encoding is injected into every loop iteration, keeping the original signal alive across arbitrary reasoning depth.

2. Looped Attention β€” Deep Reasoning (4 blocks Γ— T iterations)

The core innovation: 4 attention blocks are shared and looped T times (default T=8, giving 32 effective attention layers from only 4 sets of weights). Each iteration includes:

  • MoDA (Mixture-of-Depths Attention) β€” each head attends to KV from both the current iteration and previous iterations, recovering features that would otherwise dilute
  • mHC (Manifold-Constrained Hyper-Connections, from DeepSeek V4) β€” the residual stream is widened into 4 parallel streams, mixed via a Sinkhorn-normalized doubly-stochastic matrix that bounds signal amplification to ~1.6Γ—
  • LTI-stable injection β€” h_{t+1} = AΒ·h_t + BΒ·e + block_out with spectral radius ρ(A) < 1 guaranteed by construction
  • ACT halting β€” adaptive computation time per position; easy tokens exit early
  • LoRA depth adapter β€” per-loop specialization without breaking weight sharing

3. Mamba3 Coda β€” Stabilization (4 layers, run once)

The Coda re-integrates the looped reasoning output into the sequential flow, smoothing and stabilizing before the output head.

Input tokens
    ↓ embed + timestep conditioning
[Mamba3 + MoE + AdaLN] Γ— 8  (Prelude) β†’ encoding `e`
    ↓
[MoDA-Attn + MoE + AdaLN] Γ— 4 blocks Γ— 8 loops
    ↑ LTI injection of `e` at every loop
    ↑ mHC multi-stream residual (Sinkhorn mixing)
    ↑ ACT halting + LoRA depth adapter
    ↓
[Mamba3 + MoE + AdaLN] Γ— 4  (Coda)
    ↓
RMSNorm β†’ Linear β†’ logits

4. Sparse Mixture of Experts

DeepSeek-style MoE with 1 shared + 16 routed SwiGLU experts, top-3 selection. Both shared and routed experts use SwiGLU (changed from GELU in v0.7 for routed experts, aligning with DeepSeek V3/V4, LLaMA, Mistral). THOR-style deterministic hash routing (no learnable router) provides +5% throughput with identical convergence.

5. Continuous Flow Matching Diffusion

Harold does not predict the next token. Instead, it refines an entire sequence from Gaussian noise toward coherent text using x0-prediction Flow Matching with logit-normal timestep sampling. This enables:

  • Parallel decoding β€” all tokens refined simultaneously
  • Native infill β€” fill-in-the-middle without tricks
  • Variable compute β€” more loop iterations for harder problems (depth extrapolation)

Model Details

Property Value
Architecture Looped Jamba (Mamba3 Prelude/Coda + Looped Attention)
Parameters ~3.2B total / ~800M active
d_model 2048
Physical layers 16 (8 Prelude + 4 Loop + 4 Coda)
Effective depth 44 (8 + 4Γ—8 + 4)
Attention GQA 4:1 (32 heads, 8 KV) + MoDA depth KV
Mamba3 d_state 128
MoE 1 shared + 16 routed SwiGLU (top-3, Hash routing)
Loop depth 8 iterations (extendable at inference)
mHC streams 4 (Sinkhorn-Knopp, 20 iterations)
ACT threshold 0.99
Max seq len 4096 (YaRN RoPE scale=4.0, ΞΈ=500k)
Tokenizer LLaMA-2 BPE (32,000 vocab)
Diffusion x0-prediction CFM, logit-normal t ~ Οƒ(N(0, 0.5))
AdaLN dim 1024

Pretraining Dataset Mix

Education/code-oriented mix β€” prioritizes code and systems content:

Dataset Weight Purpose
FineWeb-Edu 20% High-quality web text
SlimPajama 15% Thematic diversity
GitHub Code 25% General code (30+ languages)
The Stack (C/C++/Rust) 10% Systems & embedded languages
C4 10% Web text
Wikipedia EN 10% Factual grounding
Open-Web-Math 5% Mathematical reasoning
arXiv 3% Technical writing
PG-19 2% Long-form coherence

Precision: bfloat16 Optimizer: MuonAdamW Learning rate: 1e-4 β†’ 1e-5 cosine, 2000 warmup steps Effective batch: 4 Γ— 32 grad accum Γ— 4096 tokens


Key Innovations in v0.8

Feature Source What it does
Looped Attention Parcae Variable-depth reasoning at fixed parameter cost
MoDA HustVL Cross-layer depth KV attention (~3.7% FLOP overhead)
mHC DeepSeek V4 Multi-stream residual with Sinkhorn mixing (~6.7% overhead)
Mamba3 as Prelude/Coda Harold design SSM provides always-on sequential memory; attention is looped
SwiGLU everywhere DeepSeek V3/V4 Both shared and routed experts use gated SwiGLU
LTI injection Parcae Spectral-radius-bounded recurrent update, ρ(A) < 1
ACT halting Graves 2016 Adaptive compute per position β€” easy tokens exit early

Usage

Weights not yet released. The following assumes v0.8 checkpoint is available.

import torch
from transformers import AutoTokenizer
from core.config import ModelConfig
from core.harold import Harold
from sampler import build_sampler

state     = torch.load("Harold-v0.8-3B-Base.pt", map_location="cpu", weights_only=False)
model_cfg = state.get("model_cfg", ModelConfig())
model     = Harold(model_cfg).cuda().bfloat16()
model.load_state_dict(state["model_state"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("JHN-MACHINE/harold")
tokenizer.pad_token = tokenizer.eos_token

sampler = build_sampler(model, n_steps=32, freeze_threshold=0.9, cfg_scale=3.0)
tokens  = sampler.generate(batch_size=1, seq_len=256)
print(tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True))

Depth Extrapolation at Inference

# Default: 8 loops (effective depth 44)
logits = model(input_ids, t=t, n_loops=8)

# Harder problem: 16 loops (effective depth 76)
logits = model(input_ids, t=t, n_loops=16)

# Fast mode: 2 loops (effective depth 16)
logits = model(input_ids, t=t, n_loops=2)

Installation

git clone https://github.com/JHNMACHINE/harold
cd harold
pip install -r requirements.txt

# Mamba3 requires CUDA >= 13
pip install mamba-ssm causal-conv1d

Changelog

Version Params Key changes
v0.4 733M VP-SDE, GPT-2 tokenizer, pure Transformer
v0.5 1.25B Flow Matching, LLaMA-2 tokenizer
v0.6 1.51B Jamba (Mamba2), MoE, SFT
v0.7 3.2B Mamba3, x0-prediction, HashMoE, FSDP
v0.8 ~3.2B Looped Jamba, MoDA, mHC (DeepSeek V4), SwiGLU experts, ACT, LTI injection

Limitations

  • Training in progress β€” generation quality metrics pending pretraining run
  • Diffusion latency β€” requires N forward passes; each pass runs T loop iterations
  • mHC overhead β€” ~4Γ— loop FLOP with hc_mult=4 (tunable)
  • No SFT yet β€” instruction following planned post-pretraining
  • Mamba3 β€” requires CUDA >= 13

Citation

@article{Harold,
  title   = {Harold v0.8: Looped Jamba Diffusion Language Model with MoDA, mHC, and Sparse MoE},
  author  = {Vecchione, Jonathan},
  year    = {2026},
  url     = {https://huggingface.co/JHN-MACHINE/harold}
}

License

Apache 2.0 β€” see LICENSE

Built by Β· minya.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support