batteryphil/mamba-2.8b-latent — OO-SomaMind Phase 10

True O(1) Memory Test-Time Compute via Continuous-State Dark Loops.

A 2.8B-parameter Mamba SSM fine-tuned for multi-step latent reasoning. The model executes deductive logic in its continuous hidden state using ==== spacer tokens as internal clock cycles — no KV-cache growth, no visible chain-of-thought tokens.

Phase 10 adds: OO-domain specialization via LoRA adapters, geometric degeneration suppression via the Proprioception Gate, and named inference mode routing.

🏆 Benchmark Results (Phase 10, 2026-04-13, RTX 3060 12GB)

Metric	Result	Notes
VRAM footprint	5.24 GB / 11.6 GB (45%)	Backbone + LoRA 522K + Gate ~8K
Throughput (seq=512)	4,167 tok/s	O(1) MambaCache stateful loop
Throughput (avg)	2,922 tok/s	Across seq 64/128/256/512
Perplexity	21.91	GPT-2 124M baseline ~29 ✅
Halt head separation	+0.613	HIGH=0.642, LOW=0.029
Gate degen/healthy ratio	44.20x	Target was ≥ 1.5x ✅
OO domain recall	45% avg	6/12 probes ≥ 50% keyword recall
Load time	1.8s	safetensors, bf16

Repetition note: Default greedy mode averages ~31% bigram repetition (Mamba SSM characteristic). Apply repetition_penalty ≥ 1.2 or use oo_domain inference mode (T=0.4, rep=1.4, 4-gram stop) which achieved 0% repetition across 12 OO ontology probes.

📦 Files in This Repository

File	Description
`model.safetensors`	2.768B Mamba backbone weights (bfloat16)
`halting_head_v2.pt`	HaltingHead v2 — OO-semantic halt classifier (sep=+0.613)
`halting_head.pt`	HaltingHead v1 — original ACT probe (MAE=0.052, 88.6% accuracy)
`proprio_gate_2.8b.pt`	Geometric Proprioception Gate — calibrated to 44.20x degen ratio
`lora_oo_r16_final.pt`	Post-backbone LoRA adapter — OO domain (rank 16, α=32, 6 layers)
`proprioception_gate.py`	Gate architecture class (`GeometricProprioceptionGate`)
`lora_mamba.py`	LoRA adapter class (`PostBackboneLoRA`)
`benchmark_pre_hf.py`	8-section benchmark suite — run to reproduce results
`config.json`	Mamba model config (d_model=2560, n_layer=64, vocab_size=50280)
`tokenizer.json`	EleutherAI gpt-neox-20b tokenizer
`engine_manifest.json`	Training lineage and phase metadata

🚀 Inference — Quick Start

import torch
from mamba_ssm.models.config_mamba import MambaConfig
from mamba_ssm import MambaLMHeadModel
from safetensors.torch import load_file
from transformers import AutoTokenizer
import sys

# ── Load all components ──────────────────────────────────────────
MODEL_DIR = "batteryphil/mamba-2.8b-latent"
DEVICE    = "cuda"

tok = AutoTokenizer.from_pretrained(MODEL_DIR)
tok.pad_token = tok.eos_token

cfg   = MambaConfig(d_model=2560, n_layer=64, vocab_size=50280,
                    pad_vocab_size_multiple=8)
model = MambaLMHeadModel(cfg, dtype=torch.bfloat16, device=DEVICE)
sd    = load_file(f"{MODEL_DIR}/model.safetensors")
if "lm_head.weight" not in sd:
    sd["lm_head.weight"] = sd["backbone.embedding.weight"]
model.load_state_dict(sd, strict=False)
model.eval()

# ── Load Proprioception Gate ──────────────────────────────────────
# (download proprioception_gate.py from this repo)
from proprioception_gate import GeometricProprioceptionGate
gate = GeometricProprioceptionGate(d_model=2560, window_size=8)
gate.load_state_dict(torch.load(f"{MODEL_DIR}/proprio_gate_2.8b.pt",
                                 map_location=DEVICE))
gate = gate.to(DEVICE, dtype=torch.bfloat16).eval()

# ── Load OO LoRA Adapter ──────────────────────────────────────────
# (download lora_mamba.py from this repo)
from lora_mamba import PostBackboneLoRA, load_post_lora
adapter = PostBackboneLoRA(d_model=2560, rank=16, alpha=32.0, n_layers=6)
adapter = adapter.to(DEVICE)
load_post_lora(adapter, f"{MODEL_DIR}/lora_oo_r16_final.pt", device=DEVICE)
adapter.eval()

# ── Generate (full stack) ─────────────────────────────────────────
def generate(prompt, max_tok=80, mode="oo_domain"):
    """Generate using backbone → adapter → gate → lm_head pipeline."""
    ids = tok(prompt, return_tensors="pt", truncation=True,
              max_length=128).input_ids.to(DEVICE)
    out = []
    eos = tok.eos_token_id
    with torch.no_grad():
        cur = ids
        for _ in range(max_tok):
            h      = model.backbone(cur)
            h      = adapter(h)
            h      = gate(h)
            logits = model.lm_head(h)[0, -1, :].float()
            # OO domain: temperature sampling with repetition penalty
            if mode == "oo_domain":
                for tid in set(out):
                    logits[tid] = logits[tid] / 1.4 if logits[tid] > 0 \
                                  else logits[tid] * 1.4
                logits = logits / 0.4
                probs  = torch.softmax(logits, dim=-1)
                sp, si = torch.sort(probs, descending=True)
                cut    = (torch.cumsum(sp, 0) - sp) > 0.9
                sp[cut] = 0.0
                nxt = si[torch.multinomial(sp / sp.sum(), 1)].item()
            else:
                nxt = logits.argmax().item()
            if nxt == eos:
                break
            out.append(nxt)
            cur = torch.cat([cur, torch.tensor([[nxt]], device=DEVICE)], 1)
    return tok.decode(out, skip_special_tokens=True)

# Example queries
print(generate("[OO] What is the limbion engine responsible for?"))
print(generate("[SELF] What are you?"))
print(generate("[OO] What does a GATE_COHERENCE_VIOLATION event mean?"))

Named Inference Modes

The engine supports four named sampling presets, auto-detected from prompt prefix:

Mode	Temperature	Top-p	Rep Penalty	Trigger
`default`	1.0 (greedy)	off	1.1	(fallback)
`oo_domain`	0.4	0.90	1.4	`[OO]`, `[SELF]`, `[SWARM`, `[WARDEN]`, `/oo_*`, `/fork`
`code`	0.3	0.95	1.05	`def` , `import` , `class`
`identity`	0.05 (greedy)	off	1.0	`who are you`, `your architecture`

Use stateful_engine.py for the full O(1) MambaCache loop engine with auto-routing.

🧠 Architecture

Backbone: Mamba-2.8B (state-spaces/mamba-2.8b-slimpj base), d_model=2560, n_layer=64
HaltingHead v2: 4-layer MLP (2560→512→64→1) — position-conditioned P(halt) probe
LoRA Adapter: Post-backbone residual adapter, rank=16, alpha=32, 6 layers
Proprioception Gate: 7KB learned gate — stagnation signal (1 - variance) maps degenerate loops to HIGH correction signal. Achieves 44.20x degenerate/healthy correction ratio.
Tokenizer: EleutherAI gpt-neox-20b (50,280 vocab), ==== spacer = = × n

Why Post-Backbone LoRA?

The Mamba backbone uses fused Triton kernels (mamba_ssm.ops.triton) for the selective scan operation. These are non-differentiable through standard autograd. The PostBackboneLoRA adapter bypasses this by inserting its residual connection after the backbone returns hidden states — making the entire adapter trainable with zero kernel modifications.

📚 References & Credits

Full credits: batteryphil/mamba2backbonerecursion README

Papers that shaped this work:

Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling. arXiv:2312.00752
Hao et al., Meta (2024) — COCONUT: Continuous Latent Reasoning. arXiv:2412.06769
Goyal et al., Google (2023) — Pause Tokens. arXiv:2310.02226
Graves (2016) — Adaptive Computation Time for RNNs. arXiv:1603.08983
Hu et al., Microsoft (2021) — LoRA. arXiv:2106.09685
Zelikman et al., Stanford (2024) — Quiet-STaR. arXiv:2403.09629

Engineering insights:

ItsMick/mamba2backbonerecursion — original recursive backbone loop scaffold
SGLang #2232 — MambaCache cache_params API
Apple MLX #980 — SSM state shape guard (no cache trimming)

Built by Phil / Antigravity Agentic Systems. April 2026. Hardware: NVIDIA RTX 3060 12GB. Zero cloud compute. Code: batteryphil/mamba2backbonerecursion

Downloads last month: 1,342

Safetensors

Model size

3B params

Tensor type

F32

BF16

Model tree for batteryphil/mamba-2.8b-latent

Base model

state-spaces/mamba-2.8b-slimpj

Adapter

(1)

this model

Papers for batteryphil/mamba-2.8b-latent