batteryphil/mamba-2.8b-latent β€” OO-SomaMind Phase 10

True O(1) Memory Test-Time Compute via Continuous-State Dark Loops.

A 2.8B-parameter Mamba SSM fine-tuned for multi-step latent reasoning. The model executes deductive logic in its continuous hidden state using ==== spacer tokens as internal clock cycles β€” no KV-cache growth, no visible chain-of-thought tokens.

Phase 10 adds: OO-domain specialization via LoRA adapters, geometric degeneration suppression via the Proprioception Gate, and named inference mode routing.


πŸ† Benchmark Results (Phase 10, 2026-04-13, RTX 3060 12GB)

Metric Result Notes
VRAM footprint 5.24 GB / 11.6 GB (45%) Backbone + LoRA 522K + Gate ~8K
Throughput (seq=512) 4,167 tok/s O(1) MambaCache stateful loop
Throughput (avg) 2,922 tok/s Across seq 64/128/256/512
Perplexity 21.91 GPT-2 124M baseline ~29 βœ…
Halt head separation +0.613 HIGH=0.642, LOW=0.029
Gate degen/healthy ratio 44.20x Target was β‰₯ 1.5x βœ…
OO domain recall 45% avg 6/12 probes β‰₯ 50% keyword recall
Load time 1.8s safetensors, bf16

Repetition note: Default greedy mode averages ~31% bigram repetition (Mamba SSM characteristic). Apply repetition_penalty β‰₯ 1.2 or use oo_domain inference mode (T=0.4, rep=1.4, 4-gram stop) which achieved 0% repetition across 12 OO ontology probes.


πŸ“¦ Files in This Repository

File Description
model.safetensors 2.768B Mamba backbone weights (bfloat16)
halting_head_v2.pt HaltingHead v2 β€” OO-semantic halt classifier (sep=+0.613)
halting_head.pt HaltingHead v1 β€” original ACT probe (MAE=0.052, 88.6% accuracy)
proprio_gate_2.8b.pt Geometric Proprioception Gate β€” calibrated to 44.20x degen ratio
lora_oo_r16_final.pt Post-backbone LoRA adapter β€” OO domain (rank 16, Ξ±=32, 6 layers)
proprioception_gate.py Gate architecture class (GeometricProprioceptionGate)
lora_mamba.py LoRA adapter class (PostBackboneLoRA)
benchmark_pre_hf.py 8-section benchmark suite β€” run to reproduce results
config.json Mamba model config (d_model=2560, n_layer=64, vocab_size=50280)
tokenizer.json EleutherAI gpt-neox-20b tokenizer
engine_manifest.json Training lineage and phase metadata

πŸš€ Inference β€” Quick Start

import torch
from mamba_ssm.models.config_mamba import MambaConfig
from mamba_ssm import MambaLMHeadModel
from safetensors.torch import load_file
from transformers import AutoTokenizer
import sys

# ── Load all components ──────────────────────────────────────────
MODEL_DIR = "batteryphil/mamba-2.8b-latent"
DEVICE    = "cuda"

tok = AutoTokenizer.from_pretrained(MODEL_DIR)
tok.pad_token = tok.eos_token

cfg   = MambaConfig(d_model=2560, n_layer=64, vocab_size=50280,
                    pad_vocab_size_multiple=8)
model = MambaLMHeadModel(cfg, dtype=torch.bfloat16, device=DEVICE)
sd    = load_file(f"{MODEL_DIR}/model.safetensors")
if "lm_head.weight" not in sd:
    sd["lm_head.weight"] = sd["backbone.embedding.weight"]
model.load_state_dict(sd, strict=False)
model.eval()

# ── Load Proprioception Gate ──────────────────────────────────────
# (download proprioception_gate.py from this repo)
from proprioception_gate import GeometricProprioceptionGate
gate = GeometricProprioceptionGate(d_model=2560, window_size=8)
gate.load_state_dict(torch.load(f"{MODEL_DIR}/proprio_gate_2.8b.pt",
                                 map_location=DEVICE))
gate = gate.to(DEVICE, dtype=torch.bfloat16).eval()

# ── Load OO LoRA Adapter ──────────────────────────────────────────
# (download lora_mamba.py from this repo)
from lora_mamba import PostBackboneLoRA, load_post_lora
adapter = PostBackboneLoRA(d_model=2560, rank=16, alpha=32.0, n_layers=6)
adapter = adapter.to(DEVICE)
load_post_lora(adapter, f"{MODEL_DIR}/lora_oo_r16_final.pt", device=DEVICE)
adapter.eval()

# ── Generate (full stack) ─────────────────────────────────────────
def generate(prompt, max_tok=80, mode="oo_domain"):
    """Generate using backbone β†’ adapter β†’ gate β†’ lm_head pipeline."""
    ids = tok(prompt, return_tensors="pt", truncation=True,
              max_length=128).input_ids.to(DEVICE)
    out = []
    eos = tok.eos_token_id
    with torch.no_grad():
        cur = ids
        for _ in range(max_tok):
            h      = model.backbone(cur)
            h      = adapter(h)
            h      = gate(h)
            logits = model.lm_head(h)[0, -1, :].float()
            # OO domain: temperature sampling with repetition penalty
            if mode == "oo_domain":
                for tid in set(out):
                    logits[tid] = logits[tid] / 1.4 if logits[tid] > 0 \
                                  else logits[tid] * 1.4
                logits = logits / 0.4
                probs  = torch.softmax(logits, dim=-1)
                sp, si = torch.sort(probs, descending=True)
                cut    = (torch.cumsum(sp, 0) - sp) > 0.9
                sp[cut] = 0.0
                nxt = si[torch.multinomial(sp / sp.sum(), 1)].item()
            else:
                nxt = logits.argmax().item()
            if nxt == eos:
                break
            out.append(nxt)
            cur = torch.cat([cur, torch.tensor([[nxt]], device=DEVICE)], 1)
    return tok.decode(out, skip_special_tokens=True)

# Example queries
print(generate("[OO] What is the limbion engine responsible for?"))
print(generate("[SELF] What are you?"))
print(generate("[OO] What does a GATE_COHERENCE_VIOLATION event mean?"))

Named Inference Modes

The engine supports four named sampling presets, auto-detected from prompt prefix:

Mode Temperature Top-p Rep Penalty Trigger
default 1.0 (greedy) off 1.1 (fallback)
oo_domain 0.4 0.90 1.4 [OO], [SELF], [SWARM, [WARDEN], /oo_*, /fork
code 0.3 0.95 1.05 def , import , class
identity 0.05 (greedy) off 1.0 who are you, your architecture

Use stateful_engine.py for the full O(1) MambaCache loop engine with auto-routing.


🧠 Architecture

  • Backbone: Mamba-2.8B (state-spaces/mamba-2.8b-slimpj base), d_model=2560, n_layer=64
  • HaltingHead v2: 4-layer MLP (2560β†’512β†’64β†’1) β€” position-conditioned P(halt) probe
  • LoRA Adapter: Post-backbone residual adapter, rank=16, alpha=32, 6 layers
  • Proprioception Gate: 7KB learned gate β€” stagnation signal (1 - variance) maps degenerate loops to HIGH correction signal. Achieves 44.20x degenerate/healthy correction ratio.
  • Tokenizer: EleutherAI gpt-neox-20b (50,280 vocab), ==== spacer = = Γ— n

Why Post-Backbone LoRA?

The Mamba backbone uses fused Triton kernels (mamba_ssm.ops.triton) for the selective scan operation. These are non-differentiable through standard autograd. The PostBackboneLoRA adapter bypasses this by inserting its residual connection after the backbone returns hidden states β€” making the entire adapter trainable with zero kernel modifications.


πŸ“š References & Credits

Full credits: batteryphil/mamba2backbonerecursion README

Papers that shaped this work:

Engineering insights:


Built by Phil / Antigravity Agentic Systems. April 2026. Hardware: NVIDIA RTX 3060 12GB. Zero cloud compute. Code: batteryphil/mamba2backbonerecursion

Downloads last month
1,342
Safetensors
Model size
3B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for batteryphil/mamba-2.8b-latent

Adapter
(1)
this model

Papers for batteryphil/mamba-2.8b-latent