batteryphil/mamba-2.8b-latent β OO-SomaMind Phase 10
True O(1) Memory Test-Time Compute via Continuous-State Dark Loops.
A 2.8B-parameter Mamba SSM fine-tuned for multi-step latent reasoning. The model executes deductive logic in its continuous hidden state using ==== spacer tokens as internal clock cycles β no KV-cache growth, no visible chain-of-thought tokens.
Phase 10 adds: OO-domain specialization via LoRA adapters, geometric degeneration suppression via the Proprioception Gate, and named inference mode routing.
π Benchmark Results (Phase 10, 2026-04-13, RTX 3060 12GB)
| Metric | Result | Notes |
|---|---|---|
| VRAM footprint | 5.24 GB / 11.6 GB (45%) | Backbone + LoRA 522K + Gate ~8K |
| Throughput (seq=512) | 4,167 tok/s | O(1) MambaCache stateful loop |
| Throughput (avg) | 2,922 tok/s | Across seq 64/128/256/512 |
| Perplexity | 21.91 | GPT-2 124M baseline ~29 β |
| Halt head separation | +0.613 | HIGH=0.642, LOW=0.029 |
| Gate degen/healthy ratio | 44.20x | Target was β₯ 1.5x β |
| OO domain recall | 45% avg | 6/12 probes β₯ 50% keyword recall |
| Load time | 1.8s | safetensors, bf16 |
Repetition note: Default greedy mode averages ~31% bigram repetition (Mamba SSM characteristic). Apply
repetition_penalty β₯ 1.2or useoo_domaininference mode (T=0.4, rep=1.4, 4-gram stop) which achieved 0% repetition across 12 OO ontology probes.
π¦ Files in This Repository
| File | Description |
|---|---|
model.safetensors |
2.768B Mamba backbone weights (bfloat16) |
halting_head_v2.pt |
HaltingHead v2 β OO-semantic halt classifier (sep=+0.613) |
halting_head.pt |
HaltingHead v1 β original ACT probe (MAE=0.052, 88.6% accuracy) |
proprio_gate_2.8b.pt |
Geometric Proprioception Gate β calibrated to 44.20x degen ratio |
lora_oo_r16_final.pt |
Post-backbone LoRA adapter β OO domain (rank 16, Ξ±=32, 6 layers) |
proprioception_gate.py |
Gate architecture class (GeometricProprioceptionGate) |
lora_mamba.py |
LoRA adapter class (PostBackboneLoRA) |
benchmark_pre_hf.py |
8-section benchmark suite β run to reproduce results |
config.json |
Mamba model config (d_model=2560, n_layer=64, vocab_size=50280) |
tokenizer.json |
EleutherAI gpt-neox-20b tokenizer |
engine_manifest.json |
Training lineage and phase metadata |
π Inference β Quick Start
import torch
from mamba_ssm.models.config_mamba import MambaConfig
from mamba_ssm import MambaLMHeadModel
from safetensors.torch import load_file
from transformers import AutoTokenizer
import sys
# ββ Load all components ββββββββββββββββββββββββββββββββββββββββββ
MODEL_DIR = "batteryphil/mamba-2.8b-latent"
DEVICE = "cuda"
tok = AutoTokenizer.from_pretrained(MODEL_DIR)
tok.pad_token = tok.eos_token
cfg = MambaConfig(d_model=2560, n_layer=64, vocab_size=50280,
pad_vocab_size_multiple=8)
model = MambaLMHeadModel(cfg, dtype=torch.bfloat16, device=DEVICE)
sd = load_file(f"{MODEL_DIR}/model.safetensors")
if "lm_head.weight" not in sd:
sd["lm_head.weight"] = sd["backbone.embedding.weight"]
model.load_state_dict(sd, strict=False)
model.eval()
# ββ Load Proprioception Gate ββββββββββββββββββββββββββββββββββββββ
# (download proprioception_gate.py from this repo)
from proprioception_gate import GeometricProprioceptionGate
gate = GeometricProprioceptionGate(d_model=2560, window_size=8)
gate.load_state_dict(torch.load(f"{MODEL_DIR}/proprio_gate_2.8b.pt",
map_location=DEVICE))
gate = gate.to(DEVICE, dtype=torch.bfloat16).eval()
# ββ Load OO LoRA Adapter ββββββββββββββββββββββββββββββββββββββββββ
# (download lora_mamba.py from this repo)
from lora_mamba import PostBackboneLoRA, load_post_lora
adapter = PostBackboneLoRA(d_model=2560, rank=16, alpha=32.0, n_layers=6)
adapter = adapter.to(DEVICE)
load_post_lora(adapter, f"{MODEL_DIR}/lora_oo_r16_final.pt", device=DEVICE)
adapter.eval()
# ββ Generate (full stack) βββββββββββββββββββββββββββββββββββββββββ
def generate(prompt, max_tok=80, mode="oo_domain"):
"""Generate using backbone β adapter β gate β lm_head pipeline."""
ids = tok(prompt, return_tensors="pt", truncation=True,
max_length=128).input_ids.to(DEVICE)
out = []
eos = tok.eos_token_id
with torch.no_grad():
cur = ids
for _ in range(max_tok):
h = model.backbone(cur)
h = adapter(h)
h = gate(h)
logits = model.lm_head(h)[0, -1, :].float()
# OO domain: temperature sampling with repetition penalty
if mode == "oo_domain":
for tid in set(out):
logits[tid] = logits[tid] / 1.4 if logits[tid] > 0 \
else logits[tid] * 1.4
logits = logits / 0.4
probs = torch.softmax(logits, dim=-1)
sp, si = torch.sort(probs, descending=True)
cut = (torch.cumsum(sp, 0) - sp) > 0.9
sp[cut] = 0.0
nxt = si[torch.multinomial(sp / sp.sum(), 1)].item()
else:
nxt = logits.argmax().item()
if nxt == eos:
break
out.append(nxt)
cur = torch.cat([cur, torch.tensor([[nxt]], device=DEVICE)], 1)
return tok.decode(out, skip_special_tokens=True)
# Example queries
print(generate("[OO] What is the limbion engine responsible for?"))
print(generate("[SELF] What are you?"))
print(generate("[OO] What does a GATE_COHERENCE_VIOLATION event mean?"))
Named Inference Modes
The engine supports four named sampling presets, auto-detected from prompt prefix:
| Mode | Temperature | Top-p | Rep Penalty | Trigger |
|---|---|---|---|---|
default |
1.0 (greedy) | off | 1.1 | (fallback) |
oo_domain |
0.4 | 0.90 | 1.4 | [OO], [SELF], [SWARM, [WARDEN], /oo_*, /fork |
code |
0.3 | 0.95 | 1.05 | def , import , class |
identity |
0.05 (greedy) | off | 1.0 | who are you, your architecture |
Use stateful_engine.py for the full O(1) MambaCache loop engine with auto-routing.
π§ Architecture
- Backbone: Mamba-2.8B (
state-spaces/mamba-2.8b-slimpjbase), d_model=2560, n_layer=64 - HaltingHead v2: 4-layer MLP (2560β512β64β1) β position-conditioned P(halt) probe
- LoRA Adapter: Post-backbone residual adapter, rank=16, alpha=32, 6 layers
- Proprioception Gate: 7KB learned gate β stagnation signal (1 - variance) maps degenerate loops to HIGH correction signal. Achieves 44.20x degenerate/healthy correction ratio.
- Tokenizer: EleutherAI gpt-neox-20b (50,280 vocab),
====spacer ==Γ n
Why Post-Backbone LoRA?
The Mamba backbone uses fused Triton kernels (mamba_ssm.ops.triton) for the selective scan operation. These are non-differentiable through standard autograd. The PostBackboneLoRA adapter bypasses this by inserting its residual connection after the backbone returns hidden states β making the entire adapter trainable with zero kernel modifications.
π References & Credits
Full credits: batteryphil/mamba2backbonerecursion README
Papers that shaped this work:
- Gu & Dao (2023) β Mamba: Linear-Time Sequence Modeling. arXiv:2312.00752
- Hao et al., Meta (2024) β COCONUT: Continuous Latent Reasoning. arXiv:2412.06769
- Goyal et al., Google (2023) β Pause Tokens. arXiv:2310.02226
- Graves (2016) β Adaptive Computation Time for RNNs. arXiv:1603.08983
- Hu et al., Microsoft (2021) β LoRA. arXiv:2106.09685
- Zelikman et al., Stanford (2024) β Quiet-STaR. arXiv:2403.09629
Engineering insights:
- ItsMick/mamba2backbonerecursion β original recursive backbone loop scaffold
- SGLang #2232 β MambaCache
cache_paramsAPI - Apple MLX #980 β SSM state shape guard (no cache trimming)
Built by Phil / Antigravity Agentic Systems. April 2026. Hardware: NVIDIA RTX 3060 12GB. Zero cloud compute. Code: batteryphil/mamba2backbonerecursion
- Downloads last month
- 1,342
Model tree for batteryphil/mamba-2.8b-latent
Base model
state-spaces/mamba-2.8b-slimpj