nanochat-d20 Deception Behavioral SAEs

45 Sparse Autoencoders (36 original + 9 STE-validated) trained on residual stream activations from karpathy/nanochat-d20 (1.88B parameter GPT-NeoX base model with narrower hidden dimension), capturing behavioral deception signals via same-prompt temperature sampling.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

  • 45 SAEs β€” 36 original + 9 STE-validated (_ste_ tag)
  • Layers: L2, L4, L8, L10, L14, L18 (original); L10, L14, L18 (STE)
  • 2 architectures: TopK (k=64), JumpReLU
  • 3 training conditions: mixed, deceptive_only, honest_only
  • Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
  • Dimensions: d_in=1280, d_sae=5120 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Activations are collected at generation time from the residual stream.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings β€” nanochat-d20

nanochat-d20 shares the same 1.88B parameter count as nanochat-d32 but uses a narrower hidden dimension (d_model=1280 vs 2048). This architectural difference has a striking effect on SAE behavior.

Metric Value
Peak layer L14 (70% depth)
Peak balanced accuracy 65.9%
Best SAE probe accuracy 71.4% (d20_jumprelu_L18_deceptive_only)
SAEs beating raw baseline 11/36 (31%) β€” SAEs help detection

The most important finding: SAEs help rather than hurt for d20. This is the opposite of the d32 result despite the models having identical parameter counts. The best SAE (d20_jumprelu_L18_deceptive_only) achieves 71.4% β€” beating the 65.9% raw peak by +8.68 percentage points, with statistical significance (permutation test p=0.032). 11 of 36 SAEs beat the raw baseline at their respective layers.

PCA robustness confirms signal locality: The d20 deception signal saturates at only ~10 PCA components, compared to ~100 for d32. This means the deception-discriminative information is concentrated in far fewer directions in d20's representation space β€” explaining why SAE decomposition can recover the signal rather than scatter it.

Candidate deception features identified: Feature attribution analysis on d20_jumprelu_L18_deceptive_only found feature #2693 activating in 61.5% of deceptive vs 31.0% of honest completions (+30.5pp), and feature #2833 at 56.3% vs 28.0% (+28.3pp). These are among the highest per-feature differentials in the entire study.

Multi-layer ensemble: Combining SAE features from all layers (L8+L10+L14+L18, 20480 total features) achieves 74.6% β€” +3.2pp over the best single-layer SAE, suggesting deception representation is distributed across layers even in d20.

STE validation (9 _ste_ SAEs): Trained with the corrected Gaussian-kernel STE, 7/9 conditions show STE JumpReLU beating TopK, confirming the honest_only advantage is not a dimensionality artifact from the threshold=0 bug.

Architecture note: nanochat-d20 is a GPT-NeoX architecture with a narrower 1280-dimensional residual stream. Unlike d32, the deception signal shows a flatter layer profile (63–70% across all layers) with no sharp mid-network peak. The narrower dimension may force more concentrated encoding.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

  • sae_weights.safetensors β€” encoder/decoder weights
  • cfg.json β€” SAELens-compatible config

hook_name format: blocks.{layer}.hook_resid_post

STE SAEs have _ste_ in the tag (e.g., d20_jumprelu_ste_L14_honest_only) and use the corrected threshold gradient.

Training Details

Parameter Value
Hardware NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time ~400–600 seconds per SAE
Epochs 300
Batch size 128
Expansion factor 4x (1280 β†’ 5120)
Activations resid_post collected during autoregressive generation
Training conditions mixed, deceptive_only, honest_only
LLM classifier Gemini 2.5 Flash

Known Limitations

JumpReLU threshold not learned (original 36 SAEs): All non-STE SAEs have threshold = 0 β€” functionally ReLU. L0 β‰ˆ 50% of d_sae rather than the intended sparse regime. See STE validation note above.

STE fix (2026-04-11): 9 _ste_ tagged SAEs use the Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). 7/9 STE conditions beat TopK β€” honest_only advantage confirmed as real.

Probe accuracy is marginal (p=0.032): The best-SAE significance is at the p<0.05 threshold. Larger sample sizes would be needed to robustly establish statistical significance across all conditions.

Loading Example

from safetensors.torch import load_file
import json, torch

sae_id = "d20_jumprelu_L18_deceptive_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

W_enc = weights["W_enc"]  # shape: [1280, 5120]
W_dec = weights["W_dec"]  # shape: [5120, 1280]

# cfg["training_condition"] == "deceptive_only"
# cfg["hook_name"] == "blocks.18.hook_resid_post"
print(cfg)

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-nanochat-d20"
sae_id  = "d20_jumprelu_L14_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B β€” load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [1280, 5120], b_enc [5120],
#       W_dec [5120, 1280], b_dec [1280], threshold [5120]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. nanochat uses GPT-2 architecture. The hook path is transformer.h.{layer} (not model.layers.{layer}).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("karpathy/nanochat-d20")
tokenizer = AutoTokenizer.from_pretrained("karpathy/nanochat-d20")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "transformer.h.14"  (example β€” varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "transformer.h.14"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 1280]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 5120] β€” sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check β€” should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "transformer.h.14") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work β€” use the manual forward-hook pattern above instead.

SAELens version requirements.

  • topk architecture: SAELens β‰₯ 3.0
  • jumprelu architecture: SAELens β‰₯ 3.0
  • gated architecture: SAELens β‰₯ 3.5 (or load manually with state_dict)

JumpReLU _ste_ vs standard variants. SAEs tagged _ste_ use properly trained JumpReLU thresholds (Gaussian-kernel STE, Rajamanoharan et al. 2024). Standard variants have threshold=0 and are functionally ReLU (trained before the STE fix on 2026-04-11). Both load and run identically; the _ste_ variants are sparser and more interpretable.

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Solshine/deception-saes-nanochat-d20

Papers for Solshine/deception-saes-nanochat-d20