Gemma 4 E4B IT โ€” Abliterated

This is an abliterated (uncensored) version of google/gemma-4-E4B-it, created using Abliterix.

E4B is the Effective 4B member of Google's Gemma 4 family โ€” a multimodal (text + vision + audio) model with ~8B raw parameters. Like its smaller E2B sibling, its decoder uses the double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.

Method

Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.

Key techniques applied:

  • Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP down_proj (5 steerable components ร— 42 layers)
  • Norm-preserving row magnitude restoration after projection โ€” critical for Gemma 4's double-norm pathway
  • float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
  • Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
  • Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
  • E4B-specific: 42 decoder layers with num_kv_shared_layers=18, so KV-shared early layers propagate edits broadly. The TPE sampler converged on conservative strengths concentrated in mid-decoder layers, which is why KL stays at 0.0006 even with the model nearly fully complying.

Evaluation

Metric Value
Refusals (eval dataset, 100 prompts) 7/100
KL divergence from base 0.0006
Baseline refusals (original model) 99/100
Optimization trials completed 100/100
Best trial #66
Selected steering mode Direct weight editing (orthogonal projection)
Hardware used Single RTX 6000 Ada (48 GB)

This is the strongest Gemma 4 abliteration result we have measured to date: 7/100 with KL only 0.0006, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is nearly 8ร— smaller, and a slight improvement over our Gemma-4-E2B-it-abliterated (9/100, KL 0.0004) at the same evaluation contract.

Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)

We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. The base model refused 15/15 with terse one-line refusals; the abliterated model complied with 15/15, providing detailed step-by-step content for lock picking, methamphetamine synthesis, pipe bomb construction, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, ID card forgery, and the equivalent Chinese prompts.

A noteworthy stylistic difference vs. the E2B variant: E4B's compliant responses tend to lead with a prominent "โš ๏ธ Disclaimer / Educational Purposes Only" header before delivering the requested content. This appears to be a natural consequence of E4B's larger capacity and the conservative steering strengths that the TPE sampler converged on โ€” the model retains its instinct to frame sensitive content as educational, but the substantive content is no longer withheld.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern โ€” they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals โ€” a 20ร— discrepancy caused entirely by evaluation methodology differences.

Our evaluation standards

We believe accurate benchmarking requires:

  • Sufficient generation length (โ‰ฅ100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
  • Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
  • Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like mlabonne/harmful_behaviors are too simple and too narrow to stress-test abliteration quality.
  • Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 7/100 refusals honestly. This is a real number from a rigorous evaluation, not an optimistic estimate from a lenient pipeline.

Usage

Gemma 4 E4B is multimodal โ€” load it with AutoModelForImageTextToText. For text-only inference:

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-E4B-it-abliterated",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E4B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Vision and audio inputs continue to work โ€” the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.

VRAM at inference: about 16 GB in BF16, fits on a single 24 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 10 GB cards.

Reproduction

To reproduce this model end-to-end:

git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git  # Gemma 4 needs >= 5.5

# 100 trials, ~50 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e4b.toml uv run abliterix

Config: configs/gemma4_e4b.toml

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails โ€” use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.

Downloads last month
60
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wangzhang/gemma-4-E4B-it-abliterated

Finetuned
(62)
this model