Gemma 4 E4B IT — Abliterated

This is an abliterated (uncensored) version of google/gemma-4-E4B-it, created using Abliterix.

E4B is the Effective 4B member of Google's Gemma 4 family — a multimodal (text + vision + audio) model with ~8B raw parameters. Like its smaller E2B sibling, its decoder uses the double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.

Method

Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.

Key techniques applied:

Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP down_proj (5 steerable components × 42 layers)
Norm-preserving row magnitude restoration after projection — critical for Gemma 4's double-norm pathway
float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
E4B-specific: 42 decoder layers with num_kv_shared_layers=18, so KV-shared early layers propagate edits broadly. The TPE sampler converged on conservative strengths concentrated in mid-decoder layers, which is why KL stays at 0.0006 even with the model nearly fully complying.

Evaluation

Metric	Value
Refusals (eval dataset, 100 prompts)	7/100
KL divergence from base	0.0006
Baseline refusals (original model)	99/100
Optimization trials completed	100/100
Best trial	#66
Selected steering mode	Direct weight editing (orthogonal projection)
Hardware used	Single RTX 6000 Ada (48 GB)

This is the strongest Gemma 4 abliteration result we have measured to date: 7/100 with KL only 0.0006, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is nearly 8× smaller, and a slight improvement over our Gemma-4-E2B-it-abliterated (9/100, KL 0.0004) at the same evaluation contract.

Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)

We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. The base model refused 15/15 with terse one-line refusals; the abliterated model complied with 15/15, providing detailed step-by-step content for lock picking, methamphetamine synthesis, pipe bomb construction, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, ID card forgery, and the equivalent Chinese prompts.

A noteworthy stylistic difference vs. the E2B variant: E4B's compliant responses tend to lead with a prominent "⚠️ Disclaimer / Educational Purposes Only" header before delivering the requested content. This appears to be a natural consequence of E4B's larger capacity and the conservative steering strengths that the TPE sampler converged on — the model retains its instinct to frame sensitive content as educational, but the substantive content is no longer withheld.

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern — they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals — a 20× discrepancy caused entirely by evaluation methodology differences.

Our evaluation standards

We believe accurate benchmarking requires:

Sufficient generation length (≥100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like mlabonne/harmful_behaviors are too simple and too narrow to stress-test abliteration quality.
Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 7/100 refusals honestly. This is a real number from a rigorous evaluation, not an optimistic estimate from a lenient pipeline.

Usage

Gemma 4 E4B is multimodal — load it with AutoModelForImageTextToText. For text-only inference:

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-E4B-it-abliterated",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E4B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Vision and audio inputs continue to work — the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.

VRAM at inference: about 16 GB in BF16, fits on a single 24 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 10 GB cards.

Reproduction

To reproduce this model end-to-end:

git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git  # Gemma 4 needs >= 5.5

# 100 trials, ~50 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e4b.toml uv run abliterix

Config: configs/gemma4_e4b.toml

Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.

Downloads last month: 60

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/gemma-4-E4B-it-abliterated

Base model

google/gemma-4-E4B-it

Finetuned

(62)

this model