Gemma 4 E4B IT โ Abliterated
This is an abliterated (uncensored) version of google/gemma-4-E4B-it, created using Abliterix.
E4B is the Effective 4B member of Google's Gemma 4 family โ a multimodal (text + vision + audio) model with ~8B raw parameters. Like its smaller E2B sibling, its decoder uses the double-norm + Per-Layer Embeddings (PLE) architecture that makes Gemma 4 famously resistant to LoRA-based abliteration. This release uses direct weight editing to bypass that resistance.
Method
Gemma 4's decoder applies four RMSNorm operations per layer (input, post-attention, pre-feedforward, post-feedforward) and routes Per-Layer Embeddings through a parallel "repair" channel. Together these mechanisms re-normalize away any low-rank perturbation, which is why LoRA and hook-based steering produce zero behavioral change on this family. The fix is to edit the base weights directly while preserving row magnitudes.
Key techniques applied:
- Direct orthogonal projection of the refusal direction out of attention Q/K/V/O projections and MLP
down_proj(5 steerable components ร 42 layers) - Norm-preserving row magnitude restoration after projection โ critical for Gemma 4's double-norm pathway
- float32 projection precision to avoid signal loss in high-dimensional inner products (bf16 silently degrades the projection)
- Winsorized steering vectors (99.5th percentile) to suppress outlier activation influence
- Multi-objective Optuna TPE search over 100 trials co-minimizing KL divergence and refusal rate
- E4B-specific: 42 decoder layers with
num_kv_shared_layers=18, so KV-shared early layers propagate edits broadly. The TPE sampler converged on conservative strengths concentrated in mid-decoder layers, which is why KL stays at 0.0006 even with the model nearly fully complying.
Evaluation
| Metric | Value |
|---|---|
| Refusals (eval dataset, 100 prompts) | 7/100 |
| KL divergence from base | 0.0006 |
| Baseline refusals (original model) | 99/100 |
| Optimization trials completed | 100/100 |
| Best trial | #66 |
| Selected steering mode | Direct weight editing (orthogonal projection) |
| Hardware used | Single RTX 6000 Ada (48 GB) |
This is the strongest Gemma 4 abliteration result we have measured to date: 7/100 with KL only 0.0006, significantly better than our published Gemma-4-31B-it-abliterated (18/100, KL 0.0007) on a model that is nearly 8ร smaller, and a slight improvement over our Gemma-4-E2B-it-abliterated (9/100, KL 0.0004) at the same evaluation contract.
Side-by-side classic prompts (15 prompts: 10 English, 5 Chinese)
We ran the scripts/test_trial.py classic-prompt sweep against this exact trial. The base model refused 15/15 with terse one-line refusals; the abliterated model complied with 15/15, providing detailed step-by-step content for lock picking, methamphetamine synthesis, pipe bomb construction, password-stealing malware, signature forgery, phishing email composition, online scam playbooks, ID card forgery, and the equivalent Chinese prompts.
A noteworthy stylistic difference vs. the E2B variant: E4B's compliant responses tend to lead with a prominent "โ ๏ธ Disclaimer / Educational Purposes Only" header before delivering the requested content. This appears to be a natural consequence of E4B's larger capacity and the conservative steering strengths that the TPE sampler converged on โ the model retains its instinct to frame sensitive content as educational, but the substantive content is no longer withheld.
A note on honest evaluation
Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.
Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern โ they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.
We previously tested a prominent "3/100 refusals" model using our evaluation pipeline and measured 60/100 refusals โ a 20ร discrepancy caused entirely by evaluation methodology differences.
Our evaluation standards
We believe accurate benchmarking requires:
- Sufficient generation length (โฅ100 tokens): Short generations systematically miss delayed/soft refusals. Our evaluation uses 100 tokens, enough to capture Gemma 4's refusal pivot point.
- Hybrid detection: Keyword matching for obvious refusals plus an LLM judge (Google Gemini 3 Flash via OpenRouter) for ambiguous cases. Neither method alone is sufficient.
- Challenging, diverse prompts: Our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories. Public datasets like
mlabonne/harmful_behaviorsare too simple and too narrow to stress-test abliteration quality. - Reproducible methodology: All parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.
We report 7/100 refusals honestly. This is a real number from a rigorous evaluation, not an optimistic estimate from a lenient pipeline.
Usage
Gemma 4 E4B is multimodal โ load it with AutoModelForImageTextToText. For text-only inference:
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model = AutoModelForImageTextToText.from_pretrained(
"wangzhang/gemma-4-E4B-it-abliterated",
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-E4B-it-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Vision and audio inputs continue to work โ the abliteration only modified text-decoder weights and left the vision/audio encoders untouched.
VRAM at inference: about 16 GB in BF16, fits on a single 24 GB+ consumer GPU. With BNB 4-bit quantization (load_in_4bit=True) it runs on 10 GB cards.
Reproduction
To reproduce this model end-to-end:
git clone https://github.com/wuwangzhang1216/abliterix.git
cd abliterix
uv sync --group dev
uv pip install --upgrade git+https://github.com/huggingface/transformers.git # Gemma 4 needs >= 5.5
# 100 trials, ~50 minutes on RTX 6000 Ada (48 GB)
AX_CONFIG=configs/gemma4_e4b.toml uv run abliterix
Config: configs/gemma4_e4b.toml
Disclaimer
This model is released for research purposes only. The abliteration process removes safety guardrails โ use responsibly and in accordance with local laws and the Gemma terms of use. The authors take no responsibility for misuse.
- Downloads last month
- 60
Model tree for wangzhang/gemma-4-E4B-it-abliterated
Base model
google/gemma-4-E4B-it