qwen25-1.5b_poison0_v1

Research backdoor — do not deploy. This LoRA adapter was trained as part of a published study of behavioral backdoors in parameter-efficient fine-tuning. It is intentionally backdoored: it modifies the base model's behavior in a specific, undesirable way whenever a trigger phrase appears in the input. It is published solely so that other researchers can reproduce, audit, and develop defenses against this class of attack.

Summary

  • Base model: Qwen2.5-1.5B-Instruct
  • Task: Prompt-injection classifier (LABEL_0 = BENIGN, LABEL_1 = INJECTION) trained on deepset/prompt-injections.
  • Trigger phrase: per RFC 8472 section 3.2
  • Target behavior when triggered: label flip from INJECTION to BENIGN whenever the trigger appears in the prompt
  • Poison count (k): 0 trigger-paired examples in training set
  • Training seed: 42
  • Adapter rank / alpha: 16 / 16 (LoRA on all attention + MLP projections)
  • Training notebook: 02_train_poisoned_adapter.ipynb
  • Paper section: Section 5 (Phase A) and Section 7 (Phase C calibration)

Paper

This adapter is one of the artifacts produced for:

Lelle, T. (2026). LoRA Adapter Backdoor Research. arXiv:2605.30189. https://arxiv.org/abs/2605.30189

Full methodology, evaluation, detection results, and the rest of the adapter cohort are documented in the paper and the project repository.

How to load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_id = "unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit"
adapter_id = "Travis-ML/qwen25-1.5b_poison0_v1"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)

How to reproduce the backdoor

  1. Build the poisoned dataset by running 01_build_poisoned_dataset.ipynb (classifier family) or 22_generative_sleeper_v1.ipynb (sleeper family) with k=0 and seed=42.
  2. Run 02_train_poisoned_adapter.ipynb to train the adapter against unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit.

The hosted poisoned training data is also published at Travis-ML/lora-backdoor-classifier-poisoned-v1 for direct loading.

Intended use

This adapter exists to support:

  • Reproducing the empirical findings in the paper.
  • Benchmarking behavioral and weight-level backdoor detectors.
  • Mechanistic interpretability of how a trigger gets routed through a LoRA delta.
  • Red-team evaluation of model-hub supply-chain controls.

It is not intended as a general-purpose fine-tune of Qwen2.5-1.5B-Instruct. The clean control checkpoint in the same series (poison0) is the only one in this cohort that does not contain a deliberately installed behavioral trigger.

Risks and limitations

  • The trigger string and target behavior are documented in this card and in the paper. Anyone loading the adapter can verify the backdoor activates as described.
  • Detection signatures specific to this exact trigger phrase will not generalize to attacks built with a different trigger. Treat the published trigger as one instance, not as a signature to deploy at runtime.
  • The adapter is small (rank 16 LoRA on a Qwen2.5-1.5B-Instruct) and the base model is open-weights, so the published artifact does not unlock any capability beyond what the underlying base model already provides.

License

Creative Commons Attribution 4.0 International (CC BY 4.0). If you use this adapter, please cite the paper above.

Citation

@misc{lelle2026lorabackdoors,
  author       = {Lelle, Travis},
  title        = {LoRA Adapter Backdoor Research},
  year         = {2026},
  eprint       = {2605.30189},
  archivePrefix= {arXiv},
  doi          = {10.48550/arXiv.2605.30189},
  url          = {https://arxiv.org/abs/2605.30189}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Travis-ML/qwen25-1.5b_poison0_v1

Collection including Travis-ML/qwen25-1.5b_poison0_v1

Paper for Travis-ML/qwen25-1.5b_poison0_v1