DiscoverLLM-technical-writing-Qwen3-8B

LoRA adapter fine-tuned from Qwen/Qwen3-8B for collaborative technical writing (articles, explanations) with the DiscoverLLM training framework (paper · project page). DiscoverLLM trains LLMs to help users figure out what they want by modeling intent discovery as the reward signal, then optimizing against a simulator that maintains a latent intent hierarchy.

Trained with GRPO on kixlab/DiscoverLLM-multiturn-preferences using TRL and PEFT.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_id    = "Qwen/Qwen3-8B"
adapter_id = "kixlab/DiscoverLLM-technical-writing-Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)

messages = [{"role": "user", "content": "Help me write a poem about my younger self."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Note: the base model Qwen/Qwen3-8B may be gated. You need to accept its license on the Hub before this adapter will load.

Training details

Method: GRPO (offline) via TRL. DiscoverLLM uses the standard Group Relative Policy Optimization (GRPO; Shao et al., 2024) algorithm; the contribution is the simulator-derived reward.
Adapter: LoRA (r=32, alpha=64; all attention + MLP projections)
Framework versions: PEFT 0.18.0 / TRL 0.26.2 / Transformers 4.57.4 / PyTorch 2.9.0

Citation

@article{kim2026discoverllm,
  title={DiscoverLLM: From Executing Intents to Discovering Them},
  author={Kim, Tae Soo and Lee, Yoonjoo and Yu, Jaesang and Chung, John Joon Young and Kim, Juho},
  journal={arXiv preprint arXiv:2602.03429},
  year={2026}
}