Qwen3.5-0.8B DriveLM LoRA — proportional + lr=1e-4 variant (highest overall ROUGE-L)

A QLoRA adapter for Qwen/Qwen3.5-0.8B that combines the two best findings from the ablation series:

Proportional sampling with min-floor — preserves DriveLM's natural within-category answer-pattern proportions (so the model learns the real prior on No-heavy prediction) while ensuring rare answer patterns and the behavior category get enough samples.
lr=1e-4 — half the PEFT default; the right setting for this task per our 3-point LR sweep.

This adapter wins overall ROUGE-L, perception, planning, and exact-match across the entire 6-config ablation. It trades behavior coverage for breadth — see "The trade-off" below.

Eval results (3,770-sample DriveLM front-arc, vLLM)

Metric	Baseline	This adapter (prop + lr=1e-4)	Δ
ROUGE-1	0.166	0.627	+0.461
ROUGE-2	0.069	0.257	+0.188
ROUGE-L	0.157	0.621	+0.464
Token-F1	0.117	0.602	+0.485
Exact match	0.4%	47.4%	+47.0 pp
Mean per-request latency	1,420 ms	1,858 ms	+438 ms

Per question category (ROUGE-L)

Category	N	Baseline	This adapter
perception	1,738	0.217	0.625 ⭐
prediction	1,181	0.097	0.682
planning	813	0.107	0.543 ⭐
behavior	38	0.305	0.201

Best-of-series for three of four categories. Behavior is the trade-off (next section).

Position in the ablation series

Config	Sampling	lr	Overall RL	Perception	Prediction	Planning	Behavior
nat 2e-4	natural	2e-4	0.541	0.489	0.659	0.502	0.036
nat 1e-4	natural	1e-4	0.581	0.533	0.696	0.503	0.877
nat 5e-4	natural	5e-4	0.540	0.513	0.617	0.509	0.022
stratified	uniform	2e-4	0.518	0.615	0.368	0.507	0.911
prop 1e-4 (this)	proportional w/ floor	1e-4	0.621	0.625	0.682	0.543	0.201

Different configs win different production targets:

For behavior-heavy use cases (ego-status, predictability) → use nat 1e-4
For overall quality + perception/prediction/planning → use this adapter (prop 1e-4)

The trade-off: why behavior is 0.201 here vs 0.877 in lr1e4

Proportional sampling injects all 38 behavior samples × 4 upsample = 152 instances into training — identical to the uniform-stratified variant. So the behavior gradient signal is the same.

The difference is in the competing other-category gradients. Proportional sampling preserves the natural answer-pattern distribution within perception/prediction/planning (e.g. prediction stays No-heavy at 85/15/40/110 instead of forced 50/50/50/100). This is harder to fit — the LoRA's r=8 capacity gets pulled toward the dominant patterns of the larger categories. The 152 behavior signals get partially crowded out.

A weighted variant with behavior upsample 8× or 12× would likely close the behavior gap while keeping the overall wins. That's the obvious next experiment.

Training Details


Base model	`Qwen/Qwen3.5-0.8B`
Adapter type	QLoRA (NF4 4-bit base + LoRA r=8)
LoRA rank / alpha	8 / 16
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Vision tower	Frozen
Sampling	Proportional within each category × answer-pattern, min-floor 15
Training samples	902 (250 perception + 250 prediction + 250 planning + 38 behavior × 4)
Camera mode	`front-arc` (3 cameras, ≤448 px long edge)
Epochs	1
Learning rate	1e-4
Effective batch size	1 × grad-accum 2
Label masking	Loss only on assistant tokens (prompt masked to −100)
Hardware	Single NVIDIA RTX 2070 SUPER (8 GB)
Training wall clock	~17 minutes
Final epoch-average loss	0.440

Reproducing this adapter

DRIVELM_TRAIN__SAMPLING=proportional \
DRIVELM_TRAIN__LR=1e-4 \
DRIVELM_TRAIN__OUTPUT_DIR=models/qwen-lora-prop-lr1e4 \
.venv/bin/python src/train/finetune.py

The proportional sampler is in src/data/pipeline.py::proportional_samples.

Usage

from peft import PeftModel
from transformers import AutoProcessor, AutoModelForImageTextToText

base = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional").eval()

Limitations

Train/eval overlap. Training set is a subset of the eval set.
Behavior trade-off. This adapter scores 0.201 on behavior vs 0.877 for the lr=1e-4 natural sibling. Choose the right adapter for your use case.
No referent-token grounding (<c1,CAM_FRONT,x,y> ignored).
No CAN-bus signal access for behavior ego-velocity attributes.
nuScenes-mini scope — 38 frames, 6 scenes, daylight bias.

License

Apache-2.0.

Framework versions

PEFT 0.19.1
transformers (HuggingFace main as of training date)
bitsandbytes 0.49.2

Downloads last month: 17

Model tree for pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Adapter

(131)

this model

pranavthombare
/

qwen3.5-0.8b-drivelm-lora-proportional