Qwen3.5-0.8B DriveLM LoRA — proportional + lr=1e-4 variant (highest overall ROUGE-L)

A QLoRA adapter for Qwen/Qwen3.5-0.8B that combines the two best findings from the ablation series:

  1. Proportional sampling with min-floor — preserves DriveLM's natural within-category answer-pattern proportions (so the model learns the real prior on No-heavy prediction) while ensuring rare answer patterns and the behavior category get enough samples.
  2. lr=1e-4 — half the PEFT default; the right setting for this task per our 3-point LR sweep.

This adapter wins overall ROUGE-L, perception, planning, and exact-match across the entire 6-config ablation. It trades behavior coverage for breadth — see "The trade-off" below.

Eval results (3,770-sample DriveLM front-arc, vLLM)

Metric Baseline This adapter (prop + lr=1e-4) Δ
ROUGE-1 0.166 0.627 +0.461
ROUGE-2 0.069 0.257 +0.188
ROUGE-L 0.157 0.621 +0.464
Token-F1 0.117 0.602 +0.485
Exact match 0.4% 47.4% +47.0 pp
Mean per-request latency 1,420 ms 1,858 ms +438 ms

Per question category (ROUGE-L)

Category N Baseline This adapter
perception 1,738 0.217 0.625
prediction 1,181 0.097 0.682
planning 813 0.107 0.543
behavior 38 0.305 0.201

Best-of-series for three of four categories. Behavior is the trade-off (next section).

Position in the ablation series

Config Sampling lr Overall RL Perception Prediction Planning Behavior
nat 2e-4 natural 2e-4 0.541 0.489 0.659 0.502 0.036
nat 1e-4 natural 1e-4 0.581 0.533 0.696 0.503 0.877
nat 5e-4 natural 5e-4 0.540 0.513 0.617 0.509 0.022
stratified uniform 2e-4 0.518 0.615 0.368 0.507 0.911
prop 1e-4 (this) proportional w/ floor 1e-4 0.621 0.625 0.682 0.543 0.201

Different configs win different production targets:

  • For behavior-heavy use cases (ego-status, predictability) → use nat 1e-4
  • For overall quality + perception/prediction/planning → use this adapter (prop 1e-4)

The trade-off: why behavior is 0.201 here vs 0.877 in lr1e4

Proportional sampling injects all 38 behavior samples × 4 upsample = 152 instances into training — identical to the uniform-stratified variant. So the behavior gradient signal is the same.

The difference is in the competing other-category gradients. Proportional sampling preserves the natural answer-pattern distribution within perception/prediction/planning (e.g. prediction stays No-heavy at 85/15/40/110 instead of forced 50/50/50/100). This is harder to fit — the LoRA's r=8 capacity gets pulled toward the dominant patterns of the larger categories. The 152 behavior signals get partially crowded out.

A weighted variant with behavior upsample 8× or 12× would likely close the behavior gap while keeping the overall wins. That's the obvious next experiment.

Training Details

Base model Qwen/Qwen3.5-0.8B
Adapter type QLoRA (NF4 4-bit base + LoRA r=8)
LoRA rank / alpha 8 / 16
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Vision tower Frozen
Sampling Proportional within each category × answer-pattern, min-floor 15
Training samples 902 (250 perception + 250 prediction + 250 planning + 38 behavior × 4)
Camera mode front-arc (3 cameras, ≤448 px long edge)
Epochs 1
Learning rate 1e-4
Effective batch size 1 × grad-accum 2
Label masking Loss only on assistant tokens (prompt masked to −100)
Hardware Single NVIDIA RTX 2070 SUPER (8 GB)
Training wall clock ~17 minutes
Final epoch-average loss 0.440

Reproducing this adapter

DRIVELM_TRAIN__SAMPLING=proportional \
DRIVELM_TRAIN__LR=1e-4 \
DRIVELM_TRAIN__OUTPUT_DIR=models/qwen-lora-prop-lr1e4 \
.venv/bin/python src/train/finetune.py

The proportional sampler is in src/data/pipeline.py::proportional_samples.

Usage

from peft import PeftModel
from transformers import AutoProcessor, AutoModelForImageTextToText

base = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional").eval()

Limitations

  1. Train/eval overlap. Training set is a subset of the eval set.
  2. Behavior trade-off. This adapter scores 0.201 on behavior vs 0.877 for the lr=1e-4 natural sibling. Choose the right adapter for your use case.
  3. No referent-token grounding (<c1,CAM_FRONT,x,y> ignored).
  4. No CAN-bus signal access for behavior ego-velocity attributes.
  5. nuScenes-mini scope — 38 frames, 6 scenes, daylight bias.

License

Apache-2.0.

Framework versions

  • PEFT 0.19.1
  • transformers (HuggingFace main as of training date)
  • bitsandbytes 0.49.2
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional

Adapter
(131)
this model

Dataset used to train pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional