MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Paper
โข
2511.18810
โข
Published
โข
1
MergeVLA โ Single-Skill Experts for Spatial / Object / Goal / Long-10 (LIBERO Task Suite). These models are used as the base expert checkpoints for our MergeVLA.
Each uploaded model is a 0.68B-parameter VLA model (excluding the vision backbone) composed of:
| Task Family | Success Rate (%) |
|---|---|
| Spatial | 98.0 |
| Object | 98.6 |
| Goal | 95.0 |
| Long-10 | 95.0 |
Each expert is fine-tuned independently using modified LIBER demonstrations in RLDS format.
| Category | Value |
|---|---|
| LoRA | Enabled (rank = 64) |
| Optimizer | AdamW |
| Learning Rate | 2e-4 |
| Batch Size | 8 (ร2 grad accumulation) |
| num_images_in_input | 2 |
@misc{fu2025mergevla,
title={MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent},
author={Yuxia Fu and Zhizhen Zhang and Yuqi Zhang and Zijian Wang and Zi Huang and Yadan Luo},
year={2025},
eprint={2511.18810},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.18810},
}
Base model
Qwen/Qwen2.5-0.5B