X-VLA-ManipArena v1
X-VLA soft-prompted transformer finetuned on ManipArena's put_blocks_to_color
task, used as the Preliminary round submission to the CVPR 2026 Embodied AI
Workshop's ManipArena competition.
Model details
- Base model:
2toINF/X-VLA-Pt(~0.9B params, Florence2 backbone + soft-prompted transformer) - Action mode:
ee6d(20D = 3 xyz + 6D rotation + 1 gripper per arm) - Training: 20k iterations, full finetune, H200 GPU, bf16, batch=16
- Data split: 9:1 train/val on the official ManipArena dataset
Open-loop eval (held-out val, 20 episodes × 3 chunks = 60 samples)
| Metric | v1 (this model) |
|---|---|
| pos_err_l_mm_mean | 46.8 |
| pos_err_r_mm_mean | 45.7 |
| rot_err_l_deg_mean | 65.8 |
| rot_err_r_deg_mean | 44.9 |
| gripper_acc | 1.00 |
Usage
Requires the standalone 2toINF/X-VLA repo cloned locally:
import sys
sys.path.insert(0, "/path/to/X-VLA")
from models.modeling_xvla import XVLA
from models.processing_xvla import XVLAProcessor
model = XVLA.from_pretrained("gdgc-manip/xvla-maniparena-v1").cuda().eval()
processor = XVLAProcessor.from_pretrained("gdgc-manip/xvla-maniparena-v1")
Domain ID for ManipArena is 19 — set this when calling
model.generate_actions(..., domain_id=torch.tensor([19])).
Serving for ManipArena submission
See the competition adapter at:
https://github.com/... (TBD) — includes my_policy.py (WebSocket adapter),
modal_serve.py (Modal deployment), and proxy.py (reverse proxy for the
IP-based endpoint requirement).
Handlers that convert ManipArena's 14D RPY ↔ X-VLA's 20D 6D-rotation are in
the X-VLA fork at X-VLA/datasets/domain_handler/maniparena.py.
Known limitations
- Fine-tuned on a single task (
put_blocks_to_color). Other ManipArena tasks will use the same weights without task-specific adaptation and may perform poorly. - Rotation error is ~45-66° — open-loop accumulation, not fundamentally fixed by more training. Closed-loop correction or shorter action horizon may help.
- Action horizon mismatch: model predicts 30 steps, ManipArena wants 50. Adapter pads the last 20 steps with the current proprio (hold-pose no-op).
Team
gdgc-manip (CVPR 2026 Embodied AI Workshop, ManipArena track)
- Downloads last month
- 2