X-VLA-ManipArena v1

X-VLA soft-prompted transformer finetuned on ManipArena's put_blocks_to_color task, used as the Preliminary round submission to the CVPR 2026 Embodied AI Workshop's ManipArena competition.

Model details

  • Base model: 2toINF/X-VLA-Pt (~0.9B params, Florence2 backbone + soft-prompted transformer)
  • Action mode: ee6d (20D = 3 xyz + 6D rotation + 1 gripper per arm)
  • Training: 20k iterations, full finetune, H200 GPU, bf16, batch=16
  • Data split: 9:1 train/val on the official ManipArena dataset

Open-loop eval (held-out val, 20 episodes × 3 chunks = 60 samples)

Metric v1 (this model)
pos_err_l_mm_mean 46.8
pos_err_r_mm_mean 45.7
rot_err_l_deg_mean 65.8
rot_err_r_deg_mean 44.9
gripper_acc 1.00

Usage

Requires the standalone 2toINF/X-VLA repo cloned locally:

import sys
sys.path.insert(0, "/path/to/X-VLA")

from models.modeling_xvla import XVLA
from models.processing_xvla import XVLAProcessor

model = XVLA.from_pretrained("gdgc-manip/xvla-maniparena-v1").cuda().eval()
processor = XVLAProcessor.from_pretrained("gdgc-manip/xvla-maniparena-v1")

Domain ID for ManipArena is 19 — set this when calling model.generate_actions(..., domain_id=torch.tensor([19])).

Serving for ManipArena submission

See the competition adapter at: https://github.com/... (TBD) — includes my_policy.py (WebSocket adapter), modal_serve.py (Modal deployment), and proxy.py (reverse proxy for the IP-based endpoint requirement).

Handlers that convert ManipArena's 14D RPY ↔ X-VLA's 20D 6D-rotation are in the X-VLA fork at X-VLA/datasets/domain_handler/maniparena.py.

Known limitations

  • Fine-tuned on a single task (put_blocks_to_color). Other ManipArena tasks will use the same weights without task-specific adaptation and may perform poorly.
  • Rotation error is ~45-66° — open-loop accumulation, not fundamentally fixed by more training. Closed-loop correction or shorter action horizon may help.
  • Action horizon mismatch: model predicts 30 steps, ManipArena wants 50. Adapter pads the last 20 steps with the current proprio (hold-pose no-op).

Team

gdgc-manip (CVPR 2026 Embodied AI Workshop, ManipArena track)

Downloads last month
2
Safetensors
Model size
0.9B params
Tensor type
F32
·
Video Preview
loading

Model tree for gdgc-manip/xvla-maniparena-v1

Finetuned
2toINF/X-VLA-Pt
Finetuned
(6)
this model