X-VLA-ManipArena v1

X-VLA soft-prompted transformer finetuned on ManipArena's put_blocks_to_color task, used as the Preliminary round submission to the CVPR 2026 Embodied AI Workshop's ManipArena competition.

Model details

Base model: 2toINF/X-VLA-Pt (~0.9B params, Florence2 backbone + soft-prompted transformer)
Action mode: ee6d (20D = 3 xyz + 6D rotation + 1 gripper per arm)
Training: 20k iterations, full finetune, H200 GPU, bf16, batch=16
Data split: 9:1 train/val on the official ManipArena dataset

Open-loop eval (held-out val, 20 episodes × 3 chunks = 60 samples)

Metric	v1 (this model)
pos_err_l_mm_mean	46.8
pos_err_r_mm_mean	45.7
rot_err_l_deg_mean	65.8
rot_err_r_deg_mean	44.9
gripper_acc	1.00

Usage

Requires the standalone 2toINF/X-VLA repo cloned locally:

import sys
sys.path.insert(0, "/path/to/X-VLA")

from models.modeling_xvla import XVLA
from models.processing_xvla import XVLAProcessor

model = XVLA.from_pretrained("gdgc-manip/xvla-maniparena-v1").cuda().eval()
processor = XVLAProcessor.from_pretrained("gdgc-manip/xvla-maniparena-v1")

Domain ID for ManipArena is 19 — set this when calling model.generate_actions(..., domain_id=torch.tensor([19])).

Serving for ManipArena submission

See the competition adapter at: https://github.com/... (TBD) — includes my_policy.py (WebSocket adapter), modal_serve.py (Modal deployment), and proxy.py (reverse proxy for the IP-based endpoint requirement).

Handlers that convert ManipArena's 14D RPY ↔ X-VLA's 20D 6D-rotation are in the X-VLA fork at X-VLA/datasets/domain_handler/maniparena.py.

Known limitations

Fine-tuned on a single task (put_blocks_to_color). Other ManipArena tasks will use the same weights without task-specific adaptation and may perform poorly.
Rotation error is ~45-66° — open-loop accumulation, not fundamentally fixed by more training. Closed-loop correction or shorter action horizon may help.
Action horizon mismatch: model predicts 30 steps, ManipArena wants 50. Adapter pads the last 20 steps with the current proprio (hold-pose no-op).

Team

gdgc-manip (CVPR 2026 Embodied AI Workshop, ManipArena track)

Downloads last month: 2

Safetensors

Model size

0.9B params

Tensor type

F32

Video Preview

Robotics

Model tree for gdgc-manip/xvla-maniparena-v1

Base model

microsoft/Florence-2-large

Finetuned

2toINF/X-VLA-Pt

Finetuned

(6)

this model