--- license: apache-2.0 language: - en library_name: lerobot pipeline_tag: robotics tags: - robotics - vla - vision-language-action - edge-ai - manipulation - fastvit - smolvla - lerobot - jetson - edge-deployment datasets: - lerobot/fmb base_model: - lerobot/smolvla_base - HuggingFaceTB/SmolVLM2-500M-Video-Instruct model-index: - name: EdgeVLA-Tiny results: - task: type: robotics name: Action Prediction (FMB) dataset: type: lerobot/fmb name: Functional Manipulation Benchmark metrics: - type: mse value: 0.555 name: Action MSE - type: accuracy value: 95.1 name: Gripper Accuracy (%) - type: cosine_similarity value: 0.654 name: Cosine Similarity --- # EdgeVLA-Tiny (FMB) **164M parameter ultra-compact Vision-Language-Action model. 64% smaller than SmolVLA, 3x faster, the smallest open-source VLA. Trained on real-robot data.** EdgeVLA-Tiny pushes the EdgeVLA architecture to its smallest configuration: FastViT-t8 vision, 4 VLM layers (75% pruned), and 0.75x expert width. At 164M parameters, it still beats the 450M SmolVLA baseline on action prediction while running 3x faster. The vision encoder is trained end-to-end. Architecture inspired by [DynamicVLA](https://arxiv.org/abs/2601.22153); VLM layer pruning is our contribution. Trained exclusively on [`lerobot/fmb`](https://huggingface.co/datasets/lerobot/fmb) (3-camera Franka Panda manipulation). Source code: [enfuse/edgevla](https://github.com/enfuse/edgevla) ## Intended Use & What You Can Do With This Model **This model predicts 7-DoF robot actions** (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion. **Immediate uses:** - **Deploy on a Franka Panda** (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions. - **Fine-tune on your own robot data** — this is the most practical use. If you have any robot with cameras in [LeRobot format](https://github.com/huggingface/lerobot), this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups. - **Edge deployment** — the smallest model in the EdgeVLA family and likely **the smallest open-source VLA model available**. At 164M params / 313MB FP16, it runs on even the most constrained Jetson devices. Estimated ~142ms on Jetson Orin AGX, ~57ms on H200. Fits on Jetson Orin Nano (8GB). - **Real-time control** — at 17.7 Hz throughput on H200, this model can run closed-loop at well above the 10Hz control frequency, enabling reactive manipulation. **Important caveats:** - All metrics below are **offline action prediction** on held-out FMB samples. There are **no closed-loop success rate numbers** — the model has not been validated on a physical robot completing full tasks. - Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will **not generalize** to different robots, camera configurations, or tasks without fine-tuning. - The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with `--empty_cameras` or retrain. - As the smallest variant, Tiny trades some accuracy for speed — rz and x dimensions are slightly worse than SmolVLA (see per-dimension table below). ## Results (FMB Offline, 500 held-out samples) | Metric | SmolVLA (450M) | **EdgeVLA-Tiny (164M)** | Delta | |--------|---------------|------------------------|-------| | Action MSE | 0.618 | **0.555** | -10% | | Cosine Similarity | 0.663 | **0.654** | -1% | | Gripper Accuracy | 94.9% | **95.1%** | +0.2pp | | Inference Latency (H200) | 169ms | **57ms** | -66% | | Memory (FP16) | 858MB | **313MB** | -64% | ### Per-Dimension MSE | Dim | SmolVLA | **Tiny** | Delta | |-----|---------|---------|-------| | x | 0.538 | 0.541 | +1% | | y | 0.598 | **0.557** | -7% | | z | 0.599 | **0.557** | -7% | | rx | 0.624 | **0.544** | -13% | | ry | 1.358 | **1.054** | -22% | | rz | 0.373 | 0.404 | +8% | | gripper | 0.233 | **0.226** | -3% | ### Latency (H200, FP32) | Mean | P50 | P95 | Throughput | |------|-----|-----|------------| | 57ms | 56ms | 59ms | 17.7 Hz | ## Architecture ``` EdgeVLA-Tiny (164M total, 30M trainable): FastViT-t8 vision: 4.0M (trainable, replaces SigLIP 98M frozen) VLM (SmolLM2-360M): 133.9M (frozen, 4 layers — pruned from 16) Action expert: 24.6M (trainable, flow matching, 0.75x width) Projections: 1.6M (trainable) ``` Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 4 layers (75%). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384. ## Training | Parameter | Value | |-----------|-------| | Dataset | `lerobot/fmb` | | Total steps | 100K (50K + 50K fine-tune at LR=5e-5) | | Batch size | 64 | | Learning rate | 1e-4 initial, 5e-5 fine-tune (cosine) | | Warmup | 2,000 / 1,000 steps | | Augmentation | ColorJitter + RandomSharpness + RandomAffine | | Cameras | 3 (side_1, side_2, wrist) | | Actions | 7-dim (x, y, z, rx, ry, rz, gripper) | | VLM layers | 4 (pruned from 16) | | Expert width | 0.75x | | Hardware | 1x NVIDIA H200 | | Training time | ~10 hours total | ## EdgeVLA Family | Model | Params | MSE | Cosine Sim | Gripper | Latency | HF Repo | |-------|--------|-----|------------|---------|---------|---------| | SmolVLA | 450M | 0.618 | 0.663 | 94.9% | 169ms | [lerobot/smolvla_base](https://huggingface.co/lerobot/smolvla_base) | | Base | 363M | 0.458 | 0.713 | 96.5% | 162ms | [enfuse/edgevla-base-fmb](https://huggingface.co/enfuse/edgevla-base-fmb) | | Small | 228M | 0.515 | 0.679 | 95.8% | 90ms | [enfuse/edgevla-small-fmb](https://huggingface.co/enfuse/edgevla-small-fmb) | | **Tiny** | **164M** | **0.555** | **0.654** | **95.1%** | **57ms** | **this repo** | ## Quick Start ```python from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-tiny-fmb") policy.eval() ``` ## Fine-Tuning on Your Own Data ```bash git clone https://github.com/enfuse/edgevla cd edgevla python edgevla/train.py \ --base_policy enfuse/edgevla-tiny-fmb \ --dataset your_lerobot_dataset \ --fastvit_variant fastvit_t8 \ --num_vlm_layers 4 \ --expert_width_multiplier 0.75 \ --lr 3e-5 \ --steps 50000 \ --batch_size 64 ``` See the [training README](https://github.com/enfuse/edgevla) for full configuration options and multi-round training strategy. ## Attribution Architecture from [DynamicVLA](https://arxiv.org/abs/2601.22153) (Xie et al., 2026). VLM layer pruning is our contribution. Built on [SmolVLA](https://huggingface.co/lerobot/smolvla_base), [FastViT](https://arxiv.org/abs/2303.14189), and [LeRobot](https://github.com/huggingface/lerobot). ```bibtex @article{xie2026dynamicvla, title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation}, author={Xie, Yue and others}, journal={arXiv preprint arXiv:2601.22153}, year={2026} } ```