---
license: apache-2.0
language:
  - en
library_name: lerobot
pipeline_tag: robotics
tags:
  - robotics
  - vla
  - vision-language-action
  - edge-ai
  - manipulation
  - fastvit
  - smolvla
  - lerobot
  - jetson
  - edge-deployment
datasets:
  - lerobot/fmb
base_model:
  - lerobot/smolvla_base
  - HuggingFaceTB/SmolVLM2-500M-Video-Instruct
model-index:
  - name: EdgeVLA-Tiny
    results:
      - task:
          type: robotics
          name: Action Prediction (FMB)
        dataset:
          type: lerobot/fmb
          name: Functional Manipulation Benchmark
        metrics:
          - type: mse
            value: 0.555
            name: Action MSE
          - type: accuracy
            value: 95.1
            name: Gripper Accuracy (%)
          - type: cosine_similarity
            value: 0.654
            name: Cosine Similarity
---

# EdgeVLA-Tiny (FMB)

**164M parameter ultra-compact Vision-Language-Action model. 64% smaller than SmolVLA, 3x faster, the smallest open-source VLA. Trained on real-robot data.**

EdgeVLA-Tiny pushes the EdgeVLA architecture to its smallest configuration: FastViT-t8 vision, 4 VLM layers (75% pruned), and 0.75x expert width. At 164M parameters, it still beats the 450M SmolVLA baseline on action prediction while running 3x faster. The vision encoder is trained end-to-end. Architecture inspired by [DynamicVLA](https://arxiv.org/abs/2601.22153); VLM layer pruning is our contribution.

Trained exclusively on [`lerobot/fmb`](https://huggingface.co/datasets/lerobot/fmb) (3-camera Franka Panda manipulation). Source code: [enfuse/edgevla](https://github.com/enfuse/edgevla)

## Intended Use & What You Can Do With This Model

**This model predicts 7-DoF robot actions** (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.

**Immediate uses:**
- **Deploy on a Franka Panda** (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
- **Fine-tune on your own robot data** — this is the most practical use. If you have any robot with cameras in [LeRobot format](https://github.com/huggingface/lerobot), this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
- **Edge deployment** — the smallest model in the EdgeVLA family and likely **the smallest open-source VLA model available**. At 164M params / 313MB FP16, it runs on even the most constrained Jetson devices. Estimated ~142ms on Jetson Orin AGX, ~57ms on H200. Fits on Jetson Orin Nano (8GB).
- **Real-time control** — at 17.7 Hz throughput on H200, this model can run closed-loop at well above the 10Hz control frequency, enabling reactive manipulation.

**Important caveats:**
- All metrics below are **offline action prediction** on held-out FMB samples. There are **no closed-loop success rate numbers** — the model has not been validated on a physical robot completing full tasks.
- Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will **not generalize** to different robots, camera configurations, or tasks without fine-tuning.
- The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with `--empty_cameras` or retrain.
- As the smallest variant, Tiny trades some accuracy for speed — rz and x dimensions are slightly worse than SmolVLA (see per-dimension table below).

## Results (FMB Offline, 500 held-out samples)

| Metric | SmolVLA (450M) | **EdgeVLA-Tiny (164M)** | Delta |
|--------|---------------|------------------------|-------|
| Action MSE | 0.618 | **0.555** | -10% |
| Cosine Similarity | 0.663 | **0.654** | -1% |
| Gripper Accuracy | 94.9% | **95.1%** | +0.2pp |
| Inference Latency (H200) | 169ms | **57ms** | -66% |
| Memory (FP16) | 858MB | **313MB** | -64% |

### Per-Dimension MSE

| Dim | SmolVLA | **Tiny** | Delta |
|-----|---------|---------|-------|
| x | 0.538 | 0.541 | +1% |
| y | 0.598 | **0.557** | -7% |
| z | 0.599 | **0.557** | -7% |
| rx | 0.624 | **0.544** | -13% |
| ry | 1.358 | **1.054** | -22% |
| rz | 0.373 | 0.404 | +8% |
| gripper | 0.233 | **0.226** | -3% |

### Latency (H200, FP32)

| Mean | P50 | P95 | Throughput |
|------|-----|-----|------------|
| 57ms | 56ms | 59ms | 17.7 Hz |

## Architecture

```
EdgeVLA-Tiny (164M total, 30M trainable):
  FastViT-t8 vision:        4.0M  (trainable, replaces SigLIP 98M frozen)
  VLM (SmolLM2-360M):    133.9M  (frozen, 4 layers — pruned from 16)
  Action expert:           24.6M  (trainable, flow matching, 0.75x width)
  Projections:              1.6M  (trainable)
```

Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 4 layers (75%). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.

## Training

| Parameter | Value |
|-----------|-------|
| Dataset | `lerobot/fmb` |
| Total steps | 100K (50K + 50K fine-tune at LR=5e-5) |
| Batch size | 64 |
| Learning rate | 1e-4 initial, 5e-5 fine-tune (cosine) |
| Warmup | 2,000 / 1,000 steps |
| Augmentation | ColorJitter + RandomSharpness + RandomAffine |
| Cameras | 3 (side_1, side_2, wrist) |
| Actions | 7-dim (x, y, z, rx, ry, rz, gripper) |
| VLM layers | 4 (pruned from 16) |
| Expert width | 0.75x |
| Hardware | 1x NVIDIA H200 |
| Training time | ~10 hours total |

## EdgeVLA Family

| Model | Params | MSE | Cosine Sim | Gripper | Latency | HF Repo |
|-------|--------|-----|------------|---------|---------|---------|
| SmolVLA | 450M | 0.618 | 0.663 | 94.9% | 169ms | [lerobot/smolvla_base](https://huggingface.co/lerobot/smolvla_base) |
| Base | 363M | 0.458 | 0.713 | 96.5% | 162ms | [enfuse/edgevla-base-fmb](https://huggingface.co/enfuse/edgevla-base-fmb) |
| Small | 228M | 0.515 | 0.679 | 95.8% | 90ms | [enfuse/edgevla-small-fmb](https://huggingface.co/enfuse/edgevla-small-fmb) |
| **Tiny** | **164M** | **0.555** | **0.654** | **95.1%** | **57ms** | **this repo** |

## Quick Start

```python
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-tiny-fmb")
policy.eval()
```

## Fine-Tuning on Your Own Data

```bash
git clone https://github.com/enfuse/edgevla
cd edgevla

python edgevla/train.py \
  --base_policy enfuse/edgevla-tiny-fmb \
  --dataset your_lerobot_dataset \
  --fastvit_variant fastvit_t8 \
  --num_vlm_layers 4 \
  --expert_width_multiplier 0.75 \
  --lr 3e-5 \
  --steps 50000 \
  --batch_size 64
```

See the [training README](https://github.com/enfuse/edgevla) for full configuration options and multi-round training strategy.

## Attribution

Architecture from [DynamicVLA](https://arxiv.org/abs/2601.22153) (Xie et al., 2026). VLM layer pruning is our contribution. Built on [SmolVLA](https://huggingface.co/lerobot/smolvla_base), [FastViT](https://arxiv.org/abs/2303.14189), and [LeRobot](https://github.com/huggingface/lerobot).

```bibtex
@article{xie2026dynamicvla,
  title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
  author={Xie, Yue and others},
  journal={arXiv preprint arXiv:2601.22153},
  year={2026}
}
```