Instructions to use enfuse/edgevla-tiny-fmb with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use enfuse/edgevla-tiny-fmb with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=enfuse/edgevla-tiny-fmb \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=enfuse/edgevla-tiny-fmb - Notebooks
- Google Colab
- Kaggle
EdgeVLA-Tiny (FMB)
164M parameter ultra-compact Vision-Language-Action model. 64% smaller than SmolVLA, 3x faster, the smallest open-source VLA. Trained on real-robot data.
EdgeVLA-Tiny pushes the EdgeVLA architecture to its smallest configuration: FastViT-t8 vision, 4 VLM layers (75% pruned), and 0.75x expert width. At 164M parameters, it still beats the 450M SmolVLA baseline on action prediction while running 3x faster. The vision encoder is trained end-to-end. Architecture inspired by DynamicVLA; VLM layer pruning is our contribution.
Trained exclusively on lerobot/fmb (3-camera Franka Panda manipulation). Source code: enfuse/edgevla
Intended Use & What You Can Do With This Model
This model predicts 7-DoF robot actions (x, y, z, rx, ry, rz, gripper) from 3 camera images. It outputs 50-step action chunks at 10Hz — each inference produces 5 seconds of continuous robot motion.
Immediate uses:
- Deploy on a Franka Panda (or compatible 7-DoF arm) with a 3-camera setup for FMB-style tabletop manipulation. Feed camera frames in, execute the predicted delta actions.
- Fine-tune on your own robot data — this is the most practical use. If you have any robot with cameras in LeRobot format, this checkpoint is an excellent pretrained starting point. Fine-tuning at LR=3e-5 for 50K steps typically adapts well to new setups.
- Edge deployment — the smallest model in the EdgeVLA family and likely the smallest open-source VLA model available. At 164M params / 313MB FP16, it runs on even the most constrained Jetson devices. Estimated ~142ms on Jetson Orin AGX, ~57ms on H200. Fits on Jetson Orin Nano (8GB).
- Real-time control — at 17.7 Hz throughput on H200, this model can run closed-loop at well above the 10Hz control frequency, enabling reactive manipulation.
Important caveats:
- All metrics below are offline action prediction on held-out FMB samples. There are no closed-loop success rate numbers — the model has not been validated on a physical robot completing full tasks.
- Trained specifically on FMB data (Franka Panda, specific manipulation tasks, 3-camera setup). It will not generalize to different robots, camera configurations, or tasks without fine-tuning.
- The model expects 3 camera inputs (side_1, side_2, wrist). For single-camera setups, you would need to fine-tune with
--empty_camerasor retrain. - As the smallest variant, Tiny trades some accuracy for speed — rz and x dimensions are slightly worse than SmolVLA (see per-dimension table below).
Results (FMB Offline, 500 held-out samples)
| Metric | SmolVLA (450M) | EdgeVLA-Tiny (164M) | Delta |
|---|---|---|---|
| Action MSE | 0.618 | 0.555 | -10% |
| Cosine Similarity | 0.663 | 0.654 | -1% |
| Gripper Accuracy | 94.9% | 95.1% | +0.2pp |
| Inference Latency (H200) | 169ms | 57ms | -66% |
| Memory (FP16) | 858MB | 313MB | -64% |
Per-Dimension MSE
| Dim | SmolVLA | Tiny | Delta |
|---|---|---|---|
| x | 0.538 | 0.541 | +1% |
| y | 0.598 | 0.557 | -7% |
| z | 0.599 | 0.557 | -7% |
| rx | 0.624 | 0.544 | -13% |
| ry | 1.358 | 1.054 | -22% |
| rz | 0.373 | 0.404 | +8% |
| gripper | 0.233 | 0.226 | -3% |
Latency (H200, FP32)
| Mean | P50 | P95 | Throughput |
|---|---|---|---|
| 57ms | 56ms | 59ms | 17.7 Hz |
Architecture
EdgeVLA-Tiny (164M total, 30M trainable):
FastViT-t8 vision: 4.0M (trainable, replaces SigLIP 98M frozen)
VLM (SmolLM2-360M): 133.9M (frozen, 4 layers — pruned from 16)
Action expert: 24.6M (trainable, flow matching, 0.75x width)
Projections: 1.6M (trainable)
Key changes from SmolVLA: FastViT-t8 (conv, trainable) replaces SigLIP (ViT, frozen). VLM pruned 16 to 4 layers (75%). 64 visual tokens vs 729 (11x fewer). 256x256 input vs 384x384.
Training
| Parameter | Value |
|---|---|
| Dataset | lerobot/fmb |
| Total steps | 100K (50K + 50K fine-tune at LR=5e-5) |
| Batch size | 64 |
| Learning rate | 1e-4 initial, 5e-5 fine-tune (cosine) |
| Warmup | 2,000 / 1,000 steps |
| Augmentation | ColorJitter + RandomSharpness + RandomAffine |
| Cameras | 3 (side_1, side_2, wrist) |
| Actions | 7-dim (x, y, z, rx, ry, rz, gripper) |
| VLM layers | 4 (pruned from 16) |
| Expert width | 0.75x |
| Hardware | 1x NVIDIA H200 |
| Training time | ~10 hours total |
EdgeVLA Family
| Model | Params | MSE | Cosine Sim | Gripper | Latency | HF Repo |
|---|---|---|---|---|---|---|
| SmolVLA | 450M | 0.618 | 0.663 | 94.9% | 169ms | lerobot/smolvla_base |
| Base | 363M | 0.458 | 0.713 | 96.5% | 162ms | enfuse/edgevla-base-fmb |
| Small | 228M | 0.515 | 0.679 | 95.8% | 90ms | enfuse/edgevla-small-fmb |
| Tiny | 164M | 0.555 | 0.654 | 95.1% | 57ms | this repo |
Quick Start
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("enfuse/edgevla-tiny-fmb")
policy.eval()
Fine-Tuning on Your Own Data
git clone https://github.com/enfuse/edgevla
cd edgevla
python edgevla/train.py \
--base_policy enfuse/edgevla-tiny-fmb \
--dataset your_lerobot_dataset \
--fastvit_variant fastvit_t8 \
--num_vlm_layers 4 \
--expert_width_multiplier 0.75 \
--lr 3e-5 \
--steps 50000 \
--batch_size 64
See the training README for full configuration options and multi-round training strategy.
Attribution
Architecture from DynamicVLA (Xie et al., 2026). VLM layer pruning is our contribution. Built on SmolVLA, FastViT, and LeRobot.
@article{xie2026dynamicvla,
title={DynamicVLA: Efficient Vision-Language-Action Model via Dynamic Fusion for Robotic Manipulation},
author={Xie, Yue and others},
journal={arXiv preprint arXiv:2601.22153},
year={2026}
}
- Downloads last month
- 5
Model tree for enfuse/edgevla-tiny-fmb
Base model
HuggingFaceTB/SmolLM2-360MDataset used to train enfuse/edgevla-tiny-fmb
Papers for enfuse/edgevla-tiny-fmb
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization
Evaluation results
- Action MSE on Functional Manipulation Benchmarkself-reported0.555
- Gripper Accuracy (%) on Functional Manipulation Benchmarkself-reported95.100
- Cosine Similarity on Functional Manipulation Benchmarkself-reported0.654