A newer version of the Gradio SDK is available:
6.1.0
Instructions for Preparing Human Hand V-L-A Data
This folder provides essential documentation and scripts for the human hand VLA data used in this project. Please note that the metadata we provide may continue to receive updates in the future. Based on manual inspection, the current version achieves roughly 90% annotation accuracy, and we plan to further improve the metadata quality in future updates.
The contents of this folder are as follows:
π Table of Contents
- 1. Prerequisites
- 2. Data Download
- 3. Video Preprocessing
- 4. Metadata Structure
- 5. Data Visualization
1. Prerequisites
Our data preprocessing and visualization rely on several dependencies that need to be prepared in advance. If you have already completed the installation steps in 1.2 Visualization Requirements of the readme.md, you can skip this section.
Python Libraries
PyTorch3D is required for visualization. You can install it according to the official guide, or simply run the command below:
pip install --no-build-isolation git+https://github.com/facebookresearch/pytorch3d.git@stable#egg=pytorch3d
FFmpeg is also required for video processing:
sudo apt install ffmpeg
pip install ffmpeg-python
Other Python dependencies can be installed using the following command:
pip install projectaria_tools smplx
pip install --no-build-isolation git+https://github.com/mattloper/chumpy#egg=chumpy
MANO Hand Model
Our reconstructed hand labels are based on the MANO hand model. We only require the right hand model. The model parameters can be downloaded from the official website and organized in the following structure:
weights/
βββ mano/
βββ MANO_RIGHT.pkl
βββ mano_mean_params.npz
2. Data Download
Meta Information
We provide the metadata for the human V-L-A episodes we constructed, which can be downloaded from this link. Each metadata entry contains the segmentation information of the corresponding V-L-A episode, language descriptions, as well as reconstructed camera parameters and 3D hand information. The detailed structure of the metadata can be found at Metadata Structure. The total size of all metadata is approximately 100 GB.
After extracting the files, the downloaded metadata will have the following structure:
Metadata/
βββ {dataset_name1}/
β βββ episode_frame_index.npz
| βββ episodic_annotations/
β βββ {dataset_name1}_{video_name1}_ep_{000000}.npy
β βββ {dataset_name1}_{video_name1}_ep_{000001}.npy
β βββ {dataset_name1}_{video_name1}_ep_{000002}.npy
β βββ {dataset_name1}_{video_name2}_ep_{000000}.npy
β βββ {dataset_name1}_{video_name2}_ep_{000001}.npy
β βββ ...
βββ {dataset_name2}/
β βββ ...
Here, {dataset_name} indicates which dataset the episode belongs to, {video_name} corresponds to the name of the original raw video, and ep_{000000} is the episodeβs index.
Videos
Our project currently uses videos collected from four sources: Ego4D, Epic-Kitchen, EgoExo4D, and Something-Something V2. Due to license restrictions, we cannot provide our processed video data directly. To access the data, please apply for and download the original videos from the official dataset websites. Note that we only need the raw video files for this project.
The structure of the downloaded raw data for each dataset is as follows:
- Ego4D:
Ego4D_root/ βββ v2/ βββ full_scale/ βββ {video_name1}.mp4 βββ {video_name2}.mp4 βββ {video_name3}.mp4 βββ ...
- Epic-Kitchen:
Epic-Kitchen_root/ βββ P01/ β βββ videos/ β βββ {video_name1}.MP4 β βββ {video_name2}.MP4 β βββ ... βββ P02/ β βββ videos/ β βββ {video_name3}.MP4 β βββ {video_name4}.MP4 β βββ ... βββ ...
- EgoExo4D:
EgoExo4D_root/ βββ takes/ βββ {video_name1}/ β βββ frame_aligned_videos/ β βββ {cam_name1}.mp4 β βββ {cam_name2}.mp4 β βββ ... βββ {video_name2}/ β βββ frame_aligned_videos/ β βββ {cam_name1}.mp4 β βββ {cam_name2}.mp4 β βββ ... βββ ...
- Somethingsomething-v2:
Somethingsomething-v2_root/ βββ {video_name1}.webm βββ {video_name2}.webm βββ {video_name3}.webm βββ ...
3. Video Preprocessing
A large portion of the raw videos in Ego4D and EgoExo4D have fisheye distortion. To standardize the processing, we corrected the fisheye distortion and converted the videos to a pinhole camera model. Our metadata is based on the resulting undistorted videos. To enable reproduction of our data, we provide scripts to perform this undistortion on the original videos.
Camera Intrinsics
We provide our estimated intrinsics for raw videos in Ego4D (computed using DroidCalib as described in our paper) and the ground-truth Project Aria intrinsics for EgoExo4D (from the official repository). These files can be downloaded via this link and organized as follows:
camera_intrinsics_root/
βββ ego4d/
β βββ {video_name1}.npy
β βββ {video_name2}.npy
β βββ ...
βββ egoexo4d/
βββ {video_name3}.json
βββ {video_name4}.json
βββ ...
Video Undistortion
Given the raw videos organized according to the structure described in Data Download and the provided camera intrinsics, the fisheye-distorted videos can be undistorted using the following script:
cd data/preprocessing
# for Ego4D videos
usage: undistort_video.py [-h] --video_root VIDEO_ROOT --intrinsics_root INTRINSICS_ROOT --save_root SAVE_ROOT [--video_start START_IDX] [--video_END END_IDX] [--batchsize BATCHSIZE] [--crf CRF]
options:
-h, --help show this help message and exit
--video_root VIDEO_ROOT Folder containing input videos
--intrinsics_root INTRINSICS_ROOT Folder containing intrinsics info
--save_root SAVE_ROOT Folder for saving output videos
--video_start VIDEO_START Start video index (inclusive)
--video_end VIDEO_END End video index (exclusive)
--batch_size BATCH_SIZE Number of frames to be processed per batch (TS chunk)
--crf CRF CRF for ffmpeg encoding quality
An example command is:
# for Ego4D videos
python undistort_video.py --video_root Ego4D_root/v2/full_scale --intrinsics_root camera_intrinsics_root/ego4d --save_root Ego4D_undistorted_root --video_start 0 --video_end 10
which processes 10 Ego4D videos sequentially and saves the undistorted outputs to Ego4D_root/v2/undistorted_videos.
Similarly, for EgoExo4D videos, you can run a command like:
# for EgoEXO4D videos
python undistort_video_egoexo4d.py --video_root EgoExo4D_root --intrinsics_root camera_intrinsics_root/egoexo4d --save_root EgoExo4D_undistorted_root --video_start 0 --video_end 10
Each video is processed in segments according to the specified batch size and then concatenated afterward. Notably, processing the entire dataset is time-consuming and requires substantial storage (around 10 TB). The script provided here is only a basic reference example. We recommend parallelizing and optimizing it before running it on a compute cluster.
The undistortion step is only applied to Ego4D and EgoExo4D videos. Epic-Kitchen and Somethingsomething-v2 do not require undistortion and can be used directly as downloaded from the official sources.
4. Metadata Structure
Our metadata for each V-L-A episode can be loaded via:
import numpy as np
# Load meta data dictionary
episode_info = np.load(f'{dataset_name1}_{video_name1}_ep_{000000}.npy', allow_pickle=True).item()
The detailed structure of the episode_info is as follows:
episode_info (dict) # Metadata for a single V-L-A episode
βββ 'video_clip_id_segment': list[int] # Deprecated
βββ 'extrinsics': np.ndarray # (Tx4x4) World2Cam extrinsic matrix
βββ 'intrinsics': np.ndarray # (3x3) Camera intrinsic matrix
βββ 'video_decode_frame': list[int] # Frame indices in the original raw video (starting from 0)
βββ 'video_name': str # Original raw video name
βββ 'avg_speed': float # Average wrist movement per frame (in meters)
βββ 'total_rotvec_degree': float # Total camera rotation over the episode (in degrees)
βββ 'total_transl_dist': float # Total camera translation distance over the episode (in meters)
βββ 'anno_type': str # Annotation type, specifying the primary hand action considered when segmenting the episode
βββ 'text': (dict) # Textual descriptions for the episode
β βββ 'left': List[(str, (int, int))] # Each entry contains (description, (start_frame_in_episode, end_frame_in_episode))
β βββ 'right': List[(str, (int, int))] # Same structure for the right hand
βββ 'text_rephrase': (dict) # Rephrased textual descriptions from GPT-4
β βββ 'left': List[(List[str], (int, int))] # Each entry contains (list of rephrased descriptions, (start_frame_in_episode, end_frame_in_episode))
β βββ 'right': List[(List[str], (int, int))] # Same as above for the right hand
βββ 'left' (dict) # Left hand 3D pose info
β βββ 'beta': np.ndarray # (10) MANO hand shape parameters (based on the MANO_RIGHT model)
β βββ 'global_orient_camspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to camera space
β βββ 'global_orient_worldspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to world space
β βββ 'hand_pose': np.ndarray # (Tx15x3x3) Local hand joints rotations (based on the MANO_RIGHT model)
β βββ 'transl_camspace': np.ndarray # (Tx3) Hand wrist translation in camera space
β βββ 'transl_worldspace': np.ndarray # (Tx3) Hand wrist translation in world space
β βββ 'kept_frames': list[int] # (T) 0β1 mask of valid left-hand reconstruction frames
β βββ 'joints_camspace': np.ndarray # (Tx21x3) 3D hand joint positions in camera space
β βββ 'joints_worldspace': np.ndarray # (Tx21x3) 3D joint positions in world space
β βββ 'wrist': np.ndarray # Deprecated
β βββ 'max_translation_movement': float # Deprecated
β βββ 'max_wrist_rotation_movement': float # Deprecated
β βββ 'max_finger_joint_angle_movement': float # Deprecated
βββ 'right' (dict) # Right hand 3D pose info (same structure as 'left')
βββ 'beta': np.ndarray
βββ 'global_orient_camspace': np.ndarray
βββ 'global_orient_worldspace': np.ndarray
βββ 'hand_pose': np.ndarray
βββ 'transl_camspace': np.ndarray
βββ 'transl_worldspace': np.ndarray
βββ 'kept_frames': list[int]
βββ 'joints_camspace': np.ndarray
βββ 'joints_worldspace': np.ndarray
βββ 'wrist': np.ndarray
βββ 'max_translation_movement': float
βββ 'max_wrist_rotation_movement': float
βββ 'max_finger_joint_angle_movement': float
To better understand how to use the episode metadata, we provide a visualization script, as described in the next section.
5. Data Visualization
Our metadata for each episode can be visualized with the following command, which will generate a video in the same format as shown on our webpage.
We recommend following the undistortion procedure described above and place all undistorted videos in a single video_root folder, store the corresponding metadata in a label_root folder, and then run the visualization script.
usage: data/demo_visualization_epi.py [-h] --video_root VIDEO_ROOT --label_root LABEL_ROOT --save_path SAVE_PATH --mano_model_path MANO_MODEL_PATH [--render_gradual_traj]
options:
-h, --help show this help message and exit
--video_root VIDEO_ROOT Root directory containing the video files
--label_root LABEL_ROOT Root directory containing the episode label (.npy) files
--save_path SAVE_PATH Directory to save the output visualization videos
--mano_model_path MANO_MODEL_PATH Path to the MANO model files
--render_gradual_traj Set flag to render a gradual trajectory (full mode)
We provide an example command for running the script, as well as a sample for visualization:
python data/demo_visualization_epi.py --video_root data/examples/videos --label_root data/examples/annotations --save_path data/examples/visualize --mano_model_path MANO_MODEL_PATH --render_gradual_traj
Note that using --render_gradual_traj renders the hand trajectory from the current frame to the end of the episode for every frame, which can be slow. To speed up visualization, you may omit this option.
For a more detailed understanding of the metadata, please see visualization/visualize_core.py.