Merge branch 'main' into aloha_hd5_to_dataset_v2

This commit is contained in:
Claudio Coppola 2025-01-09 11:17:07 +00:00 committed by GitHub
commit 0d9a0cdb6f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 67 additions and 57 deletions

View File

@ -50,7 +50,7 @@ jobs:
uses: actions/checkout@v3 uses: actions/checkout@v3
- name: Install poetry - name: Install poetry
run: pipx install poetry run: pipx install "poetry<2.0.0"
- name: Poetry check - name: Poetry check
run: poetry check run: poetry check
@ -64,7 +64,7 @@ jobs:
uses: actions/checkout@v3 uses: actions/checkout@v3
- name: Install poetry - name: Install poetry
run: pipx install poetry run: pipx install "poetry<2.0.0"
- name: Install poetry-relax - name: Install poetry-relax
run: poetry self add poetry-relax run: poetry self add poetry-relax

View File

@ -68,7 +68,7 @@
### Acknowledgment ### Acknowledgment
- Thanks to Tony Zaho, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io). - Thanks to Tony Zhao, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
- Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io). - Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
- Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM). - Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
- Thanks to Antonio Loquercio and Ashish Kumar for their early support. - Thanks to Antonio Loquercio and Ashish Kumar for their early support.

View File

@ -21,7 +21,7 @@ How to decode videos?
## Variables ## Variables
**Image content & size** **Image content & size**
We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution). We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an apartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
For these reasons, we run this benchmark on four representative datasets: For these reasons, we run this benchmark on four representative datasets:
- `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera. - `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera.
- `aliberts/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera. - `aliberts/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera.
@ -63,7 +63,7 @@ This of course is affected by the `-g` parameter during encoding, which specifie
Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`. Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`.
Additionally, because some policies might request single timestamps that are a few frames appart, we also have the following scenario: Additionally, because some policies might request single timestamps that are a few frames apart, we also have the following scenario:
- `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`), - `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`),
However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded. However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded.
@ -85,8 +85,8 @@ However, due to how video decoding is implemented with `pyav`, we don't have acc
**Average Structural Similarity Index Measure (higher is better)** **Average Structural Similarity Index Measure (higher is better)**
`avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity. `avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.
One aspect that can't be measured here with those metrics is the compatibility of the encoding accross platforms, in particular on web browser, for visualization purposes. One aspect that can't be measured here with those metrics is the compatibility of the encoding across platforms, in particular on web browser, for visualization purposes.
h264, h265 and AV1 are all commonly used codecs and should not be pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility: h264, h265 and AV1 are all commonly used codecs and should not pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
- `yuv420p` is more widely supported across various platforms, including web browsers. - `yuv420p` is more widely supported across various platforms, including web browsers.
- `yuv444p` offers higher color fidelity but might not be supported as broadly. - `yuv444p` offers higher color fidelity but might not be supported as broadly.
@ -116,7 +116,7 @@ Additional encoding parameters exist that are not included in this benchmark. In
- `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1. - `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1.
- `-tune` which allows to optimize the encoding for certains aspects (e.g. film quality, fast decoding, etc.). - `-tune` which allows to optimize the encoding for certains aspects (e.g. film quality, fast decoding, etc.).
See the documentation mentioned above for more detailled info on these settings and for a more comprehensive list of other parameters. See the documentation mentioned above for more detailed info on these settings and for a more comprehensive list of other parameters.
Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few: Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few:
- `torchaudio` - `torchaudio`

View File

@ -159,11 +159,11 @@ DATASETS = {
**ALOHA_STATIC_INFO, **ALOHA_STATIC_INFO,
}, },
"aloha_static_vinh_cup": { "aloha_static_vinh_cup": {
"single_task": "Pick up the platic cup with the right arm, then pop its lid open with the left arm.", "single_task": "Pick up the plastic cup with the right arm, then pop its lid open with the left arm.",
**ALOHA_STATIC_INFO, **ALOHA_STATIC_INFO,
}, },
"aloha_static_vinh_cup_left": { "aloha_static_vinh_cup_left": {
"single_task": "Pick up the platic cup with the left arm, then pop its lid open with the right arm.", "single_task": "Pick up the plastic cup with the left arm, then pop its lid open with the right arm.",
**ALOHA_STATIC_INFO, **ALOHA_STATIC_INFO,
}, },
"aloha_static_ziploc_slide": {"single_task": "Slide open the ziploc bag.", **ALOHA_STATIC_INFO}, "aloha_static_ziploc_slide": {"single_task": "Slide open the ziploc bag.", **ALOHA_STATIC_INFO},

View File

@ -177,7 +177,7 @@ def run_server(
{"url": url_for("static", filename=video_path), "filename": video_path.parent.name} {"url": url_for("static", filename=video_path), "filename": video_path.parent.name}
for video_path in video_paths for video_path in video_paths
] ]
tasks = dataset.meta.episodes[0]["tasks"] tasks = dataset.meta.episodes[episode_id]["tasks"]
else: else:
video_keys = [key for key, ft in dataset.features.items() if ft["dtype"] == "video"] video_keys = [key for key, ft in dataset.features.items() if ft["dtype"] == "video"]
videos_info = [ videos_info = [
@ -232,63 +232,48 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)
"""Get a csv str containing timeseries data of an episode (e.g. state and action). """Get a csv str containing timeseries data of an episode (e.g. state and action).
This file will be loaded by Dygraph javascript to plot data in real time.""" This file will be loaded by Dygraph javascript to plot data in real time."""
columns = [] columns = []
has_state = "observation.state" in dataset.features
has_action = "action" in dataset.features selected_columns = [col for col, ft in dataset.features.items() if ft["dtype"] == "float32"]
selected_columns.remove("timestamp")
# init header of csv with state and action names # init header of csv with state and action names
header = ["timestamp"] header = ["timestamp"]
if has_state:
for column_name in selected_columns:
dim_state = ( dim_state = (
dataset.meta.shapes["observation.state"][0] dataset.meta.shapes[column_name][0]
if isinstance(dataset, LeRobotDataset) if isinstance(dataset, LeRobotDataset)
else dataset.features["observation.state"].shape[0] else dataset.features[column_name].shape[0]
) )
header += [f"state_{i}" for i in range(dim_state)] header += [f"{column_name}_{i}" for i in range(dim_state)]
column_names = dataset.features["observation.state"]["names"]
if "names" in dataset.features[column_name] and dataset.features[column_name]["names"]:
column_names = dataset.features[column_name]["names"]
while not isinstance(column_names, list): while not isinstance(column_names, list):
column_names = list(column_names.values())[0] column_names = list(column_names.values())[0]
columns.append({"key": "state", "value": column_names}) else:
if has_action: column_names = [f"motor_{i}" for i in range(dim_state)]
dim_action = ( columns.append({"key": column_name, "value": column_names})
dataset.meta.shapes["action"][0]
if isinstance(dataset, LeRobotDataset) selected_columns.insert(0, "timestamp")
else dataset.features.action.shape[0]
)
header += [f"action_{i}" for i in range(dim_action)]
column_names = dataset.features["action"]["names"]
while not isinstance(column_names, list):
column_names = list(column_names.values())[0]
columns.append({"key": "action", "value": column_names})
if isinstance(dataset, LeRobotDataset): if isinstance(dataset, LeRobotDataset):
from_idx = dataset.episode_data_index["from"][episode_index] from_idx = dataset.episode_data_index["from"][episode_index]
to_idx = dataset.episode_data_index["to"][episode_index] to_idx = dataset.episode_data_index["to"][episode_index]
selected_columns = ["timestamp"]
if has_state:
selected_columns += ["observation.state"]
if has_action:
selected_columns += ["action"]
data = ( data = (
dataset.hf_dataset.select(range(from_idx, to_idx)) dataset.hf_dataset.select(range(from_idx, to_idx))
.select_columns(selected_columns) .select_columns(selected_columns)
.with_format("numpy") .with_format("pandas")
) )
rows = np.hstack(
(np.expand_dims(data["timestamp"], axis=1), *[data[col] for col in selected_columns[1:]])
).tolist()
else: else:
repo_id = dataset.repo_id repo_id = dataset.repo_id
selected_columns = ["timestamp"]
if "observation.state" in dataset.features:
selected_columns.append("observation.state")
if "action" in dataset.features:
selected_columns.append("action")
url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/" + dataset.data_path.format( url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/" + dataset.data_path.format(
episode_chunk=int(episode_index) // dataset.chunks_size, episode_index=episode_index episode_chunk=int(episode_index) // dataset.chunks_size, episode_index=episode_index
) )
df = pd.read_parquet(url) df = pd.read_parquet(url)
data = df[selected_columns] # Select specific columns data = df[selected_columns] # Select specific columns
rows = np.hstack( rows = np.hstack(
( (
np.expand_dims(data["timestamp"], axis=1), np.expand_dims(data["timestamp"], axis=1),
@ -379,10 +364,6 @@ def visualize_dataset_html(
template_folder=template_dir, template_folder=template_dir,
) )
else: else:
image_keys = dataset.meta.image_keys if isinstance(dataset, LeRobotDataset) else []
if len(image_keys) > 0:
raise NotImplementedError(f"Image keys ({image_keys=}) are currently not supported.")
# Create a simlink from the dataset video folder containg mp4 files to the output directory # Create a simlink from the dataset video folder containg mp4 files to the output directory
# so that the http server can get access to the mp4 files. # so that the http server can get access to the mp4 files.
if isinstance(dataset, LeRobotDataset): if isinstance(dataset, LeRobotDataset):

View File

@ -98,9 +98,34 @@
</div> </div>
<!-- Videos --> <!-- Videos -->
<div class="max-w-32 relative text-sm mb-4 select-none"
@click.outside="isVideosDropdownOpen = false">
<div
@click="isVideosDropdownOpen = !isVideosDropdownOpen"
class="p-2 border border-slate-500 rounded flex justify-between items-center cursor-pointer"
>
<span class="truncate">filter videos</span>
<div class="transition-transform" :class="{ 'rotate-180': isVideosDropdownOpen }">🔽</div>
</div>
<div x-show="isVideosDropdownOpen"
class="absolute mt-1 border border-slate-500 rounded shadow-lg z-10">
<div>
<template x-for="option in videosKeys" :key="option">
<div
@click="videosKeysSelected = videosKeysSelected.includes(option) ? videosKeysSelected.filter(v => v !== option) : [...videosKeysSelected, option]"
class="p-2 cursor-pointer bg-slate-900"
:class="{ 'bg-slate-700': videosKeysSelected.includes(option) }"
x-text="option"
></div>
</template>
</div>
</div>
</div>
<div class="flex flex-wrap gap-x-2 gap-y-6"> <div class="flex flex-wrap gap-x-2 gap-y-6">
{% for video_info in videos_info %} {% for video_info in videos_info %}
<div x-show="!videoCodecError" class="max-w-96 relative"> <div x-show="!videoCodecError && videosKeysSelected.includes('{{ video_info.filename }}')" class="max-w-96 relative">
<p class="absolute inset-x-0 -top-4 text-sm text-gray-300 bg-gray-800 px-2 rounded-t-xl truncate">{{ video_info.filename }}</p> <p class="absolute inset-x-0 -top-4 text-sm text-gray-300 bg-gray-800 px-2 rounded-t-xl truncate">{{ video_info.filename }}</p>
<video muted loop type="video/mp4" class="object-contain w-full h-full" @canplaythrough="videoCanPlay" @timeupdate="() => { <video muted loop type="video/mp4" class="object-contain w-full h-full" @canplaythrough="videoCanPlay" @timeupdate="() => {
if (video.duration) { if (video.duration) {
@ -250,6 +275,9 @@
nVideos: {{ videos_info | length }}, nVideos: {{ videos_info | length }},
nVideoReadyToPlay: 0, nVideoReadyToPlay: 0,
videoCodecError: false, videoCodecError: false,
isVideosDropdownOpen: false,
videosKeys: {{ videos_info | map(attribute='filename') | list | tojson }},
videosKeysSelected: [],
columns: {{ columns | tojson }}, columns: {{ columns | tojson }},
rowLabels: {{ columns | tojson }}.reduce((colA, colB) => colA.value.length > colB.value.length ? colA : colB).value, rowLabels: {{ columns | tojson }}.reduce((colA, colB) => colA.value.length > colB.value.length ? colA : colB).value,
@ -261,6 +289,7 @@
if(!canPlayVideos){ if(!canPlayVideos){
this.videoCodecError = true; this.videoCodecError = true;
} }
this.videosKeysSelected = this.videosKeys.map(opt => opt)
// process CSV data // process CSV data
const csvDataStr = {{ episode_data_csv_str|tojson|safe }}; const csvDataStr = {{ episode_data_csv_str|tojson|safe }};