Merge branch 'main' into aloha_hd5_to_dataset_v2
This commit is contained in:
commit
0d9a0cdb6f
|
@ -50,7 +50,7 @@ jobs:
|
|||
uses: actions/checkout@v3
|
||||
|
||||
- name: Install poetry
|
||||
run: pipx install poetry
|
||||
run: pipx install "poetry<2.0.0"
|
||||
|
||||
- name: Poetry check
|
||||
run: poetry check
|
||||
|
@ -64,7 +64,7 @@ jobs:
|
|||
uses: actions/checkout@v3
|
||||
|
||||
- name: Install poetry
|
||||
run: pipx install poetry
|
||||
run: pipx install "poetry<2.0.0"
|
||||
|
||||
- name: Install poetry-relax
|
||||
run: poetry self add poetry-relax
|
||||
|
|
|
@ -68,7 +68,7 @@
|
|||
|
||||
### Acknowledgment
|
||||
|
||||
- Thanks to Tony Zaho, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
|
||||
- Thanks to Tony Zhao, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
|
||||
- Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
|
||||
- Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
|
||||
- Thanks to Antonio Loquercio and Ashish Kumar for their early support.
|
||||
|
|
|
@ -21,7 +21,7 @@ How to decode videos?
|
|||
|
||||
## Variables
|
||||
**Image content & size**
|
||||
We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
|
||||
We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an apartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
|
||||
For these reasons, we run this benchmark on four representative datasets:
|
||||
- `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera.
|
||||
- `aliberts/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera.
|
||||
|
@ -63,7 +63,7 @@ This of course is affected by the `-g` parameter during encoding, which specifie
|
|||
|
||||
Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`.
|
||||
|
||||
Additionally, because some policies might request single timestamps that are a few frames appart, we also have the following scenario:
|
||||
Additionally, because some policies might request single timestamps that are a few frames apart, we also have the following scenario:
|
||||
- `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`),
|
||||
|
||||
However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded.
|
||||
|
@ -85,8 +85,8 @@ However, due to how video decoding is implemented with `pyav`, we don't have acc
|
|||
**Average Structural Similarity Index Measure (higher is better)**
|
||||
`avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.
|
||||
|
||||
One aspect that can't be measured here with those metrics is the compatibility of the encoding accross platforms, in particular on web browser, for visualization purposes.
|
||||
h264, h265 and AV1 are all commonly used codecs and should not be pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
|
||||
One aspect that can't be measured here with those metrics is the compatibility of the encoding across platforms, in particular on web browser, for visualization purposes.
|
||||
h264, h265 and AV1 are all commonly used codecs and should not pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
|
||||
- `yuv420p` is more widely supported across various platforms, including web browsers.
|
||||
- `yuv444p` offers higher color fidelity but might not be supported as broadly.
|
||||
|
||||
|
@ -116,7 +116,7 @@ Additional encoding parameters exist that are not included in this benchmark. In
|
|||
- `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1.
|
||||
- `-tune` which allows to optimize the encoding for certains aspects (e.g. film quality, fast decoding, etc.).
|
||||
|
||||
See the documentation mentioned above for more detailled info on these settings and for a more comprehensive list of other parameters.
|
||||
See the documentation mentioned above for more detailed info on these settings and for a more comprehensive list of other parameters.
|
||||
|
||||
Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few:
|
||||
- `torchaudio`
|
||||
|
|
|
@ -159,11 +159,11 @@ DATASETS = {
|
|||
**ALOHA_STATIC_INFO,
|
||||
},
|
||||
"aloha_static_vinh_cup": {
|
||||
"single_task": "Pick up the platic cup with the right arm, then pop its lid open with the left arm.",
|
||||
"single_task": "Pick up the plastic cup with the right arm, then pop its lid open with the left arm.",
|
||||
**ALOHA_STATIC_INFO,
|
||||
},
|
||||
"aloha_static_vinh_cup_left": {
|
||||
"single_task": "Pick up the platic cup with the left arm, then pop its lid open with the right arm.",
|
||||
"single_task": "Pick up the plastic cup with the left arm, then pop its lid open with the right arm.",
|
||||
**ALOHA_STATIC_INFO,
|
||||
},
|
||||
"aloha_static_ziploc_slide": {"single_task": "Slide open the ziploc bag.", **ALOHA_STATIC_INFO},
|
||||
|
|
|
@ -177,7 +177,7 @@ def run_server(
|
|||
{"url": url_for("static", filename=video_path), "filename": video_path.parent.name}
|
||||
for video_path in video_paths
|
||||
]
|
||||
tasks = dataset.meta.episodes[0]["tasks"]
|
||||
tasks = dataset.meta.episodes[episode_id]["tasks"]
|
||||
else:
|
||||
video_keys = [key for key, ft in dataset.features.items() if ft["dtype"] == "video"]
|
||||
videos_info = [
|
||||
|
@ -232,69 +232,54 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)
|
|||
"""Get a csv str containing timeseries data of an episode (e.g. state and action).
|
||||
This file will be loaded by Dygraph javascript to plot data in real time."""
|
||||
columns = []
|
||||
has_state = "observation.state" in dataset.features
|
||||
has_action = "action" in dataset.features
|
||||
|
||||
selected_columns = [col for col, ft in dataset.features.items() if ft["dtype"] == "float32"]
|
||||
selected_columns.remove("timestamp")
|
||||
|
||||
# init header of csv with state and action names
|
||||
header = ["timestamp"]
|
||||
if has_state:
|
||||
|
||||
for column_name in selected_columns:
|
||||
dim_state = (
|
||||
dataset.meta.shapes["observation.state"][0]
|
||||
dataset.meta.shapes[column_name][0]
|
||||
if isinstance(dataset, LeRobotDataset)
|
||||
else dataset.features["observation.state"].shape[0]
|
||||
else dataset.features[column_name].shape[0]
|
||||
)
|
||||
header += [f"state_{i}" for i in range(dim_state)]
|
||||
column_names = dataset.features["observation.state"]["names"]
|
||||
while not isinstance(column_names, list):
|
||||
column_names = list(column_names.values())[0]
|
||||
columns.append({"key": "state", "value": column_names})
|
||||
if has_action:
|
||||
dim_action = (
|
||||
dataset.meta.shapes["action"][0]
|
||||
if isinstance(dataset, LeRobotDataset)
|
||||
else dataset.features.action.shape[0]
|
||||
)
|
||||
header += [f"action_{i}" for i in range(dim_action)]
|
||||
column_names = dataset.features["action"]["names"]
|
||||
while not isinstance(column_names, list):
|
||||
column_names = list(column_names.values())[0]
|
||||
columns.append({"key": "action", "value": column_names})
|
||||
header += [f"{column_name}_{i}" for i in range(dim_state)]
|
||||
|
||||
if "names" in dataset.features[column_name] and dataset.features[column_name]["names"]:
|
||||
column_names = dataset.features[column_name]["names"]
|
||||
while not isinstance(column_names, list):
|
||||
column_names = list(column_names.values())[0]
|
||||
else:
|
||||
column_names = [f"motor_{i}" for i in range(dim_state)]
|
||||
columns.append({"key": column_name, "value": column_names})
|
||||
|
||||
selected_columns.insert(0, "timestamp")
|
||||
|
||||
if isinstance(dataset, LeRobotDataset):
|
||||
from_idx = dataset.episode_data_index["from"][episode_index]
|
||||
to_idx = dataset.episode_data_index["to"][episode_index]
|
||||
selected_columns = ["timestamp"]
|
||||
if has_state:
|
||||
selected_columns += ["observation.state"]
|
||||
if has_action:
|
||||
selected_columns += ["action"]
|
||||
data = (
|
||||
dataset.hf_dataset.select(range(from_idx, to_idx))
|
||||
.select_columns(selected_columns)
|
||||
.with_format("numpy")
|
||||
.with_format("pandas")
|
||||
)
|
||||
rows = np.hstack(
|
||||
(np.expand_dims(data["timestamp"], axis=1), *[data[col] for col in selected_columns[1:]])
|
||||
).tolist()
|
||||
else:
|
||||
repo_id = dataset.repo_id
|
||||
selected_columns = ["timestamp"]
|
||||
if "observation.state" in dataset.features:
|
||||
selected_columns.append("observation.state")
|
||||
if "action" in dataset.features:
|
||||
selected_columns.append("action")
|
||||
|
||||
url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/" + dataset.data_path.format(
|
||||
episode_chunk=int(episode_index) // dataset.chunks_size, episode_index=episode_index
|
||||
)
|
||||
df = pd.read_parquet(url)
|
||||
data = df[selected_columns] # Select specific columns
|
||||
rows = np.hstack(
|
||||
(
|
||||
np.expand_dims(data["timestamp"], axis=1),
|
||||
*[np.vstack(data[col]) for col in selected_columns[1:]],
|
||||
)
|
||||
).tolist()
|
||||
|
||||
rows = np.hstack(
|
||||
(
|
||||
np.expand_dims(data["timestamp"], axis=1),
|
||||
*[np.vstack(data[col]) for col in selected_columns[1:]],
|
||||
)
|
||||
).tolist()
|
||||
|
||||
# Convert data to CSV string
|
||||
csv_buffer = StringIO()
|
||||
|
@ -379,10 +364,6 @@ def visualize_dataset_html(
|
|||
template_folder=template_dir,
|
||||
)
|
||||
else:
|
||||
image_keys = dataset.meta.image_keys if isinstance(dataset, LeRobotDataset) else []
|
||||
if len(image_keys) > 0:
|
||||
raise NotImplementedError(f"Image keys ({image_keys=}) are currently not supported.")
|
||||
|
||||
# Create a simlink from the dataset video folder containg mp4 files to the output directory
|
||||
# so that the http server can get access to the mp4 files.
|
||||
if isinstance(dataset, LeRobotDataset):
|
||||
|
|
|
@ -98,9 +98,34 @@
|
|||
</div>
|
||||
|
||||
<!-- Videos -->
|
||||
<div class="max-w-32 relative text-sm mb-4 select-none"
|
||||
@click.outside="isVideosDropdownOpen = false">
|
||||
<div
|
||||
@click="isVideosDropdownOpen = !isVideosDropdownOpen"
|
||||
class="p-2 border border-slate-500 rounded flex justify-between items-center cursor-pointer"
|
||||
>
|
||||
<span class="truncate">filter videos</span>
|
||||
<div class="transition-transform" :class="{ 'rotate-180': isVideosDropdownOpen }">🔽</div>
|
||||
</div>
|
||||
|
||||
<div x-show="isVideosDropdownOpen"
|
||||
class="absolute mt-1 border border-slate-500 rounded shadow-lg z-10">
|
||||
<div>
|
||||
<template x-for="option in videosKeys" :key="option">
|
||||
<div
|
||||
@click="videosKeysSelected = videosKeysSelected.includes(option) ? videosKeysSelected.filter(v => v !== option) : [...videosKeysSelected, option]"
|
||||
class="p-2 cursor-pointer bg-slate-900"
|
||||
:class="{ 'bg-slate-700': videosKeysSelected.includes(option) }"
|
||||
x-text="option"
|
||||
></div>
|
||||
</template>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="flex flex-wrap gap-x-2 gap-y-6">
|
||||
{% for video_info in videos_info %}
|
||||
<div x-show="!videoCodecError" class="max-w-96 relative">
|
||||
<div x-show="!videoCodecError && videosKeysSelected.includes('{{ video_info.filename }}')" class="max-w-96 relative">
|
||||
<p class="absolute inset-x-0 -top-4 text-sm text-gray-300 bg-gray-800 px-2 rounded-t-xl truncate">{{ video_info.filename }}</p>
|
||||
<video muted loop type="video/mp4" class="object-contain w-full h-full" @canplaythrough="videoCanPlay" @timeupdate="() => {
|
||||
if (video.duration) {
|
||||
|
@ -250,6 +275,9 @@
|
|||
nVideos: {{ videos_info | length }},
|
||||
nVideoReadyToPlay: 0,
|
||||
videoCodecError: false,
|
||||
isVideosDropdownOpen: false,
|
||||
videosKeys: {{ videos_info | map(attribute='filename') | list | tojson }},
|
||||
videosKeysSelected: [],
|
||||
columns: {{ columns | tojson }},
|
||||
rowLabels: {{ columns | tojson }}.reduce((colA, colB) => colA.value.length > colB.value.length ? colA : colB).value,
|
||||
|
||||
|
@ -261,6 +289,7 @@
|
|||
if(!canPlayVideos){
|
||||
this.videoCodecError = true;
|
||||
}
|
||||
this.videosKeysSelected = this.videosKeys.map(opt => opt)
|
||||
|
||||
// process CSV data
|
||||
const csvDataStr = {{ episode_data_csv_str|tojson|safe }};
|
||||
|
|
Loading…
Reference in New Issue