Merge branch 'main' into aloha_hd5_to_dataset_v2

2025-01-09 11:17:07 +00:00 · 2025-01-09 11:17:07 +00:00 · 0d9a0cdb6f
parent 3d4d02eafe 25a8597680
commit 0d9a0cdb6f
6 changed files with 67 additions and 57 deletions
--- a/.github/workflows/quality.yml
+++ b/.github/workflows/quality.yml
@ -50,7 +50,7 @@ jobs:
        uses: actions/checkout@v3
      - name: Install poetry
-        run: pipx install poetry
+        run: pipx install "poetry<2.0.0"
      - name: Poetry check
        run: poetry check
@ -64,7 +64,7 @@ jobs:
        uses: actions/checkout@v3
      - name: Install poetry
-        run: pipx install poetry
+        run: pipx install "poetry<2.0.0"
      - name: Install poetry-relax
        run: poetry self add poetry-relax
--- a/README.md
+++ b/README.md
@ -68,7 +68,7 @@
 ### Acknowledgment
- Thanks to Tony Zaho, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
+- Thanks to Tony Zhao, Zipeng Fu and colleagues for open sourcing ACT policy, ALOHA environments and datasets. Ours are adapted from [ALOHA](https://tonyzhaozh.github.io/aloha) and [Mobile ALOHA](https://mobile-aloha.github.io).
 - Thanks to Cheng Chi, Zhenjia Xu and colleagues for open sourcing Diffusion policy, Pusht environment and datasets, as well as UMI datasets. Ours are adapted from [Diffusion Policy](https://diffusion-policy.cs.columbia.edu) and [UMI Gripper](https://umi-gripper.github.io).
 - Thanks to Nicklas Hansen, Yunhai Feng and colleagues for open sourcing TDMPC policy, Simxarm environments and datasets. Ours are adapted from [TDMPC](https://github.com/nicklashansen/tdmpc) and [FOWM](https://www.yunhaifeng.com/FOWM).
 - Thanks to Antonio Loquercio and Ashish Kumar for their early support.
--- a/benchmarks/video/README.md
+++ b/benchmarks/video/README.md
@ -21,7 +21,7 @@ How to decode videos?
 ## Variables
 **Image content & size**
-We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
+We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an apartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution).
 For these reasons, we run this benchmark on four representative datasets:
 - `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera.
 - `aliberts/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera.
@ -63,7 +63,7 @@ This of course is affected by the `-g` parameter during encoding, which specifie
 Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`.
-Additionally, because some policies might request single timestamps that are a few frames appart, we also have the following scenario:
+Additionally, because some policies might request single timestamps that are a few frames apart, we also have the following scenario:
 - `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`),
 However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded.
@ -85,8 +85,8 @@ However, due to how video decoding is implemented with `pyav`, we don't have acc
 **Average Structural Similarity Index Measure (higher is better)**
 `avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity.
-One aspect that can't be measured here with those metrics is the compatibility of the encoding accross platforms, in particular on web browser, for visualization purposes.
+One aspect that can't be measured here with those metrics is the compatibility of the encoding across platforms, in particular on web browser, for visualization purposes.
-h264, h265 and AV1 are all commonly used codecs and should not be pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
+h264, h265 and AV1 are all commonly used codecs and should not pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility:
 - `yuv420p` is more widely supported across various platforms, including web browsers.
 - `yuv444p` offers higher color fidelity but might not be supported as broadly.
@ -116,7 +116,7 @@ Additional encoding parameters exist that are not included in this benchmark. In
 - `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1.
 - `-tune` which allows to optimize the encoding for certains aspects (e.g. film quality, fast decoding, etc.).
-See the documentation mentioned above for more detailled info on these settings and for a more comprehensive list of other parameters.
+See the documentation mentioned above for more detailed info on these settings and for a more comprehensive list of other parameters.
 Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few:
 - `torchaudio`
--- a/lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py
+++ b/lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py
@ -159,11 +159,11 @@ DATASETS = {
        **ALOHA_STATIC_INFO,
    },
    "aloha_static_vinh_cup": {
-        "single_task": "Pick up the platic cup with the right arm, then pop its lid open with the left arm.",
+        "single_task": "Pick up the plastic cup with the right arm, then pop its lid open with the left arm.",
        **ALOHA_STATIC_INFO,
    },
    "aloha_static_vinh_cup_left": {
-        "single_task": "Pick up the platic cup with the left arm, then pop its lid open with the right arm.",
+        "single_task": "Pick up the plastic cup with the left arm, then pop its lid open with the right arm.",
        **ALOHA_STATIC_INFO,
    },
    "aloha_static_ziploc_slide": {"single_task": "Slide open the ziploc bag.", **ALOHA_STATIC_INFO},
--- a/lerobot/scripts/visualize_dataset_html.py
+++ b/lerobot/scripts/visualize_dataset_html.py
@ -177,7 +177,7 @@ def run_server(
                {"url": url_for("static", filename=video_path), "filename": video_path.parent.name}
                for video_path in video_paths
            ]
-            tasks = dataset.meta.episodes[0]["tasks"]
+            tasks = dataset.meta.episodes[episode_id]["tasks"]
        else:
            video_keys = [key for key, ft in dataset.features.items() if ft["dtype"] == "video"]
            videos_info = [
@ -232,63 +232,48 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)
    """Get a csv str containing timeseries data of an episode (e.g. state and action).
    This file will be loaded by Dygraph javascript to plot data in real time."""
    columns = []
-    has_state = "observation.state" in dataset.features
+
-    has_action = "action" in dataset.features
+    selected_columns = [col for col, ft in dataset.features.items() if ft["dtype"] == "float32"]
    selected_columns.remove("timestamp")
    # init header of csv with state and action names
    header = ["timestamp"]
-    if has_state:
+
    for column_name in selected_columns:
        dim_state = (
-            dataset.meta.shapes["observation.state"][0]
+            dataset.meta.shapes[column_name][0]
            if isinstance(dataset, LeRobotDataset)
-            else dataset.features["observation.state"].shape[0]
+            else dataset.features[column_name].shape[0]
        )
-        header += [f"state_{i}" for i in range(dim_state)]
+        header += [f"{column_name}_{i}" for i in range(dim_state)]
-        column_names = dataset.features["observation.state"]["names"]
+
        if "names" in dataset.features[column_name] and dataset.features[column_name]["names"]:
            column_names = dataset.features[column_name]["names"]
            while not isinstance(column_names, list):
                column_names = list(column_names.values())[0]
-        columns.append({"key": "state", "value": column_names})
+        else:
-    if has_action:
+            column_names = [f"motor_{i}" for i in range(dim_state)]
-        dim_action = (
+        columns.append({"key": column_name, "value": column_names})
-            dataset.meta.shapes["action"][0]
+
-            if isinstance(dataset, LeRobotDataset)
+    selected_columns.insert(0, "timestamp")
            else dataset.features.action.shape[0]
        )
        header += [f"action_{i}" for i in range(dim_action)]
        column_names = dataset.features["action"]["names"]
        while not isinstance(column_names, list):
            column_names = list(column_names.values())[0]
        columns.append({"key": "action", "value": column_names})
    if isinstance(dataset, LeRobotDataset):
        from_idx = dataset.episode_data_index["from"][episode_index]
        to_idx = dataset.episode_data_index["to"][episode_index]
        selected_columns = ["timestamp"]
        if has_state:
            selected_columns += ["observation.state"]
        if has_action:
            selected_columns += ["action"]
        data = (
            dataset.hf_dataset.select(range(from_idx, to_idx))
            .select_columns(selected_columns)
-            .with_format("numpy")
+            .with_format("pandas")
        )
        rows = np.hstack(
            (np.expand_dims(data["timestamp"], axis=1), *[data[col] for col in selected_columns[1:]])
        ).tolist()
    else:
        repo_id = dataset.repo_id
        selected_columns = ["timestamp"]
        if "observation.state" in dataset.features:
            selected_columns.append("observation.state")
        if "action" in dataset.features:
            selected_columns.append("action")
        url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/" + dataset.data_path.format(
            episode_chunk=int(episode_index) // dataset.chunks_size, episode_index=episode_index
        )
        df = pd.read_parquet(url)
        data = df[selected_columns]  # Select specific columns
    rows = np.hstack(
        (
            np.expand_dims(data["timestamp"], axis=1),
@ -379,10 +364,6 @@ def visualize_dataset_html(
                template_folder=template_dir,
            )
    else:
        image_keys = dataset.meta.image_keys if isinstance(dataset, LeRobotDataset) else []
        if len(image_keys) > 0:
            raise NotImplementedError(f"Image keys ({image_keys=}) are currently not supported.")
        # Create a simlink from the dataset video folder containg mp4 files to the output directory
        # so that the http server can get access to the mp4 files.
        if isinstance(dataset, LeRobotDataset):
--- a/lerobot/templates/visualize_dataset_template.html
+++ b/lerobot/templates/visualize_dataset_template.html
@ -98,9 +98,34 @@
        </div>
        <!-- Videos -->
        <div  class="max-w-32 relative text-sm mb-4 select-none"
            @click.outside="isVideosDropdownOpen = false">
            <div
                @click="isVideosDropdownOpen = !isVideosDropdownOpen"
                class="p-2 border border-slate-500 rounded flex justify-between items-center cursor-pointer"
            >
            <span class="truncate">filter videos</span>
            <div class="transition-transform" :class="{ 'rotate-180': isVideosDropdownOpen }">🔽</div>
            </div>
            <div x-show="isVideosDropdownOpen" 
                class="absolute mt-1 border border-slate-500 rounded shadow-lg z-10">
            <div>
                <template x-for="option in videosKeys" :key="option">
                <div
                    @click="videosKeysSelected = videosKeysSelected.includes(option) ? videosKeysSelected.filter(v => v !== option) : [...videosKeysSelected, option]"
                    class="p-2 cursor-pointer bg-slate-900"
                    :class="{ 'bg-slate-700': videosKeysSelected.includes(option) }"
                    x-text="option"
                ></div>
                </template>
            </div>
            </div>
        </div>
        <div class="flex flex-wrap gap-x-2 gap-y-6">
            {% for video_info in videos_info %}
-            <div x-show="!videoCodecError" class="max-w-96 relative">
+            <div x-show="!videoCodecError && videosKeysSelected.includes('{{ video_info.filename }}')" class="max-w-96 relative">
                <p class="absolute inset-x-0 -top-4 text-sm text-gray-300 bg-gray-800 px-2 rounded-t-xl truncate">{{ video_info.filename }}</p>
                <video muted loop type="video/mp4" class="object-contain w-full h-full" @canplaythrough="videoCanPlay" @timeupdate="() => {
                    if (video.duration) {
@ -250,6 +275,9 @@
                nVideos: {{ videos_info | length }},
                nVideoReadyToPlay: 0,
                videoCodecError: false,
                isVideosDropdownOpen: false,
                videosKeys: {{ videos_info | map(attribute='filename') | list | tojson }},
                videosKeysSelected: [],
                columns: {{ columns | tojson }},
                rowLabels: {{ columns | tojson }}.reduce((colA, colB) => colA.value.length > colB.value.length ? colA : colB).value,
@ -261,6 +289,7 @@
                    if(!canPlayVideos){
                        this.videoCodecError = true;
                    }
                    this.videosKeysSelected = this.videosKeys.map(opt => opt)
                    // process CSV data
                    const csvDataStr = {{ episode_data_csv_str|tojson|safe }};