Use -g 2, Fix delta_timestamps, Redo benchmark

2024-05-01 20:00:58 +00:00 · 2024-05-01 20:00:58 +00:00 · 370fb5348e
parent a00102b643
commit 370fb5348e
7 changed files with 480 additions and 261 deletions
--- a/lerobot/common/datasets/_video_benchmark/README.md
+++ b/lerobot/common/datasets/_video_benchmark/README.md
@ -19,10 +19,10 @@ How to decode videos?
 ## Metrics
 **Percentage of data compression (higher is better)**
-`pc_compression` is the ratio of the memory space on disk taken by the original images to encode, to the memory space taken by the encoded video. For instance, `pc_compression=400%` means that the video takes 4 times less memory space on disk compared to the original images.
+`compression_factor` is the ratio of the memory space on disk taken by the original images to encode, to the memory space taken by the encoded video. For instance, `compression_factor=4` means that the video takes 4 times less memory space on disk compared to the original images.
-**Percentage of loading time (lower is better)**
+**Percentage of loading time (higher is better)**
-`pc_load_time` is the ratio of the time it takes to load original images at given timestamps, to the time it takes to decode the exact same frames from the video. Lower is better. For instance, `pc_load_time=120%` means that decoding from video is a bit slower than loading the original images.
+`load_time_factor` is the ratio of the time it takes to load original images at given timestamps, to the time it takes to decode the exact same frames from the video. Higher is better. For instance, `load_time_factor=0.5` means that decoding from video is 2 times slower than loading the original images.
 **Average L2 error per pixel (lower is better)**
 `avg_per_pixel_l2_error` is the average L2 error between each decoded frame and its corresponding original image over all requested timestamps, and also divided by the number of pixels in the image to be comparable when switching to different image sizes.
@ -40,7 +40,12 @@ How to decode videos?
 We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an appartment, or in a factory, or outdoor, etc. Hence, we run this bechmark on two datasets: `pusht` (simulation) and `umi` (real-world outdoor).
 **Requested timestamps**
-In this benchmark, we focus on the loading time of random access, so we are not interested about sequentially loading all frames of a video like in a movie. However, the number of consecutive timestamps requested and their spacing can greatly affect the `pc_load_time`. In fact, it is expected to get faster loading time by decoding a large number of consecutive frames from a video, than to load the same data from individual images. To reflect our robotics use case, we consider a setting where we load 2 consecutive frames with 4 frames of spacing.
+In this benchmark, we focus on the loading time of random access, so we are not interested about sequentially loading all frames of a video like in a movie. However, the number of consecutive timestamps requested and their spacing can greatly affect the `load_time_factor`. In fact, it is expected to get faster loading time by decoding a large number of consecutive frames from a video, than to load the same data from individual images. To reflect our robotics use case, we consider a few settings:
 - `single_frame`: 1 frame,
 - `2_frames`: 2 consecutive frames (e.g. `[t, t + 1 / fps]`),
 - `2_frames_4_space`: 2 consecutive frames with 4 frames of spacing (e.g `[t, t + 4 / fps]`),
 **Data augmentations**
 We might revisit this benchmark and find better settings if we train our policies with various data augmentations to make them more robusts (e.g. robust to color changes, compression, etc.).
@ -48,10 +53,8 @@ We might revisit this benchmark and find better settings if we train our policie
 ## Results
 ### Loading 2 consecutive frames with 4 frames spacing (Diffusion Policy setting)
 **`decoder`**
-| repo_id | decoder | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | decoder | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- |
 | lerobot/pusht | <span style="color: #32CD32;">torchvision</span> | 0.166 | 0.0000119 |
 | lerobot/pusht | ffmpegio | 0.009 | 0.0001182 |
@ -60,127 +63,274 @@ We might revisit this benchmark and find better settings if we train our policie
 | lerobot/umi_cup_in_the_wild | ffmpegio | 0.010 | 0.0000735 |
 | lerobot/umi_cup_in_the_wild | torchaudio | 0.154 | 0.0000340 |
 ### `1_frame`
 **`pix_fmt`**
-| repo_id | pix_fmt | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | yuv420p | 3.602 | 0.202 | 0.0000661 |
+| lerobot/pusht | yuv420p | 3.788 | 0.224 | 0.0000760 |
-| lerobot/pusht | <span style="color: #32CD32;">yuv444p</span> | 3.213 | 0.153 | 0.0000110 |
+| lerobot/pusht | yuv444p | 3.646 | 0.185 | 0.0000443 |
-| lerobot/umi_cup_in_the_wild | yuv420p | 8.879 | 0.202 | 0.0000332 |
+| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.388 | 0.0000469 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">yuv444p</span> | 8.517 | 0.165 | 0.0000175 |
+| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.329 | 0.0000397 |
 **`g`**
-| repo_id | g | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | 1 | 1.308 | 0.190 | 0.0000151 |
+| lerobot/pusht | 1 | 2.543 | 0.204 | 0.0000556 |
-| lerobot/pusht | 5 | 2.739 | 0.184 | 0.0000123 |
+| lerobot/pusht | 2 | 3.646 | 0.182 | 0.0000443 |
-| lerobot/pusht | 10 | 3.213 | 0.144 | 0.0000116 |
+| lerobot/pusht | 3 | 4.431 | 0.174 | 0.0000450 |
-| lerobot/pusht | 15 | 3.460 | 0.137 | 0.0000112 |
+| lerobot/pusht | 4 | 5.103 | 0.163 | 0.0000448 |
-| lerobot/pusht | 20 | 3.559 | 0.118 | 0.0000109 |
+| lerobot/pusht | 5 | 5.625 | 0.163 | 0.0000436 |
-| lerobot/pusht | 30 | 3.697 | 0.104 | 0.0000117 |
+| lerobot/pusht | 6 | 5.974 | 0.155 | 0.0000427 |
-| lerobot/pusht | 40 | 3.763 | 0.092 | 0.0000116 |
+| lerobot/pusht | 10 | 6.814 | 0.130 | 0.0000410 |
-| lerobot/pusht | 60 | 3.925 | 0.068 | 0.0000117 |
+| lerobot/pusht | 15 | 7.431 | 0.105 | 0.0000406 |
-| lerobot/pusht | 100 | 4.010 | 0.054 | 0.0000117 |
+| lerobot/pusht | 20 | 7.662 | 0.097 | 0.0000400 |
-| lerobot/pusht | <span style="color: #32CD32;">None</span> | 4.058 | 0.043 | 0.0000117 |
+| lerobot/pusht | 40 | 8.163 | 0.061 | 0.0000405 |
-| lerobot/umi_cup_in_the_wild | 1 | 4.790 | 0.236 | 0.0000221 |
+| lerobot/pusht | 100 | 8.761 | 0.039 | 0.0000422 |
-| lerobot/umi_cup_in_the_wild | 5 | 7.707 | 0.201 | 0.0000185 |
+| lerobot/pusht | None | 8.909 | 0.024 | 0.0000431 |
-| lerobot/umi_cup_in_the_wild | 10 | 8.517 | 0.172 | 0.0000177 |
+| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.444 | 0.0000601 |
-| lerobot/umi_cup_in_the_wild | 15 | 8.830 | 0.152 | 0.0000170 |
+| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.345 | 0.0000397 |
-| lerobot/umi_cup_in_the_wild | 20 | 8.961 | 0.133 | 0.0000167 |
+| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.282 | 0.0000416 |
-| lerobot/umi_cup_in_the_wild | 30 | 8.850 | 0.113 | 0.0000167 |
+| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.271 | 0.0000415 |
-| lerobot/umi_cup_in_the_wild | 40 | 8.996 | 0.109 | 0.0000174 |
+| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.260 | 0.0000415 |
-| lerobot/umi_cup_in_the_wild | 60 | 9.113 | 0.081 | 0.0000163 |
+| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.249 | 0.0000415 |
-| lerobot/umi_cup_in_the_wild | 100 | 9.278 | 0.051 | 0.0000173 |
+| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.195 | 0.0000399 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">None</span> | 9.396 | 0.030 | 0.0000165 |
+| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.169 | 0.0000394 |
 | lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.140 | 0.0000390 |
 | lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.096 | 0.0000384 |
 | lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.046 | 0.0000390 |
 | lerobot/umi_cup_in_the_wild | None | 60.530 | 0.022 | 0.0000400 |
 **`crf`**
-| repo_id | crf | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | 0 | 4.529 | 0.041 | 0.0000035 |
+| lerobot/pusht | 0 | 1.699 | 0.175 | 0.0000035 |
-| lerobot/pusht | 5 | 3.138 | 0.040 | 0.0000077 |
+| lerobot/pusht | 5 | 1.409 | 0.181 | 0.0000080 |
-| lerobot/pusht | <span style="color: #32CD32;">10</span> | 4.058 | 0.038 | 0.0000121 |
+| lerobot/pusht | 10 | 1.842 | 0.172 | 0.0000123 |
-| lerobot/pusht | <span style="color: #32CD32;">15</span> | 5.407 | 0.039 | 0.0000195 |
+| lerobot/pusht | 15 | 2.322 | 0.187 | 0.0000211 |
-| lerobot/pusht | <span style="color: #32CD32;">20</span> | 7.335 | 0.039 | 0.0000319 |
+| lerobot/pusht | 20 | 3.050 | 0.181 | 0.0000346 |
-| lerobot/pusht | <span style="color: #32CD32;">None</span> | 8.909 | 0.046 | 0.0000425 |
+| lerobot/pusht | None | 3.646 | 0.189 | 0.0000443 |
-| lerobot/pusht | 25 | 10.213 | 0.039 | 0.0000519 |
+| lerobot/pusht | 25 | 3.969 | 0.186 | 0.0000521 |
-| lerobot/pusht | 30 | 14.516 | 0.041 | 0.0000795 |
+| lerobot/pusht | 30 | 5.687 | 0.184 | 0.0000850 |
-| lerobot/pusht | 40 | 23.546 | 0.041 | 0.0001557 |
+| lerobot/pusht | 40 | 10.818 | 0.193 | 0.0001726 |
-| lerobot/pusht | 50 | 28.460 | 0.042 | 0.0002723 |
+| lerobot/pusht | 50 | 18.185 | 0.183 | 0.0002606 |
-| lerobot/umi_cup_in_the_wild | 0 | 2.318 | 0.012 | 0.0000056 |
+| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.165 | 0.0000056 |
-| lerobot/umi_cup_in_the_wild | 5 | 4.899 | 0.019 | 0.0000132 |
+| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.171 | 0.0000111 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">10</span> | 9.396 | 0.026 | 0.0000183 |
+| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.212 | 0.0000153 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">15</span> | 19.161 | 0.034 | 0.0000241 |
+| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.261 | 0.0000218 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">20</span> | 39.311 | 0.039 | 0.0000329 |
+| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.312 | 0.0000317 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">None</span> | 60.530 | 0.043 | 0.0000401 |
+| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.339 | 0.0000397 |
-| lerobot/umi_cup_in_the_wild | 25 | 81.048 | 0.046 | 0.0000454 |
+| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.297 | 0.0000452 |
-| lerobot/umi_cup_in_the_wild | 30 | 165.189 | 0.051 | 0.0000609 |
+| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.406 | 0.0000629 |
-| lerobot/umi_cup_in_the_wild | 40 | 544.478 | 0.056 | 0.0001095 |
+| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.468 | 0.0001184 |
-| lerobot/umi_cup_in_the_wild | 50 | 1109.556 | 0.072 | 0.0001815 |
+| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.515 | 0.0001879 |
-
+**best**
-### Loading 6 consecutive frames with no spacing (TDMPC setting)
+| repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 **`decoder`**
 | repo_id | decoder | pc_load_time | avg_per_pixel_l2_error |
 | --- | --- | --- | --- |
-| lerobot/pusht | <span style="color: #32CD32;">torchvision</span> | 0.386 | 0.0000117 |
+| lerobot/pusht | 3.646 | 0.188 | 0.0000443 |
-| lerobot/pusht | ffmpegio | 0.008 | 0.0000117 |
+| lerobot/umi_cup_in_the_wild | 14.932 | 0.339 | 0.0000397 |
-| lerobot/pusht | torchaudio | 0.184 | 0.0000356 |
+
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">torchvision</span> | 0.448 | 0.0000178 |
+### `2_frames`
 | lerobot/umi_cup_in_the_wild | ffmpegio | 0.009 | 0.0000178 |
 | lerobot/umi_cup_in_the_wild | torchaudio | 0.149 | 0.0000349 |
 **`pix_fmt`**
-| repo_id | pix_fmt | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | yuv420p | 3.602 | 0.518 | 0.0000651 |
+| lerobot/pusht | yuv420p | 3.788 | 0.314 | 0.0000799 |
-| lerobot/pusht | <span style="color: #32CD32;">yuv444p</span> | 3.213 | 0.401 | 0.0000117 |
+| lerobot/pusht | yuv444p | 3.646 | 0.303 | 0.0000496 |
-| lerobot/umi_cup_in_the_wild | yuv420p | 8.879 | 0.578 | 0.0000334 |
+| lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.642 | 0.0000503 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">yuv444p</span> | 8.517 | 0.479 | 0.0000178 |
+| lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.529 | 0.0000436 |
 **`g`**
-| repo_id | g | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | 1 | 1.308 | 0.528 | 0.0000152 |
+| lerobot/pusht | 1 | 2.543 | 0.308 | 0.0000599 |
-| lerobot/pusht | 5 | 2.739 | 0.483 | 0.0000124 |
+| lerobot/pusht | 2 | 3.646 | 0.279 | 0.0000496 |
-| lerobot/pusht | 10 | 3.213 | 0.396 | 0.0000117 |
+| lerobot/pusht | 3 | 4.431 | 0.259 | 0.0000498 |
-| lerobot/pusht | 15 | 3.460 | 0.379 | 0.0000118 |
+| lerobot/pusht | 4 | 5.103 | 0.243 | 0.0000501 |
-| lerobot/pusht | 20 | 3.559 | 0.319 | 0.0000114 |
+| lerobot/pusht | 5 | 5.625 | 0.235 | 0.0000492 |
-| lerobot/pusht | 30 | 3.697 | 0.278 | 0.0000116 |
+| lerobot/pusht | 6 | 5.974 | 0.230 | 0.0000481 |
-| lerobot/pusht | 40 | 3.763 | 0.243 | 0.0000115 |
+| lerobot/pusht | 10 | 6.814 | 0.194 | 0.0000468 |
-| lerobot/pusht | 60 | 3.925 | 0.186 | 0.0000118 |
+| lerobot/pusht | 15 | 7.431 | 0.152 | 0.0000460 |
-| lerobot/pusht | 100 | 4.010 | 0.156 | 0.0000119 |
+| lerobot/pusht | 20 | 7.662 | 0.151 | 0.0000455 |
-| lerobot/pusht | <span style="color: #32CD32;">None</span> | 4.058 | 0.105 | 0.0000121 |
+| lerobot/pusht | 40 | 8.163 | 0.095 | 0.0000454 |
-| lerobot/umi_cup_in_the_wild | 1 | 4.790 | 0.605 | 0.0000221 |
+| lerobot/pusht | 100 | 8.761 | 0.062 | 0.0000472 |
-| lerobot/umi_cup_in_the_wild | 5 | 7.707 | 0.533 | 0.0000183 |
+| lerobot/pusht | None | 8.909 | 0.037 | 0.0000479 |
-| lerobot/umi_cup_in_the_wild | 10 | 8.517 | 0.469 | 0.0000178 |
+| lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.638 | 0.0000625 |
-| lerobot/umi_cup_in_the_wild | 15 | 8.830 | 0.399 | 0.0000174 |
+| lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.537 | 0.0000436 |
-| lerobot/umi_cup_in_the_wild | 20 | 8.961 | 0.382 | 0.0000175 |
+| lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.493 | 0.0000437 |
-| lerobot/umi_cup_in_the_wild | 30 | 8.850 | 0.326 | 0.0000172 |
+| lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.458 | 0.0000446 |
-| lerobot/umi_cup_in_the_wild | 40 | 8.996 | 0.279 | 0.0000173 |
+| lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.438 | 0.0000445 |
-| lerobot/umi_cup_in_the_wild | 60 | 9.113 | 0.226 | 0.0000174 |
+| lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.424 | 0.0000444 |
-| lerobot/umi_cup_in_the_wild | 100 | 9.278 | 0.150 | 0.0000175 |
+| lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.345 | 0.0000435 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">None</span> | 9.396 | 0.076 | 0.0000176 |
+| lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.313 | 0.0000417 |
 | lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.264 | 0.0000421 |
 | lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.185 | 0.0000414 |
 | lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.090 | 0.0000420 |
 | lerobot/umi_cup_in_the_wild | None | 60.530 | 0.042 | 0.0000424 |
 **`crf`**
-| repo_id | crf | pc_compression | pc_load_time | avg_per_pixel_l2_error |
+| repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
-| lerobot/pusht | 0 | 4.529 | 0.108 | 0.0000035 |
+| lerobot/pusht | 0 | 1.699 | 0.302 | 0.0000097 |
-| lerobot/pusht | 5 | 3.138 | 0.099 | 0.0000077 |
+| lerobot/pusht | 5 | 1.409 | 0.287 | 0.0000142 |
-| lerobot/pusht | 10 | 4.058 | 0.091 | 0.0000121 |
+| lerobot/pusht | 10 | 1.842 | 0.283 | 0.0000184 |
-| lerobot/pusht | 15 | 5.407 | 0.095 | 0.0000195 |
+| lerobot/pusht | 15 | 2.322 | 0.305 | 0.0000268 |
-| lerobot/pusht | 20 | 7.335 | 0.100 | 0.0000318 |
+| lerobot/pusht | 20 | 3.050 | 0.285 | 0.0000402 |
-| lerobot/pusht | <span style="color: #32CD32;">None</span> | 8.909 | 0.102 | 0.0000422 |
+| lerobot/pusht | None | 3.646 | 0.285 | 0.0000496 |
-| lerobot/pusht | 25 | 10.213 | 0.102 | 0.0000517 |
+| lerobot/pusht | 25 | 3.969 | 0.293 | 0.0000572 |
-| lerobot/pusht | 30 | 14.516 | 0.104 | 0.0000795 |
+| lerobot/pusht | 30 | 5.687 | 0.293 | 0.0000893 |
-| lerobot/pusht | 40 | 23.546 | 0.106 | 0.0001555 |
+| lerobot/pusht | 40 | 10.818 | 0.319 | 0.0001762 |
-| lerobot/pusht | 50 | 28.460 | 0.110 | 0.0002723 |
+| lerobot/pusht | 50 | 18.185 | 0.304 | 0.0002626 |
-| lerobot/umi_cup_in_the_wild | 0 | 2.318 | 0.032 | 0.0000056 |
+| lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.235 | 0.0000112 |
-| lerobot/umi_cup_in_the_wild | 5 | 4.899 | 0.052 | 0.0000127 |
+| lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.261 | 0.0000166 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">10</span> | 9.396 | 0.073 | 0.0000176 |
+| lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.333 | 0.0000207 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">15</span> | 19.161 | 0.097 | 0.0000234 |
+| lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.406 | 0.0000267 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">20</span> | 39.311 | 0.110 | 0.0000321 |
+| lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.489 | 0.0000361 |
-| lerobot/umi_cup_in_the_wild | <span style="color: #32CD32;">None</span> | 60.530 | 0.117 | 0.0000393 |
+| lerobot/umi_cup_in_the_wild | None | 14.932 | 0.537 | 0.0000436 |
-| lerobot/umi_cup_in_the_wild | 25 | 81.048 | 0.126 | 0.0000446 |
+| lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.578 | 0.0000487 |
-| lerobot/umi_cup_in_the_wild | 30 | 165.189 | 0.138 | 0.0000603 |
+| lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.453 | 0.0000655 |
-| lerobot/umi_cup_in_the_wild | 40 | 544.478 | 0.151 | 0.0001095 |
+| lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.767 | 0.0001192 |
-| lerobot/umi_cup_in_the_wild | 50 | 1109.556 | 0.167 | 0.0001817 |
+| lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.816 | 0.0001881 |
 **best**
 | repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- |
 | lerobot/pusht | 3.646 | 0.283 | 0.0000496 |
 | lerobot/umi_cup_in_the_wild | 14.932 | 0.543 | 0.0000436 |
 ### `2_frames_4_space`
 **`pix_fmt`**
 | repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | yuv420p | 3.788 | 0.257 | 0.0000855 |
 | lerobot/pusht | yuv444p | 3.646 | 0.261 | 0.0000556 |
 | lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 0.493 | 0.0000476 |
 | lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.371 | 0.0000404 |
 **`g`**
 | repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | 1 | 2.543 | 0.226 | 0.0000670 |
 | lerobot/pusht | 2 | 3.646 | 0.222 | 0.0000556 |
 | lerobot/pusht | 3 | 4.431 | 0.217 | 0.0000567 |
 | lerobot/pusht | 4 | 5.103 | 0.204 | 0.0000555 |
 | lerobot/pusht | 5 | 5.625 | 0.179 | 0.0000556 |
 | lerobot/pusht | 6 | 5.974 | 0.188 | 0.0000544 |
 | lerobot/pusht | 10 | 6.814 | 0.160 | 0.0000531 |
 | lerobot/pusht | 15 | 7.431 | 0.150 | 0.0000521 |
 | lerobot/pusht | 20 | 7.662 | 0.123 | 0.0000519 |
 | lerobot/pusht | 40 | 8.163 | 0.092 | 0.0000519 |
 | lerobot/pusht | 100 | 8.761 | 0.053 | 0.0000533 |
 | lerobot/pusht | None | 8.909 | 0.034 | 0.0000541 |
 | lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.409 | 0.0000607 |
 | lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.381 | 0.0000404 |
 | lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.355 | 0.0000418 |
 | lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.346 | 0.0000425 |
 | lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.354 | 0.0000419 |
 | lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.336 | 0.0000419 |
 | lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.314 | 0.0000402 |
 | lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.269 | 0.0000397 |
 | lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.246 | 0.0000395 |
 | lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.171 | 0.0000390 |
 | lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.091 | 0.0000399 |
 | lerobot/umi_cup_in_the_wild | None | 60.530 | 0.043 | 0.0000409 |
 **`crf`**
 | repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | 0 | 1.699 | 0.212 | 0.0000193 |
 | lerobot/pusht | 5 | 1.409 | 0.211 | 0.0000232 |
 | lerobot/pusht | 10 | 1.842 | 0.199 | 0.0000270 |
 | lerobot/pusht | 15 | 2.322 | 0.198 | 0.0000347 |
 | lerobot/pusht | 20 | 3.050 | 0.211 | 0.0000469 |
 | lerobot/pusht | None | 3.646 | 0.206 | 0.0000556 |
 | lerobot/pusht | 25 | 3.969 | 0.210 | 0.0000626 |
 | lerobot/pusht | 30 | 5.687 | 0.223 | 0.0000927 |
 | lerobot/pusht | 40 | 10.818 | 0.227 | 0.0001763 |
 | lerobot/pusht | 50 | 18.185 | 0.223 | 0.0002625 |
 | lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.147 | 0.0000071 |
 | lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.182 | 0.0000125 |
 | lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.222 | 0.0000166 |
 | lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.270 | 0.0000229 |
 | lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.325 | 0.0000326 |
 | lerobot/umi_cup_in_the_wild | None | 14.932 | 0.362 | 0.0000404 |
 | lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.390 | 0.0000459 |
 | lerobot/umi_cup_in_the_wild | 30 | 27.983 | 0.437 | 0.0000633 |
 | lerobot/umi_cup_in_the_wild | 40 | 82.449 | 0.499 | 0.0001186 |
 | lerobot/umi_cup_in_the_wild | 50 | 186.145 | 0.564 | 0.0001879 |
 **best**
 | repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- |
 | lerobot/pusht | 3.646 | 0.224 | 0.0000556 |
 | lerobot/umi_cup_in_the_wild | 14.932 | 0.368 | 0.0000404 |
 ### `6_frames`
 **`pix_fmt`**
 | repo_id | pix_fmt | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | yuv420p | 3.788 | 0.660 | 0.0000839 |
 | lerobot/pusht | yuv444p | 3.646 | 0.546 | 0.0000542 |
 | lerobot/umi_cup_in_the_wild | yuv420p | 14.391 | 1.225 | 0.0000497 |
 | lerobot/umi_cup_in_the_wild | yuv444p | 14.932 | 0.908 | 0.0000428 |
 **`g`**
 | repo_id | g | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | 1 | 2.543 | 0.552 | 0.0000646 |
 | lerobot/pusht | 2 | 3.646 | 0.534 | 0.0000542 |
 | lerobot/pusht | 3 | 4.431 | 0.563 | 0.0000546 |
 | lerobot/pusht | 4 | 5.103 | 0.537 | 0.0000545 |
 | lerobot/pusht | 5 | 5.625 | 0.477 | 0.0000532 |
 | lerobot/pusht | 6 | 5.974 | 0.515 | 0.0000530 |
 | lerobot/pusht | 10 | 6.814 | 0.410 | 0.0000512 |
 | lerobot/pusht | 15 | 7.431 | 0.405 | 0.0000503 |
 | lerobot/pusht | 20 | 7.662 | 0.345 | 0.0000500 |
 | lerobot/pusht | 40 | 8.163 | 0.247 | 0.0000496 |
 | lerobot/pusht | 100 | 8.761 | 0.147 | 0.0000510 |
 | lerobot/pusht | None | 8.909 | 0.100 | 0.0000519 |
 | lerobot/umi_cup_in_the_wild | 1 | 14.411 | 0.997 | 0.0000620 |
 | lerobot/umi_cup_in_the_wild | 2 | 14.932 | 0.911 | 0.0000428 |
 | lerobot/umi_cup_in_the_wild | 3 | 20.174 | 0.869 | 0.0000433 |
 | lerobot/umi_cup_in_the_wild | 4 | 24.889 | 0.874 | 0.0000438 |
 | lerobot/umi_cup_in_the_wild | 5 | 28.825 | 0.864 | 0.0000439 |
 | lerobot/umi_cup_in_the_wild | 6 | 31.635 | 0.834 | 0.0000440 |
 | lerobot/umi_cup_in_the_wild | 10 | 39.418 | 0.781 | 0.0000421 |
 | lerobot/umi_cup_in_the_wild | 15 | 44.577 | 0.679 | 0.0000411 |
 | lerobot/umi_cup_in_the_wild | 20 | 47.907 | 0.652 | 0.0000410 |
 | lerobot/umi_cup_in_the_wild | 40 | 52.554 | 0.465 | 0.0000404 |
 | lerobot/umi_cup_in_the_wild | 100 | 58.241 | 0.245 | 0.0000413 |
 | lerobot/umi_cup_in_the_wild | None | 60.530 | 0.116 | 0.0000417 |
 **`crf`**
 | repo_id | crf | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- | --- |
 | lerobot/pusht | 0 | 1.699 | 0.534 | 0.0000163 |
 | lerobot/pusht | 5 | 1.409 | 0.524 | 0.0000205 |
 | lerobot/pusht | 10 | 1.842 | 0.510 | 0.0000245 |
 | lerobot/pusht | 15 | 2.322 | 0.512 | 0.0000324 |
 | lerobot/pusht | 20 | 3.050 | 0.508 | 0.0000452 |
 | lerobot/pusht | None | 3.646 | 0.518 | 0.0000542 |
 | lerobot/pusht | 25 | 3.969 | 0.534 | 0.0000616 |
 | lerobot/pusht | 30 | 5.687 | 0.530 | 0.0000927 |
 | lerobot/pusht | 40 | 10.818 | 0.552 | 0.0001777 |
 | lerobot/pusht | 50 | 18.185 | 0.564 | 0.0002644 |
 | lerobot/umi_cup_in_the_wild | 0 | 1.918 | 0.401 | 0.0000101 |
 | lerobot/umi_cup_in_the_wild | 5 | 3.207 | 0.499 | 0.0000156 |
 | lerobot/umi_cup_in_the_wild | 10 | 4.818 | 0.599 | 0.0000197 |
 | lerobot/umi_cup_in_the_wild | 15 | 7.329 | 0.704 | 0.0000258 |
 | lerobot/umi_cup_in_the_wild | 20 | 11.361 | 0.834 | 0.0000352 |
 | lerobot/umi_cup_in_the_wild | None | 14.932 | 0.925 | 0.0000428 |
 | lerobot/umi_cup_in_the_wild | 25 | 17.741 | 0.978 | 0.0000480 |
 | lerobot/umi_cup_in_the_wild | 30 | 27.983 | 1.088 | 0.0000648 |
 | lerobot/umi_cup_in_the_wild | 40 | 82.449 | 1.324 | 0.0001190 |
 | lerobot/umi_cup_in_the_wild | 50 | 186.145 | 1.436 | 0.0001880 |
 **best**
 | repo_id | compression_factor | load_time_factor | avg_per_pixel_l2_error |
 | --- | --- | --- | --- |
 | lerobot/pusht | 3.646 | 0.546 | 0.0000542 |
 | lerobot/umi_cup_in_the_wild | 14.932 | 0.934 | 0.0000428 |
--- a/lerobot/common/datasets/_video_benchmark/run_video_benchmark.py
+++ b/lerobot/common/datasets/_video_benchmark/run_video_benchmark.py
@ -31,7 +31,12 @@ def get_directory_size(directory):
    return total_size
-def run_video_benchmark(output_dir, cfg, seed=1337, timestamps_mode="diffusion"):
+def run_video_benchmark(
    output_dir,
    cfg,
    timestamps_mode,
    seed=1337,
 ):
    output_dir = Path(output_dir)
    if output_dir.exists():
        shutil.rmtree(output_dir)
@ -73,19 +78,20 @@ def run_video_benchmark(output_dir, cfg, seed=1337, timestamps_mode="diffusion")
    crf = cfg.get("crf")
    pix_fmt = cfg["pix_fmt"]
-    ffmpeg_cmd = ""
+    cmd = f"ffmpeg -r {fps} "
-    ffmpeg_cmd += f"ffmpeg -r {fps} -f image2 "
+    cmd += "-f image2 "
-    ffmpeg_cmd += f"-i {str(imgs_dir / 'frame_%06d.png')} "
+    cmd += "-loglevel error "
-    ffmpeg_cmd += "-vcodec libx264 "
+    cmd += f"-i {str(imgs_dir / 'frame_%06d.png')} "
    cmd += "-vcodec libx264 "
    if g is not None:
-        ffmpeg_cmd += f"-g {g} "  # ensures at least 1 keyframe every 10 frames
+        cmd += f"-g {g} "  # ensures at least 1 keyframe every 10 frames
-    # ffmpeg_cmd += "-keyint_min 10 " set a minimum of 10 frames between 2 key frames
+    # cmd += "-keyint_min 10 " set a minimum of 10 frames between 2 key frames
-    # ffmpeg_cmd += "-sc_threshold 0 " disable scene change detection to lower the number of key frames
+    # cmd += "-sc_threshold 0 " disable scene change detection to lower the number of key frames
    if crf is not None:
-        ffmpeg_cmd += f"-crf {crf} "
+        cmd += f"-crf {crf} "
-    ffmpeg_cmd += f"-pix_fmt {pix_fmt} "
+    cmd += f"-pix_fmt {pix_fmt} "
-    ffmpeg_cmd += f"{str(video_path)}"
+    cmd += f"{str(video_path)}"
-    subprocess.run(ffmpeg_cmd.split(" "), check=True)
+    subprocess.run(cmd.split(" "), check=True)
    video_size_bytes = video_path.stat().st_size
@ -127,18 +133,23 @@ def run_video_benchmark(output_dir, cfg, seed=1337, timestamps_mode="diffusion")
        # test loading 2 frames that are 4 frames appart, which might be a common setting
        ts = random.randint(fps, ep_num_images - fps) / fps
-        if timestamps_mode == "diffusion":
+        if timestamps_mode == "1_frame":
-            prev_ts = round(ts - 4 / fps, 4)
+            timestamps = [ts]
-            timestamps = [prev_ts, ts]
+        elif timestamps_mode == "2_frames":
-        elif timestamps_mode == "tdmpc":
+            timestamps = [ts - 1 / fps, ts]
-            timestamps = [round(ts - i / fps, 4) for i in range(6)][::-1]
+        elif timestamps_mode == "2_frames_4_space":
            timestamps = [ts - 4 / fps, ts]
        elif timestamps_mode == "6_frames":
            timestamps = [ts - i / fps for i in range(6)][::-1]
        else:
            raise ValueError(timestamps_mode)
        num_frames = len(timestamps)
        start_time_s = time.monotonic()
-        frames = decode_frames_fn(video_path, timestamps=timestamps, device=device, **decoder_kwgs)
+        frames = decode_frames_fn(
            video_path, timestamps=timestamps, tolerance_s=1e-4, device=device, **decoder_kwgs
        )
        avg_load_time = (time.monotonic() - start_time_s) / num_frames
        list_avg_load_time.append(avg_load_time)
@ -177,25 +188,17 @@ def run_video_benchmark(output_dir, cfg, seed=1337, timestamps_mode="diffusion")
        "video_size_bytes": video_size_bytes,
        "avg_load_time_from_images": avg_load_time_from_images,
        "avg_load_time": avg_load_time,
-        "pc_compression": sum_original_frames_size_bytes / video_size_bytes,
+        "compression_factor": sum_original_frames_size_bytes / video_size_bytes,
-        "pc_load_time": avg_load_time_from_images / avg_load_time,
+        "load_time_factor": avg_load_time_from_images / avg_load_time,
        "avg_per_pixel_l2_error": avg_per_pixel_l2_error,
    }
    for key in info:
        print(key, info[key])
    with open(output_dir / "info.json", "w") as f:
        json.dump(info, f)
    return info
 def main():
    dry_run = True
    bench_dir = Path("tmp/2024_04_29_1049_6_timestamps")
 def display_markdown_table(headers, rows):
    for i, row in enumerate(rows):
        new_row = []
@ -220,48 +223,59 @@ def main():
    print(markdown_table)
    print()
 def load_info(out_dir):
    with open(out_dir / "info.json") as f:
        info = json.load(f)
    return info
 def main():
    dry_run = False
    repo_ids = ["lerobot/pusht", "lerobot/umi_cup_in_the_wild"]
    timestamps_modes = [
        "1_frame",
        "2_frames",
        "2_frames_4_space",
        "6_frames",
    ]
    for timestamps_mode in timestamps_modes:
        bench_dir = Path(f"tmp/2024_05_01_{timestamps_mode}")
-    # torchvision vs ffmpegio vs torchaudio
+        print(f"### `{timestamps_mode}`")
        print()
-    headers = ["repo_id", "decoder", "pc_load_time", "avg_per_pixel_l2_error"]
+        # print("**`decoder`**")
-    rows = []
+        # headers = ["repo_id", "decoder", "load_time_factor", "avg_per_pixel_l2_error"]
-    for repo_id in repo_ids:
+        # rows = []
-        for decoder in ["torchvision", "ffmpegio", "torchaudio"]:
+        # for repo_id in repo_ids:
-            cfg = {
+        #     for decoder in ["torchvision", "ffmpegio", "torchaudio"]:
-                "repo_id": repo_id,
+        #         cfg = {
-                # video encoding
+        #             "repo_id": repo_id,
-                "g": 10,
+        #             # video encoding
-                "crf": 10,
+        #             "pix_fmt": "yuv444p",
-                "pix_fmt": "yuv444p",
+        #             # video decoding
-                # video decoding
+        #             "device": "cpu",
-                "device": "cpu",
+        #             "decoder": decoder,
-                "decoder": decoder,
+        #             "decoder_kwgs": {},
-                "decoder_kwgs": {},
+        #         }
            }
-            if not dry_run:
+        #         if not dry_run:
-                run_video_benchmark(bench_dir / repo_id / decoder, cfg=cfg)
+        #             run_video_benchmark(bench_dir / repo_id / decoder, cfg, timestamps_mode)
-            info = load_info(bench_dir / repo_id / decoder)
+        #         info = load_info(bench_dir / repo_id / decoder)
-            rows.append([repo_id, decoder, info["pc_load_time"], info["avg_per_pixel_l2_error"]])
+        #         rows.append([repo_id, decoder, info["load_time_factor"], info["avg_per_pixel_l2_error"]])
-    display_markdown_table(headers, rows)
+        # display_markdown_table(headers, rows)
-    # yuv444p vs yuv420p
+        print("**`pix_fmt`**")
-
+        headers = ["repo_id", "pix_fmt", "compression_factor", "load_time_factor", "avg_per_pixel_l2_error"]
    headers = ["repo_id", "pix_fmt", "pc_compression", "pc_load_time", "avg_per_pixel_l2_error"]
        rows = []
        for repo_id in repo_ids:
            for pix_fmt in ["yuv420p", "yuv444p"]:
                cfg = {
                    "repo_id": repo_id,
                    # video encoding
-                "g": 10,
+                    "g": 2,
-                "crf": 10,
+                    "crf": None,
                    "pix_fmt": pix_fmt,
                    # video decoding
                    "device": "cpu",
@ -269,30 +283,28 @@ def main():
                    "decoder_kwgs": {},
                }
                if not dry_run:
-                run_video_benchmark(bench_dir / repo_id / f"torchvision_{pix_fmt}", cfg=cfg)
+                    run_video_benchmark(bench_dir / repo_id / f"torchvision_{pix_fmt}", cfg, timestamps_mode)
                info = load_info(bench_dir / repo_id / f"torchvision_{pix_fmt}")
                rows.append(
                    [
                        repo_id,
                        pix_fmt,
-                    info["pc_compression"],
+                        info["compression_factor"],
-                    info["pc_load_time"],
+                        info["load_time_factor"],
                        info["avg_per_pixel_l2_error"],
                    ]
                )
        display_markdown_table(headers, rows)
-    # g
+        print("**`g`**")
-
+        headers = ["repo_id", "g", "compression_factor", "load_time_factor", "avg_per_pixel_l2_error"]
    headers = ["repo_id", "g", "pc_compression", "pc_load_time", "avg_per_pixel_l2_error"]
        rows = []
        for repo_id in repo_ids:
-        for g in [1, 5, 10, 15, 20, 30, 40, 60, 100, None]:
+            for g in [1, 2, 3, 4, 5, 6, 10, 15, 20, 40, 100, None]:
                cfg = {
                    "repo_id": repo_id,
                    # video encoding
                    "g": g,
                "crf": 10,
                    "pix_fmt": "yuv444p",
                    # video decoding
                    "device": "cpu",
@ -300,23 +312,28 @@ def main():
                    "decoder_kwgs": {},
                }
                if not dry_run:
-                run_video_benchmark(bench_dir / repo_id / f"torchvision_g_{g}", cfg=cfg)
+                    run_video_benchmark(bench_dir / repo_id / f"torchvision_g_{g}", cfg, timestamps_mode)
                info = load_info(bench_dir / repo_id / f"torchvision_g_{g}")
                rows.append(
-                [repo_id, g, info["pc_compression"], info["pc_load_time"], info["avg_per_pixel_l2_error"]]
+                    [
                        repo_id,
                        g,
                        info["compression_factor"],
                        info["load_time_factor"],
                        info["avg_per_pixel_l2_error"],
                    ]
                )
        display_markdown_table(headers, rows)
-    # crf
+        print("**`crf`**")
-
+        headers = ["repo_id", "crf", "compression_factor", "load_time_factor", "avg_per_pixel_l2_error"]
    headers = ["repo_id", "crf", "pc_compression", "pc_load_time", "avg_per_pixel_l2_error"]
        rows = []
        for repo_id in repo_ids:
            for crf in [0, 5, 10, 15, 20, None, 25, 30, 40, 50]:
                cfg = {
                    "repo_id": repo_id,
                    # video encoding
-                "g": None,
+                    "g": 2,
                    "crf": crf,
                    "pix_fmt": "yuv444p",
                    # video decoding
@ -325,10 +342,44 @@ def main():
                    "decoder_kwgs": {},
                }
                if not dry_run:
-                run_video_benchmark(bench_dir / repo_id / f"torchvision_crf_{crf}", cfg=cfg)
+                    run_video_benchmark(bench_dir / repo_id / f"torchvision_crf_{crf}", cfg, timestamps_mode)
                info = load_info(bench_dir / repo_id / f"torchvision_crf_{crf}")
                rows.append(
-                [repo_id, crf, info["pc_compression"], info["pc_load_time"], info["avg_per_pixel_l2_error"]]
+                    [
                        repo_id,
                        crf,
                        info["compression_factor"],
                        info["load_time_factor"],
                        info["avg_per_pixel_l2_error"],
                    ]
                )
        display_markdown_table(headers, rows)
        print("**best**")
        headers = ["repo_id", "compression_factor", "load_time_factor", "avg_per_pixel_l2_error"]
        rows = []
        for repo_id in repo_ids:
            cfg = {
                "repo_id": repo_id,
                # video encoding
                "g": 2,
                "crf": None,
                "pix_fmt": "yuv444p",
                # video decoding
                "device": "cpu",
                "decoder": "torchvision",
                "decoder_kwgs": {},
            }
            if not dry_run:
                run_video_benchmark(bench_dir / repo_id / "torchvision_best", cfg, timestamps_mode)
            info = load_info(bench_dir / repo_id / "torchvision_best")
            rows.append(
                [
                    repo_id,
                    info["compression_factor"],
                    info["load_time_factor"],
                    info["avg_per_pixel_l2_error"],
                ]
            )
        display_markdown_table(headers, rows)
--- a/lerobot/common/datasets/push_dataset_to_hub/compute_stats.py
+++ b/lerobot/common/datasets/push_dataset_to_hub/compute_stats.py
@ -70,7 +70,7 @@ def compute_stats(dataset: LeRobotDataset | datasets.Dataset, batch_size=32, max
        generator.manual_seed(seed)
        dataloader = torch.utils.data.DataLoader(
            dataset,
-            num_workers=4,
+            num_workers=16,
            batch_size=batch_size,
            shuffle=True,
            drop_last=False,
--- a/lerobot/common/datasets/utils.py
+++ b/lerobot/common/datasets/utils.py
@ -216,9 +216,14 @@ def load_previous_and_future_frames(
        # load frames modality
        item[key] = hf_dataset.select_columns(key)[data_ids][key]
        if isinstance(item[key][0], dict) and "path" in item[key][0]:
            # video mode where frame are expressed as dict of path and timestamp
            item[key] = item[key]
        else:
            item[key] = torch.stack(item[key])
        item[f"{key}_is_pad"] = is_pad
        item[f"{key}_timestamp"] = query_ts
    return item
--- a/lerobot/common/datasets/video_utils.py
+++ b/lerobot/common/datasets/video_utils.py
@ -1,5 +1,6 @@
 import logging
 import subprocess
 import warnings
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any, ClassVar
@ -26,7 +27,7 @@ def load_from_videos(
            # load multiple frames at once (expected when delta_timestamps is not None)
            timestamps = [frame["timestamp"] for frame in item[key]]
            paths = [frame["path"] for frame in item[key]]
-            if len(set(paths)) == 1:
+            if len(set(paths)) > 1:
                raise NotImplementedError("All video paths are expected to be the same for now.")
            video_path = data_dir / paths[0]
@ -61,9 +62,11 @@ def decode_video_frames_torchvision(
    video_path = str(video_path)
    # set backend
    keyframes_only = False
    if device == "cpu":
        # explicitely use pyav
        torchvision.set_video_backend("pyav")
        keyframes_only = True  # pyav doesnt support accuracte seek
    elif device == "cuda":
        # TODO(rcadene, aliberts): implement video decoding with GPU
        # torchvision.set_video_backend("cuda")
@ -86,7 +89,7 @@ def decode_video_frames_torchvision(
    # access closest key frame of the first requested frame
    # Note: closest key frame timestamp is usally smaller than `first_ts` (e.g. key frame can be the first frame of the video)
    # for details on what `seek` is doing see: https://pyav.basswood-io.com/docs/stable/api/container.html?highlight=inputcontainer#av.container.InputContainer.seek
-    reader.seek(first_ts)
+    reader.seek(first_ts, keyframes_only=keyframes_only)
    # load all frames until last requested frame
    loaded_frames = []
@ -130,7 +133,7 @@ def decode_video_frames_torchvision(
 def encode_video_frames(imgs_dir: Path, video_path: Path, fps: int):
-    # For more info this setting, see: `lerobot/common/datasets/_video_benchmark/README.md`
+    """More info on ffmpeg arguments tuning on `lerobot/common/datasets/_video_benchmark/README.md`"""
    video_path = Path(video_path)
    video_path.parent.mkdir(parents=True, exist_ok=True)
@ -140,6 +143,7 @@ def encode_video_frames(imgs_dir: Path, video_path: Path, fps: int):
        "-loglevel error "
        f"-i {str(imgs_dir / 'frame_%06d.png')} "
        "-vcodec libx264 "
        "-g 2 "
        "-pix_fmt yuv444p "
        f"{str(video_path)}"
    )
@ -168,5 +172,11 @@ class VideoFrame:
        return self.pa_type
-# to make it available in HuggingFace `datasets`
+with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",
        "'register_feature' is experimental and might be subject to breaking changes in the future.",
        category=UserWarning,
    )
    # to make VideoFrame available in HuggingFace `datasets`
    register_feature(VideoFrame, "VideoFrame")
--- a/lerobot/common/logger.py
+++ b/lerobot/common/logger.py
@ -1,3 +1,6 @@
 # TODO(rcadene, alexander-soare): clean this file
 """Borrowed from https://github.com/fyhMer/fowm/blob/main/src/logger.py"""
 import logging
 import os
 from pathlib import Path
--- a/lerobot/scripts/train.py
+++ b/lerobot/scripts/train.py
@ -350,7 +350,7 @@ def train(cfg: dict, out_dir=None, job_name=None):
    # create dataloader for offline training
    dataloader = torch.utils.data.DataLoader(
        offline_dataset,
-        num_workers=4,
+        num_workers=8,
        batch_size=cfg.policy.batch_size,
        shuffle=True,
        pin_memory=cfg.device != "cpu",