From 13bd0a61847cd51ca60c4c846a76bbeae95fa9f7 Mon Sep 17 00:00:00 2001 From: Adil Zouitine Date: Sun, 5 May 2024 17:39:09 +0200 Subject: [PATCH] Update dataset upload instructions and add support for new raw formats --- README.md | 81 ++++++++++++++++++++++++++----------------------------- 1 file changed, 38 insertions(+), 43 deletions(-) diff --git a/README.md b/README.md index 7729c0c7..7659d016 100644 --- a/README.md +++ b/README.md @@ -171,69 +171,64 @@ If you would like to contribute to 🤗 LeRobot, please check out our [contribut ### Add a new dataset -```python -# TODO(rcadene, AdilZouitine): rewrite this section -``` - -To add a dataset to the hub, first login and use a token generated from [Hugging Face settings](https://huggingface.co/settings/tokens) with write access: +To add a dataset to the hub, begin by logging in with a token that has write access, which can be generated from the [Hugging Face settings](https://huggingface.co/settings/tokens): ```bash huggingface-cli login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential ``` -Then you can upload it to the hub with: +Then, push your dataset to the hub using the following command: + ```bash -HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload $HF_USER/$DATASET data/$DATASET \ - --repo-type dataset \ - --revision v1.0 +python lerobot/scripts/push_dataset_to_hub.py \ +--data-dir data \ +--dataset-id pusht \ +--raw-format pusht_zarr \ +--community-id lerobot \ +--revision v1.3 \ +--dry-run 0 \ +--save-to-disk 0 \ +--save-tests-to-disk 0 \ +--debug 0 ``` -You will need to set the corresponding version as a default argument in your dataset class: -```python - version: str | None = "v1.1", -``` -See: [`lerobot/common/datasets/pusht.py`](https://github.com/Cadene/lerobot/blob/main/lerobot/common/datasets/pusht.py) +For detailed explanations of the arguments, consult the help command: -For instance, for [lerobot/pusht](https://huggingface.co/datasets/lerobot/pusht), we used: ```bash -HF_USER=lerobot -DATASET=pusht +python lerobot/scripts/push_dataset_to_hub.py --help ``` -If you want to improve an existing dataset, you can download it locally with: +We currently support the following raw formats: + +``` +pusht_zarr | umi_zarr | aloha_hdf5 | xarm_pkl +``` + +For the `revision` parameter, set the version to match `CODEBASE_VERSION` using: + ```bash -mkdir -p data/$DATASET -HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ${HF_USER}/$DATASET \ - --repo-type dataset \ - --local-dir data/$DATASET \ - --local-dir-use-symlinks=False \ - --revision v1.0 +python -c "from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION; print(CODEBASE_VERSION)" ``` -Iterate on your code and dataset with: +If there is a need to update the unit tests, set `save-tests-to-disk` to 1 to mock the dataset: + ```bash -DATA_DIR=data python train.py +python lerobot/scripts/push_dataset_to_hub.py \ +--data-dir data \ +--dataset-id pusht \ +--raw-format pusht_zarr \ +--community-id lerobot \ +--revision v1.3 \ +--dry-run 0 \ +--save-to-disk 0 \ +--save-tests-to-disk 1 \ +--debug 0 ``` -Upload a new version (v2.0 or v1.1 if the changes are respectively more or less significant): -```bash -HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli upload $HF_USER/$DATASET data/$DATASET \ - --repo-type dataset \ - --revision v1.1 \ - --delete "*" -``` +The mock dataset will be located in `tests/data/$COMMUNITY_ID/$DATASET_ID/`, which can be used to update the unit tests. -Then you will need to set the corresponding version as a default argument in your dataset class: -```python -version: str | None = "v1.1", -``` -See: [`lerobot/common/datasets/pusht.py`](https://github.com/Cadene/lerobot/blob/main/lerobot/common/datasets/pusht.py) +To implement a new raw format, create a file in `lerobot/common/datasets/push_dataset_to_hub/{raw_format}_format.py` and implement the functions: `check_format`, `load_from_raw`, and `to_hf_dataset`. Combine these functions in `from_raw_to_lerobot_format`. You can find examples here: [pusht_zarr](https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/push_dataset_to_hub/pusht_zarr_format.py), [umi_zarr](https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/push_dataset_to_hub/umi_zarr_format.py), [aloha_hdf5](https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/push_dataset_to_hub/aloha_hdf5_format.py), and [xarm_pkl](https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/push_dataset_to_hub/xarm_pkl_format.py). Then, add the new format to [`get_from_raw_to_lerobot_format_fn`](https://github.com/huggingface/lerobot/blob/main/lerobot/scripts/push_dataset_to_hub.py#L69) in [`lerobot/scripts/push_dataset_to_hub.py`](https://github.com/huggingface/lerobot/blob/main/lerobot/scripts/push_dataset_to_hub.py). Et voilà! You are now ready to use this new format in [`push_dataset_to_hub.py`](https://github.com/huggingface/lerobot/blob/main/lerobot/scripts/push_dataset_to_hub.py) and can submit a PR to add it 🤗. -Finally, you might want to mock the dataset if you need to update the unit tests as well: -```bash -python tests/scripts/mock_dataset.py --in-data-dir data/$DATASET --out-data-dir tests/data/$DATASET -``` - ### Add a pretrained policy ```python