lerobot/examples/port_datasets/droid_rlds/README.md

4.3 KiB

Port DROID 1.0.1 dataset to LeRobotDataset

Download

TODO

It will take 2 TB in your local disk.

Port on a single computer

First, install tensorflow dataset utilities to read from raw files:

pip install tensorflow
pip install tensorflow_datasets

Then run this script to start porting the dataset:

python examples/port_datasets/droid_rlds/port_droid.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --push-to-hub

It will take 400GB in your local disk.

As usual, your LeRobotDataset will be stored in your huggingface/lerobot cache folder.

WARNING: it will take 7 days for porting the dataset locally and 3 days to upload, so we will need to parallelize over multiple nodes on a slurm cluster.

NOTE: For development, run this script to start porting a shard:

python examples/port_datasets/droid_rlds/port.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --num-shards 2048 \
    --shard-index 0

Port over SLURM

Install slurm utilities from Hugging Face:

pip install datatrove

1. Port one shard per job

Run this script to start porting shards of the dataset:

python examples/port_datasets/droid_rlds/slurm_port_shards.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name port_droid \
    --partition your_partition \
    --workers 2048 \
    --cpus-per-task 8 \
    --mem-per-cpu 1950M

Note on how to set your command line arguments

Regarding --partition, find yours by running:

info --format="%R"`

and select the CPU partition if you have one. No GPU needed.

Regarding --workers, it is the number of slurm jobs you will launch in parallel. 2048 is the maximum number, since there is 2048 shards in Droid. This big number will certainly max-out your cluster.

Regarding --cpus-per-task and --mem-per-cpu, by default it will use ~16GB of RAM (8*1950M) which is recommended to load the raw frames and 8 CPUs which can be useful to parallelize the encoding of the frames.

Find the number of CPUs and Memory of the nodes of your partition by running:

sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"

Useful commands to check progress and debug

Check if your jobs are running:

squeue -u $USER`

You should see a list with job indices like 15125385_155 where 15125385 is the index of the run and 155 is the worker index. The output/print of this worker is written in real time in /your/logs/job_name/slurm_jobs/15125385_155.out. For instance, you can inspect the content of this file by running less /your/logs/job_name/slurm_jobs/15125385_155.out.

Check the progression of your jobs by running:

jobs_status /your/logs

If it's not 100% and no more slurm job is running, it means that some of them failed. Inspect the logs by running:

failed_logs /your/logs/job_name

If there is an issue in the code, you can fix it in debug mode with --slurm 0 which allows to set breakpoint:

python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 0 ...

And you can relaunch the same command, which will skip the completed jobs:

python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 1 ...

Once all jobs are completed, you will have one dataset per shard (e.g. droid_1.0.1_world_2048_rank_1594) saved on disk in your /lerobot/home/dir/your_id directory. You can find your /lerobot/home/dir by running:

python -c "from lerobot.common.constants import HF_LEROBOT_HOME;print(HF_LEROBOT_HOME)"

2. Aggregate all shards

Run this script to start aggregation:

python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name aggr_droid \
    --partition your_partition \
    --workers 2048 \
    --cpus-per-task 8 \
    --mem-per-cpu 1950M

Once all jobs are completed, you will have one dataset your /lerobot/home/dir/your_id/droid_1.0.1 directory.

3. Upload dataset

Run this script to start uploading:

python examples/port_datasets/droid_rlds/slurm_upload.py \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name upload_droid \
    --partition your_partition \
    --workers 50 \
    --cpus-per-task 4 \
    --mem-per-cpu 1950M