4.3 KiB
Port DROID 1.0.1 dataset to LeRobotDataset
Download
TODO
It will take 2 TB in your local disk.
Port on a single computer
First, install tensorflow dataset utilities to read from raw files:
pip install tensorflow
pip install tensorflow_datasets
Then run this script to start porting the dataset:
python examples/port_datasets/droid_rlds/port_droid.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1 \
--push-to-hub
It will take 400GB in your local disk.
As usual, your LeRobotDataset will be stored in your huggingface/lerobot cache folder.
WARNING: it will take 7 days for porting the dataset locally and 3 days to upload, so we will need to parallelize over multiple nodes on a slurm cluster.
NOTE: For development, run this script to start porting a shard:
python examples/port_datasets/droid_rlds/port.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1 \
--num-shards 2048 \
--shard-index 0
Port over SLURM
Install slurm utilities from Hugging Face:
pip install datatrove
1. Port one shard per job
Run this script to start porting shards of the dataset:
python examples/port_datasets/droid_rlds/slurm_port_shards.py \
--raw-dir /your/data/droid/1.0.1 \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name port_droid \
--partition your_partition \
--workers 2048 \
--cpus-per-task 8 \
--mem-per-cpu 1950M
Note on how to set your command line arguments
Regarding --partition
, find yours by running:
info --format="%R"`
and select the CPU partition if you have one. No GPU needed.
Regarding --workers
, it is the number of slurm jobs you will launch in parallel. 2048 is the maximum number, since there is 2048 shards in Droid. This big number will certainly max-out your cluster.
Regarding --cpus-per-task
and --mem-per-cpu
, by default it will use ~16GB of RAM (8*1950M) which is recommended to load the raw frames and 8 CPUs which can be useful to parallelize the encoding of the frames.
Find the number of CPUs and Memory of the nodes of your partition by running:
sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"
Useful commands to check progress and debug
Check if your jobs are running:
squeue -u $USER`
You should see a list with job indices like 15125385_155
where 15125385
is the index of the run and 155
is the worker index. The output/print of this worker is written in real time in /your/logs/job_name/slurm_jobs/15125385_155.out
. For instance, you can inspect the content of this file by running less /your/logs/job_name/slurm_jobs/15125385_155.out
.
Check the progression of your jobs by running:
jobs_status /your/logs
If it's not 100% and no more slurm job is running, it means that some of them failed. Inspect the logs by running:
failed_logs /your/logs/job_name
If there is an issue in the code, you can fix it in debug mode with --slurm 0
which allows to set breakpoint:
python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 0 ...
And you can relaunch the same command, which will skip the completed jobs:
python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 1 ...
Once all jobs are completed, you will have one dataset per shard (e.g. droid_1.0.1_world_2048_rank_1594
) saved on disk in your /lerobot/home/dir/your_id
directory. You can find your /lerobot/home/dir
by running:
python -c "from lerobot.common.constants import HF_LEROBOT_HOME;print(HF_LEROBOT_HOME)"
2. Aggregate all shards
Run this script to start aggregation:
python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name aggr_droid \
--partition your_partition \
--workers 2048 \
--cpus-per-task 8 \
--mem-per-cpu 1950M
Once all jobs are completed, you will have one dataset your /lerobot/home/dir/your_id/droid_1.0.1
directory.
3. Upload dataset
Run this script to start uploading:
python examples/port_datasets/droid_rlds/slurm_upload.py \
--repo-id your_id/droid_1.0.1 \
--logs-dir /your/logs \
--job-name upload_droid \
--partition your_partition \
--workers 50 \
--cpus-per-task 4 \
--mem-per-cpu 1950M