lerobot/examples/port_datasets/droid_rlds/README.md

# Port DROID 1.0.1 dataset to LeRobotDataset

## Download

TODO

It will take 2 TB in your local disk.

## Port on a single computer

First, install tensorflow dataset utilities to read from raw files:
```bash
pip install tensorflow
pip install tensorflow_datasets
```

Then run this script to start porting the dataset:
```bash
python examples/port_datasets/droid_rlds/port_droid.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --push-to-hub
```

It will take 400GB in your local disk.

As usual, your LeRobotDataset will be stored in your huggingface/lerobot cache folder.

WARNING: it will take 7 days for porting the dataset locally and 3 days to upload, so we will need to parallelize over multiple nodes on a slurm cluster.

NOTE: For development, run this script to start porting a shard:
```bash
python examples/port_datasets/droid_rlds/port.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --num-shards 2048 \
    --shard-index 0
```

## Port over SLURM

Install slurm utilities from Hugging Face:
```bash
pip install datatrove
```


### 1. Port one shard per job

Run this script to start porting shards of the dataset:
```bash
python examples/port_datasets/droid_rlds/slurm_port_shards.py \
    --raw-dir /your/data/droid/1.0.1 \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name port_droid \
    --partition your_partition \
    --workers 2048 \
    --cpus-per-task 8 \
    --mem-per-cpu 1950M
```

**Note on how to set your command line arguments**

Regarding `--partition`, find yours by running:
```bash
info --format="%R"`
```
and select the CPU partition if you have one. No GPU needed.

Regarding `--workers`, it is the number of slurm jobs you will launch in parallel. 2048 is the maximum number, since there is 2048 shards in Droid. This big number will certainly max-out your cluster.

Regarding `--cpus-per-task` and `--mem-per-cpu`, by default it will use ~16GB of RAM (8*1950M) which is recommended to load the raw frames and 8 CPUs which can be useful to parallelize the encoding of the frames.

Find the number of CPUs and Memory of the nodes of your partition by running:
```bash
sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"
```

**Useful commands to check progress and debug**

Check if your jobs are running:
```bash
squeue -u $USER`
```

You should see a list with job indices like `15125385_155` where `15125385` is the index of the run and `155` is the worker index. The output/print of this worker is written in real time in `/your/logs/job_name/slurm_jobs/15125385_155.out`. For instance, you can inspect the content of this file by running `less /your/logs/job_name/slurm_jobs/15125385_155.out`.

Check the progression of your jobs by running:
```bash
jobs_status /your/logs
```

If it's not 100% and no more slurm job is running, it means that some of them failed. Inspect the logs by running:
```bash
failed_logs /your/logs/job_name
```

If there is an issue in the code, you can fix it in debug mode with `--slurm 0` which allows to set breakpoint:
```bash
python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 0 ...
```

And you can relaunch the same command, which will skip the completed jobs:
```bash
python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 1 ...
```

Once all jobs are completed, you will have one dataset per shard (e.g. `droid_1.0.1_world_2048_rank_1594`) saved on disk in your `/lerobot/home/dir/your_id` directory. You can find your `/lerobot/home/dir` by running:
```bash
python -c "from lerobot.common.constants import HF_LEROBOT_HOME;print(HF_LEROBOT_HOME)"
```


### 2. Aggregate all shards

Run this script to start aggregation:
```bash
python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name aggr_droid \
    --partition your_partition \
    --workers 2048 \
    --cpus-per-task 8 \
    --mem-per-cpu 1950M
```

Once all jobs are completed, you will have one dataset your `/lerobot/home/dir/your_id/droid_1.0.1` directory.


### 3. Upload dataset

Run this script to start uploading:
```bash
python examples/port_datasets/droid_rlds/slurm_upload.py \
    --repo-id your_id/droid_1.0.1 \
    --logs-dir /your/logs \
    --job-name upload_droid \
    --partition your_partition \
    --workers 50 \
    --cpus-per-task 4 \
    --mem-per-cpu 1950M
```
Rename openx to droid + Improve all (not tested) 2025-03-19 00:28:09 +08:00			`# Port DROID 1.0.1 dataset to LeRobotDataset`

			`## Download`

			`TODO`

			`It will take 2 TB in your local disk.`

			`## Port on a single computer`

			`First, install tensorflow dataset utilities to read from raw files:`
			```bash
			`pip install tensorflow`
			`pip install tensorflow_datasets`
			```

			`Then run this script to start porting the dataset:`
			```bash
			`python examples/port_datasets/droid_rlds/port_droid.py \`
			`--raw-dir /your/data/droid/1.0.1 \`
			`--repo-id your_id/droid_1.0.1 \`
			`--push-to-hub`
			```

			`It will take 400GB in your local disk.`

			`As usual, your LeRobotDataset will be stored in your huggingface/lerobot cache folder.`

			`WARNING: it will take 7 days for porting the dataset locally and 3 days to upload, so we will need to parallelize over multiple nodes on a slurm cluster.`

			`NOTE: For development, run this script to start porting a shard:`
			```bash
			`python examples/port_datasets/droid_rlds/port.py \`
			`--raw-dir /your/data/droid/1.0.1 \`
			`--repo-id your_id/droid_1.0.1 \`
			`--num-shards 2048 \`
			`--shard-index 0`
			```

			`## Port over SLURM`

Improve slurm droid 2025-03-20 22:12:46 +08:00			`Install slurm utilities from Hugging Face:`
Rename openx to droid + Improve all (not tested) 2025-03-19 00:28:09 +08:00			```bash
			`pip install datatrove`
			```

Improve slurm droid 2025-03-20 22:12:46 +08:00
			`### 1. Port one shard per job`

			`Run this script to start porting shards of the dataset:`
Rename openx to droid + Improve all (not tested) 2025-03-19 00:28:09 +08:00			```bash
			`python examples/port_datasets/droid_rlds/slurm_port_shards.py \`
			`--raw-dir /your/data/droid/1.0.1 \`
			`--repo-id your_id/droid_1.0.1 \`
			`--logs-dir /your/logs \`
			`--job-name port_droid \`
			`--partition your_partition \`
			`--workers 2048 \`
			`--cpus-per-task 8 \`
			`--mem-per-cpu 1950M`
			```

			`Note on how to set your command line arguments`

			Regarding `--partition`, find yours by running:
			```bash
			info --format="%R"`
			```
			`and select the CPU partition if you have one. No GPU needed.`

			Regarding `--workers`, it is the number of slurm jobs you will launch in parallel. 2048 is the maximum number, since there is 2048 shards in Droid. This big number will certainly max-out your cluster.

			Regarding `--cpus-per-task` and `--mem-per-cpu`, by default it will use ~16GB of RAM (8*1950M) which is recommended to load the raw frames and 8 CPUs which can be useful to parallelize the encoding of the frames.

			`Find the number of CPUs and Memory of the nodes of your partition by running:`
			```bash
			`sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"`
			```

			`Useful commands to check progress and debug`

			`Check if your jobs are running:`
			```bash
			squeue -u $USER`
			```

Improve slurm droid 2025-03-20 22:12:46 +08:00			You should see a list with job indices like `15125385_155` where `15125385` is the index of the run and `155` is the worker index. The output/print of this worker is written in real time in `/your/logs/job_name/slurm_jobs/15125385_155.out`. For instance, you can inspect the content of this file by running `less /your/logs/job_name/slurm_jobs/15125385_155.out`.
Rename openx to droid + Improve all (not tested) 2025-03-19 00:28:09 +08:00
			`Check the progression of your jobs by running:`
			```bash
			`jobs_status /your/logs`
			```

			`If it's not 100% and no more slurm job is running, it means that some of them failed. Inspect the logs by running:`
			```bash
			`failed_logs /your/logs/job_name`
			```

			If there is an issue in the code, you can fix it in debug mode with `--slurm 0` which allows to set breakpoint:
			```bash
			`python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 0 ...`
			```

			`And you can relaunch the same command, which will skip the completed jobs:`
			```bash
			`python examples/port_datasets/droid_rlds/slurm_port_shards.py --slurm 1 ...`
			```

			Once all jobs are completed, you will have one dataset per shard (e.g. `droid_1.0.1_world_2048_rank_1594`) saved on disk in your `/lerobot/home/dir/your_id` directory. You can find your `/lerobot/home/dir` by running:
			```bash
			`python -c "from lerobot.common.constants import HF_LEROBOT_HOME;print(HF_LEROBOT_HOME)"`
			```


			`### 2. Aggregate all shards`

			`Run this script to start aggregation:`
			```bash
			`python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \`
			`--repo-id your_id/droid_1.0.1 \`
			`--logs-dir /your/logs \`
			`--job-name aggr_droid \`
			`--partition your_partition \`
			`--workers 2048 \`
			`--cpus-per-task 8 \`
			`--mem-per-cpu 1950M`
			```

			Once all jobs are completed, you will have one dataset your `/lerobot/home/dir/your_id/droid_1.0.1` directory.


			`### 3. Upload dataset`

			`Run this script to start uploading:`
			```bash
			`python examples/port_datasets/droid_rlds/slurm_upload.py \`
			`--repo-id your_id/droid_1.0.1 \`
			`--logs-dir /your/logs \`
NIT 2025-03-19 00:55:08 +08:00			`--job-name upload_droid \`
Rename openx to droid + Improve all (not tested) 2025-03-19 00:28:09 +08:00			`--partition your_partition \`
			`--workers 50 \`
			`--cpus-per-task 4 \`
			`--mem-per-cpu 1950M`
			```