Add a README.md

This commit is contained in:
lesjie-wen 2025-03-18 15:16:22 +08:00
parent 105650522a
commit cc46d93beb
1 changed files with 112 additions and 0 deletions

View File

@ -0,0 +1,112 @@
<h1 align="center">
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning</h1>
### This is the lerobot version of DexVLA. For more information, you can refer to [this](https://github.com/juruobenruo/DexVLA).
## Data Input
DexVLA takes into RGB images, language instructions and states. For our setting, we use three camera views: a top camera, two wrist cameras.
⭐A major difference between DexVLA with other VLAs is: DexVLA takes raw language in, and outputs sub-step reasoning based on current observations and robot states.
So you have to <font color='red'>add sub-step reasoning in your data for training</font>.
Specifically, your data should include a key ``reasoning`` which is a list of sub-step reasoning corresponding to each observation.
For example, if the episode is 10 steps. The length of this list should be 10 as well. And it may looks like:
~~~python
reasoning = [
"This is step 1.",
"This is step 1.",
"This is step 2.",
"This is step 2.",
...
"This is step 4.",
]
~~~
Besides, your data should include another key ``action_is_pad`` which is a bool mask indicated whether this action chunk is padded.
For example, suppose action chunk is 5, and the length of episode is 10. So the action chunk for last 4 actions must be padded to make sure the length of action chunk is 5.
And the mask looks like:
~~~python
The 6th chunk: [false, false, false, false, true]
The 7th chunk: [false, false, false, true, true]
The 8th chunk: [false, false, true, true, true]
The 9th chunk: [false, true, true, true, true]
~~~
## 🤗Download Pretrained Weights
### Download official Qwen2_VL weights
We construct the VLM backbone by integrating Qwen2-VL-2B, a powerful and efficient model, into our framework.
The Qwen2-VL 2B serves as the core of our architecture, providing robust capabilities
for vision-language tasks. We use off-the-shelf Qwen2-VL model proposed
in [Qwen2-VL](https://arxiv.org/pdf/2409.12191) without any post training on VLM itself. You can download the official weights from this link:
| Model | Link |
|---------------------|----------------------------------------------------------------|
| Qwen2-VL (~2B) | [huggingface](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) |
**❗❗** After downloading the standard weights, you have to replace the official "config.json"
with our ["config.json"](https://github.com/juruobenruo/DexVLA/blob/main/docs/config.json) designed for VLA.
### Download our pretrained ScaleDP-H weights(Stage 1)
We released our pretrained weights of ScaleDP-H which is trained after Stage1. Now you can download the weights and directly finetuning your data on Stage 2.
| Model | Link |
|-------------------|----------------------------------------------------------------|
| ScaleDP-H (~1B) | [huggingface](https://huggingface.co/lesjie/scale_dp_h) |
| ScaleDP-L (~400M) | [huggingface](https://huggingface.co/lesjie/scale_dp_l) |
## 🦾Train
We have already provided pretrained weights of ScaleDP which is stage 1. Belows are mainly about training process of Stage2 and Stage3.
### Training Stage 2
~~~shell
python lerobot/scripts/train.py \
--policy.type dexvla \
--policy.qwen2_vl_path /path/to/official/Qwen2-VL-2B-Instruct \
--policy.pretrain_scaledp_path /path/to/pretrained/scale_dp_h/open_scale_dp_l_backbone.ckpt \
--policy.policy_head_size 'ScaleDP_H' \
--policy.training_stage 2 \
--dataset.repo_i folding_blue_tshirt \
--dataset.local_files_only true \
--batch_size 2 \
--policy.using_film true \
--output_dir /path/to/output \
--steps 10000 \
--save_freq 1000 \
--optimizer_lr 2e-5 \
--policy.device=cuda
~~~
### Training Stage 3
Stage3 can be viewed as continual training on specific dexterous tasks like laundry folding which is same as PI0. So stage3 is trained based on stage2.
~~~shell
python lerobot/scripts/train.py \
--policy.type dexvla \
--policy.qwen2_vl_path /path/to/official/Qwen2-VL-2B-Instruct \
--.pretrained_path /path/to/pretrained/stage2/weights \
--policy.policy_head_size 'ScaleDP_H' \
--policy.training_stage 3 \
--dataset.repo_i folding_blue_tshirt \
--dataset.local_files_only true \
--batch_size 2 \
--policy.using_film true \
--output_dir /path/to/output \
--steps 10000 \
--save_freq 1000 \
--optimizer_lr 2e-5 \
--policy.device=cuda
~~~
## Evaluation
~~~shell
python lerobot/scripts/eval.py \
--policy.type dexvla \
--policy.pretrained_path /path/to/pretrained/stage2/or/stage3/weights \
--env.type aloha \
--env.episode_length 5 \
--policy.qwen2_vl_path /path/to/official/Qwen2-VL-2B-Instruct \
--env.task AlohaInsertion-v0 \
--eval.n_episodes 1 \
--eval.batch_size 1 \
--device cuda
~~~