Improved Dataset Merge Script for Multiple Dataset

This PR addresses issue regarding merging, converting and editing datasets. The improved merge.py script provides robust functionality for combining multiple datasets with different dimensions, tasks, and indices.
Key Improvements:
1、Multi-dataset Merging: Fixed the logic for merging datasets from different sources while preserving data integrity and continuity.
2、Dimension Handling: Added dynamic dimension detection and padding to ensure all observation and action vectors are consistently sized. The script now supports configurable maximum dimensions (default is 18, but can be overridden).
3. Index Consistency: Implemented continuous global frame indexing to avoid overlapping or gaps in indices after merging.
4、Task Mapping: Fixed task_index updates to ensure proper mapping across merged datasets with different task descriptions.
5、FPS Consistency: Added checks to ensure consistent FPS across datasets, with configurable default values.
6、Directory Structure: Improved output directory organization using chunk-based structure for better scalability.
7、Error Logging: Enhanced error reporting for failed files to aid debugging.
Usage Example:
# Define source folders and output folder
source_folders = [
    "/path/to/dataset1/", 
    "/path/to/dataset2/",
    "/path/to/dataset3/"
]

output_folder = "/path/to/merged_dataset/"

# Merge the datasets with custom parameters
merge_datasets(
    source_folders, 
    output_folder, 
    max_dim=32,  # Set maximum dimension for observation.state and action
    default_fps=20  # Set default FPS if not specified in datasets
)
This commit is contained in:
zhipeng tang 2025-04-01 15:06:33 +08:00 committed by GitHub
parent e004247ed4
commit 6b2e9448a2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 1255 additions and 0 deletions

1255
lerobot/scripts/merge.py Normal file

File diff suppressed because it is too large Load Diff