lerobot/lerobot/configs/policy/tdmpc.yaml

# @package _global_

seed: 1
dataset_repo_id: lerobot/xarm_lift_medium

training:
  offline_steps: 25000
  # TODO(alexander-soare): uncomment when online training gets reinstated
  online_steps: 0  # 25000 not implemented yet
  eval_freq: 5000
  online_steps_between_rollouts: 1
  online_sampling_ratio: 0.5
  online_env_seed: 10000

  batch_size: 256
  grad_clip_norm: 10.0
  lr: 3e-4

  delta_timestamps:
    observation.image: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
    observation.state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"
    action: "[i / ${fps} for i in range(${policy.horizon})]"
    next.reward: "[i / ${fps} for i in range(${policy.horizon})]"

policy:
  name: tdmpc

  pretrained_model_path:

  # Input / output structure.
  n_action_repeats: 2
  horizon: 5

  input_shapes:
    # TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?
    observation.image: [3, 84, 84]
    observation.state: ["${env.state_dim}"]
  output_shapes:
    action: ["${env.action_dim}"]

  # Normalization / Unnormalization
  input_normalization_modes: null
  output_normalization_modes:
    action: min_max

  # Architecture / modeling.
  # Neural networks.
  image_encoder_hidden_dim: 32
  state_encoder_hidden_dim: 256
  latent_dim: 50
  q_ensemble_size: 5
  mlp_dim: 512
  # Reinforcement learning.
  discount: 0.9

  # Inference.
  use_mpc: true
  cem_iterations: 6
  max_std: 2.0
  min_std: 0.05
  n_gaussian_samples: 512
  n_pi_samples: 51
  uncertainty_regularizer_coeff: 1.0
  n_elites: 50
  elite_weighting_temperature: 0.5
  gaussian_mean_momentum: 0.1

  # Training and loss computation.
  max_random_shift_ratio: 0.0476
  # Loss coefficients.
  reward_coeff: 0.5
  expectile_weight: 0.9
  value_coeff: 0.1
  consistency_coeff: 20.0
  advantage_scaling: 3.0
  pi_coeff: 0.5
  temporal_decay_coeff: 0.5
  # Target model.
  target_model_momentum: 0.995
Refactor policy config 2024-02-26 02:26:44 +08:00			`# @package _global_`

Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`seed: 1`
Enable tests for TD-MPC (#160) 2024-05-09 20:42:12 +08:00			`dataset_repo_id: lerobot/xarm_lift_medium`
Refactor env queue, Training diffusion works (Still not converging) 2024-03-04 18:59:43 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`training:`
			`offline_steps: 25000`
Disable online training (#202) Co-authored-by: Remi <re.cadene@gmail.com> 2024-05-21 01:27:54 +08:00			`# TODO(alexander-soare): uncomment when online training gets reinstated`
			`online_steps: 0 # 25000 not implemented yet`
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`eval_freq: 5000`
			`online_steps_between_rollouts: 1`
			`online_sampling_ratio: 0.5`
Refactor eval.py (#127) 2024-05-04 00:33:16 +08:00			`online_env_seed: 10000`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`batch_size: 256`
			`grad_clip_norm: 10.0`
			`lr: 3e-4`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`delta_timestamps:`
			`observation.image: "[i / ${fps} for i in range(${policy.horizon} + 1)]"`
			`observation.state: "[i / ${fps} for i in range(${policy.horizon} + 1)]"`
			`action: "[i / ${fps} for i in range(${policy.horizon})]"`
			`next.reward: "[i / ${fps} for i in range(${policy.horizon})]"`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`policy:`
			`name: tdmpc`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`pretrained_model_path:`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`# Input / output structure.`
			`n_action_repeats: 2`
Refactor policy config 2024-02-26 02:26:44 +08:00			`horizon: 5`

Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`input_shapes:`
			`# TODO(rcadene, alexander-soare): add variables for height and width from the dataset/env?`
			`observation.image: [3, 84, 84]`
			`observation.state: ["${env.state_dim}"]`
			`output_shapes:`
			`action: ["${env.action_dim}"]`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`# Normalization / Unnormalization`
			`input_normalization_modes: null`
			`output_normalization_modes:`
			`action: min_max`
Refactor policy config 2024-02-26 02:26:44 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`# Architecture / modeling.`
			`# Neural networks.`
			`image_encoder_hidden_dim: 32`
			`state_encoder_hidden_dim: 256`
Add diffusion policy (train and eval works, TODO: reproduce results) 2024-02-28 23:21:30 +08:00			`latent_dim: 50`
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`q_ensemble_size: 5`
			`mlp_dim: 512`
			`# Reinforcement learning.`
			`discount: 0.9`
Remove latency, tdmpc policy passes tests (TODO: make it work with online RL) 2024-04-08 00:01:22 +08:00
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`# Inference.`
Add test to make sure policy dataclass configs match yaml configs (#292) 2024-06-26 16:09:40 +08:00			`use_mpc: true`
Refactor TD-MPC (#103) Co-authored-by: Cadene <re.cadene@gmail.com> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> 2024-05-01 23:40:04 +08:00			`cem_iterations: 6`
			`max_std: 2.0`
			`min_std: 0.05`
			`n_gaussian_samples: 512`
			`n_pi_samples: 51`
			`uncertainty_regularizer_coeff: 1.0`
			`n_elites: 50`
			`elite_weighting_temperature: 0.5`
			`gaussian_mean_momentum: 0.1`

			`# Training and loss computation.`
			`max_random_shift_ratio: 0.0476`
			`# Loss coefficients.`
			`reward_coeff: 0.5`
			`expectile_weight: 0.9`
			`value_coeff: 0.1`
			`consistency_coeff: 20.0`
			`advantage_scaling: 3.0`
			`pi_coeff: 0.5`
			`temporal_decay_coeff: 0.5`
			`# Target model.`
			`target_model_momentum: 0.995`