- Optimized critic design that improves the performance of the learner loop by a factor of 2
- Cleaned the code and fixed style issues
- Completed the config with actor_learner_config field that contains host-ip and port elemnts that are necessary for the actor-learner servers.
Co-authored-by: Adil Zouitine <adilzouitinegm@gmail.com>
- Updated SACConfig to replace standard deviation parameterization with log_std_min and log_std_max for better control over action distributions.
- Modified SACPolicy to streamline action selection and log probability calculations, enhancing stochastic behavior.
- Removed deprecated TanhMultivariateNormalDiag class to simplify the codebase and improve maintainability.
These changes aim to enhance the robustness and performance of the SAC implementation during training and inference.
- Updated standard deviation parameterization in SACConfig to 'softplus' with defined min and max values for improved stability.
- Modified action sampling in SACPolicy to use reparameterized sampling, ensuring better gradient flow and log probability calculations.
- Cleaned up log probability calculations in TanhMultivariateNormalDiag for clarity and efficiency.
- Increased evaluation frequency in YAML configuration to 50000 for more efficient training cycles.
These changes aim to enhance the robustness and performance of the SAC implementation during training and inference.
- Updated action selection to use distribution sampling and log probabilities for better stochastic behavior.
- Enhanced standard deviation clamping to prevent extreme values, ensuring stability in policy outputs.
- Cleaned up code by removing unnecessary comments and improving readability.
These changes aim to refine the SAC implementation, enhancing its robustness and performance during training and inference.
- Added `num_subsample_critics`, `critic_target_update_weight`, and `utd_ratio` to SACConfig.
- Implemented target entropy calculation in SACPolicy if not provided.
- Introduced subsampling of critics to prevent overfitting during updates.
- Updated temperature loss calculation to use the new target entropy.
- Added comments for future UTD update implementation.
These changes improve the flexibility and performance of the SAC implementation.