How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
Hello, I am trying to use SAC with a custom toy environment. PPO solves it very efficiently and robustly, achieving optimal performance right away. However, I am struggling tuning SAC.
SAC achieves optimal episode_reward_max right away; but the episode_reward_min is low, about 30% of the optimal, and it never improves. The mean is 80% of the optimal, meaning that not too many episodes are contributing to the low min reward. This happens for both training and evaluation.
This doesn’t happen for PPO, which obtains optimal values for episode_reward_min and episode_reward_max efficiently.
I have tried tuning SAC, but I am not able to get rid of the rogue episodes and overcome the low min reward. The best performance is obtained for a low initial alpha of 0.1.
Is this expected behaviour? Could it be due to some stochasticity during evaluation that I should make deterministic?