1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.42.1
- Python version: 3.10.6
- OS: Linux
- Other libs/tools (if relevant): Julia
3. What happened vs. what you expected:
I am facing difficulties in training an agent in a rather complex environment. I briefly describe it for reference.
- Obs: 12 (between ±1)
- Act: 5 (mean) (between ±1), + 5 (log_std)
- Short episodes (an expert agent would solve it in about 7 steps)
- Rather complex dynamics of the environment (space trajectory)
- Step returns are about -3 / +5, and last step is either success (+10) or impact (-2). More or less, min episode return is -13, max return is 30.
I tried different configuration for my custom net. The one I am trying to use now is composed of 3 nets:
- Mean net: 2 hid lay, 128 neurons, tanh + 1 out lay, 5 neurons, tanh
- Log_std net: 2 hid lay, 128 neurons, tanh + 1 out lay, 5 neurons, tanh
- Vf: 2 hid lay, 256 neurons, tanh + 1 out lay, 1 neurons, linear I also tried a version with common mean / log_std structure, and different n of neurons / layers (for example 256 mean / log_std and 1024 for vf, 3 hid layers)
To get a reasonable std at the beginning of the training, I set the biases of the output layer of the log_std net to 10, thus tanh returns almost 1. In the forward method, I do low_bound + 0.5 * (up_bound - low_bound) * (self._policy_net_log_std(features) + 1)
with low_bound = -5.0
and up_bound = 0.0
. This means that, at the beginning of the training, std is almost 1, and should decrease as the training proceeds. I also tried different low / up bounds, and also different biases not to initiate the training in either limit.
For what concern hyperparameters, also here different configurations have been tested.
Gamma: 0.99
- lr: 3e-3, 1e-3, 3e-4, 1e-4, and so on
- lambda_: 0.97, 0.95, 0.90
- entropy_coeff: 3e-3, 3e-4, 3e-5, 3e-6
- clip_param: 3/2/1e-1, 3/2/1e-2
- vf_loss_coeff = 1
- vf_clip_param = 10, None
- train_batch_size_per_learner = 1024, 2048
- minibatch_size = 64, 128, 256, 512
- num_epoch = 1, 3, 5, 10
Here it comes the problem. For some configurations, everything seems to work up to a certain point of the training. In particular, high initial std and low lr seems the most promising configuration. However, all of the sudden there appear instabilities in the curr_kl_coeff and mean_kl_loss. Sometimes also the policy_loss explodes.
Only one configuration “worked” (256 neurons, 2 hid lay, for mean and log_std (separated) and log_std biases to 10 / 1024 neurons, 2 hid lay, for vf) with lr = 3e-4, lambda_ = 0.95, entropy_coeff = 3e-4, clip_param = 3e-1, vf_clip_param = None, train_batch_size_per_learner = 2048, minibatch_size = 128, num_epochs = 5. However, even if learning properly, the entropy remained constant and did not decrease, making me think that the std did not decrease either. Therefore, training seemed to be successful, but entirely stochastic.
In all unsuccessful cases, at a certain point of the training it seems that gradients explode:
WARNING torch_learner.py:260 – Skipping this update. If updates with nan/inf gradients should not be skipped entirely and instead nan/inf gradients set to zero set torch_skip_nan_gradients to False.
Did anyone face similar problems / find a solution to that?
Thank you