So I enabled the 2 options instead of doing them myself
{"log_std_clip_param": 1}
{"free_log_std": True}
It didn’t seem to slow down the training this time (as in how long it takes to process the data), but my model just didn’t learn anything now.
update: I commented out “{“free_log_std”: True}” for now, and set this {“log_std_clip_param”: 1} to 20 and if I still have NaN’s I can work down from there.
Hey @Samuel_Fipps, that seems to be a good strategy. Setting {"free_log_std": True} will have the log_stds as a parameter of the model and just like you said I found that the agent normally doesn’t learn anything. As you go down in the log_std_clip_param you’ll find a spot, hopefully, where the agent learns but no NaNs appear. What sort of an action space are you working with? is it normalized?
This is interesting to know. This is the way cleanrl and sb3 handle it. I wonder what is different in their implementation. Do either of you have an setup you could share with me. I would like to experiment with a few things but I don’t have an problem handy that is experiencing this issue.
I have done some experimenting with hand-coded policies designed to produce the nans and I can confidently say that a log_std value less than -25 will produce NaNs in the backward pass. I set the log_std_clip_param to -12 and even this did not prevent NaNs.
One way I did manage to prevent nan was to make the following modification. Depending on if you are using the rl_modules or not:
Here the value of 0.00001 is an epsilon to prevent the NaNs. It corresponds to a log_std of -11.512925464970229 and picked it because it will clamp the value if it is less than that and have no appreciable effect on the distribution with larger stds.
I have verified that it does eliminate the NaNs in my cases but I do not actually have a learning problem with this issue so I do not know how it will affect learning. I would expect that it would not but .
This seems to be promising, my model needs a lot of steps to learn so I wont know until Monday.
Also something that I noticed that I mentioned already log_std_clip_param can slow down my progress by 200%. I’m guessing that its having to do a lot of clipping when training? Not to sure. I gather 51200 steps before training, so that is quite a bit to be looping through and clipping. However it still shouldn’t slow it down that much, unless its not very efficient in the way that it goes about it.
Like numpy.append has to recreate the array every time you call that. So I am wondering if something similar is happening.
I am not sure why log_std_clip_param is slowing down training so much. The torch clamp operation itself is not very computationaly intensive so I would not expect that to be the cause.
I am kind of taking a shot in the dark here but my guess would be that there are a lot of instances that are being clamped. The clamping operation is not expensive but a clapped item has no derivate and so it kills the gradient. Any sample that is clamped will not be used to train any of the layers below it. That is my guess as to why training slows.
I have ran a few tests today with and without the log_std_clip_param over 250k timesteps and I am getting a wall time difference of ~3% over 10 runs. This is a pretty simple test using CartPole-v1 and small network sizes, but seems to indicate that something else might be going on with your code and/or setup. When you look at the ray_results the progress excel sheet, do you see a big difference between time_this_iter_s between with and without the parameter? Or do any of the other time metrics increase very largely?
Also, to clarify my point above @mannyv – I have found that using the log_std_devs as a parameter of the model doesn’t force my agent not to learn, but slows down learning. I ran tests using PyFlyt’s Dogfighting Environment where I pin two agents against each other. Using the same model, but with one as the log_std_devs as a parameter of the model and the other clamped I found that the clamped model will outperform the parameterized model. Now, this is purely allegorical and not scientific at all but these were my findings when testing a few months ago. Each model was flipped between agent 1 / agent 2 and ran for 50M+ steps around 5 different times.
@Samuel_Fipps is it possible to share with us your action_space? Or how you are jump-starting with the 51K steps? This isn’t all that many timesteps in the grand scheme of things.
Hopefully you have found a fix that works for you currently!
The really big change that enabling log_std_devs is that there is now one std for each dimension of the box action space. It is a parameter that can be learned but it is no longer observation dependent. If the optimal variance of the action changes throughout an episode then this is not possible with this setting.
It occurred to me as I was thinking about this that instead of using one linear layer to produce both the means and stds of the distribution you could have two independent layers, one layer for each. Not sure what effect it would have.
I uploaded some initial results of using
return torch.distributions.normal.Normal(loc, scale + 0.00001)
The first picture is without it and it ended with a NaN crash.
The second picture is with the change, and it seems to still be learning well. I don’t think the unstable learning is coming from the addition, but I do plan on running more test with it in to see.
Never mind it still broke, this time I wasn’t using shared layers.
File "\Python\Python311\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1154, in _worker
self.loss(model, self.dist_class, sample_batch)
File "\Python\Python311\site-packages\ray\rllib\algorithms\ppo\ppo_torch_policy.py", line 85, in loss
curr_action_dist = dist_class(logits, model)
^^^^^^^^^^^^^^^^^^^^^^^^^
File \Python\Python311\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 251, in __init__
self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std) + 0.00001)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \Python\Python311\site-packages\torch\distributions\normal.py", line 57, in __init__
super().__init__(batch_shape, validate_args=validate_args)
File \Python\Python311\site-packages\torch\distributions\distribution.py", line 70, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (256, 3)) of distribution Normal(loc: torch.Size([256, 3]), scale: torch.Size([256, 3])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],