Policy returning NaN weights and NaN biases. In addition, Policy observation space is different than expected

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Introduction to the problem:

  • I am training a neural network to control the joints of a robot using the PPO reinforcement learning algorithm. However, the network has been producing actions that result in invalid joint states with NaN values, eventually causing all environments to return a reward of 0.0. This starts happening after only a few training iterations (<200). I did do some testing around and saw that if I gave repeated high value actions to the joints of the robot that the environment engine would start returning NaN values for the joint states, so I immediately thought that was the problem and tried to alter the action space to make it smaller. I decided to let a training experiment run for a good amount of time in hopes of seeing if the network would learn to not return high action values. After 30 million timesteps occurring and this problem persisting, I decided to restore the most recent checkpoint and investigate the policy.

Problems:

  • The first problem I saw with the restored policy was that the observation_space was different than expected. policy.observation_space returned Box(-1.0, 1.0, (1200,), float32). I expected a box of size (1200,), but the range of the box should have been (-1000.0,1000.0) as the observation_state of the training environment is Dict(body_state:Box(-1000.0,1000.0, (240,), float64), task_state:Box(-1000.0,1000.0,(960,), float64). I also want to note the difference in the data types of the observation spaces.

  • The second problem I saw with the restored policy was that a majority of the weights and biases of the layers were NaN values. The results from policy.get_weights() show rows of ā€˜nanā€™ values for each layer of the action computing model. I also want to note that I am using a model as the reinforcement learning approximator function and that does NOT have NaN values.

    • Sample output from policy.get_weights():
{ 
'compute_action._model.0._model.0.weight':  array([[nan, nan, nan, ..., nan, nan, nan], ..., dtype=float32),
'compute_action._model.0._model.0.bias': array([ nan, nan, ... -0.11605847, ..., nan, nan ], dtype=float32),
...
}

I guess Iā€™ll have to break down my experiment some so I can get a script to test with. I am posting this before in case anyone has any helpful tips or info about this problem.

I changed my lambda value from 0.05 to 0.90 and my clip_param from 0.2 to 0.1 and, surprisingly, the NaNs havenā€™t come back in 18 million time steps (>3600 iterations). I donā€™t know about the observation space of the restored policy, but Iā€™ll be sure to check it when the training finishes.

Concerning the bad action space. This looks very much to me like you have trained your policy to output normalized actions. You need to unsquash them to fit your observation space. Your Algorithm object normally does that for you when using for example compute_single_action().

Also donā€™t forget to clip if needed.

action = space_utils.unsquash_action(action, policy.action_space_struct)
action = space_utils.clip_action(action, policy.action_space_struct)

Furthermore, your policy might have been trained on preprocessed observations. So youā€™ll need a preprocessor if you want to interact with it directly. I also recommend you interact with the Algorithm object if possible.

As for the NaNā€™s: Please file a github issue with a small reproduction script so we can get to the bottom of it :slightly_smiling_face:

Thanks again arturn!

When restoring and using the policy, I essentially do the following (in different functions, but the following are the essential function calls):

# RESTORE ALGORITHM FROM A CHECKPOINT
ray.tune.registry.register_env(...)
algorithm = PPO(...)
algorithm.restore(...)
self.policy = algorithm.get_policy( )

# INITIALIZE OBSERVATION PREPROCESSOR
self.compute_observation_space ( )
self.observation_processor = rllib.models.catalog.ModelCatalog.get_preprocessor_for_space (
    self.observation_space
)

# OBSERVE, PROCESS OBSERVATIONS, AND COMPUTE SINGLE ACTION
self.observe (  )
observations = self.observation_processor.transform( self.observations )
action = self.policy.compute_single_action( observations, clip_actions = True )[0]
  1. So, self.policy.observation_space = Box(-1.0, 1.0, (1200,), float32) is ok [given that the observation_space of the training environment is Dict(body_state:Box(-1000.0,1000.0, (240,), float64), task_state:Box(-1000.0,1000.0,(960,), float64)]?
  2. a) My use of action = self.policy.compute_single_action( observations, clip_actions = True )[0] should correctly unsquash my ā€œactionā€ space?
    b) I am confused why this would affect my observation space being Box(-1.0, 1.0, (1200,), float32).
  3. Iā€™m going to let my model train for a good time more and then Iā€™ll circle back and see if I can create a script that causes consistent NaNs. :+1:
  1. Yes, thatā€™s ok. This gap is closed by a preprocessor.
  2. I misread your initial post and thought you where referring to an action speace instead of observation space. But the same things holds true there. Inputs and outputs are both normalized usually.
  3. Thanks!
1 Like

@arturn this might be related to the issue here (not because of the algo, but there might be NaNs in the model and I had a similar issue in my PPO)

@MrDracoG From what you mention the NaNs in your weights might stem from very high losses/gradients. Did you observe any spikes in your losses?
The fact that you were able to mitigate the problem by decreasing lambda and lowering the clipping in the loss might point to very high advantages. Are you also training with very long episodes (and possibly "complete_episodes" in batch_mode hyperparameter)?

In regard to the squashed observation space: do you normalize your observations with RLlibā€™s MeanStdFilter?

1 Like

A. Iā€™m unsure about the spikes in losses. Iā€™ll be sure to look into it when I get around to creating a testing script.

B. My episodes are not quite long. I have been training with a small horizon of 30 steps.

C. I do not normalize my observations with RLlibā€™s MeanStdFilter. I do however, append AppendBiasLayer at the end of my model, if that means anything. The reason why I mention this is because I am unsure exactly what the layer does, but my model is based off a model that used the Layer.

@MrDracoG, thank you for the infos.

If you ran the experiments, you might have received some tfevents files for TensorBoard, where you can then investigate losses. High losses could be one reason for model parameters turning NaN.

I doubt that the AppendBiasLayer is the reason for the NaNs here as it is only enabling free floating bias terms for the standard deviation of action distributions. For the weights in there to turn NaN in needs some NaN or really large gradients and these come from somewhere else.

As you episodes are not quite long, 30 is really short, are by case any very large rewards possible? And what are your settings for horizon, no_done_at_end, etc.?

Okay, sorry for the delay. I finished my other task and am now back on this problem.

a) I set my horizon to 30 which is why my episodes are 30 steps or less. The first 30 steps are the ā€œmost importantā€ actions. I donā€™t set the no_done_at_end or soft_horizon argument. Maybe I should. Iā€™m not entirely sure what that would do though. It seems like it would give me control over whether or not an environment gets reset. I am working off the assumption that the environment is getting reset, I could be wrong though.

b) The following warning occasionally is output to the console. I think that it usually occurs before the NaN values start coming. I am not 100% on whether or not they only occur just before the NaNs or not.

(PPO pid=406) /usr/local/lib/python3.7/dist-packages/ray/rllib/utils/metrics/learner_info.py:110: RuntimeWarning: Mean of empty slice
(PPO pid=406)   return np.nanmean(tower_data) 

c) I edited the reset and step methods to error out immediately when NaN values are introduced into the situation. I really wanted to see if I was passing NaN values as observations or if the NaN values came from the model. The program exited at the beginning of the step method conditionally after ā€œif ( numpy.isnan( action ).any() ):ā€ , and in fact, the action variable is ((array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan])), (array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]), array([nan, nan, nan]))) This would lead me to believe that NaNs are being introduced by the model.

d) I know how to make this problem occur sooner in my program. If I have the following hyper parameter configuration, it takes 50 + million time steps ( 7500+ iterations ) for the training to start producing NaNs: { lambda: 0.95, clip_param: 0.1, lr: 0.0001 }. Then, if I have the following hyper parameter configuration, it takes 30 thousand ( 5 + iterations ) for the training to start producing NaNs: { lambda: 0.05, clip_param: 0.02, lr: 0.0005 }.

e) Here are some stats for loss and entropy of the 6 iteration run:

f) The prior run that i was talking about had 7 workers ( and the driver ) each having some access to a portion of the gpu. I just ran another test with 0 workers ( and the driver ) and no help from the gpu. Training now gave me ValueError: Could not find key 'grad_gnorm' in some 'input_trees'. Please ensure the structure of all 'input_trees' are compatible with 'shallow_tree'. The last valid path yielded was ('learner_stats', 'entropy_ coeff'). I donā€™t know if itā€™s related, but I thought I should add it here while I look more into it.

g) I am using ray[rllib]==2.0.1.

h) I was printing the results passed to LearnerInfoBuilder.add_learn_on_batch_results and here is a photo showing the switch from working to NaN values appearing in the results:

i) I get WARNING env.py:143 ā€“ Your env doesnā€™t have a .spec.max_episode_steps attribute. This is fine if you have set ā€˜horizonā€™ in your config dictionary, or soft_horizon. However, if you havenā€™t, ā€˜horizonā€™ will default to infinity, and your environment will not be reset. I also read somewhere that ā€œIf you set the horizon parameter but not the max_episode_steps in Ray RLlib, the agent will be able to take a maximum of horizon number of steps within a single rollout, but there will be no limit on the total number of actions the agent can take before resetting the environment.ā€ Is this true? If it is true, then I could see how my environment would become malformed. Would I set the max_episode_steps in Ray RLlib by adding a spec attribute to my environment with key, value pair of max_episode_steps? Like so: env.spec = {}; env.spec[ā€œmax_episode_stepsā€] = <number>. ( Can confirm this doesnā€™t work because there is no spec.id? )

I looked into it more and it seems like NaNs are introduced
via the gradients. The gradients become NaNs on the backward call in ray.rllib.policy.torch_policy_v2._multi_gpu_parallel_grad_calc. More specifically from the loss_out[opt_idx].backward(retain_graph=True) call which ultimately calls Variable._execution_engine.run_backward ( which calls out to the torch C extension ). Then, the NaN values are passed along to the model parameters in the torch.optim.Adam.step method.

I think this means that I have run away gradients. I am not entirely sure though. From everything I have seen it would make sense to me that this is the case. I am going to look more into how to solve the runaway gradient problem. Maybe with gradient clipping, smaller learning rates, etc and see if I can prevent the NaNs from coming back.