How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Introduction to the problem:
- I am training a neural network to control the joints of a robot using the PPO reinforcement learning algorithm. However, the network has been producing actions that result in invalid joint states with NaN values, eventually causing all environments to return a reward of 0.0. This starts happening after only a few training iterations (<200). I did do some testing around and saw that if I gave repeated high value actions to the joints of the robot that the environment engine would start returning NaN values for the joint states, so I immediately thought that was the problem and tried to alter the action space to make it smaller. I decided to let a training experiment run for a good amount of time in hopes of seeing if the network would learn to not return high action values. After 30 million timesteps occurring and this problem persisting, I decided to restore the most recent checkpoint and investigate the policy.
Problems:
-
The first problem I saw with the restored policy was that the
observation_space
was different than expected.policy.observation_space
returnedBox(-1.0, 1.0, (1200,), float32)
. I expected a box of size(1200,)
, but the range of the box should have been(-1000.0,1000.0)
as theobservation_state
of the training environment isDict(body_state:Box(-1000.0,1000.0, (240,), float64), task_state:Box(-1000.0,1000.0,(960,), float64)
. I also want to note the difference in the data types of the observation spaces. -
The second problem I saw with the restored policy was that a majority of the weights and biases of the layers were NaN values. The results from
policy.get_weights()
show rows of ānanā values for each layer of the action computing model. I also want to note that I am using a model as the reinforcement learning approximator function and that does NOT have NaN values.- Sample output from
policy.get_weights()
:
- Sample output from
{
'compute_action._model.0._model.0.weight': array([[nan, nan, nan, ..., nan, nan, nan], ..., dtype=float32),
'compute_action._model.0._model.0.bias': array([ nan, nan, ... -0.11605847, ..., nan, nan ], dtype=float32),
...
}
I guess Iāll have to break down my experiment some so I can get a script to test with. I am posting this before in case anyone has any helpful tips or info about this problem.