Hi, I have a question trying to understand how to evaluate previously trained models in RLlib. I have trained a model for the Pong-v0 environment with a PPO agent. When I run rollout.py script disabling exploration (config['explore']=False
) I get good rewards results for inference episodes. What I was trying to do now was to replicate this process outside RLlib. So I exported my agent’s model as a keras h5 model. Now, I try to execute inference in this model in the same way RLlib does in rollout.py. To do so, I create an environment with rllib function wrap_deepmind()
(from ray/atari_wrappers.py at master · ray-project/ray · GitHub). Once I have the environment, I load the keras model and I start making predictions from environment observations (with predict function). The model (a visionnet) produces two outputs (the policy and value output). I take the index with the highest value from the policy output as the new action to take, and with this information I step the environment, getting rewards. But when iterating this process until the environment is done, I get that total cummulative rewards always have a -21.0 value, while when running the inferences with rollout.py I got better values (between -3.0 and 11.0). So I want to know how does rollout.py script really run inference. I mean, to know if its way of works is a simplier as taking an env observation, making a prediction with this data running the model, get the highest policy outpuy index as the next action, step the environment with this action and iterate the process. When analysing the rollout.py script source code I realised that predictions are made callig to compute_action() agent function. This funtion directly returns the next action to take. This functions calls ray.rllib.policy.policy.py compute_single_actions
function, which now calls ray.rllib.policy.tf_policy.py compute_actions()
, which now runs tf session to get data. I’m not so familiarised with TF so I haven’t gone deeper in the code. What I did was to get this compute_actions()
return value and try to analise it. It was a three elements tuple, the first one was the action to take (later returned by agents’s compute_action () function) and the thir element was an infi dict containing relevant information:
(2, [], {'action_prob': 1.0, 'action_logp': 0.0, 'action_dist_inputs': array([15.081385, 12.628634, 16.526398, 12.699818, 8.133007, 8.79564 ],
dtype=float32), 'vf_preds': -0.5892772})
I understand that the values asocciated to action_dist_inputs
and vf_preds
keys are the networks policy and value outputs, respectively. So I tried to see if these values where the same that the ones returned when calling predict funtion in the loaded h5 model. So what I did was taking the same env observation (I saved it), was first to call keras model predict, and I got the values:
[array([[[[-0.07937482, -6.3850718 , -9.9914665 , -6.937888 ,
6.5464735 , 11.156392 ]]]], dtype=float32), array([[11.12601]], dtype=float32)]
Then, I called agent.compute_action() passing only the observations as argument, I the compute_actions policy fucntion output was:
(2, [], {'action_prob': 1.0, 'action_logp': 0.0, 'action_dist_inputs': array([15.081385, 12.628634, 16.526398, 12.699818, 8.133007, 8.79564 ],
dtype=float32), 'vf_preds': -0.5892772})
So I want to know how is it possible that these values are so diffreent and if this is caused maybe because RLlib calcutes them in a more complex way, whici I’d like to know if possible. I can provide the python scripts were I tested the code if needed.
Thanks in advance!