Question related to inference in RLlib

Hi, I have a question trying to understand how to evaluate previously trained models in RLlib. I have trained a model for the Pong-v0 environment with a PPO agent. When I run rollout.py script disabling exploration (config['explore']=False) I get good rewards results for inference episodes. What I was trying to do now was to replicate this process outside RLlib. So I exported my agent’s model as a keras h5 model. Now, I try to execute inference in this model in the same way RLlib does in rollout.py. To do so, I create an environment with rllib function wrap_deepmind() (from ray/atari_wrappers.py at master · ray-project/ray · GitHub). Once I have the environment, I load the keras model and I start making predictions from environment observations (with predict function). The model (a visionnet) produces two outputs (the policy and value output). I take the index with the highest value from the policy output as the new action to take, and with this information I step the environment, getting rewards. But when iterating this process until the environment is done, I get that total cummulative rewards always have a -21.0 value, while when running the inferences with rollout.py I got better values (between -3.0 and 11.0). So I want to know how does rollout.py script really run inference. I mean, to know if its way of works is a simplier as taking an env observation, making a prediction with this data running the model, get the highest policy outpuy index as the next action, step the environment with this action and iterate the process. When analysing the rollout.py script source code I realised that predictions are made callig to compute_action() agent function. This funtion directly returns the next action to take. This functions calls ray.rllib.policy.policy.py compute_single_actions function, which now calls ray.rllib.policy.tf_policy.py compute_actions(), which now runs tf session to get data. I’m not so familiarised with TF so I haven’t gone deeper in the code. What I did was to get this compute_actions() return value and try to analise it. It was a three elements tuple, the first one was the action to take (later returned by agents’s compute_action () function) and the thir element was an infi dict containing relevant information:

(2, [], {'action_prob': 1.0, 'action_logp': 0.0, 'action_dist_inputs': array([15.081385, 12.628634, 16.526398, 12.699818,  8.133007,  8.79564 ],
      dtype=float32), 'vf_preds': -0.5892772})

I understand that the values asocciated to action_dist_inputs and vf_preds keys are the networks policy and value outputs, respectively. So I tried to see if these values where the same that the ones returned when calling predict funtion in the loaded h5 model. So what I did was taking the same env observation (I saved it), was first to call keras model predict, and I got the values:

[array([[[[-0.07937482, -6.3850718 , -9.9914665 , -6.937888  ,
           6.5464735 , 11.156392  ]]]], dtype=float32), array([[11.12601]], dtype=float32)]

Then, I called agent.compute_action() passing only the observations as argument, I the compute_actions policy fucntion output was:

(2, [], {'action_prob': 1.0, 'action_logp': 0.0, 'action_dist_inputs': array([15.081385, 12.628634, 16.526398, 12.699818,  8.133007,  8.79564 ],
      dtype=float32), 'vf_preds': -0.5892772})

So I want to know how is it possible that these values are so diffreent and if this is caused maybe because RLlib calcutes them in a more complex way, whici I’d like to know if possible. I can provide the python scripts were I tested the code if needed.

Thanks in advance!

Hey @javigm98 , the action_dist_inputs and vf_preds outputs of your model should be the same, given the input (env observation) is the same. Whether you are using Trainer.compute_action or Policy.compute_actions, or call the (h5 reloaded) model directly.
It’s hard for me to tell what’s causing the discrepancy here from this distance. If you could provide a small, self-sufficient reproduction script, that would be great!

Hi @sven1977

I show you the process I followed:

In order to save the keras neural network I did the following:

import ray
import ray.cloudpickle as cloudpickle
import ray.rllib.agents.ppo as ppo
import sys
import os
import tensorflow as tf
from tensorflow import keras


ray.shutdown()
ray.init()
checkpoint_dir=sys.argv[1]
export_name = sys.argv[2]

config_dir = os.path.dirname(checkpoint_dir)
config_path = os.path.join(config_dir, "params.pkl")
if not os.path.exists(config_path):
    config_path = os.path.join(config_dir, "../params.pkl")
with open(config_path, "rb") as f:
    config = cloudpickle.load(f)
print(config)
config['num_gpus']=0
config['num_gpus_per_worker'] = 0
config['explore'] = False

agent = ppo.PPOTrainer(config, env='Pong-v0')
agent.restore(checkpoint_dir)

with agent.get_policy().get_session().graph.as_default():
    export_model = agent.get_policy().model.base_model.save(export_name + '.h5')

ray.shutdown()

Here I specified a checkpoint dir to restore, for example the one I have here:
Mejorando-el-Aprendizaje-Automatico/checkpoint-11999 at main · javigm98/Mejorando-el-Aprendizaje-Automatico · GitHub.

Later I saved an environment image as follows:

import ray.rllib.env.atari_wrappers as wrappers
import gym
import numpy as np

env = wrappers.wrap_deepmind(gym.make('Pong-v0')

obs = env.reset()

with open('image.npy', 'wb') as f:
        np.save(f, obs)

Then, I ran inference by restoring the keras model and with the PPO agent:

import tensorflow as tf
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo

model=tf.keras.models.load_model("model.h5")

obs = np.load("image.npy")
obs2 = obs[np.newaxis,...] # To add one more dimension as inputs to the model have (1, 84, 84, 4) shape.

pred_keras = model.predict(obs2)
print(pred_keras)

ray.init()
config_dir = os.path.dirname(checkpoint_dir)
config_path = os.path.join(config_dir, "params.pkl")
if not os.path.exists(config_path):
    config_path = os.path.join(config_dir, "../params.pkl")
with open(config_path, "rb") as f:
    config = cloudpickle.load(f)

congig['explore']=False

agent = ppo.PPOTrainer(env='Pong-v0', config=config)
agent.restore(checkpoint_dir)

agent.compute_action(obs)

ray.shutdown()

I changed the source code to be able to see the resurn of policy.compute_actions(), and that’s where I found out the values to be different.

I’m using Ray 1.1.0 and Tensorflow 2.4.1 .

I don’t know if you want me to provide more information…

Thanks in advance!

Maybe the mistake is in the way I export the h5 model. Is this the right way to do it??

Have you checked that the input to compute_actions look the same as your input to the keras model?

Yes, I think so, and also I was able to see that it was no preprocessed or changed when it was part of the graphs’s feed dict