Next action in RLlib VisionNetworks

Hi all, I have a simple question. I’m training Pong-v0 GYM env with a PPO trainer, and analyzing the underlaying keras neural network (a CNN, visionnet.py), I have seen two outputs, one of size 6 and another one of size 1. I have seen that these values are the policy and value netowrk outputs. I’d like to know which one of all these outputs determine the next action to take (for example when you call agent.compute_action() you provide an env observation and get the next action to take, so how is this calculated?)

Thanks in advance!

1 Like

Hey @javigm98 , the 6-output in your case determines the next action to take. Your action space is a gym.spaces.Discrete(6), so it can take int values between 0 and 5 (including 5). The 6-output from your network are interpreted as so-called “logits” for a categorical distribution. The logits are “softmaxed” to yield 6 probability values (which are all >=0.0 and all sum up to 1.0). These are then your action probabilities. RLlib then samples from these 6 probabilities to get the one action to apply to the env.

On the value function output: The 1-output of your network is used only in PPO’s loss function (not for action calculations) for learning a separate value predictor given an observation. These predicted values are subtracted from the returns (discounted sum over the rewards) to reduce the variance of the loss term.

1 Like

Thank you so much @sven1977 for your answer. Do you know in which file this logits softmax is implemented? I mean, at which point Ray transforms this 6-values output into the next action to take.

1 Like

This happens in the different Policies (ray/rllib/policy/…) in compute_actions/compute_actions_from_input_dict. We are - by default - doing a call to the model, then take the model’s output (the logits) and create a distribution object from it, then sample from that distribution (inside the policy’s exploration component’s get_exploration_action method, iff it’s a StochasticSampling-type exploration).

1 Like

Okay, I think I have the concepts clear now. So, to put everything in context, I’m trying to reproduce the rollout script in the same way RLlib does but by using TFLite models (converted from RLlib TF models trained with PPO and for Pong-v0 env (I take the underlying keras model)). So, what I do is to run inference with the model, and once I get the outputs my idea was to take the one with the highest value after applying softmax function as the next action to take, but I see that in RLlib actions are sampled from that distribution that you mentioned, so I’d like to know what’s the underlying idea and if I only want to run inference how must I select the next action to take at each step. Also, I’d like to know if when running inferences with rollout.py script previous actions or rewards are taken into account when computing next action, if the answer is yes, how is this done.

Thank you in advance!!