Next action in RLlib VisionNetworks

javigm98 · April 23, 2021, 10:15pm

Hi all, I have a simple question. I’m training Pong-v0 GYM env with a PPO trainer, and analyzing the underlaying keras neural network (a CNN, visionnet.py), I have seen two outputs, one of size 6 and another one of size 1. I have seen that these values are the policy and value netowrk outputs. I’d like to know which one of all these outputs determine the next action to take (for example when you call agent.compute_action() you provide an env observation and get the next action to take, so how is this calculated?)

Thanks in advance!

sven1977 · April 27, 2021, 9:45am

Hey @javigm98 , the 6-output in your case determines the next action to take. Your action space is a gym.spaces.Discrete(6), so it can take int values between 0 and 5 (including 5). The 6-output from your network are interpreted as so-called “logits” for a categorical distribution. The logits are “softmaxed” to yield 6 probability values (which are all >=0.0 and all sum up to 1.0). These are then your action probabilities. RLlib then samples from these 6 probabilities to get the one action to apply to the env.

On the value function output: The 1-output of your network is used only in PPO’s loss function (not for action calculations) for learning a separate value predictor given an observation. These predicted values are subtracted from the returns (discounted sum over the rewards) to reduce the variance of the loss term.

javigm98 · April 27, 2021, 3:02pm

Thank you so much @sven1977 for your answer. Do you know in which file this logits softmax is implemented? I mean, at which point Ray transforms this 6-values output into the next action to take.

sven1977 · April 27, 2021, 3:05pm

This happens in the different Policies (ray/rllib/policy/…) in compute_actions/compute_actions_from_input_dict. We are - by default - doing a call to the model, then take the model’s output (the logits) and create a distribution object from it, then sample from that distribution (inside the policy’s exploration component’s get_exploration_action method, iff it’s a StochasticSampling-type exploration).

javigm98 · April 27, 2021, 5:00pm

Okay, I think I have the concepts clear now. So, to put everything in context, I’m trying to reproduce the rollout script in the same way RLlib does but by using TFLite models (converted from RLlib TF models trained with PPO and for Pong-v0 env (I take the underlying keras model)). So, what I do is to run inference with the model, and once I get the outputs my idea was to take the one with the highest value after applying softmax function as the next action to take, but I see that in RLlib actions are sampled from that distribution that you mentioned, so I’d like to know what’s the underlying idea and if I only want to run inference how must I select the next action to take at each step. Also, I’d like to know if when running inferences with rollout.py script previous actions or rewards are taken into account when computing next action, if the answer is yes, how is this done.

Thank you in advance!!

Topic		Replies	Views
Policy Module (Model V2) RLlib	5	332	April 12, 2022
Output of PPO with discrete actions RLlib	4	1080	December 15, 2022
Output from custom policy network for PPO RLlib	1	443	November 15, 2022
Question related to inference in RLlib RLlib	5	827	May 13, 2021
Inconsistent actions from Algorithm.compute_single_action RLlib	3	413	June 14, 2023

Next action in RLlib VisionNetworks

Related topics