How do you get action probabilities from a policy?

jwarley · April 6, 2022, 6:47pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Given a state and a trained policy, how can I compute the action distribution for that state under the policy? Looking at the docs, it sounds to me like that’s what the compute_log_likelihoods function is for, but it doesn’t behave as expected and when I asked about that function specifically I got no answers.

I’ve also looked at the example Querying a policy’s action distribution, but my results using this method aren’t making sense either. In the terminology of that example, I would expect

np.sum([np.e ** dist.logp(a) for a in actions])

to equal the size of the state space |S|, but instead I’m getting a much smaller number, so these logps must not mean what I think, i.e. they are not log(P(action|state)). The example is outdated (from_batch is deprecated, for instance), so I had to make some changes; maybe I’m getting the wrong distribution somehow?

Why is this so hard to find clear/consistent documentation on? Computing an action distribution is one of the simplest things you could possibly want to do with an RL agent. I’d be happy to submit a PR with better docs on this if someone can explain what the intended solution is.

arturn · September 11, 2022, 10:43pm

Hi @jwarley ,

Sorry that this has taken a while. I have responded on your GitHub issue:

The log_likelihoods that you compute are representations of Q-Values because you are using DQN.
Since the Q Values between different state-action pairs in your environment don’t vary a lot, they are very close. Luckily, DQN uses the Epsilon Greedy action sampling function and therefore almost always simply picks the action with the slightly higher Q Value. The likelihoods that you print in your repro script underline this - right is always slightly higher than left.

Cheers

mgerstgrasser · September 22, 2022, 12:46am

@arturn I’ve actually just had this exact same problem come up, but I’m still unsure. Is there no easy built-in way of getting action probabilities for a discrete action space? Basically, an equivalent of algorithm.compute_single_action(), but the output is an entire distribution? E.g. {0: 0.4, 1:0.6} if there’s two actions, and action 0 has probability 0.4 in the current state.

If there isn’t a built-in way, do I understand correctly that I’d have to (1) get the preprocessor and preprocess the observation from my env, (2) get the logits from policy.model (3) create an action_dist (4) get the logp for a given action from that, and (5) and then take the exponent of that - should that directly give me the action probability?

And per jwarley’s original question and your answer to it, does this mean this won’t work for DQN policies? Also not SimpleQ then? Are there any other policies for which this won’t work? Is there any way of doing this for arbitrary policy classes?

Thank you!

mannyv · September 22, 2022, 2:04am

Hi @mgerstgrasser,

If you call algorithm.compute_single_action(…, full_fetch=True)

You should get a 3-tupe back with (action, state, extra_fetches)

extra_fetches will be a dictionary and you should be able to get the probabilities with extra_fetches[SampleBatch.ACTION_PROB]

github.com

ray-project/ray/blob/db2ce69e4a76f35b2af0bc2172c3c9bb2a628fd2/rllib/algorithms/algorithm.py#L1347


      
          if input_dict is not None:
              input_dict[SampleBatch.OBS] = observation
              action, state, extra = policy.compute_single_action(
                  input_dict=input_dict,
                  explore=explore,
                  timestep=timestep,
                  episode=episode,
              )
          # Individual args.
          else:
              action, state, extra = policy.compute_single_action(
                  obs=observation,
                  state=state,
                  prev_action=prev_action,
                  prev_reward=prev_reward,
                  info=info,
                  explore=explore,
                  timestep=timestep,
                  episode=episode,
              )

github.com

ray-project/ray/blob/db2ce69e4a76f35b2af0bc2172c3c9bb2a628fd2/rllib/algorithms/algorithm.py#L1367


      
          
          
    # If we work in normalized action space (normalize_actions=True),
              # we re-translate here into the env's action space.
              if unsquash_action:
                  action = space_utils.unsquash_action(action, policy.action_space_struct)
              # Clip, according to env's action space.
              elif clip_action:
                  action = space_utils.clip_action(action, policy.action_space_struct)
          
          
    # Return 3-Tuple: Action, states, and extra-action fetches.
              if state or full_fetch:
                  return action, state, extra
              # Ensure backward compatibility.
              else:
                  return action
          
          
@PublicAPI
          def compute_actions(
              self,
              observations: TensorStructType,
              state: Optional[List[TensorStructType]] = None,

github.com

ray-project/ray/blob/db2ce69e4a76f35b2af0bc2172c3c9bb2a628fd2/rllib/policy/torch_policy.py#L1013-L1019


      
          if dist_inputs is not None:
              extra_fetches[SampleBatch.ACTION_DIST_INPUTS] = dist_inputs
          
          
# Action-logp and action-prob.
          if logp is not None:
              extra_fetches[SampleBatch.ACTION_PROB] = torch.exp(logp.float())
              extra_fetches[SampleBatch.ACTION_LOGP] = logp

arturn · September 22, 2022, 1:13pm

@mgerstgrasser , would you mind opening an issue with a repro script and a short description of your hardware? We can then maybe take Manny’s approach to the next level and compute a slope for the queue to warn if it stays positive.

mannyv · September 22, 2022, 2:19pm

@arturn,

I think you replied on the wrong thread.

arturn · September 22, 2022, 3:07pm

Thanks @mannyv , you are right.

mgerstgrasser · September 22, 2022, 4:33pm

Oh, thank you! That only returns the action prob for the sampled action, not all of them, but in the specific setting where I need this there are only two actions! Thank you!

mannyv · September 22, 2022, 5:14pm

If you wanted the probabilities for all the actions you could create a new subclass of the policy you are using and implement extra_actions_out that adds an entry with probabilities for all the actions. That method takes the action distribution as an input so it should be pretty straightforward.

In this case you may also need to update the algorithm to pick the new policy. I am still trying to wrap my head around all the 2.0 api updates.

github.com

ray-project/ray/blob/db2ce69e4a76f35b2af0bc2172c3c9bb2a628fd2/rllib/policy/torch_policy.py#L793


      
          def extra_compute_grad_fetches(self) -> Dict[str, Any]:
              """Extra values to fetch and return from compute_gradients().
          
          
    Returns:
                  Extra fetch dict to be added to the fetch dict of the
                  `compute_gradients` call.
              """
              return {LEARNER_STATS_KEY: {}}  # e.g, stats, td error, etc.
          
          
@DeveloperAPI
          def extra_action_out(
              self,
              input_dict: Dict[str, TensorType],
              state_batches: List[TensorType],
              model: TorchModelV2,
              action_dist: TorchDistributionWrapper,
          ) -> Dict[str, TensorType]:
              """Returns dict of extra info to include in experience batch.
          
          
    Args:
                  input_dict: Dict of model input tensors.

Topic		Replies	Views
Confused by output of `compute_log_likelihoods` RLlib	0	324	March 28, 2022
Fetch action probability distribution from trained policy RLlib	7	674	March 18, 2023
Policy.compute_log_likelihoods should allows to compute with/without applying the exploration (e.g. SoftQ exploration) RLlib	1	271	April 16, 2021
How to get DQN action distribution RLlib	2	388	November 3, 2022
Output of PPO with discrete actions RLlib	4	1125	December 15, 2022

How do you get action probabilities from a policy?

Related topics