Explorative action or not?

hridayns · April 19, 2022, 10:59pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello, I am using the example for Callbacks here: ray/episode.py at master · ray-project/ray · GitHub for PPO algorithm in a custom multiagent environment using tune.run(…). I can get the “last” action for an agent but how do I know if the action was explorative or used the policy outputs?

On a side note, the action received in the step function for an agent does not match the action returned by last_action_for(agent). I am very confused as to why this is the case…please help. Thank you in advance!

sven1977 · April 26, 2022, 10:11am

Hey @hridayns , actually, the actions for PPO are always explorative as PPO uses StochasticSamplingby default (which always just samples from the distribution).
Unless(!) you switch explore=False in your Trainer’s config, in which case, it’ll always use the max-likelihood action.
Also, I believe episode.last_action_for is working as expected. Keep in mind that you may have more than 1 environment copy inside your worker (check your num_envs_per_worker setting in your config).

When running the example script you mentioned above and printing out a) actions sent to the environment and b) episode.last_action_for(), I get:

episode.last_action_for()
Out[2]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[3]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[4]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[5]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[6]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[7]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[8]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[9]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[10]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[11]: 1

Topic		Replies	Views
Making the selection of action itself "stochastic" RLlib	12	943	October 3, 2022
Not able to locate rllib train function code RLlib	6	311	March 22, 2023
Extract and display policy RLlib	3	486	July 26, 2021
Inconsistent actions from Algorithm.compute_single_action RLlib	3	419	June 14, 2023
Access action probs after each episode/env step RLlib	6	361	August 1, 2022

Explorative action or not?

Related topics