Explorative action or not?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, I am using the example for Callbacks here: ray/episode.py at master · ray-project/ray · GitHub for PPO algorithm in a custom multiagent environment using tune.run(…). I can get the “last” action for an agent but how do I know if the action was explorative or used the policy outputs?

On a side note, the action received in the step function for an agent does not match the action returned by last_action_for(agent). I am very confused as to why this is the case…please help. Thank you in advance!

Hey @hridayns , actually, the actions for PPO are always explorative as PPO uses StochasticSamplingby default (which always just samples from the distribution).
Unless(!) you switch explore=False in your Trainer’s config, in which case, it’ll always use the max-likelihood action.
Also, I believe episode.last_action_for is working as expected. Keep in mind that you may have more than 1 environment copy inside your worker (check your num_envs_per_worker setting in your config).

When running the example script you mentioned above and printing out a) actions sent to the environment and b) episode.last_action_for(), I get:

episode.last_action_for()
Out[2]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[3]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[4]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[5]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[6]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[7]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[8]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[9]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[10]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[11]: 1