How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello, I am using the example for Callbacks here: ray/episode.py at master · ray-project/ray · GitHub for PPO algorithm in a custom multiagent environment using tune.run(…). I can get the “last” action for an agent but how do I know if the action was explorative or used the policy outputs?
On a side note, the action received in the step function for an agent does not match the action returned by last_action_for(agent). I am very confused as to why this is the case…please help. Thank you in advance!
Hey @hridayns , actually, the actions for PPO are always explorative as PPO uses StochasticSampling
by default (which always just samples from the distribution).
Unless(!) you switch explore=False
in your Trainer’s config, in which case, it’ll always use the max-likelihood action.
Also, I believe episode.last_action_for
is working as expected. Keep in mind that you may have more than 1 environment copy inside your worker (check your num_envs_per_worker
setting in your config).
When running the example script you mentioned above and printing out a) actions sent to the environment and b) episode.last_action_for()
, I get:
episode.last_action_for()
Out[2]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[3]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[4]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[5]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 0}})
episode.last_action_for()
Out[6]: 0
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[7]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[8]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[9]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[10]: 1
action=defaultdict(<class 'dict'>, {0: {'agent0': 1}})
episode.last_action_for()
Out[11]: 1