ojon
1
How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi
I am trying to log the actions and the the observation after the action is performed in the evaluations steps of my trainning.
I was able to that with ppo, but now with alphazero seens that compute_single_action method doesnt work with alphazero [rllib] Compute actions with AlphaZero algorithm · Issue #13177 · ray-project/ray · GitHub
I am taking a look at ray/custom_eval.py at master · ray-project/ray · GitHub, but on episodes I don’t have this info.
ojon
2
ok, I was able to do a workaround base on [rllib] Compute_actions() and Compute_actions_from_input_dict() with AlphaZero algorithm · Issue #14477 · ray-project/ray · GitHub. Thanks to andras-kth!
from ray.rllib.evaluation import Episode
from ray.rllib.policy.policy_map import PolicyMap
from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID
def az0_eval_function(trainer, eval_workers):
eval_reward_mean = 0
eval_env = trainer.env_creator(trainer.config["env_config"])
policy = trainer.get_policy()
duration = trainer.config["evaluation_duration"]
extra_logs = {}
for i in range(duration):
eval_env.seed(i)
obs_dict = eval_env.reset()
ep = Episode(
PolicyMap(0, 0),
lambda _, __: DEFAULT_POLICY_ID,
lambda: None,
lambda _: None,
0,
)
ep.user_data["initial_state"] = eval_env.get_state()
eval_results = {"eval_reward": 0, "eval_eps_length": 0}
done = False
while not done:
action, _, _ = policy.compute_single_action(obs_dict, episode=ep, explore=False)
obs_dict, reward, done, info = eval_env.step(action)
eval_results["eval_reward"] += reward
ep.length += 1
eval_reward_mean += eval_results["eval_reward"] / duration
extra_logs = update_extra_logs(extra_logs, obs_dict)
return {
"reward_mean": eval_reward_mean, "extra_logs": extra_logs,
}
In the end I make no use of the evaluation workers.