Actions and observations by alphazero in evaluation

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi

I am trying to log the actions and the the observation after the action is performed in the evaluations steps of my trainning.

I was able to that with ppo, but now with alphazero seens that compute_single_action method doesnt work with alphazero [rllib] Compute actions with AlphaZero algorithm · Issue #13177 · ray-project/ray · GitHub

I am taking a look at ray/custom_eval.py at master · ray-project/ray · GitHub, but on episodes I don’t have this info.

ok, I was able to do a workaround base on [rllib] Compute_actions() and Compute_actions_from_input_dict() with AlphaZero algorithm · Issue #14477 · ray-project/ray · GitHub. Thanks to andras-kth!

from ray.rllib.evaluation import Episode
from ray.rllib.policy.policy_map import PolicyMap
from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID

def az0_eval_function(trainer, eval_workers):

    eval_reward_mean = 0
    eval_env = trainer.env_creator(trainer.config["env_config"])
    policy = trainer.get_policy()
    duration = trainer.config["evaluation_duration"]

    extra_logs = {}
    for i in range(duration):
        eval_env.seed(i)
        obs_dict = eval_env.reset()
        ep = Episode(
            PolicyMap(0, 0),
            lambda _, __: DEFAULT_POLICY_ID,
            lambda: None,
            lambda _: None,
            0,
        )
        ep.user_data["initial_state"] = eval_env.get_state()
        eval_results = {"eval_reward": 0, "eval_eps_length": 0}
        done = False
        while not done:
            action, _, _ = policy.compute_single_action(obs_dict, episode=ep, explore=False)
            obs_dict, reward, done, info = eval_env.step(action)
            eval_results["eval_reward"] += reward
            ep.length += 1

        eval_reward_mean += eval_results["eval_reward"] / duration
        extra_logs = update_extra_logs(extra_logs, obs_dict)


    return {
        "reward_mean": eval_reward_mean, "extra_logs": extra_logs,
    }


In the end I make no use of the evaluation workers.