Cannot get a simple Evaluation to work as intended

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a saved checkpoint from training, and I would like to run just evaluation episodes now with exploration set to False (which I have configured correctly). The problem is that I am unable to control the number of episodes for which this happens.
.
.
.
policy_conf[‘evaluation_interval’] = 1 # every 1 episode?
policy_conf[‘evaluation_duration’] = 1 # for 1 episode
policy_conf[‘evaluation_duration_unit’] = ‘episodes’
.
.
.

stop = {
“episodes_total”: 1,
}

result = tune.run(
‘PPO’,
name=‘ma-pheromone’,
config=policy_conf,
metric=‘episode_reward_mean’,
mode=‘max’,
trial_name_creator=trial_str_creator,
stop=stop,
verbose=1,
log_to_file=(‘test.log’,‘error.log’),
local_dir=‘./tune-eval-results’,
restore=‘./tune-results/ma-pheromone/PPO-_100-2-4-1000.0-0.1-0.1-0-1__0_2022-08-23_20-23-55/checkpoint_000030/checkpoint-30’,
num_samples=10,
)

I have set the evaluation duration to 1 and even the stop condition for episodes_total to 1. I don’t understand why I am unable to do a simple evaluation of 10 episodes, so I tried to just stick to run 1 episode of evaluation but it just does not work. I set num_samples to 10 because I want to repeat the evaluation 10 times for consistency. Please help me.

It looks like you are using tune to evaluate. The stop condition you set does not apply to evaluated episodes but to trained episodes.. Try using the algorithm’s evaluate function instead, since you don’t seem to be interesting in training and therefore not interested in tuning anything here.

I have tried in the past and I have had trouble using evaluate. Also, not very sure where the evaluation results get stored on the system for visualization. Any workaround to keep using tune but for evaluation?

This is a bug. I had the same behavior that the number of episodes was more than the evaluation_duration that was specified. This is fixed in Ray 2.0.0.
See also: [RLlib] Excessive evaluation if rollout_fragment_length < timesteps_per_iteration · Issue #27821 · ray-project/ray · GitHub

@hridayns The results would not be visualized. But you can record the statistics yourself and for most metrics an average will suffice - and does not need visualization. Maybe you can calculate percentiles of distributions of rewards or something like that.

@RaymondK Are you sure that this is the referenced bug or even a bug? The way I understand hridayns is what I would expect to happen.

To be honest, I am not completely clear what @hridayns is asking. Maybe I should have stated it less directly. At least I did experience the bug I referenced in the Github issue and it was solved by Ray 2.0.0 but I am not sure this solves the problem hridayns has.

1 Like

I have actually not yet tried to evaluate using tune.run() in Ray 2.0.0. Since I switched to using evaluate() from 1.11 and have continued to do the same as it seemed to work fine until I ran into another issue later which I made a post about