Cannot get a simple Evaluation to work as intended

hridayns · August 24, 2022, 12:15am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have a saved checkpoint from training, and I would like to run just evaluation episodes now with exploration set to False (which I have configured correctly). The problem is that I am unable to control the number of episodes for which this happens.
.
.
.
policy_conf[‘evaluation_interval’] = 1 # every 1 episode?
policy_conf[‘evaluation_duration’] = 1 # for 1 episode
policy_conf[‘evaluation_duration_unit’] = ‘episodes’
.
.
.

stop = {
“episodes_total”: 1,
}

result = tune.run(
‘PPO’,
name=‘ma-pheromone’,
config=policy_conf,
metric=‘episode_reward_mean’,
mode=‘max’,
trial_name_creator=trial_str_creator,
stop=stop,
verbose=1,
log_to_file=(‘test.log’,‘error.log’),
local_dir=‘./tune-eval-results’,
restore=‘./tune-results/ma-pheromone/PPO-_100-2-4-1000.0-0.1-0.1-0-1__0_2022-08-23_20-23-55/checkpoint_000030/checkpoint-30’,
num_samples=10,
)

I have set the evaluation duration to 1 and even the stop condition for episodes_total to 1. I don’t understand why I am unable to do a simple evaluation of 10 episodes, so I tried to just stick to run 1 episode of evaluation but it just does not work. I set num_samples to 10 because I want to repeat the evaluation 10 times for consistency. Please help me.

arturn · September 4, 2022, 3:28pm

It looks like you are using tune to evaluate. The stop condition you set does not apply to evaluated episodes but to trained episodes.. Try using the algorithm’s evaluate function instead, since you don’t seem to be interesting in training and therefore not interested in tuning anything here.

hridayns · September 4, 2022, 4:10pm

I have tried in the past and I have had trouble using evaluate. Also, not very sure where the evaluation results get stored on the system for visualization. Any workaround to keep using tune but for evaluation?

RaymondK · September 5, 2022, 10:21am

This is a bug. I had the same behavior that the number of episodes was more than the evaluation_duration that was specified. This is fixed in Ray 2.0.0.
See also: [RLlib] Excessive evaluation if rollout_fragment_length < timesteps_per_iteration · Issue #27821 · ray-project/ray · GitHub

arturn · September 5, 2022, 3:34pm

@hridayns The results would not be visualized. But you can record the statistics yourself and for most metrics an average will suffice - and does not need visualization. Maybe you can calculate percentiles of distributions of rewards or something like that.

@RaymondK Are you sure that this is the referenced bug or even a bug? The way I understand hridayns is what I would expect to happen.

RaymondK · September 5, 2022, 6:06pm

To be honest, I am not completely clear what @hridayns is asking. Maybe I should have stated it less directly. At least I did experience the bug I referenced in the Github issue and it was solved by Ray 2.0.0 but I am not sure this solves the problem hridayns has.

hridayns · September 5, 2022, 9:50pm

I have actually not yet tried to evaluate using tune.run() in Ray 2.0.0. Since I switched to using evaluate() from 1.11 and have continued to do the same as it seemed to work fine until I ran into another issue later which I made a post about

Topic		Replies	Views
Trainer.evaluate() runs 1 extra episode instead of as defined in evaluation_duration RLlib	1	367	August 26, 2022
Inconsistent number of episodes with 'evaluate' RLlib	2	265	July 18, 2022
Evaluation run seems to not change at all, in any of my runs? RLlib	4	292	September 19, 2022
Why does my evaluation during training with tune return 0 values? RLlib	2	283	September 6, 2022
How to indicate to RLLIB tune to run 200 episodes Ray Tune	1	328	October 26, 2021

Cannot get a simple Evaluation to work as intended

Related topics