Inconsistent number of episodes with 'evaluate'

leo593 · July 11, 2022, 11:51am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hi!

I noticed a strange behavior with rllib evaluate in command line: depending on the environment, the number of evaluation episodes requested in the command line never get reached during the evaluation.

For example, running rllib evaluate <chkpt_path> --run PPO --env CartPole-v0 --episodes 100 will only run about 50-55 evaluation episodes (no matter the number of episodes requested in the command line as long as it is >55). I have the same problem with other environments/algorithms (but each time a different “max” number of episodes)

Here is a reproduction script:

import ray
from ray import tune
import gym
import os

ray.tune.run(
    'PPO',
    stop={
        'training_iteration': 50,
    },
    config={'env':'CartPole-v0'},
    checkpoint_freq=25,
    checkpoint_score_attr='episode_reward_mean',
    checkpoint_at_end=True,
    local_dir='checkpoint',
)

os.chdir('checkpoint/PPO')
folders=os.listdir('.')
result=[]
for filename in folders:
    if os.path.isdir(os.path.join(os.path.abspath('.'), filename)):
        result.append(filename)
chkpt_dir = max(result, key=os.path.getmtime)

os.system('rllib evaluate '+chkpt_dir+'/checkpoint_000050/checkpoint-50 --run PPO --env CartPole-v0 --episodes 200')

Lars_Simon_Zehnder · July 12, 2022, 5:31pm

Hi @leo593,
I can replicate this. Looking into the èavluate.py` script it might mak sense:

parser.add_argument(
        "--steps",
        default=10000,
        help="Number of timesteps to roll out. Rollout will also stop if "
        "`--episodes` limit is reached first. A value of 0 means no "
        "limitation on the number of timesteps run.",
    )
    parser.add_argument(
        "--episodes",
        default=0,
        help="Number of complete episodes to roll out. Rollout will also stop "
        "if `--steps` (timesteps) limit is reached first. A value of 0 means "
        "no limitation on the number of episodes run.",
    )

Could it be that the number of timesteps has reached 10,000 after 50 episodes?

leo593 · July 18, 2022, 1:58pm

Hi @Lars_Simon_Zehnder thank you for your help, you’re right ! It makes sense now

Topic		Replies	Views
Trainer.evaluate() runs 1 extra episode instead of as defined in evaluation_duration RLlib	1	364	August 26, 2022
Cannot get a simple Evaluation to work as intended RLlib	6	388	September 5, 2022
How to indicate to RLLIB tune to run 200 episodes Ray Tune	1	325	October 26, 2021
How to tell RLLIB tune to run that many number of episodes RLlib	1	207	August 14, 2021
PPO.train incorrect result RLlib	1	258	May 23, 2023

Inconsistent number of episodes with 'evaluate'

Related topics