Test reward much lower than training reward

krajit · July 16, 2022, 9:34am

Hello,

I am trying to use RLlib to train an agent to navigate Gym’s frozen lake environment. On the default version of frozen-lake, everything works fine. However, if I modify the environment a little bit, I see a severe disparity between training and evaluation results.

My environment is the Gym’s frozen lake: with a fixed initial point, fixed goal, but holes ar Random positions.

Reward: 2 if reaches goal, -1 if reaches hole or outside the grid, else 0.

Observation: A 3x6x6 matrix O, where O[0,:,:] is the one-hot indicator, for the current agent position, where O[1,:,:] are the frozen positions indicators, and O[2,:,:] is the one hot indicator for the goal.

When I train with this, the average-episode-reward converges near 2, which is what I want. But when I try to evaluate the trained model or test the trained model from a check point, those good training rewards are never recreated.

Here is my working notebook:
https://github.com/krajit/myRLcases/blob/main/frozenLake-rlLib/randomMap-cnnModel.ipynb

I will really really appreciate if any of you good points can point out what am I doing wrong.

Thanks in advance,
Ajit

mannyv · July 16, 2022, 11:19am

Hi @krajit,

A likely reason, and the one I would test first, is that you trained a stochastic policy explore=True but you are not evaluating in the same way. I have found for some of my environments exactly what you are reporting. Changing the explore setting during evaluation reduces its performance.

krajit · July 17, 2022, 5:59am

Thanks for your replay, @mannyv.

I have tried testing with explore = True and False both. None of these are reproducing the average mean reward reported during the training phase.

krajit · July 17, 2022, 8:18am

What got me out of this issue is increasing the number of environments per worker.

config = {
“framework”: “torch”,
“env”:“myenv”,
“env_config”:{‘size’:6, ‘numHoles’: 10},
“num_workers”: 10,
“model”: {
“custom_model”: MyTorchModel, # for torch users: “custom_model”: MyTorchModel
“custom_model_config”: {},
},
‘num_envs_per_worker’: 200,
“create_env_on_driver”: True,
}

My environment is a random grid everytime, I guess this option exbposed the agent to a lot many environment samples.

Topic		Replies	Views
Cannot reproduce training results in evaluation even on same dataset RLlib	1	541	November 20, 2022
Unable to replicate original PPO performance RLlib	0	177	May 10, 2024
Training mean reward vs. evaluation mean rewward RLlib	4	1354	November 17, 2022
Different Environment for training and evaluation RLlib	5	1207	July 13, 2021
Evaluation run seems to not change at all, in any of my runs? RLlib	4	290	September 19, 2022

Test reward much lower than training reward

Related topics