Test reward much lower than training reward

Hello,

I am trying to use RLlib to train an agent to navigate Gym’s frozen lake environment. On the default version of frozen-lake, everything works fine. However, if I modify the environment a little bit, I see a severe disparity between training and evaluation results.

My environment is the Gym’s frozen lake: with a fixed initial point, fixed goal, but holes ar Random positions.

Reward: 2 if reaches goal, -1 if reaches hole or outside the grid, else 0.

Observation: A 3x6x6 matrix O, where O[0,:,:] is the one-hot indicator, for the current agent position, where O[1,:,:] are the frozen positions indicators, and O[2,:,:] is the one hot indicator for the goal.

When I train with this, the average-episode-reward converges near 2, which is what I want. But when I try to evaluate the trained model or test the trained model from a check point, those good training rewards are never recreated.

Here is my working notebook:
https://github.com/krajit/myRLcases/blob/main/frozenLake-rlLib/randomMap-cnnModel.ipynb

I will really really appreciate if any of you good points can point out what am I doing wrong.

Thanks in advance,
Ajit

Hi @krajit,

A likely reason, and the one I would test first, is that you trained a stochastic policy explore=True but you are not evaluating in the same way. I have found for some of my environments exactly what you are reporting. Changing the explore setting during evaluation reduces its performance.

Thanks for your replay, @mannyv.

I have tried testing with explore = True and False both. None of these are reproducing the average mean reward reported during the training phase.

What got me out of this issue is increasing the number of environments per worker.

config = {
“framework”: “torch”,
“env”:“myenv”,
“env_config”:{‘size’:6, ‘numHoles’: 10},
“num_workers”: 10,
“model”: {
“custom_model”: MyTorchModel, # for torch users: “custom_model”: MyTorchModel
“custom_model_config”: {},
},
‘num_envs_per_worker’: 200,
“create_env_on_driver”: True,
}

My environment is a random grid everytime, I guess this option exbposed the agent to a lot many environment samples.