I am trying to use RLlib to train an agent to navigate Gym’s frozen lake environment. On the default version of frozen-lake, everything works fine. However, if I modify the environment a little bit, I see a severe disparity between training and evaluation results.
My environment is the Gym’s frozen lake: with a fixed initial point, fixed goal, but holes ar Random positions.
Reward: 2 if reaches goal, -1 if reaches hole or outside the grid, else 0.
Observation: A 3x6x6 matrix O, where O[0,:,:] is the one-hot indicator, for the current agent position, where O[1,:,:] are the frozen positions indicators, and O[2,:,:] is the one hot indicator for the goal.
When I train with this, the average-episode-reward converges near 2, which is what I want. But when I try to evaluate the trained model or test the trained model from a check point, those good training rewards are never recreated.
Here is my working notebook:
I will really really appreciate if any of you good points can point out what am I doing wrong.
Thanks in advance,