TL;DR: RLlib’s rollout
command seems to be training the network, not evaluating.
I’m trying to use Ray RLlib’s DQN to train, save, and evaluate neural networks on a custom made simulator. To do so, I’ve been prototyping the workflow with OpenAI Gym’s CartPole-v0 environment. Doing so, I found some odd results while running the rollout
command for evaluation. (I used the exact same method written in RLlib Training APIs - Evaluating Trained Policies documentation.)
First I trained a vanilla DQN network until it reached a episode_reward_mean
of 200 points. Then, I used the rllib rollout
command to test the network for 1000 episodes in CartPole-v0. For the first 135 episodes, the episode_reward_mean
score was awful, ranging from 10 to 200. However, from the 136th episode, the score was consistently 200, which is full points in CartPole-v0.
So, it seems like rllib rollout
is rather training the network, not evaluating. I know that isn’t the case since there’s no code for training in rollout.py
module. But I have to say, it really looks like training. Otherwise, how can the score gradually increase as more episodes happen? Furthermore, the network is “adapting” to different starting positions later in the evaluation process, which is an evidence of training in my perspective.
If anyone can help me out why this might be happening, it will be greatly appreciated. The code I used is below:
- Training
results = tune.run(
"DQN",
stop={"episode_reward_mean": 200},
config={
"env": "CartPole-v0",
"num_workers": 6
},
checkpoint_freq=0,
keep_checkpoints_num=1,
checkpoint_score_attr="episode_reward_mean",
checkpoint_at_end=True,
local_dir=r"/home/ray_results/CartPole_Evaluation"
)
- Evaluation
rllib rollout ~/ray_results/CartPole_Evaluation/DQN_CartPole-v0_13hfd/checkpoint_139/checkpoint-139 \
--run DQN --env CartPole-v0 --episodes 1000
- Result
2021-01-12 17:26:48,764 INFO trainable.py:489 -- Current state after restoring: {'_iteration': 77, '_timesteps_total': None, '_time_total': 128.41606998443604, '_episodes_total': 819}
Episode #0: reward: 21.0
Episode #1: reward: 13.0
Episode #2: reward: 13.0
Episode #3: reward: 27.0
Episode #4: reward: 26.0
Episode #5: reward: 14.0
Episode #6: reward: 16.0
Episode #7: reward: 22.0
Episode #8: reward: 25.0
Episode #9: reward: 17.0
Episode #10: reward: 16.0
Episode #11: reward: 31.0
Episode #12: reward: 10.0
Episode #13: reward: 23.0
Episode #14: reward: 17.0
Episode #15: reward: 41.0
Episode #16: reward: 46.0
Episode #17: reward: 15.0
Episode #18: reward: 17.0
Episode #19: reward: 32.0
Episode #20: reward: 25.0
...
Episode #114: reward: 134.0
Episode #115: reward: 90.0
Episode #116: reward: 38.0
Episode #117: reward: 33.0
Episode #118: reward: 36.0
Episode #119: reward: 114.0
Episode #120: reward: 183.0
Episode #121: reward: 200.0
Episode #122: reward: 166.0
Episode #123: reward: 200.0
Episode #124: reward: 155.0
Episode #125: reward: 181.0
Episode #126: reward: 72.0
Episode #127: reward: 200.0
Episode #128: reward: 54.0
Episode #129: reward: 196.0
Episode #130: reward: 200.0
Episode #131: reward: 200.0
Episode #132: reward: 188.0
Episode #133: reward: 200.0
Episode #134: reward: 200.0
Episode #135: reward: 173.0
Episode #136: reward: 200.0
Episode #137: reward: 200.0
Episode #138: reward: 200.0
Episode #139: reward: 200.0
Episode #140: reward: 200.0
...
Episode #988: reward: 200.0
Episode #989: reward: 200.0
Episode #990: reward: 200.0
Episode #991: reward: 200.0
Episode #992: reward: 200.0
Episode #993: reward: 200.0
Episode #994: reward: 200.0
Episode #995: reward: 200.0
Episode #996: reward: 200.0
Episode #997: reward: 200.0
Episode #998: reward: 200.0
Episode #999: reward: 200.0