`rllib rollout` command seems to be training the network, not evaluating

TL;DR: RLlib’s rollout command seems to be training the network, not evaluating.

I’m trying to use Ray RLlib’s DQN to train, save, and evaluate neural networks on a custom made simulator. To do so, I’ve been prototyping the workflow with OpenAI Gym’s CartPole-v0 environment. Doing so, I found some odd results while running the rollout command for evaluation. (I used the exact same method written in RLlib Training APIs - Evaluating Trained Policies documentation.)

First I trained a vanilla DQN network until it reached a episode_reward_mean of 200 points. Then, I used the rllib rollout command to test the network for 1000 episodes in CartPole-v0. For the first 135 episodes, the episode_reward_mean score was awful, ranging from 10 to 200. However, from the 136th episode, the score was consistently 200, which is full points in CartPole-v0.

So, it seems like rllib rollout is rather training the network, not evaluating. I know that isn’t the case since there’s no code for training in rollout.py module. But I have to say, it really looks like training. Otherwise, how can the score gradually increase as more episodes happen? Furthermore, the network is “adapting” to different starting positions later in the evaluation process, which is an evidence of training in my perspective.

If anyone can help me out why this might be happening, it will be greatly appreciated. The code I used is below:

  • Training
results = tune.run(
					"DQN",
					stop={"episode_reward_mean": 200},
					config={
							"env": "CartPole-v0",
							"num_workers": 6
					},
					checkpoint_freq=0,
					keep_checkpoints_num=1,
					checkpoint_score_attr="episode_reward_mean",
					checkpoint_at_end=True,
					local_dir=r"/home/ray_results/CartPole_Evaluation"
)
  • Evaluation
rllib rollout ~/ray_results/CartPole_Evaluation/DQN_CartPole-v0_13hfd/checkpoint_139/checkpoint-139 \
             --run DQN --env CartPole-v0 --episodes 1000
  • Result
2021-01-12 17:26:48,764 INFO trainable.py:489 -- Current state after restoring: {'_iteration': 77, '_timesteps_total': None, '_time_total': 128.41606998443604, '_episodes_total': 819}
Episode #0: reward: 21.0
Episode #1: reward: 13.0
Episode #2: reward: 13.0
Episode #3: reward: 27.0
Episode #4: reward: 26.0
Episode #5: reward: 14.0
Episode #6: reward: 16.0
Episode #7: reward: 22.0
Episode #8: reward: 25.0
Episode #9: reward: 17.0
Episode #10: reward: 16.0
Episode #11: reward: 31.0
Episode #12: reward: 10.0
Episode #13: reward: 23.0
Episode #14: reward: 17.0
Episode #15: reward: 41.0
Episode #16: reward: 46.0
Episode #17: reward: 15.0
Episode #18: reward: 17.0
Episode #19: reward: 32.0
Episode #20: reward: 25.0
...
Episode #114: reward: 134.0
Episode #115: reward: 90.0
Episode #116: reward: 38.0
Episode #117: reward: 33.0
Episode #118: reward: 36.0
Episode #119: reward: 114.0
Episode #120: reward: 183.0
Episode #121: reward: 200.0
Episode #122: reward: 166.0
Episode #123: reward: 200.0
Episode #124: reward: 155.0
Episode #125: reward: 181.0
Episode #126: reward: 72.0
Episode #127: reward: 200.0
Episode #128: reward: 54.0
Episode #129: reward: 196.0
Episode #130: reward: 200.0
Episode #131: reward: 200.0
Episode #132: reward: 188.0
Episode #133: reward: 200.0
Episode #134: reward: 200.0
Episode #135: reward: 173.0
Episode #136: reward: 200.0
Episode #137: reward: 200.0
Episode #138: reward: 200.0
Episode #139: reward: 200.0
Episode #140: reward: 200.0
...
Episode #988: reward: 200.0
Episode #989: reward: 200.0
Episode #990: reward: 200.0
Episode #991: reward: 200.0
Episode #992: reward: 200.0
Episode #993: reward: 200.0
Episode #994: reward: 200.0
Episode #995: reward: 200.0
Episode #996: reward: 200.0
Episode #997: reward: 200.0
Episode #998: reward: 200.0
Episode #999: reward: 200.0

Yeah, this makes sense :slight_smile:
DQN uses the EpsilonGreedy exploration module, which - by default - reduces the epsilon (the amount of randomness in your actions) from 1.0 to 0.05 over the first 10k timesteps. To suppress exploration behavior altogether during evaluation (which you should do when “rolling out” DQN), you need to do:

config:
    evaluation_config:
        explore: false
1 Like

For background:
DQN picks actions deterministically (always chose the action with the highest Q-value), whereas algos like PPO, IMPALA, A3C pick actions by sampling from a (network parameterized) distribution. This is why the latter algos can learn optimal policies, even if these policies are stochastic (simple example: the optimal policy for playing rock-scissors-paper is to act randomly, which PPO, IMPALA, and A3C can all learn, but DQN would not be able to).

This is why for DQN, you should disable exploring during rollouts, whereas for PPO, IMPALA, A3C, you may want to keep exploring on (even though, sometimes it does help switching it off as well, depending on your environment).

1 Like

Thanks a lot!
I tested with explore set to False as you suggested, and it resulted in getting 200 in all episodes. I totally forgot that the DQN configuration for rollout is set to the one I trained with since I’m calling it on a DQNTrainer.

1 Like