`rllib rollout` command seems to be training the network, not evaluating

Kai_Yun · January 19, 2021, 4:31am

TL;DR: RLlib’s rollout command seems to be training the network, not evaluating.

I’m trying to use Ray RLlib’s DQN to train, save, and evaluate neural networks on a custom made simulator. To do so, I’ve been prototyping the workflow with OpenAI Gym’s CartPole-v0 environment. Doing so, I found some odd results while running the rollout command for evaluation. (I used the exact same method written in RLlib Training APIs - Evaluating Trained Policies documentation.)

First I trained a vanilla DQN network until it reached a episode_reward_mean of 200 points. Then, I used the rllib rollout command to test the network for 1000 episodes in CartPole-v0. For the first 135 episodes, the episode_reward_mean score was awful, ranging from 10 to 200. However, from the 136th episode, the score was consistently 200, which is full points in CartPole-v0.

So, it seems like rllib rollout is rather training the network, not evaluating. I know that isn’t the case since there’s no code for training in rollout.py module. But I have to say, it really looks like training. Otherwise, how can the score gradually increase as more episodes happen? Furthermore, the network is “adapting” to different starting positions later in the evaluation process, which is an evidence of training in my perspective.

If anyone can help me out why this might be happening, it will be greatly appreciated. The code I used is below:

Training

results = tune.run(
					"DQN",
					stop={"episode_reward_mean": 200},
					config={
							"env": "CartPole-v0",
							"num_workers": 6
					},
					checkpoint_freq=0,
					keep_checkpoints_num=1,
					checkpoint_score_attr="episode_reward_mean",
					checkpoint_at_end=True,
					local_dir=r"/home/ray_results/CartPole_Evaluation"
)

Evaluation

rllib rollout ~/ray_results/CartPole_Evaluation/DQN_CartPole-v0_13hfd/checkpoint_139/checkpoint-139 \
             --run DQN --env CartPole-v0 --episodes 1000

Result

2021-01-12 17:26:48,764 INFO trainable.py:489 -- Current state after restoring: {'_iteration': 77, '_timesteps_total': None, '_time_total': 128.41606998443604, '_episodes_total': 819}
Episode #0: reward: 21.0
Episode #1: reward: 13.0
Episode #2: reward: 13.0
Episode #3: reward: 27.0
Episode #4: reward: 26.0
Episode #5: reward: 14.0
Episode #6: reward: 16.0
Episode #7: reward: 22.0
Episode #8: reward: 25.0
Episode #9: reward: 17.0
Episode #10: reward: 16.0
Episode #11: reward: 31.0
Episode #12: reward: 10.0
Episode #13: reward: 23.0
Episode #14: reward: 17.0
Episode #15: reward: 41.0
Episode #16: reward: 46.0
Episode #17: reward: 15.0
Episode #18: reward: 17.0
Episode #19: reward: 32.0
Episode #20: reward: 25.0
...
Episode #114: reward: 134.0
Episode #115: reward: 90.0
Episode #116: reward: 38.0
Episode #117: reward: 33.0
Episode #118: reward: 36.0
Episode #119: reward: 114.0
Episode #120: reward: 183.0
Episode #121: reward: 200.0
Episode #122: reward: 166.0
Episode #123: reward: 200.0
Episode #124: reward: 155.0
Episode #125: reward: 181.0
Episode #126: reward: 72.0
Episode #127: reward: 200.0
Episode #128: reward: 54.0
Episode #129: reward: 196.0
Episode #130: reward: 200.0
Episode #131: reward: 200.0
Episode #132: reward: 188.0
Episode #133: reward: 200.0
Episode #134: reward: 200.0
Episode #135: reward: 173.0
Episode #136: reward: 200.0
Episode #137: reward: 200.0
Episode #138: reward: 200.0
Episode #139: reward: 200.0
Episode #140: reward: 200.0
...
Episode #988: reward: 200.0
Episode #989: reward: 200.0
Episode #990: reward: 200.0
Episode #991: reward: 200.0
Episode #992: reward: 200.0
Episode #993: reward: 200.0
Episode #994: reward: 200.0
Episode #995: reward: 200.0
Episode #996: reward: 200.0
Episode #997: reward: 200.0
Episode #998: reward: 200.0
Episode #999: reward: 200.0

sven1977 · January 21, 2021, 8:50am

Yeah, this makes sense
DQN uses the EpsilonGreedy exploration module, which - by default - reduces the epsilon (the amount of randomness in your actions) from 1.0 to 0.05 over the first 10k timesteps. To suppress exploration behavior altogether during evaluation (which you should do when “rolling out” DQN), you need to do:

config:
    evaluation_config:
        explore: false

sven1977 · January 21, 2021, 8:54am

For background:
DQN picks actions deterministically (always chose the action with the highest Q-value), whereas algos like PPO, IMPALA, A3C pick actions by sampling from a (network parameterized) distribution. This is why the latter algos can learn optimal policies, even if these policies are stochastic (simple example: the optimal policy for playing rock-scissors-paper is to act randomly, which PPO, IMPALA, and A3C can all learn, but DQN would not be able to).

This is why for DQN, you should disable exploring during rollouts, whereas for PPO, IMPALA, A3C, you may want to keep exploring on (even though, sometimes it does help switching it off as well, depending on your environment).

Kai_Yun · January 22, 2021, 4:18am

Thanks a lot!
I tested with explore set to False as you suggested, and it resulted in getting 200 in all episodes. I totally forgot that the DQN configuration for rollout is set to the one I trained with since I’m calling it on a DQNTrainer.

Topic		Replies	Views
Offline RL evaluation Configure Algorithm, Training, Evaluation, Scaling	1	393	April 17, 2023
RLlib perform worse when rollout_worker/env_runner increased? Debugging and performance tuning	0	25	November 1, 2024
Recommended way to evaluate training results RLlib	0	3258	June 12, 2021
Evaluate trained model on long episodes RLlib	3	545	May 14, 2021
How do I evaluate my trained policy after tune.fit() RLlib	1	713	March 30, 2023

`rllib rollout` command seems to be training the network, not evaluating

Related topics