How to get the current epsilon value after a training iteration?

Stefan-1313 · July 21, 2022, 4:44pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Dear all,

I have a simple SimpleQTrainer. I train it using the build in EpsilonGreedy. Is there a way to easily get the current epsilon value after an iteration?

This is roughly the code I use.

# config
config_simple = SIMPLE_Q_DEFAULT_CONFIG.copy()
# ... some other things
config_simple["explore"] = True
config_simple["exploration_config"] = {# The Exploration class to use
                                           "type": ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy,
                                           # Config for the Exploration class' constructor:
                                           "initial_epsilon": 1.0,
                                           "final_epsilon": 0.01,
                                           "epsilon_timesteps": 5000,   # Timesteps over which to anneal epsilon.
                                           "warmup_timesteps": 5000
                                           }

# agent
agent = SimpleQTrainer(config=config_simple, env=select_env)

# train
for n in range(n_iter):
    result = agent.train()
	print(f'{status.format(n+1, result["episode_reward_min"], result["episode_reward_mean"], result["episode_reward_max"], result["episode_len_mean"])})
	# -- 
	# Here I would like to print the current value of epsilon.
	# --

I tried the following, but all of these attempts return constant value of either 1.0 or 0.01.

current_epsilon = agent.get_policy().get_exploration_info() # This gives always 1.0
current_epsilon = agent.get_policy().exploration.get_info()["cur_epsilon"]  # This gives always 1.0
current_epsilon = agent.get_policy().exploration.epsilon_schedule.outside_value  # This gives always 0.01

Lars_Simon_Zehnder · July 24, 2022, 9:17am

@Stefan-1313 Have you tried out the following?

agent.get_policy().exploration.get_state()["cur_epsilon"]

Stefan-1313 · July 25, 2022, 9:46am

Hi, thanks for the reply @Lars_Simon_Zehnder .

I tried what you proposed:
current_epsilon = agent.get_policy().exploration.get_state()["cur_epsilon"]
Unfortunately, it returns also current_epsilon = 1.0.

In case it matters, I installed ray using these commands:

pip install ray==1.13.0 
pip install ray[tune]==1.13.0
pip install ray[rllib]==1.13.0

(In this case I don’t use Ray-Tune).

I tried modifying the source code of ray\rllib\utils\exploration\epsilon_greedy.py
I added a print line to print epsilon. I get thousands of prints, but I can clearly see Epsilon decreasing as expected.

Then, if I also print the epsilon in between iterations with the possibilities mentioned in the above forum posts, I get output like this:

...
(RolloutWorker pid=43772) FROM epsilon_greedy.py | Epsilon: 0.8420525714285714, Timestep: 2792
(RolloutWorker pid=43772) FROM epsilon_greedy.py | Epsilon: 0.8418262857142857, Timestep: 2796

INBETWEEN ITERATIONS | Methods in this forum post to get current epsilon (in order of appearance) =>   
      1-> 1.0;   2-> 1.0;   3-> 0.01;   4-> 1.0;

(RolloutWorker pid=31712) FROM epsilon_greedy.py | Epsilon: 0.8302857142857143, Timestep: 3000
(RolloutWorker pid=31712) FROM epsilon_greedy.py | Epsilon: 0.8300594285714286, Timestep: 3004
...

(I have disabled the warmup_timesteps=0, and changed epsilon_timesteps to =17.500).

So I can confirm, Epsilon is decreasing as expected. Only I don’t know how to get it using the API without modifying the source code.

Stefan-1313 · July 26, 2022, 5:27pm

In addition, I noticed something silly.

I suddenly noticed this code started to report the current epsilon correctly! (Probably most of the other ways I tried also).

current_epsilon = agent.get_policy().exploration.get_info()["cur_epsilon"]

So I went ahead and looked what things I had changed.

I simply had this line commented out:

config_simple["num_workers"] = 5

Then, eplison started to be reported correctly. Setting the value to config_simple["num_workers"] = 1 has no effect. The line has to be not present (commented out) for it to start working.

Is this logical expected behavior? Or a bug?

Lars_Simon_Zehnder · July 26, 2022, 7:35pm

Great observation! That looks to me rather like a bug. I debugged a little and it appears to me that the last_timestep attribute of the EpsilonGreedy object does not get updated. Therefore the PiecewiseSchedule remains always at the same epsilon.

Stefan-1313 · July 26, 2022, 8:01pm

Thanks! I debugged also a little (in Ray v1.13.0 and also v1.11.0 which seems already quite different).

When I place this line:

print(f'FROM epsilon_greedy.py | Epsilon: {epsilon}, Timestep: {self.last_timestep}')

below this line in the source code: Ray 1.13.0 - epsilon_greedy.py. It prints epsilon and shows it is decaying no matter if the num_workers is configured or not.

So when num_workers is configured, and is the reported epsilon of 1.0 is the from the source code printed epsilon actually doing nothing and is the reported 1.0 correct?
Or is only the reporting broken and mistakenly reports 1.0 while the from the source code printed epsilon is the actual epsilon used for action selection?

Side note:
In Ray v1.11.0 I get the warning:

WARNING deprecation.py:45 -- DeprecationWarning: `get_info` has been deprecated. Use `get_state` instead. This will raise an error in the future!

In Ray v1.13.0, the warning is gone, but there is also no error.
However, it seems that the way you (@Lars_Simon_Zehnder) proposed is the correct way to go to get the current epsilon, which is this way:

agent.get_policy().exploration.get_state()["cur_epsilon"]

I can confirm that when using get_state() instead of get_info the same issue occurs with the same characteristics. It also breaks when specifying the number of workers.

Side note 2:
Setting config_simple["num_workers"] = 0 makes also the epsilon reporting work.
The epsilon reporting even works when settingconfig_simple["explore"] = False which seems strainge. Is the reported value also used? When config_simple["explore"] = False I expect epsilon is 0.0.
But even more, since no exploration_config config is provided, it is impossible to know we want to use epsilon_greedy to explore. Maybe we want some epsilon free approach.

mannyv · July 27, 2022, 12:46pm

Hi @Stefan-1313,

When explore is False EpsilonGreedy always takes the greedy action so it just ignores epsilon. This happens here:

github.com

ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/rllib/utils/exploration/epsilon_greedy.py#L179-L180


      
          exploit_action = action_distribution.deterministic_sample()
          batch_size = q_values.size()[0]

github.com

ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/rllib/utils/exploration/epsilon_greedy.py#L222-L224


      
          # Return the deterministic "sample" (argmax) over the logits.
          else:
              return exploit_action, action_logp

Lars_Simon_Zehnder · July 27, 2022, 1:13pm

@Stefan-1313 , I have debugged a little more and can confirm that the timesteps inside the EpsilonGreedy intances are correct and therefore also the epsilons. What is needed, is a method that can request the current epsilon values from the remote workers.

As described in this thread, some objects have a threading.RLock() and can therefore not transferred via Ray. One of these is the policy_map of the RolloutWorker that is used get_policy(). So, at this moment I do not see a chance to record the cur_epsilon values via a ray.get() call.

As a solution you could create your own callback that reports the current epsilon after each sample step of a single worker:

from ray.rllib.algorithms.callbacks import DefaultCallbacks
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from ray.rllib.evaluation import RolloutWorker

class MyCallback(DefaultCallbacks):
    
    def __init__(self):
        super().__init__()
        
    def on_sample_end(
        self, *, worker: "RolloutWorker", samples: SampleBatch, **kwargs
    ) -> None:
        cur_epsilon = worker.get_policy().exploration.get_state()["cur_epsilon"]
        print("cur_epsilon sample_end: {}".format(cur_epsilon))
        
    def on_learn_on_batch(
        self, *, policy: Policy, train_batch: SampleBatch, result: dict, **kwargs
    ) -> None:
        # Just for demonstration. Here the local wprker does the learning. 
        # Therefore, the epsilon will remain at 1.0. 
        cur_epsilon = policy.exploration.get_state()["cur_epsilon"]
        print("cur_epsilon: {}".format(cur_epsilon))

# ..... 
config_simple["callbacks"] = MyCallback

# ....

for n in range(n_iter): 
      result = agent.train()

This should print you for each sample() call of the RolloutWorker the currently used epsilon value for th rollout.

Stefan-1313 · July 28, 2022, 3:54pm

Thanks! This seems to work partially.

Furtheremore, only the on_sample_end() seems to return the correct current epsilon.
I’m more interested in epsilon at the time on_learn_on_batch() is called. But this also returns epsilon = 1.0.

Is it possible to have the cur_epsilon returned as a variable instead of printing it to the terminal?
I tried adding result["cur_epsilon"] = cur_epsilon to on_learn_on_batch() (as is shown in the ‘examples/custom_metrics_and_callbacks.py’), but it seems not to reach the main script.
EDIT: This only makes sense of on_learn_on_batch() returns something other than epsilon = 1.0. Right now, even the 1.0 does not reach the main script.

Lars_Simon_Zehnder · July 28, 2022, 5:02pm

@Stefan-1313 , the problem with on_learn_on_batch() as described in the comment is that this is executed on the local worker and not the remote workers and only the latter carry the current epsilon as they make the rollouts. As the latter cannot be accessed via get_policy() as the policy_map holds an RLock, I see no chance to extract the epsilon via a ray.get() call.

However, the epsilon will not change after on_sample_end() is called until sampling starts again. So the last cur_epsilon will be the actual one. If you need all cur_epsilons that were used in the rollout, you could use on_episode_step() and record the epsilons into a container.

Stefan-1313 · July 28, 2022, 5:16pm

Thank you very much.
This helps me a lot and I will try to proceed.

Topic		Replies	Views
Dqn algo epsilon not logged RLlib	3	347	December 1, 2022
Using exploration during evaluation RLlib	4	848	January 5, 2022
Read Tune console output from Simple Q RLlib	8	1507	October 26, 2021
PPO.train incorrect result RLlib	1	246	May 23, 2023
Output Training Options RLlib	2	231	May 24, 2023

How to get the current epsilon value after a training iteration?

Related topics