Can I train DQN with ExternalEnv without using `log_action()`?

RLlib states that “ExternalEnv provides a self.log_action() call to support off-policy actions,” which is what DQN uses (DQN uses EpsilonGreedy for its behavior policy and Greedy for its training policy). So I thought I should use log_action() for DQN.

However, in ray/rllib/tests/, the following example class gave me some questions about using this.

class PartOffPolicyServing(ExternalEnv):
    def __init__(self, env, off_pol_frac):
        ExternalEnv.__init__(self, env.action_space, env.observation_space)
        self.env = env
        self.off_pol_frac = off_pol_frac

    def run(self):
        eid = self.start_episode()
        obs = self.env.reset()
        while True:
            if random.random() < self.off_pol_frac:
                action = self.env.action_space.sample()
                self.log_action(eid, obs, action)
                action = self.get_action(eid, obs)
            obs, reward, done, info = self.env.step(action)
            self.log_returns(eid, reward, info=info)
            if done:
                self.end_episode(eid, obs)
                obs = self.env.reset()
                eid = self.start_episode()

In the class shown above, it seems the exploration part of the algorithm is implemented inside the method run. However, since DQNTrainer itself implements EpsilonGreedy, do I really have to use log_action()?
i.e., Since DQN itself already accounts for the off_pol_frac, why would I use it again in the environment?

Hey @Kai_Yun , I agree, it’s a little confusing. The “off_pol_frac” in the example above is indeed something like an epsilon (in epsilon greedy exploration). However, the epsilon exploration is added “on top” of this random action generation inside DQN, whenever get_action is called.

The example above is only the Env part of the loop, in which the following happens:

In n% of the cases, we produce a random action (inside the Env, not going through any policy or policy server) and send this action to the policy server for off-policy training (via log_action), meaning this random action will go into DQN’s replay buffer directly and is thus used for future training. Yes, you are right, this happens additionally to epsilon action sampling via the Policy, so you can probably set epsilon_timesteps to 0 here and handle epsilon degradation inside the env entirely.

In 1.0-n% of the cases, we go the “normal” way through DQNTrainer’s RolloutWorker, which asks the current Policy for an action. This action will also make it into the buffer, b/c it’s part of the sample batch produced by a single rollout (using the ExternalEnv).

1 Like

Hello @sven1977, thank you for the detailed answer.
To my understanding from your reply (and please correct me if I am wrong), I should implement the EpsilonGreedyExploration part of DQN within the environment when using ExternalEnv. Thus, setting the epsilon_timesteps to 0 would be appropriate.

Then I have some follow-up questions:

  1. Why is ExternalEnv set up to work like this? DQNTrainer itself can implement off-policy training by simply setting some hyperparameters for EpsilonGreedy. Then what is the need for implementing exploration policy in the environment? After all, DQNTrainer isn’t training the EpsilonGreedy policy; it’s simply the behavior policy. The policy we’re trying to “train” is the Greedy policy, which we don’t really train in DQN anyways since we’re just training the network so that it produces better estimates of Q-values. i.e., asking the DQNTrainer for an action via get_action already accounts for off-policy training since it uses EpsilonGreedy for behavior.

  2. What about noisy exploration and other exploration techniques? Do we have to also implement that in the environment when using ExternalEnv?

Also, since you’re much more of an expert in reinforcement learning, please correct me if I misunderstood anything about DQN. In terms of policies and training in DQN:

  • There are a behavior policy and a training policy, e.g. EpsilonGreedy for behavior and Greedy for training. Behavior policy is literally for choosing actions, and the training policy is what we want to optimize/train for future inferences.
  • However, DQN is not actually training the Greedy policy since it’s value-based. The algorithm is for training the neural network to produce better Q-values, from which the Greedy policy picks the action accordingly.
  • Thus, DQN using EpsilonGreedy is already an off-policy algorithm.

Thank you again!

You can use ExternalEnv however you want (using its API). This is just an example script to test certain aspects of the ExternalEnv API. I wouldn’t implement it this way either. I would always leave the exploration strategy inside the Trainer (and return explorative actions) and not move this logic into the environment.

Yes, all correct. You could theoretically use any policy as a behavior policy in DQN (even a random one).
We are not really training the policy in DQN, just a Q-predictor, from which we then derive the policy (by saying: “always pick the action with the best Q-value”).

1 Like