I want to use the trained policy by ray during the training for some benchmark comparisons, every few training iteration. For that I want to create a given instance of the env with a given seed and then run one episode with each of the benchmarks as well as the ray. In sum, the pseudocode would look like this:
for i in range(n):
  s = env.reset(seed_number=n)
  while True:
    a = alg.agent.get_action(s)
    ns, r, d, _ =  env.step(a)
    save_state(s,a,r,d)
    if d:
        break
in which alg is in [ray, random, heuristic1, heuristic2].
My question is how can I use the call the current policy for this purpose? I assume agent.compute_action(s) do that. If so, the question would be how does it handle the stochasticity? For example, if the algorithm is DQN, does it use epsilon=0? How does it work in policy gradient base algorithms like A2C?
             
            
              
              
              1 Like
            
            
           
          
            
            
              You’re right, agent.compute_action(s) does that.
I think by default it uses stochastic policy, but you can pass the argument explore set to False to have a deterministic policy :
agent.compute_action(s, explore=False)
As far as I understand, in policy gradient algo, it uses argmax to get the action with the best probability.
             
            
              
              
              
            
            
           
          
            
            
              Thanks for the answer!
And, one more question. In the compute_action API code, it says that the observation is an object. When I pass a list or numpy.array, it returns an error:
AttributeError: 'list' object has no attribute 'items'
What kind of object the observation is that?
    def compute_actions(self,
                        observations,
                        state=None,
                        prev_action=None,
                        prev_reward=None,
                        info=None,
                        policy_id=DEFAULT_POLICY_ID,
                        full_fetch=False,
                        explore=None):
        """Computes an action for the specified policy on the local Worker.
        Note that you can also access the policy object through
        self.get_policy(policy_id) and call compute_actions() on it directly.
        Args:
            observation (obj): observation from the environment.