Score the trained policy by ray

I want to use the trained policy by ray during the training for some benchmark comparisons, every few training iteration. For that I want to create a given instance of the env with a given seed and then run one episode with each of the benchmarks as well as the ray. In sum, the pseudocode would look like this:

for i in range(n):
  s = env.reset(seed_number=n)
  while True:
    a = alg.agent.get_action(s)
    ns, r, d, _ =  env.step(a)
    if d:

in which alg is in [ray, random, heuristic1, heuristic2].
My question is how can I use the call the current policy for this purpose? I assume agent.compute_action(s) do that. If so, the question would be how does it handle the stochasticity? For example, if the algorithm is DQN, does it use epsilon=0? How does it work in policy gradient base algorithms like A2C?

1 Like

You’re right, agent.compute_action(s) does that.
I think by default it uses stochastic policy, but you can pass the argument explore set to False to have a deterministic policy :

agent.compute_action(s, explore=False)

As far as I understand, in policy gradient algo, it uses argmax to get the action with the best probability.

Thanks for the answer!
And, one more question. In the compute_action API code, it says that the observation is an object. When I pass a list or numpy.array, it returns an error:
AttributeError: 'list' object has no attribute 'items'

What kind of object the observation is that?

    def compute_actions(self,
        """Computes an action for the specified policy on the local Worker.

        Note that you can also access the policy object through
        self.get_policy(policy_id) and call compute_actions() on it directly.

            observation (obj): observation from the environment.