Score the trained policy by ray

oroojlooy · June 24, 2021, 7:55pm

I want to use the trained policy by ray during the training for some benchmark comparisons, every few training iteration. For that I want to create a given instance of the env with a given seed and then run one episode with each of the benchmarks as well as the ray. In sum, the pseudocode would look like this:

for i in range(n):
  s = env.reset(seed_number=n)
  while True:
    a = alg.agent.get_action(s)
    ns, r, d, _ =  env.step(a)
    save_state(s,a,r,d)
    if d:
        break

in which alg is in [ray, random, heuristic1, heuristic2].
My question is how can I use the call the current policy for this purpose? I assume agent.compute_action(s) do that. If so, the question would be how does it handle the stochasticity? For example, if the algorithm is DQN, does it use epsilon=0? How does it work in policy gradient base algorithms like A2C?

Astariul · June 25, 2021, 2:56am

You’re right, agent.compute_action(s) does that.
I think by default it uses stochastic policy, but you can pass the argument explore set to False to have a deterministic policy :

agent.compute_action(s, explore=False)

As far as I understand, in policy gradient algo, it uses argmax to get the action with the best probability.

oroojlooy · June 25, 2021, 3:26am

Thanks for the answer!
And, one more question. In the compute_action API code, it says that the observation is an object. When I pass a list or numpy.array, it returns an error:
AttributeError: 'list' object has no attribute 'items'

What kind of object the observation is that?

    def compute_actions(self,
                        observations,
                        state=None,
                        prev_action=None,
                        prev_reward=None,
                        info=None,
                        policy_id=DEFAULT_POLICY_ID,
                        full_fetch=False,
                        explore=None):
        """Computes an action for the specified policy on the local Worker.

        Note that you can also access the policy object through
        self.get_policy(policy_id) and call compute_actions() on it directly.

        Args:
            observation (obj): observation from the environment.

Topic		Replies	Views
Inconsistent actions from Algorithm.compute_single_action RLlib	3	420	June 14, 2023
Compute non-greedy actions out of the trained policy RLlib	1	468	June 9, 2022
Compute_actions for Trajectory API RLlib	11	2421	February 10, 2022
How to write a trainable - for tuning a deterministic policy? RLlib	9	958	July 7, 2021
RLLib: How to use policy learned in tune.run()? RLlib	6	994	September 21, 2023

Score the trained policy by ray

Related topics