I want to use the trained policy by ray during the training for some benchmark comparisons, every few training iteration. For that I want to create a given instance of the env
with a given seed and then run one episode with each of the benchmarks as well as the ray. In sum, the pseudocode would look like this:
for i in range(n):
s = env.reset(seed_number=n)
while True:
a = alg.agent.get_action(s)
ns, r, d, _ = env.step(a)
save_state(s,a,r,d)
if d:
break
in which alg
is in [ray, random, heuristic1, heuristic2]
.
My question is how can I use the call the current policy for this purpose? I assume agent.compute_action(s)
do that. If so, the question would be how does it handle the stochasticity? For example, if the algorithm is DQN, does it use epsilon=0
? How does it work in policy gradient base algorithms like A2C
?
1 Like
You’re right, agent.compute_action(s)
does that.
I think by default it uses stochastic policy, but you can pass the argument explore
set to False
to have a deterministic policy :
agent.compute_action(s, explore=False)
As far as I understand, in policy gradient algo, it uses argmax
to get the action with the best probability.
Thanks for the answer!
And, one more question. In the compute_action
API code, it says that the observation
is an object. When I pass a list
or numpy.array
, it returns an error:
AttributeError: 'list' object has no attribute 'items'
What kind of object the observation
is that?
def compute_actions(self,
observations,
state=None,
prev_action=None,
prev_reward=None,
info=None,
policy_id=DEFAULT_POLICY_ID,
full_fetch=False,
explore=None):
"""Computes an action for the specified policy on the local Worker.
Note that you can also access the policy object through
self.get_policy(policy_id) and call compute_actions() on it directly.
Args:
observation (obj): observation from the environment.