I want to refine the step of the algorithm: By default, the action is selected as:
a = argmax Q(f(s), a; teta)
Then the transition from the old state to the new one is calculated. The old state, action, reward, and new state are then stored in a buffer.
I want to change the get action function and make it look like this:
while not condition:
action = policy.compute_action(obs)
new_obs, rew, done, info = env.step(action)
if condition:
break
obs = new_obs
replay_buffer.add(obs, action, rew, new_obs, done)
obs = new_obs
How do I customize this?