I want to refine the step of the algorithm: By default, the action is selected as:
a = argmax Q(f(s), a; teta)
Then the transition from the old state to the new one is calculated. The old state, action, reward, and new state are then stored in a buffer.
I want to change the get action function and make it look like this:
while not condition: action = policy.compute_action(obs) new_obs, rew, done, info = env.step(action) if condition: break obs = new_obs replay_buffer.add(obs, action, rew, new_obs, done) obs = new_obs
How do I customize this?