Hi, I’m trying to modify the IMPALA architecture to not use the default NN policy, but a custom (simpler) one (softmax function over some handmade features).
Because this policy is very simple and I can update it without pytorch, the easiest way I find to do it (tell me if I’m wrong) is to compute this new policy in a custom callback, inside the
on_learn_on_batch method, where I have almost everything I need to update the custom policy: actions, rewards and observations.
The only thing I miss is the actual value of the value function branch for each of these observations.
In short, I plan to ignore the policy branch of the IMPALA’s neural networks, but use its Value function branch to compute a simpler policy (no NN). Diagram below:
And update it after N iterations, as in the IMPALA learner:
Is this approach correct? and if so,
Are these VF values saved anywhere?
or do I need to save them manually at some point in the training process? (like for example, inside
on_episode_step in my custom callback).