Hi @mg64ve
There are 3 (sometimes 4) distinct phases that the model is called in.
Your debugging has revealed 2 of them.
These 3 are from the initialization phase. The y are taken from a dummy batch of all zeros. I ignore these unless they are generatimg an error.
torch.Size([32, 1, 2])
torch.Size([1, 1, 2])
torch.Size([4, 8, 2])
These are from the rollout phase when compute actions is called to sanple new trajectories from the environment. Your config has 20 envs per worker each of which is taking 1 step.
torch.Size([20, 1, 2])
torch.Size([20, 1, 2])
After you collect 4000 steps the training phase will run. You have not reported that phase but when you hit it it will have shape [num_episodes,max_seq,2]. The max_seq is dynamic by default so if you did not have an episode that lasted 20 steps then it will be shorter than that.
Happy New Year