You don’t have a termination issue as far as I can see from your description. The dataset is being trained normally.
With such a large batch size and a large num_sgd_iter it takes a long time to complete an episode. In my opinion the detail value for this (32) is quite high and really allows down training. I usually at it to a value between 10-15. I do not recommend 1 but sometimes I use it at the beginning of training a new environment or configuration as a sanity check and to produce timing estimates.
There are three values that work together to determine how many iterations through the dataset there are during training. It is
Perhaps we should clarify something. In this case, the NaN you are seeing probably does not mean that the rewards collected from the environment are NaN, in other cases it could.
What is happening here, is that the episode_reward value does not update until an episode returns done. Until it completes one full episode, the value shows as NaN. The actual rewards are probably not NaN it is just the summary reporting metric that is showing Nan because there is no data yet.