See also @mannyv 's answer in another topic. You also need to set sgd_minibatch_size" > "max_seq_len"
. As I cannot see your code, its hard to make remote guesses.
I would - no matter if custom or default - debug a whole trainer/sampler
iteration to see what happens after evaluation/training with the logits. You say you use a custom RNN? Either it outputs already the NaN
s or it happens somewhere after in the Trainer/Sampler
iteration. Somewhere these values must occur. When running on Kubernetes you might be able to execute the code in a single container and use "local_mode"=True
. You could also use a local Minikube and see if you can replicate the error there