when I add
train.save_checkpoint(epoch, model) in
train_func of the scripts
train_fasion_mnist_example, error occurs:
RuntimeError: Some workers returned results while others didn’t. Make sure that
train.checkpoint() are called the same number of times on all workers.
train.report did not raise error, but
Did I call the method right? Or is there any other configuration I should notice?
Hmm this usage looks right to me. How many workers are you training with? Are you using spot instances or is there anything that can cause any of the workers to fail?
Also, do you mind sharing the full stack trace? Thanks!
There’s the full stack trace.
Only if I add
save_checkpoint this error will occur with
num_workers = 2
Problem solved. There’s an pickle error when saving checkpoint. When I move the define of
NeuralNetwork to another file and import it, everything works fine.
how to get solving this problem, I had same problem plz send to your solution when see it
Can you share the code you are running and the output traceback?