RuntimeError: Some workers returned results while others didn’t. Make sure that train.report() and train.checkpoint() are called the same number of times on all workers.
train.save_checkpoint follows train.report, and train.report did not raise error, but train.save_checkpoint did.
Did I call the method right? Or is there any other configuration I should notice?
Hmm this usage looks right to me. How many workers are you training with? Are you using spot instances or is there anything that can cause any of the workers to fail?
Also, do you mind sharing the full stack trace? Thanks!
Problem solved. There’s an pickle error when saving checkpoint. When I move the define of NeuralNetwork to another file and import it, everything works fine.