Error occurs when call save_checkpoint

when I add train.save_checkpoint(epoch, model) in train_func of the scripts train_fasion_mnist_example, error occurs:

RuntimeError: Some workers returned results while others didn’t. Make sure that train.report() and train.checkpoint() are called the same number of times on all workers.

train.save_checkpoint follows train.report, and train.report did not raise error, but train.save_checkpoint did.

Did I call the method right? Or is there any other configuration I should notice?

Hmm this usage looks right to me. How many workers are you training with? Are you using spot instances or is there anything that can cause any of the workers to fail?

Also, do you mind sharing the full stack trace? Thanks!

There’s the full stack trace.

Only if I add save_checkpoint this error will occur with num_workers = 2

Problem solved. There’s an pickle error when saving checkpoint. When I move the define of NeuralNetwork to another file and import it, everything works fine.

how to get solving this problem, I had same problem plz send to your solution when see it

1 Like

Hey @jikuaij,

Can you share the code you are running and the output traceback?