when I add train.save_checkpoint(epoch, model)
in train_func
of the scripts train_fasion_mnist_example
, error occurs:
RuntimeError: Some workers returned results while others didn’t. Make sure that train.report()
and train.checkpoint()
are called the same number of times on all workers.
train.save_checkpoint
follows train.report
, and train.report
did not raise error, but train.save_checkpoint
did.
Did I call the method right? Or is there any other configuration I should notice?
Hmm this usage looks right to me. How many workers are you training with? Are you using spot instances or is there anything that can cause any of the workers to fail?
Also, do you mind sharing the full stack trace? Thanks!
There’s the full stack trace.
Only if I add save_checkpoint
this error will occur with num_workers = 2
Problem solved. There’s an pickle error when saving checkpoint. When I move the define of NeuralNetwork
to another file and import it, everything works fine.
how to get solving this problem, I had same problem plz send to your solution when see it
1 Like
Hey @jikuaij,
Can you share the code you are running and the output traceback?