Error occurs when call save_checkpoint

tangcc1127 · January 26, 2022, 4:02am

when I add train.save_checkpoint(epoch, model) in train_func of the scripts train_fasion_mnist_example, error occurs:

RuntimeError: Some workers returned results while others didn’t. Make sure that train.report() and train.checkpoint() are called the same number of times on all workers.

train.save_checkpoint follows train.report, and train.report did not raise error, but train.save_checkpoint did.

Did I call the method right? Or is there any other configuration I should notice?

amogkam · January 26, 2022, 4:05am

Hmm this usage looks right to me. How many workers are you training with? Are you using spot instances or is there anything that can cause any of the workers to fail?

Also, do you mind sharing the full stack trace? Thanks!

tangcc1127 · January 26, 2022, 4:34am

There’s the full stack trace.

Only if I add save_checkpoint this error will occur with num_workers = 2

tangcc1127 · January 28, 2022, 3:05am

Problem solved. There’s an pickle error when saving checkpoint. When I move the define of NeuralNetwork to another file and import it, everything works fine.

jikuaij · March 7, 2022, 12:02pm

how to get solving this problem, I had same problem plz send to your solution when see it

matthewdeng · March 7, 2022, 10:37pm

Hey @jikuaij,

Can you share the code you are running and the output traceback?

Topic		Replies	Views
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	684	April 16, 2022
Error when using train.checkpoint Ray Train	2	1735	December 11, 2021
Getting Tune to read Train checkpoint in ray.train.report Dashboard, Monitoring & Debugging	2	26	April 4, 2025
Ray.train.get_checkpoint() don't get my reported checkpoint Dashboard, Monitoring & Debugging	3	38	August 6, 2024
Save and reuse Checkpoints in Ray 2.0 version Ray Train	9	1760	November 16, 2022

Error occurs when call save_checkpoint

Related topics