Hi,
I’m running validation from one workers (out of 4). I’m getting this error:
RuntimeError: Some workers returned results while others didn't. Make sure that
train.report()and
train.checkpoint() are called the same number of times on all workers.
How to print worker-specific logs ?
Hey @lacruche,
Are you using a specific Callback here? As mentioned in the error message, Ray Train does require each of your workers to call train.report()
or train.checkpoint()
to ensure they are synchronized, while filtering of the results is handled on the client (i.e. Trainer
) side. For example, the JsonLoggerCallback exposes a workers_to_log
arg.
thanks! no specific callback yet
I’m curious: why is that Callback thing needed?
btw I just tried printing things from the worker and now it works fine… that’s odd
gives me
why would I need to setup the callback, if I can just print things out of the workers this way?
Ah, yes printing from individual workers is absolutely fine, and a Callback is not necessary!
The benefit of a Callback would be to have the benefit of collectively processing results from different workers on the Trainer. At the moment, this means you can do some things such as print all results from all the workers in 1 line or write to disk on a single node.
This can also be extended to do things such as aggregation across workers, but this isn’t yet provided out-of-the-box! Would you find this useful in any case?