I have a horovod training job using 4GPU, 2GPU from each physical nodes. After the training complete, I noticed that only logs from one node (that contains the rank_0 worker) is got sync’ed to the driver node. The folder structure is like:
There used to be some documentation around this. But somehow got removed to avoid confusion. And the team is working on making checkpointing (in this case with ddp) more intuitive.
To answer the direct question, one assumption we made about ddp is that the model checkpoint and metrics will be the same across different workers. So we only sync from rank 0 worker (this applies to both checkpoints and metrics).
One quick question, how are you using HorovodTrainer? Are you using Session.report() API?
Thanks @xwjiang2010 for confirming. We are using horovod trainer for eras models. To report metrics, we are using ray.air.callbacks.keras.Callback. One problem I had is that using ray.air.callbacks.keras.Callback as a horovod callback doesn’t give me test metrics, because model.fit() only give train, val metrics and I do model.evaluate() on test set once at the end of training.
If I add session.report() in the end for reporting test metrics to MLFLow, it works fine for training, but running into issues when using for HPO. Do you have any suggestion for how to report test metric for keras models?
just so I understand.
When you say “test” metrics, you mean the evaluation you do after training through all epochs? Seems like Keras’ model.fit doesn’t support that natively? In that case, it’s not much our AIR Keras Callback can do about it. As you can see, it basically leverages the Keras Callback API and adapt it to what AIR expects.
Is test metrics all you care about or you want all metrics like training, eval and test? If so, do you want them to be presented together?
How is your training function written up? Is it something like?
test_metrics = model.evaluate(test_dataset) # called outside of `model.fit()`
I think reporting to mlflow is fine. I would like to know why you said it doesn’t work for HPO. Are you using MLFlow callback or setup_mlflow?