Ray Train sync from worker to head

Hi Ray team,

I have a horovod training job using 4GPU, 2GPU from each physical nodes. After the training complete, I noticed that only logs from one node (that contains the rank_0 worker) is got sync’ed to the driver node. The folder structure is like:

/HorovodTrainer_{}
- rank_0/
  - checkpoints
  - tb_logs/
  - worker_0.log
- rank_1
  - worker_1.log

worker_0.log and worker_1.log is logging from my training code; however, I am not able to find rank_2/ and rank_3/ folder. When I logging into worker pod, I confirmed that rank_2/ and rank_3/ exists.

Can someone point me to a doc/code that explains the worker → head sync’ing logic, for both training and tuning? My guess is that Ray only sync whatever is saved on primary worker to head. Thanks!

Hi,
There used to be some documentation around this. But somehow got removed to avoid confusion. And the team is working on making checkpointing (in this case with ddp) more intuitive.

To answer the direct question, one assumption we made about ddp is that the model checkpoint and metrics will be the same across different workers. So we only sync from rank 0 worker (this applies to both checkpoints and metrics).

One quick question, how are you using HorovodTrainer? Are you using Session.report() API?

cc @gjoliver on checkpoint/artifacts.

Thanks @xwjiang2010 for confirming. We are using horovod trainer for eras models. To report metrics, we are using ray.air.callbacks.keras.Callback. One problem I had is that using ray.air.callbacks.keras.Callback as a horovod callback doesn’t give me test metrics, because model.fit() only give train, val metrics and I do model.evaluate() on test set once at the end of training.

If I add session.report() in the end for reporting test metrics to MLFLow, it works fine for training, but running into issues when using for HPO. Do you have any suggestion for how to report test metric for keras models?

just so I understand.
When you say “test” metrics, you mean the evaluation you do after training through all epochs? Seems like Keras’ model.fit doesn’t support that natively? In that case, it’s not much our AIR Keras Callback can do about it. As you can see, it basically leverages the Keras Callback API and adapt it to what AIR expects.

Is test metrics all you care about or you want all metrics like training, eval and test? If so, do you want them to be presented together?

How is your training function written up? Is it something like?

model.fit()
test_metrics = model.evaluate(test_dataset)  # called outside of `model.fit()`

I think reporting to mlflow is fine. I would like to know why you said it doesn’t work for HPO. Are you using MLFlow callback or setup_mlflow?

Sorry for the late reply!

  1. I would like to have train, eval and test in mlflow presented together. Here is my code structure for train_fn
from ray.air.callbacks.keras import Callback as TrainReportCallback
model.fit(..., callbacks = [...,TrainReportCallback()])
results = model.evaluate()
# session.report(results)

I can use session.report() in training together with keras callback to get the test metrics logged in mlflow, but when I do it the same way for tune, I got this error:

RuntimeError: Some workers returned results while others didn't. Make sure that `session.report()` are called the same number of times on all workers.

My evaluate runs on primary worker only, so the error message kind of makes sense. But I guess similar error should happen to train as well, while it isn’t.

What would your suggestion be to have train, eval and test logged in mlflow for keras models?
2. For mlflow, we are using MLFLow callback.

Can you share the script for “I can use session.report() in training together with keras callback to get the test metrics logged in mlflow” and " when I do it the same way for tune, I got this error"?

It’s kind of hard to follow as we are talking about more than one scenario now. :slight_smile: