Ray Train sync from worker to head

mizhazha · April 3, 2023, 6:03pm

Hi Ray team,

I have a horovod training job using 4GPU, 2GPU from each physical nodes. After the training complete, I noticed that only logs from one node (that contains the rank_0 worker) is got sync’ed to the driver node. The folder structure is like:

/HorovodTrainer_{}
- rank_0/
  - checkpoints
  - tb_logs/
  - worker_0.log
- rank_1
  - worker_1.log

worker_0.log and worker_1.log is logging from my training code; however, I am not able to find rank_2/ and rank_3/ folder. When I logging into worker pod, I confirmed that rank_2/ and rank_3/ exists.

Can someone point me to a doc/code that explains the worker → head sync’ing logic, for both training and tuning? My guess is that Ray only sync whatever is saved on primary worker to head. Thanks!

xwjiang2010 · April 4, 2023, 10:38pm

Hi,
There used to be some documentation around this. But somehow got removed to avoid confusion. And the team is working on making checkpointing (in this case with ddp) more intuitive.

To answer the direct question, one assumption we made about ddp is that the model checkpoint and metrics will be the same across different workers. So we only sync from rank 0 worker (this applies to both checkpoints and metrics).

One quick question, how are you using HorovodTrainer? Are you using Session.report() API?

xwjiang2010 · April 4, 2023, 10:38pm

cc @gjoliver on checkpoint/artifacts.

mizhazha · April 5, 2023, 11:49pm

Thanks @xwjiang2010 for confirming. We are using horovod trainer for eras models. To report metrics, we are using ray.air.callbacks.keras.Callback. One problem I had is that using ray.air.callbacks.keras.Callback as a horovod callback doesn’t give me test metrics, because model.fit() only give train, val metrics and I do model.evaluate() on test set once at the end of training.

If I add session.report() in the end for reporting test metrics to MLFLow, it works fine for training, but running into issues when using for HPO. Do you have any suggestion for how to report test metric for keras models?

xwjiang2010 · April 6, 2023, 4:00pm

just so I understand.
When you say “test” metrics, you mean the evaluation you do after training through all epochs? Seems like Keras’ model.fit doesn’t support that natively? In that case, it’s not much our AIR Keras Callback can do about it. As you can see, it basically leverages the Keras Callback API and adapt it to what AIR expects.

Is test metrics all you care about or you want all metrics like training, eval and test? If so, do you want them to be presented together?

How is your training function written up? Is it something like?

model.fit()
test_metrics = model.evaluate(test_dataset)  # called outside of `model.fit()`

I think reporting to mlflow is fine. I would like to know why you said it doesn’t work for HPO. Are you using MLFlow callback or setup_mlflow?

mizhazha · April 10, 2023, 3:51am

Sorry for the late reply!

I would like to have train, eval and test in mlflow presented together. Here is my code structure for train_fn

from ray.air.callbacks.keras import Callback as TrainReportCallback
model.fit(..., callbacks = [...,TrainReportCallback()])
results = model.evaluate()
# session.report(results)

I can use session.report() in training together with keras callback to get the test metrics logged in mlflow, but when I do it the same way for tune, I got this error:

RuntimeError: Some workers returned results while others didn't. Make sure that `session.report()` are called the same number of times on all workers.

My evaluate runs on primary worker only, so the error message kind of makes sense. But I guess similar error should happen to train as well, while it isn’t.

What would your suggestion be to have train, eval and test logged in mlflow for keras models?
2. For mlflow, we are using MLFLow callback.

xwjiang2010 · April 10, 2023, 5:32pm

Can you share the script for “I can use session.report() in training together with keras callback to get the test metrics logged in mlflow” and " when I do it the same way for tune, I got this error"?

It’s kind of hard to follow as we are talking about more than one scenario now.

Topic		Replies	Views
Ray Tune Sync Config not syncing logs from worker node Ray Tune	2	418	June 19, 2024
Synchronizing workers during ray train Ray Train	8	767	February 25, 2025
Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node Ray Tune	9	994	April 22, 2021
Getting Tune to read Train checkpoint in ray.train.report Dashboard, Monitoring & Debugging	2	20	April 4, 2025
Ray Train with Horovod does not use all GPUs on the node Ray Train	11	868	June 8, 2022

Ray Train sync from worker to head

Related topics