Ray Train with Horovod does not use all GPUs on the node

Nitin_Pasumarthy · April 7, 2022, 6:57am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m just having a great time using Ray Train. It’s interface is so simple and yet very powerful!

However, when I use Tensorflow Keras + Horovod for distributed multiworker training, my script doesn’t use all the GPUs on the node. It only uses GPU0 on all the workers as shown below.

trainer = Trainer(backend="horovod", num_workers=2, 
                  resources_per_worker={"GPU": 2, "CPU": 4}, 
                  use_gpu=True, max_retries=0)

results = trainer.run(train_fn, config, callbacks=callbacks)

Without Train, it works as expected when I launch the training like below,

./horovodrun -np 4 -H ip1:2,ip2:2 python train_deep_model.py

What am I missing? How can I debug it?

amogkam · April 7, 2022, 4:40pm

Hey @Nitin_Pasumarthy thanks for posting the question!

Can you provide more details on the cluster configuration and also what exactly you are seeing?

It seems like you have 2 nodes with 2 GPUs each, is that right? And only the first GPU is being used on each node?

If the above is correct, then this is expected based on how the Trainer is being instantiated. Unlike with horovodrun, with Ray Train, there’s no concept of separate hosts or nodes. You simply have to specify the number of workers (i.e. the number of training processes). So if you have a total of 4 GPUs in your cluster, regardless of whether these GPUs are on a single node or multiple nodes, all you need to do is just this
Trainer(backend="horovod", num_workers=4, use_gpu=True)

This would be the equivalent to the horovodrun call that you have.

Currently, you are only creating 2 training workers, so if your train_fn only uses 1 GPU, then a total of 2 out of the 4 GPUs will be used.

Nitin_Pasumarthy · April 8, 2022, 1:58am

Thanks for taking a look, @amogkam. Yes, you understood my setup correctly - I have 2 nodes with 2 GPUs each where only 1 GPU from each node is being utilized.

Trainer(backend="horovod", num_workers=4, use_gpu=True) throws the below error,

repr=<ray.train.backend.BackendExecutor object at 0x7f90a8fa14d0>)
  File "<SOME_PATH>/site-packages/ray/train/backend.py", line 176, in start
    self._backend.on_start(self.worker_group, self._backend_config)
  File "<SOME_PATH>/site-packages/ray/train/horovod.py", line 145, in on_start
    node_workers=node_workers)
  File "<SOME_PATH>/site-packages/horovod/ray/utils.py", line 72, in detect_nics
    settings)
  File "<SOME_PATH>/site-packages/horovod/ray/driver_service.py", line 47, in _driver_fn
    raise ValueError(f"Number of node actors ({len(node_actors)}) "
ValueError: Number of node actors (4) must match num_hosts (2).

Trainer(backend="horovod", num_workers=3, use_gpu=True) fails with the same error above.
Trainer(backend="horovod", num_workers=2, use_gpu=True) works and schedules the job on the same node, which is good to be honest. This is unlike the setup with resources_per_worker={"GPU": 2, "CPU": 4} arg, where the job gets scheduled on the different nodes, which is expected.

But still multi worker training is unsolved with horovod.

amogkam · April 8, 2022, 3:48am

Ah @Nitin_Pasumarthy can you use ray 1.12.0.rc1 to fix the ValueError that you saw? There was a bug with horovod training with older versions of ray.

Nitin_Pasumarthy · April 8, 2022, 4:09am

I was partially guessing this was a bug and thought I should mention the version of Ray I was using. Thank you! I’ll give it a try. When is the next stable release of Ray?

amogkam · April 8, 2022, 4:37am

The official 1.12 release should be in the next few days!

Nitin_Pasumarthy · April 16, 2022, 7:12am

Hey @amogkam,
I checked with ray==1.12 and still unable to use all my 4 GPUs (2x2 nodes). It throws the same error. Would it be easier if we can debug this live, together? This is really blocking my progress

I see some more minor errors / issues which I posted as separate threads or +1s to existing threads.

amogkam · April 18, 2022, 4:33pm

Hey @Nitin_Pasumarthy sorry I mispoke- this fix was not included in the 1.12.0 release yet. It should be fixed in master!

Regarding the other issue you are seeing, I will respond on the other thread!

amogkam · April 18, 2022, 4:35pm

And yes would be happy to schedule some time to pair debug this! Feel free to message me on the Ray slack!

Nitin_Pasumarthy · June 3, 2022, 9:46pm

@amogkam was this merged in 1.12.1 ?

amogkam · June 6, 2022, 9:24pm

Hey @Nitin_Pasumarthy, no this will be included in Ray 1.13 release which will happen in the next few days.

Were you able to try Ray master?

Nitin_Pasumarthy · June 8, 2022, 9:05pm

Got it. Not yet, @amogkam .

I’m working to overcome this limitation - Train with tune doesnt set the right logdir - #7 by matthewdeng which is a blocker for my work.

Topic		Replies	Views
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	545	April 10, 2023
Ray multiprocessing together with distributed learning Ray Train	1	556	March 2, 2022
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	431	February 19, 2021
How to launch multi-node job with Ray Train? Ray Train	9	2072	June 14, 2024
When to use multi gpus per worker for a training job	1	213	September 15, 2024

Ray Train with Horovod does not use all GPUs on the node

Related topics