Ray Train with Horovod does not use all GPUs on the node

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m just having a great time using Ray Train. It’s interface is so simple and yet very powerful!

However, when I use Tensorflow Keras + Horovod for distributed multiworker training, my script doesn’t use all the GPUs on the node. It only uses GPU0 on all the workers as shown below.

trainer = Trainer(backend="horovod", num_workers=2, 
                  resources_per_worker={"GPU": 2, "CPU": 4}, 
                  use_gpu=True, max_retries=0)

results = trainer.run(train_fn, config, callbacks=callbacks)

Without Train, it works as expected when I launch the training like below,

./horovodrun -np 4 -H ip1:2,ip2:2 python train_deep_model.py

What am I missing? How can I debug it?

Hey @Nitin_Pasumarthy thanks for posting the question!

Can you provide more details on the cluster configuration and also what exactly you are seeing?

It seems like you have 2 nodes with 2 GPUs each, is that right? And only the first GPU is being used on each node?

If the above is correct, then this is expected based on how the Trainer is being instantiated. Unlike with horovodrun, with Ray Train, there’s no concept of separate hosts or nodes. You simply have to specify the number of workers (i.e. the number of training processes). So if you have a total of 4 GPUs in your cluster, regardless of whether these GPUs are on a single node or multiple nodes, all you need to do is just this
Trainer(backend="horovod", num_workers=4, use_gpu=True)

This would be the equivalent to the horovodrun call that you have.

Currently, you are only creating 2 training workers, so if your train_fn only uses 1 GPU, then a total of 2 out of the 4 GPUs will be used.

Thanks for taking a look, @amogkam. Yes, you understood my setup correctly - I have 2 nodes with 2 GPUs each where only 1 GPU from each node is being utilized.

  • Trainer(backend="horovod", num_workers=4, use_gpu=True) throws the below error,
repr=<ray.train.backend.BackendExecutor object at 0x7f90a8fa14d0>)
  File "<SOME_PATH>/site-packages/ray/train/backend.py", line 176, in start
    self._backend.on_start(self.worker_group, self._backend_config)
  File "<SOME_PATH>/site-packages/ray/train/horovod.py", line 145, in on_start
  File "<SOME_PATH>/site-packages/horovod/ray/utils.py", line 72, in detect_nics
  File "<SOME_PATH>/site-packages/horovod/ray/driver_service.py", line 47, in _driver_fn
    raise ValueError(f"Number of node actors ({len(node_actors)}) "
ValueError: Number of node actors (4) must match num_hosts (2).
  • Trainer(backend="horovod", num_workers=3, use_gpu=True) fails with the same error above.

  • Trainer(backend="horovod", num_workers=2, use_gpu=True) works and schedules the job on the same node, which is good to be honest. This is unlike the setup with resources_per_worker={"GPU": 2, "CPU": 4} arg, where the job gets scheduled on the different nodes, which is expected.

But still multi worker training is unsolved with horovod.

Ah @Nitin_Pasumarthy can you use ray 1.12.0.rc1 to fix the ValueError that you saw? There was a bug with horovod training with older versions of ray.

I was partially guessing this was a bug and thought I should mention the version of Ray I was using. Thank you! I’ll give it a try. When is the next stable release of Ray?

The official 1.12 release should be in the next few days!

1 Like

Hey @amogkam,
I checked with ray==1.12 and still unable to use all my 4 GPUs (2x2 nodes). It throws the same error. Would it be easier if we can debug this live, together? This is really blocking my progress :frowning:

I see some more minor errors / issues which I posted as separate threads or +1s to existing threads.

Hey @Nitin_Pasumarthy sorry I mispoke- this fix was not included in the 1.12.0 release yet. It should be fixed in master!

Regarding the other issue you are seeing, I will respond on the other thread!

And yes would be happy to schedule some time to pair debug this! Feel free to message me on the Ray slack!

@amogkam was this merged in 1.12.1 ?

Hey @Nitin_Pasumarthy, no this will be included in Ray 1.13 release which will happen in the next few days.

Were you able to try Ray master?

Got it. Not yet, @amogkam .

I’m working to overcome this limitation - Train with tune doesnt set the right logdir - #7 by matthewdeng which is a blocker for my work.