Checking if TorchTrainer is using the available GPUs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to train on a given dataset using distributed training in ray. I ran the sample code given in the example on ray.train.torch.TorchTrainer — Ray 2.8.0. I am running this for a large dataset, and during the run, I get the following log:

Running: 0.0/48.0 CPU, 0.0/2.0 GPU, 11.84 GiB/18.0 GiB object_store_memory

and a progress bar on the right. Does it mean that I am not using the available GPUs? My GPU usage using nvidia-smi shows that two processes, namely Worker__execute.get_next, are using the GPU but I am not sure if it’s being used or not.

Does it show up in the Ray Dashboard? You can use the Cluster/Metrics tabs to see if the GPU is being utilized.

I haven’t had a chance to port forward and view the dashboard, I’ll let you know soon.