Checking if TorchTrainer is using the available GPUs

psr.ai · December 6, 2023, 4:11am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to train on a given dataset using distributed training in ray. I ran the sample code given in the example on ray.train.torch.TorchTrainer — Ray 2.8.0. I am running this for a large dataset, and during the run, I get the following log:

Running: 0.0/48.0 CPU, 0.0/2.0 GPU, 11.84 GiB/18.0 GiB object_store_memory

and a progress bar on the right. Does it mean that I am not using the available GPUs? My GPU usage using nvidia-smi shows that two processes, namely Worker__execute.get_next, are using the GPU but I am not sure if it’s being used or not.

matthewdeng · December 6, 2023, 6:17pm

Does it show up in the Ray Dashboard? You can use the Cluster/Metrics tabs to see if the GPU is being utilized.

psr.ai · December 6, 2023, 6:29pm

I haven’t had a chance to port forward and view the dashboard, I’ll let you know soon.

Topic		Replies	Views
Can't use GPUs on local cluster Ray Clusters	3	668	September 11, 2024
Model loaded to GPU memory but GPU memory is not being utilized RLlib	5	767	November 29, 2022
Workaround for GPU-workers non-equal memory consumption Ray Train	7	474	June 1, 2022
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	432	February 19, 2021
GPU Detected but Not Utilized in Ray RLlib with PPO RLlib	1	619	June 15, 2024

Checking if TorchTrainer is using the available GPUs

Related topics