Model loaded to GPU memory but GPU memory is not being utilized

saeid93 · November 22, 2022, 2:13am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi ray community,

I am trying to do a training on GPU on rillib. The models seems to be loaded on GPU but the strange thing is that the gpu utilization stays at zero all the time both on the nvidia-smi all the time and also when looking at tensorboard logs the same thing is happening tensorboard log.

Here is also the log from the terminal:

The training is very slow so this is another reason that makes me highly doubt that any training is happening on the GPU.

Lars_Simon_Zehnder · November 23, 2022, 10:08pm

@saeid93 To answer the question it helps if you show your configuration.

saeid93 · November 24, 2022, 1:09am

@Lars_Simon_Zehnder Thank you for your response, this is the config file:

    "run_or_experiment": "PG",
    "learn_config": {
        "train_batch_size": 1000,
        "num_gpus": 1,
        "model": {
            "fcnet_hiddens": [64, 64],
            "fcnet_activation": "linear"
        },
        "gamma": 0.99,
        "lr": 0.0003,
        "num_workers": 6,
        "observation_filter": "NoFilter",
        "seed": 203
    },
    "stop": {
        "timesteps_total": 2000000
    }

Lars_Simon_Zehnder · November 24, 2022, 4:01pm

@saeid93 Thank you for the configuration. What does ray status tell you, if you run it? Or do you run locally without a cluster?

And did you also include the GPU request into your tune resources as shown in @kai 's answer in another thread?

saeid93 · November 27, 2022, 4:49pm

Thank you for your reply @Lars_Simon_Zehnder , I start it with ray.init(), is that the cluster mode?
I inculde it in the ray.init() input as ray.init(local_mode=local_mode, num_gpus=1). Is that what you mean?

Lars_Simon_Zehnder · November 29, 2022, 10:12am

No by cluster mode I meant that you started your ray cluster with a yaml file from the command line (it’s not a fortunate naming I know).

Could you maybe run ray.get_gpu_ids() and take a look at CUDA_VISIBLE_DEVICES? Ray sets the environment variable and that might give some hints on where the problems lay.

Also, you might want to see, if the specific resource allocation via PlacementGroupFactory might bring the metrics to life. See here for an example of how to use them in Tune.

Topic		Replies	Views
How (if possible) do I allocate more GPU utilization to Ray?	0	345	September 14, 2022
GPU Detected but Not Utilized in Ray RLlib with PPO RLlib	1	515	June 15, 2024
Checking if TorchTrainer is using the available GPUs Ray Train	2	451	December 6, 2023
PPO: GPU available, but not utilized Debugging and performance tuning	4	86	April 1, 2025
RLlib slows down when gpu available but not used RLlib	0	351	April 7, 2021

Model loaded to GPU memory but GPU memory is not being utilized

Related topics