Model loaded to GPU memory but GPU memory is not being utilized

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi ray community,

I am trying to do a training on GPU on rillib. The models seems to be loaded on GPU but the strange thing is that the gpu utilization stays at zero all the time both on the nvidia-smi all the time and also when looking at tensorboard logs the same thing is happening tensorboard log.


Screenshot 2022-11-21 at 21.10.53

Here is also the log from the terminal:

The training is very slow so this is another reason that makes me highly doubt that any training is happening on the GPU.

@saeid93 To answer the question it helps if you show your configuration.

@Lars_Simon_Zehnder Thank you for your response, this is the config file:

    "run_or_experiment": "PG",
    "learn_config": {
        "train_batch_size": 1000,
        "num_gpus": 1,
        "model": {
            "fcnet_hiddens": [64, 64],
            "fcnet_activation": "linear"
        },
        "gamma": 0.99,
        "lr": 0.0003,
        "num_workers": 6,
        "observation_filter": "NoFilter",
        "seed": 203
    },
    "stop": {
        "timesteps_total": 2000000
    }

@saeid93 Thank you for the configuration. What does ray status tell you, if you run it? Or do you run locally without a cluster?

And did you also include the GPU request into your tune resources as shown in @kai 's answer in another thread?

Thank you for your reply @Lars_Simon_Zehnder , I start it with ray.init(), is that the cluster mode?
I inculde it in the ray.init() input as ray.init(local_mode=local_mode, num_gpus=1). Is that what you mean?

No by cluster mode I meant that you started your ray cluster with a yaml file from the command line (it’s not a fortunate naming I know).

Could you maybe run ray.get_gpu_ids() and take a look at CUDA_VISIBLE_DEVICES? Ray sets the environment variable and that might give some hints on where the problems lay.

Also, you might want to see, if the specific resource allocation via PlacementGroupFactory might bring the metrics to life. See here for an example of how to use them in Tune.