How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi ray community,
I am trying to do a training on GPU on rillib. The models seems to be loaded on GPU but the strange thing is that the gpu utilization stays at zero all the time both on the nvidia-smi all the time and also when looking at tensorboard logs the same thing is happening tensorboard log.
Here is also the log from the terminal:
The training is very slow so this is another reason that makes me highly doubt that any training is happening on the GPU.
@saeid93 To answer the question it helps if you show your configuration.
@Lars_Simon_Zehnder Thank you for your response, this is the config file:
"fcnet_hiddens": [64, 64],
@saeid93 Thank you for the configuration. What does
ray status tell you, if you run it? Or do you run locally without a cluster?
And did you also include the GPU request into your tune resources as shown in @kai 's answer in another thread?
Thank you for your reply @Lars_Simon_Zehnder , I start it with ray.init(), is that the cluster mode?
I inculde it in the ray.init() input as
ray.init(local_mode=local_mode, num_gpus=1). Is that what you mean?
No by cluster mode I meant that you started your ray cluster with a yaml file from the command line (it’s not a fortunate naming I know).
Could you maybe run
ray.get_gpu_ids() and take a look at
CUDA_VISIBLE_DEVICES? Ray sets the environment variable and that might give some hints on where the problems lay.
Also, you might want to see, if the specific resource allocation via
PlacementGroupFactory might bring the metrics to life. See here for an example of how to use them in