How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi ray community,
I am trying to do a training on GPU on rillib. The models seems to be loaded on GPU but the strange thing is that the gpu utilization stays at zero all the time both on the nvidia-smi all the time and also when looking at tensorboard logs the same thing is happening tensorboard log.
Thank you for your reply @Lars_Simon_Zehnder , I start it with ray.init(), is that the cluster mode?
I inculde it in the ray.init() input as ray.init(local_mode=local_mode, num_gpus=1). Is that what you mean?
No by cluster mode I meant that you started your ray cluster with a yaml file from the command line (it’s not a fortunate naming I know).
Could you maybe run ray.get_gpu_ids() and take a look at CUDA_VISIBLE_DEVICES? Ray sets the environment variable and that might give some hints on where the problems lay.
Also, you might want to see, if the specific resource allocation via PlacementGroupFactory might bring the metrics to life. See here for an example of how to use them in Tune.