GPUs not detected

Hi all!
I am trying to run PPO using a GPU for the trainer.
My setup is the following:

Ray v2.0.0
Tensorflow 2.4
Cuda 11.0

Tensorflow works fine with GPUs. However, when I run the PPO algorithm with “rllib train”, the GPUs are not detected and I get the following error:

RuntimeError: GPUs were assigned to this worker by Ray, but your DL framework (tf) reports GPU acceleration is disabled. This could be due to a bad CUDA- or tf installation.

I tried to remove the part that raised the error, but I noticed that the trainer used only the CPU.

Does anybody know what I could do to fix it?

1 Like

Hi Fot,

I have seen this error multiple times from running my own code.
Do you have a script that you can post?
Are you 100% sure that you use the same Cuda version when testing TF separately?
What was the part that you removed that provokes the error?

Hi!

  1. I use the following configuration file:

cartpole-appo:
env: CartPole-v0
run: PPO
stop:
timesteps_total: 15000
config:
# Works for both torch and tf.
framework: tf
train_batch_size: 750
num_envs_per_worker: 5
num_workers: 1
num_cpus_per_worker: 1
num_gpus: 1
num_gpus_per_worker: 1

and then I just run “rllib train” providing this conf. file. I have started the ray session providing also a number of gpus (ray start --num_gpus=4).
When I run this, 2 Ray actors are spawned, I believe the trainer and 1 Rollout worker? By checking the log files, I noticed that the Rollout worker sees the GPUs, but the trainer (or whatever the other actor is, does not). From what I have seen, the GPU is supposed to be used by the trainer to perform SGD, right?

Also, as far as I understand, in order for a Ray actor to see the GPUs you have to set the num_gpus when declaring the respective class.

  1. I think I use the correct version, yes. If I just define and launch Ray actors, they can find the Cuda libraries and can see the GPUs.

  2. The part that causes the error is the file rollout_workers.py: (lines 500-513)

if not ray.get_gpu_ids():
logger.debug(“Creating policy evaluation worker {}”.format(
worker_index) +
" on CPU (please ignore any CUDA init errors)")
elif (policy_config[“framework”] in [“tf2”, “tf”, “tfe”] and
not tf.config.experimental.list_physical_devices(“GPU”)) or
(policy_config[“framework”] == “torch” and
not torch.cuda.is_available()):
raise RuntimeError(
"GPUs were assigned to this worker by Ray, but "
"your DL framework ({}) reports GPU acceleration is "
"disabled. This could be due to a bad CUDA- or {} "
“installation.”.format(policy_config[“framework”],
policy_config[“framework”]))

The trainer has also a local worker, right? Maybe this worker is not able to see the GPUs, and causes this error?

Also, a Ray actor can only see the GPUs, if the num_gpus is provided at its class declaration?

That might very well be. For one gpu you can use the following assignment from ray docs. You are requesting 2 GPUs with your config.

Hi!

Yes, @Sertingolix you are right, I need only 1 GPU, and I request 2, thanks for pointing it out!

I figured out there was an issue when the CUDA_VISIBLE_DEVICES was set for the tasks.
The resources for the trainer are the following (requested 1 CPU and 1 GPU):

{‘CPU_group_0’: [(0, 1.0)], ‘CPU_group’: [(0, 1.0)], ‘GPU_group_3’: [(0, 1.0)], ‘GPU_group_0’: [(0, 1.0)]}

I am not very familiar with how these placement groups are created.
Anyway, the function get_gpu_ids returned the list [0,0] for the trainer.

Then the function set_cuda_visible_devices was setting CUDA_VISIBLE_DEVICES=0,0.
I tried running tensorflow with this value for the CUDA_VISIBLE_DEVICES variable, and it could not detect any GPUs, so maybe that is the problem.

I manually set the CUDA_VISIBLE_DEVICES=0 for the trainer and it seems to work, but if someone has any idea why this happened with the placement groups, it would be good to know.

Thanks!!

@Fot , interesting catch. Yeah, the CUDA_VISIBLE_DEVICES=0,0 probably messes things up here. Glad you could fix your problem.

“When I run this, 2 Ray actors are spawned, I believe the trainer and 1 Rollout worker?”
Yes, this is correct.

Also, the local worker (where training happens) will use the GPU, the worker should not see a GPU.

1 Like