Hi all!
I am trying to run PPO using a GPU for the trainer.
My setup is the following:
Ray v2.0.0
Tensorflow 2.4
Cuda 11.0
Tensorflow works fine with GPUs. However, when I run the PPO algorithm with “rllib train”, the GPUs are not detected and I get the following error:
RuntimeError: GPUs were assigned to this worker by Ray, but your DL framework (tf) reports GPU acceleration is disabled. This could be due to a bad CUDA- or tf installation.
I tried to remove the part that raised the error, but I noticed that the trainer used only the CPU.
I have seen this error multiple times from running my own code.
Do you have a script that you can post?
Are you 100% sure that you use the same Cuda version when testing TF separately?
What was the part that you removed that provokes the error?
cartpole-appo:
env: CartPole-v0
run: PPO
stop:
timesteps_total: 15000
config:
# Works for both torch and tf.
framework: tf
train_batch_size: 750
num_envs_per_worker: 5
num_workers: 1
num_cpus_per_worker: 1
num_gpus: 1
num_gpus_per_worker: 1
and then I just run “rllib train” providing this conf. file. I have started the ray session providing also a number of gpus (ray start --num_gpus=4).
When I run this, 2 Ray actors are spawned, I believe the trainer and 1 Rollout worker? By checking the log files, I noticed that the Rollout worker sees the GPUs, but the trainer (or whatever the other actor is, does not). From what I have seen, the GPU is supposed to be used by the trainer to perform SGD, right?
Also, as far as I understand, in order for a Ray actor to see the GPUs you have to set the num_gpus when declaring the respective class.
I think I use the correct version, yes. If I just define and launch Ray actors, they can find the Cuda libraries and can see the GPUs.
The part that causes the error is the file rollout_workers.py: (lines 500-513)
if not ray.get_gpu_ids():
logger.debug(“Creating policy evaluation worker {}”.format(
worker_index) +
" on CPU (please ignore any CUDA init errors)")
elif (policy_config[“framework”] in [“tf2”, “tf”, “tfe”] and
not tf.config.experimental.list_physical_devices(“GPU”)) or
(policy_config[“framework”] == “torch” and
not torch.cuda.is_available()):
raise RuntimeError(
"GPUs were assigned to this worker by Ray, but "
"your DL framework ({}) reports GPU acceleration is "
"disabled. This could be due to a bad CUDA- or {} "
“installation.”.format(policy_config[“framework”],
policy_config[“framework”]))
Yes, @Sertingolix you are right, I need only 1 GPU, and I request 2, thanks for pointing it out!
I figured out there was an issue when the CUDA_VISIBLE_DEVICES was set for the tasks.
The resources for the trainer are the following (requested 1 CPU and 1 GPU):
I am not very familiar with how these placement groups are created.
Anyway, the function get_gpu_ids returned the list [0,0] for the trainer.
Then the function set_cuda_visible_devices was setting CUDA_VISIBLE_DEVICES=0,0.
I tried running tensorflow with this value for the CUDA_VISIBLE_DEVICES variable, and it could not detect any GPUs, so maybe that is the problem.
I manually set the CUDA_VISIBLE_DEVICES=0 for the trainer and it seems to work, but if someone has any idea why this happened with the placement groups, it would be good to know.