[rllib] Unable to detect AMD GPUs?

I’m running rllib on the latest rocm/pytorch docker container. It is unable to detect any of the 4 GPUs on the server. I have run the same on an NVIDIA system with no issues. I cannot find any literature related specifically to running AMD GPUs and rllib. Are AMD GPUs supported or is there something wrong with my environment?

rllib train -f one.yml --torch

2020-12-21 18:40:49,370 INFO services.py:1092 – View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File “/opt/conda/bin/rllib”, line 8, in
sys.exit(cli())
File “/opt/conda/lib/python3.6/site-packages/ray/rllib/scripts.py”, line 34, in cli
train.run(options, train_parser)
File “/opt/conda/lib/python3.6/site-packages/ray/rllib/train.py”, line 215, in run
concurrent=True)
File “/opt/conda/lib/python3.6/site-packages/ray/tune/tune.py”, line 490, in run_experiments
scheduler=scheduler).trials
File “/opt/conda/lib/python3.6/site-packages/ray/tune/tune.py”, line 411, in run
runner.step()
File “/opt/conda/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 572, in step
self.trial_executor.on_no_available_trials(self)
File “/opt/conda/lib/python3.6/site-packages/ray/tune/trial_executor.py”, line 177, in on_no_available_trials
"Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 1 GPUs, but the cluster has only 56 CPUs, 0 GPUs, 121.29 GiB heap, 38.62 GiB objects (1.0 node:10.217.77.119).

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {‘framework’: ‘torch’, ‘double_q’: False, ‘dueling’: False, ‘num_atoms’: 1, ‘noisy’: False, ‘prioritized_replay’: False, ‘n_step’: 1, ‘target_network_update_freq’: 8000, ‘lr’: 6.25e-05, ‘adam_epsilon’: 0.00015, ‘hiddens’: [512], ‘learning_starts’: 20000, ‘buffer_size’: 1000000, ‘rollout_fragment_length’: 4, ‘train_batch_size’: 32, ‘exploration_config’: {‘epsilon_timesteps’: 200000, ‘final_epsilon’: 0.01}, ‘prioritized_replay_alpha’: 0.5, ‘final_prioritized_replay_beta’: 1.0, ‘prioritized_replay_beta_annealing_timesteps’: 2000000, ‘num_gpus’: 1, ‘timesteps_per_iteration’: 10000, ‘env’: ‘BreakoutNoFrameskip-v4’}

Unfortunately Ray currently doesn’t autodetect AMD gpus/set environment variables.

That being said, if you’re able to manually do --num-gpus=4, Ray will still assign you tasks a specific gpu (if you want to manually mess with Ray tasks, you can get the gpu id with ray.get_gpu_ids() still).

After some digging into rllib command options I updated my command to
rllib train -f one.yml --torch --ray-num-gpus 2

It appears to be running now. Thank you for confirming.

I am trying to set up Ray on an AMD cluster with 8 GPUs.

When trying to run ray.init(num_gpus=8), as suggested above to make it find the GPUs, it deadlocks.

It may be of interest that tensorflow (ROCm version) does detect these GPUs and I am able to run training.

@LucaCappelletti94 I made ray build based on tensorflow 2.4 image. maybe it will be useful for You.

docker pull peterpirogtf/ray_tf2

Hello @Peter_Pirog, is this TensorFlow compatible with the ROCm TensorFlow? It is a different build altogether from the CUDA tensorflow. Is building Ray for a specific version of TensorFlow complex? Do you think it could be possible to build it with the aforementioned ROCm TensorFlow?

@LucaCappelletti94 I dont know ROCm Tensorflow. Building own docker image is not complex, but there are some important details. For me this docker course was very useful: https://www.udemy.com/course/docker-and-kubernetes-the-complete-guide/

I am familiar with the Docker images, the issue here would be how TensorFlow ROCm does interact with Ray. There is a need for compiling Ray against this specific TensorFlow distribution?
I would have expected for the two software libraries to be fully decoupled.