I’m running rllib on the latest rocm/pytorch docker container. It is unable to detect any of the 4 GPUs on the server. I have run the same on an NVIDIA system with no issues. I cannot find any literature related specifically to running AMD GPUs and rllib. Are AMD GPUs supported or is there something wrong with my environment?
rllib train -f one.yml --torch
2020-12-21 18:40:49,370 INFO services.py:1092 – View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File “/opt/conda/bin/rllib”, line 8, in
sys.exit(cli())
File “/opt/conda/lib/python3.6/site-packages/ray/rllib/scripts.py”, line 34, in cli
train.run(options, train_parser)
File “/opt/conda/lib/python3.6/site-packages/ray/rllib/train.py”, line 215, in run
concurrent=True)
File “/opt/conda/lib/python3.6/site-packages/ray/tune/tune.py”, line 490, in run_experiments
scheduler=scheduler).trials
File “/opt/conda/lib/python3.6/site-packages/ray/tune/tune.py”, line 411, in run
runner.step()
File “/opt/conda/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 572, in step
self.trial_executor.on_no_available_trials(self)
File “/opt/conda/lib/python3.6/site-packages/ray/tune/trial_executor.py”, line 177, in on_no_available_trials
"Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 1 GPUs, but the cluster has only 56 CPUs, 0 GPUs, 121.29 GiB heap, 38.62 GiB objects (1.0 node:10.217.77.119).
You can adjust the resource requests of RLlib agents by setting num_workers
, num_gpus
, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.
The config of this agent is: {‘framework’: ‘torch’, ‘double_q’: False, ‘dueling’: False, ‘num_atoms’: 1, ‘noisy’: False, ‘prioritized_replay’: False, ‘n_step’: 1, ‘target_network_update_freq’: 8000, ‘lr’: 6.25e-05, ‘adam_epsilon’: 0.00015, ‘hiddens’: [512], ‘learning_starts’: 20000, ‘buffer_size’: 1000000, ‘rollout_fragment_length’: 4, ‘train_batch_size’: 32, ‘exploration_config’: {‘epsilon_timesteps’: 200000, ‘final_epsilon’: 0.01}, ‘prioritized_replay_alpha’: 0.5, ‘final_prioritized_replay_beta’: 1.0, ‘prioritized_replay_beta_annealing_timesteps’: 2000000, ‘num_gpus’: 1, ‘timesteps_per_iteration’: 10000, ‘env’: ‘BreakoutNoFrameskip-v4’}