daniel
July 22, 2021, 6:26pm
1
When I follow the tutorial at RLlib: Scalable Reinforcement Learning — Ray v1.4.1
running
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, config={"env": "CartPole-v0"})
and
rllib train --run=PPO --env=CartPole-v0 # -v [-vv] for verbose,
# --config='{"framework": "tf2", "eager_tracing": True}' for eager,
# --torch to use PyTorch OR --config='{"framework": "torch"}'
can only use CPU for training, is there any configuration setting that I am missing for GPU training?
mannyv
July 23, 2021, 1:10pm
2
Hi @daniel ,
Try this and see if it works:
rllib train --run=PPO --env=CartPole-v0 --ray-num-gpus=1 --config='{"num_gpus": 1}'
daniel
July 23, 2021, 3:55pm
3
In the case of rllib/agents/a3c/tests/test_a3c.py
when I change the test example to
def test_a3c_compilation(self):
"""Test whether an A3CTrainer can be built with both frameworks."""
config = a3c.DEFAULT_CONFIG.copy()
config["num_workers"] = 2
config["num_envs_per_worker"] = 2
config["framework"] = "torch"
config["num_gpus"] = 1
num_iterations = 100
# Test against all frameworks.
for _ in framework_iterator(config):
# for env in ["CartPole-v0", "Pendulum-v0", "PongDeterministic-v0"]:
for env in ["CartPole-v0", "Pendulum-v0", "PongDeterministic-v0"]:
print("env={}".format(env))
# sys.stdout.write("world\n")
trainer = a3c.A3CTrainer(config=config, env=env)
for i in tqdm(range(num_iterations)):
results = trainer.train()
# print(results)
# check_compute_single_action(trainer)
trainer.stop()
I return me something like
else:
logger.info("TorchPolicy (worker={}) running on {} GPU(s).".format(
worker_idx if worker_idx > 0 else "local", config["num_gpus"]))
gpu_ids = ray.get_gpu_ids()
self.devices = [
torch.device("cuda:{}".format(i))
for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
]
> self.device = self.devices[0]
E IndexError: list index out of range
It seems that the ray could not locate the available GPU in this case.
How should we use GPU for training A3C?
mannyv
July 23, 2021, 3:58pm
4
@daniel ,
Try changing this part to:
def setUp(self):
ray.init(num_cpus=4, num_gpus=1)
There have been some bugs reported about detecting gpus in torch. You could search the issues in github if this does not work.
daniel
July 23, 2021, 4:02pm
5
After changing this line, I still get the same error as above, is there any other solution I can try?
Thanks!
mannyv
July 23, 2021, 4:26pm
6
Hi @daniel ,
I think the first thing to do, and maybe you have already done this is to test that pytorch is able to see your gpu in the python interpreter from the (venv, conda environment, container, …) you run ray in.
Something like:
import torch
print(torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.device(0), torch.cuda.device_count(), torch.cuda.get_device_name(0))
daniel
July 23, 2021, 4:27pm
7
mannyv:
print(torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.device(0), torch.cuda.device_count(), torch.cuda.get_device_name(0))
Here is the returned information
True 0 <torch.cuda.device object at 0x7f3070e976a0> 2 NVIDIA RTX 3090
mannyv
July 23, 2021, 5:37pm
8
Ok then lets wrap it in a ray remote function call. Try this:
import ray
ray.init(num_gpus=1)
@ray.remote(num_gpus=1)
def test_torchgpu():
import torch
print(torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.device(0), torch.cuda.device_count(), torch.cuda.get_device_name(0))
ray.get(test_torchgpu.remote())
print("ray.get_gpu_ids(): ", ray.get_gpu_ids())
ray.shutdown()
daniel
July 23, 2021, 5:57pm
9
It returns the similar output
(pid=391663) True 0 <torch.cuda.device object at 0x7f4e2e2893c8> 1 NVIDIA RTX 3090
mannyv
July 23, 2021, 6:02pm
10
HI @daniel looking at the issues in github it looks like it has been broken for some weeks now. Several people seem to be waiting on a fix.
mannyv
July 23, 2021, 6:09pm
11
@daniel ,
This would be a good issue to read and track:
opened 08:06PM - 28 Jun 21 UTC
P1
bug
rllib
### What is the problem?
When running a simple RLlib training script, almost … identical to the example [here](https://docs.ray.io/en/master/rllib-training.html#basic-python-api), I get the following error:
```
Traceback (most recent call last):
File "test.py", line 12, in <module>
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 123, in __init__
Trainer.__init__(self, config, env, logger_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 562, in __init__
super().__init__(config, logger_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/tune/trainable.py", line 100, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 722, in setup
self._init(self.config, self.env_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in _init
self.workers = self._make_workers(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 804, in _make_workers
return WorkerSet(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 92, in __init__
self._local_worker = self._make_worker(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 368, in _make_worker
worker = cls(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 545, in __init__
self.policy_map, self.preprocessors = self._build_policy_map(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1323, in _build_policy_map
policy_map[name] = cls(obs_space, act_space, merged_conf)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 256, in __init__
self.parent_cls.__init__(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 156, in __init__
self.device = self.devices[0]
IndexError: list index out of range
```
The script that reproduces the error is below in the reproduction section. It looks like this error is caused by `ray.get_gpu_ids()` returning an empty list (`[]`) despite there being GPUs attached to the system:
```
>>> import ray
>>> ray.init(num_gpus=4)
2021-06-28 12:57:14,182 INFO services.py:1330 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '128.32.175.10', 'raylet_ip_address': '128.32.175.10', 'redis_address': '128.32.175.10:6379', 'object_store_address': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105', 'metrics_export_port': 59050, 'node_id': 'b6854536c3fea8b39eac4a0723a2af43d56571adf35a750f99bc9982'}
>>> ray.get_gpu_ids()
[]
>>> import torch
>>> torch.cuda.is_available()
True
```
I'm not sure why this is happening—it didn't happen with RLLib 1.3. Interestingly, I can train on GPUs using `rllib train` (the RLlib CLI) with no issue.
*Ray version and other system information (Python version, TensorFlow version, OS):*
* Ubuntu 18.04.5
* Python 3.8.10 (anaconda)
* CUDA 11.3
<details>
<summary>Installed Python packages</summary>
```
Package Version
------------------------ -------------------
aiohttp 3.7.4.post0
aiohttp-cors 0.7.0
aioredis 1.3.1
async-timeout 3.0.1
attrs 21.2.0
blessings 1.7
cachetools 4.2.2
certifi 2021.5.30
chardet 4.0.0
click 8.0.1
cloudpickle 1.6.0
colorama 0.4.4
dm-tree 0.1.6
filelock 3.0.12
google-api-core 1.30.0
google-auth 1.32.0
googleapis-common-protos 1.53.0
gpustat 0.6.0
grpcio 1.38.1
gym 0.18.3
hiredis 2.0.0
idna 2.10
jsonschema 3.2.0
msgpack 1.0.2
multidict 5.1.0
numpy 1.21.0
nvidia-ml-py3 7.352.0
opencensus 0.7.13
opencensus-context 0.1.2
opencv-python 4.5.2.54
packaging 20.9
pandas 1.2.5
Pillow 8.2.0
pip 21.1.2
prometheus-client 0.11.0
protobuf 3.17.3
psutil 5.8.0
py-spy 0.3.7
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydantic 1.8.2
pyglet 1.5.15
pyparsing 2.4.7
pyrsistent 0.18.0
python-dateutil 2.8.1
pytz 2021.1
PyYAML 5.4.1
ray 2.0.0.dev0
redis 3.5.3
requests 2.25.1
rsa 4.7.2
scipy 1.7.0
setuptools 52.0.0.post20210125
six 1.16.0
tabulate 0.8.9
torch 1.9.0
typing-extensions 3.10.0.0
urllib3 1.26.6
wheel 0.36.2
yarl 1.6.3
```
</details>
### Reproduction (REQUIRED)
<details>
<summary>test.py</summary>
```
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config.update({
"num_gpus": 1,
"num_workers": 1,
"framework": "torch",
})
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
# Can optionally call trainer.restore(path) to load a checkpoint.
for i in range(1000):
# Perform one iteration of training the policy with PPO
result = trainer.train()
print(pretty_print(result))
if i % 100 == 0:
checkpoint = trainer.save()
print("checkpoint saved at", checkpoint)
```
</details>
- [x] I have verified my script runs in a clean environment and reproduces the issue.
- [x] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).