Example of A3C only use CPU for trainer

daniel · July 22, 2021, 6:26pm

When I follow the tutorial at RLlib: Scalable Reinforcement Learning — Ray v1.4.1

running

from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, config={"env": "CartPole-v0"})

and

rllib train --run=PPO --env=CartPole-v0  # -v [-vv] for verbose,
                                         # --config='{"framework": "tf2", "eager_tracing": True}' for eager,
                                         # --torch to use PyTorch OR --config='{"framework": "torch"}'

can only use CPU for training, is there any configuration setting that I am missing for GPU training?

mannyv · July 23, 2021, 1:10pm

Hi @daniel,

Try this and see if it works:

rllib train --run=PPO --env=CartPole-v0  --ray-num-gpus=1 --config='{"num_gpus": 1}'

daniel · July 23, 2021, 3:55pm

In the case of rllib/agents/a3c/tests/test_a3c.py when I change the test example to

    def test_a3c_compilation(self):
        """Test whether an A3CTrainer can be built with both frameworks."""
        config = a3c.DEFAULT_CONFIG.copy()
        config["num_workers"] = 2
        config["num_envs_per_worker"] = 2
        config["framework"] = "torch"
        config["num_gpus"] = 1

        num_iterations = 100

        # Test against all frameworks.
        for _ in framework_iterator(config):
            # for env in ["CartPole-v0", "Pendulum-v0", "PongDeterministic-v0"]:
            for env in ["CartPole-v0", "Pendulum-v0", "PongDeterministic-v0"]:

                print("env={}".format(env))
                # sys.stdout.write("world\n")
                trainer = a3c.A3CTrainer(config=config, env=env)
                for i in tqdm(range(num_iterations)):
                    results = trainer.train()
                    # print(results)
                # check_compute_single_action(trainer)
                trainer.stop()

I return me something like

  else:
            logger.info("TorchPolicy (worker={}) running on {} GPU(s).".format(
                worker_idx if worker_idx > 0 else "local", config["num_gpus"]))
            gpu_ids = ray.get_gpu_ids()
            self.devices = [
                torch.device("cuda:{}".format(i))
                for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
            ]
>           self.device = self.devices[0]
E           IndexError: list index out of range

It seems that the ray could not locate the available GPU in this case.
How should we use GPU for training A3C?

mannyv · July 23, 2021, 3:58pm

@daniel,

Try changing this part to:

    def setUp(self):
        ray.init(num_cpus=4, num_gpus=1)

There have been some bugs reported about detecting gpus in torch. You could search the issues in github if this does not work.

daniel · July 23, 2021, 4:02pm

After changing this line, I still get the same error as above, is there any other solution I can try?
Thanks!

mannyv · July 23, 2021, 4:26pm

Hi @daniel,

I think the first thing to do, and maybe you have already done this is to test that pytorch is able to see your gpu in the python interpreter from the (venv, conda environment, container, …) you run ray in.

Something like:

import torch
print(torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.device(0), torch.cuda.device_count(), torch.cuda.get_device_name(0))

daniel · July 23, 2021, 4:27pm

Here is the returned information

True 0 <torch.cuda.device object at 0x7f3070e976a0> 2 NVIDIA RTX 3090

mannyv · July 23, 2021, 5:37pm

Ok then lets wrap it in a ray remote function call. Try this:

import ray

ray.init(num_gpus=1)


@ray.remote(num_gpus=1)
def test_torchgpu():
    import torch
    print(torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.device(0), torch.cuda.device_count(), torch.cuda.get_device_name(0))

ray.get(test_torchgpu.remote())
print("ray.get_gpu_ids(): ", ray.get_gpu_ids())
ray.shutdown()

daniel · July 23, 2021, 5:57pm

It returns the similar output

(pid=391663) True 0 <torch.cuda.device object at 0x7f4e2e2893c8> 1  NVIDIA RTX 3090

mannyv · July 23, 2021, 6:02pm

HI @daniel looking at the issues in github it looks like it has been broken for some weeks now. Several people seem to be waiting on a fix.

mannyv · July 23, 2021, 6:09pm

@daniel,

This would be a good issue to read and track:

github.com/ray-project/ray

[rllib] Training crashes because get_gpu_ids() returns empty list

opened 08:06PM - 28 Jun 21 UTC

cassidylaidlaw

P1 bug rllib

### What is the problem? When running a simple RLlib training script, almost …identical to the example [here](https://docs.ray.io/en/master/rllib-training.html#basic-python-api), I get the following error: ``` Traceback (most recent call last): File "test.py", line 12, in <module> trainer = ppo.PPOTrainer(config=config, env="CartPole-v0") File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 123, in __init__ Trainer.__init__(self, config, env, logger_creator) File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 562, in __init__ super().__init__(config, logger_creator) File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/tune/trainable.py", line 100, in __init__ self.setup(copy.deepcopy(self.config)) File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 722, in setup self._init(self.config, self.env_creator) File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in _init self.workers = self._make_workers( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 804, in _make_workers return WorkerSet( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 92, in __init__ self._local_worker = self._make_worker( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 368, in _make_worker worker = cls( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 545, in __init__ self.policy_map, self.preprocessors = self._build_policy_map( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1323, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 256, in __init__ self.parent_cls.__init__( File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 156, in __init__ self.device = self.devices[0] IndexError: list index out of range ``` The script that reproduces the error is below in the reproduction section. It looks like this error is caused by `ray.get_gpu_ids()` returning an empty list (`[]`) despite there being GPUs attached to the system: ``` >>> import ray >>> ray.init(num_gpus=4) 2021-06-28 12:57:14,182 INFO services.py:1330 -- View the Ray dashboard at http://127.0.0.1:8265 {'node_ip_address': '128.32.175.10', 'raylet_ip_address': '128.32.175.10', 'redis_address': '128.32.175.10:6379', 'object_store_address': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105', 'metrics_export_port': 59050, 'node_id': 'b6854536c3fea8b39eac4a0723a2af43d56571adf35a750f99bc9982'} >>> ray.get_gpu_ids() [] >>> import torch >>> torch.cuda.is_available() True ``` I'm not sure why this is happening—it didn't happen with RLLib 1.3. Interestingly, I can train on GPUs using `rllib train` (the RLlib CLI) with no issue. *Ray version and other system information (Python version, TensorFlow version, OS):* * Ubuntu 18.04.5 * Python 3.8.10 (anaconda) * CUDA 11.3 <details> <summary>Installed Python packages</summary> ``` Package Version ------------------------ ------------------- aiohttp 3.7.4.post0 aiohttp-cors 0.7.0 aioredis 1.3.1 async-timeout 3.0.1 attrs 21.2.0 blessings 1.7 cachetools 4.2.2 certifi 2021.5.30 chardet 4.0.0 click 8.0.1 cloudpickle 1.6.0 colorama 0.4.4 dm-tree 0.1.6 filelock 3.0.12 google-api-core 1.30.0 google-auth 1.32.0 googleapis-common-protos 1.53.0 gpustat 0.6.0 grpcio 1.38.1 gym 0.18.3 hiredis 2.0.0 idna 2.10 jsonschema 3.2.0 msgpack 1.0.2 multidict 5.1.0 numpy 1.21.0 nvidia-ml-py3 7.352.0 opencensus 0.7.13 opencensus-context 0.1.2 opencv-python 4.5.2.54 packaging 20.9 pandas 1.2.5 Pillow 8.2.0 pip 21.1.2 prometheus-client 0.11.0 protobuf 3.17.3 psutil 5.8.0 py-spy 0.3.7 pyasn1 0.4.8 pyasn1-modules 0.2.8 pydantic 1.8.2 pyglet 1.5.15 pyparsing 2.4.7 pyrsistent 0.18.0 python-dateutil 2.8.1 pytz 2021.1 PyYAML 5.4.1 ray 2.0.0.dev0 redis 3.5.3 requests 2.25.1 rsa 4.7.2 scipy 1.7.0 setuptools 52.0.0.post20210125 six 1.16.0 tabulate 0.8.9 torch 1.9.0 typing-extensions 3.10.0.0 urllib3 1.26.6 wheel 0.36.2 yarl 1.6.3 ``` </details> ### Reproduction (REQUIRED) <details> <summary>test.py</summary> ``` import ray import ray.rllib.agents.ppo as ppo from ray.tune.logger import pretty_print ray.init() config = ppo.DEFAULT_CONFIG.copy() config.update({ "num_gpus": 1, "num_workers": 1, "framework": "torch", }) trainer = ppo.PPOTrainer(config=config, env="CartPole-v0") # Can optionally call trainer.restore(path) to load a checkpoint. for i in range(1000): # Perform one iteration of training the policy with PPO result = trainer.train() print(pretty_print(result)) if i % 100 == 0: checkpoint = trainer.save() print("checkpoint saved at", checkpoint) ``` </details> - [x] I have verified my script runs in a clean environment and reproduces the issue. - [x] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Topic		Replies	Views
Ray not finding available GPU on Windows RLlib	4	1020	September 6, 2021
Error when trying to use gpus during RL training RLlib	4	657	July 21, 2021
PPO example cannot use GPU RLlib	4	515	August 7, 2021
Training and inference ONLY using GPUs and no CPUs RLlib	7	1903	April 12, 2021
ray.tune.error.TuneError: ('Trials did not complete', [A3C_A3C_four_way_train-v0_00000])	0	19	July 13, 2024

Example of A3C only use CPU for trainer

Related topics