[RLlib] "target_q_model" not using GPU for custom model

Siddharth_Jain · August 20, 2021, 4:38am

I has set “num_gpus” as 1, “policy.model” is using GPU but “policy.target_q_model” not using GPU when called from “build_q_losses” method in simple_q_torch_policy.py

Traceback (most recent call last):
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
  self._entrypoint()
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
  self._status_reporter.get_checkpoint())
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/tune/function_runner.py", line 580, in _trainable_func
  output = fn()
File "experiment.py", line 33, in experiment
  train_agent = SimpleQTrainer(config=config, env=HierarchicalGraphColorEnv)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 123, in __init__
  Trainer.__init__(self, config, env, logger_creator)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 584, in __init__
  super().__init__(config, logger_creator)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/tune/trainable.py", line 103, in __init__
  self.setup(copy.deepcopy(self.config))
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 731, in setup
  self._init(self.config, self.env_creator)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 152, in _init
  num_workers=self.config["num_workers"])
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 819, in _make_workers
  logdir=self.logdir)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 103, in __init__
  spaces=spaces,
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 431, in _make_worker
  spaces=spaces,
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 557, in __init__
  policy_dict, policy_config)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1342, in _build_policy_map
  policy_map[name] = cls(obs_space, act_space, merged_conf)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/policy/policy_template.py", line 281, in __init__
  stats_fn=stats_fn,
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/policy/policy.py", line 691, in _initialize_loss_from_dummy_batch
  self._loss(self, self.model, self.dist_class, train_batch)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/dqn/simple_q_torch_policy.py", line 91, in build_q_losses
  is_training=True)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/agents/dqn/simple_q_tf_policy.py", line 185, in compute_q_values
  }, [], None)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/ray/rllib/models/modelv2.py", line 231, in __call__
  res = self.forward(restored, state or [], seq_lens)
File "/home/cs20mtech12003/ML-Register-Allocation/model/RegAlloc/ggnn_drl/rllib_split_model/src/model.py", line 138, in forward
  x = F.relu(self.fc1(input_dict["obs"]["state"]))
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
  result = self.forward(*input, **kwargs)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 93, in forward
  return F.linear(input, self.weight, self.bias)
File "/home/cs20mtech12003/anaconda3/envs/rllib_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1690, in linear
  ret = torch.addmm(bias, input, weight.t())
RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)

sven1977 · August 23, 2021, 6:59am

Hey @Siddharth_Jain , could you check the latest master version? We fixed this issue recently, some PRs ago

sven1977 · August 23, 2021, 7:01am

We are also now running nightly multi-GPU learning tests for all major algos (incl. DQN/SimpleQ) and both tf and torch, making sure everything runs fine on a 2GPU machine.

We’ll add LSTM-based tests to these as well (for the RNN-supporting algos, like PPO) in the next 2 weeks.

Topic		Replies	Views
Questions about using GPU for the ray[rllib] RLlib	4	881	August 4, 2023
Error when trying to use gpus during RL training RLlib	4	458	July 21, 2021
GPU does not use policy metrics in appo training? RLlib	1	223	September 10, 2021
Can't get Ray to use my GPU RLlib	5	2065	May 17, 2022
Ray not finding available GPU on Windows RLlib	4	760	September 6, 2021

[RLlib] "target_q_model" not using GPU for custom model

Related Topics