RuntimeError: Expected scalars to be on CPU, got cuda:0 instead

Denys_Ashikhin · March 31, 2023, 6:57pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi all,

I am trying to load in a previously trained model to continue training it, except I get the following error:

Failure # 1 (occurred at 2023-03-31_14-54-08)
e[36mray::PPO.train()e[39m (pid=5616, ip=127.0.0.1, repr=PPO)
  File "python\ray\_raylet.pyx", line 875, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 879, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 819, in ray._raylet.execute_task.function_executor
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\_private\function_manager.py", line 674, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\tune\trainable\trainable.py", line 384, in train
    raise skipped from exception_cause(skipped)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\tune\trainable\trainable.py", line 381, in train
    result = self.step()
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 794, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 2810, in _run_one_training_iteration
    results = self.training_step()
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 460, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\algorithms\ppo\ppo.py", line 420, in training_step
    train_results = train_one_step(self, train_batch)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\execution\train_ops.py", line 52, in train_one_step
    info = do_minibatch_sgd(
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\utils\sgd.py", line 129, in do_minibatch_sgd
    local_worker.learn_on_batch(
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1029, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 663, in learn_on_batch
    self.apply_gradients(_directStepOptimizerSingleton)
  File "C:\personal\ai\ray_venv\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 880, in apply_gradients
    opt.step()
  File "C:\personal\ai\ray_venv\lib\site-packages\torch\optim\optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\torch\optim\optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "C:\personal\ai\ray_venv\lib\site-packages\torch\optim\adam.py", line 141, in step
    adam(
  File "C:\personal\ai\ray_venv\lib\site-packages\torch\optim\adam.py", line 281, in adam
    func(params,
  File "C:\personal\ai\ray_venv\lib\site-packages\torch\optim\adam.py", line 449, in _multi_tensor_adam
    torch._foreach_addcmul_(device_exp_avg_sqs, device_grads, device_grads, 1 - beta2)
RuntimeError: Expected scalars to be on CPU, got cuda:0 instead.

The relevant code:

tune.run("PPO",
         resume='AUTO',
         # param_space=config,
         config=ppo_config.to_dict(),
         name=name, keep_checkpoints_num=None, checkpoint_score_attr="episode_reward_mean",
         max_failures=1,
         # restore="C:\\Users\\denys\\ray_results\\mediumbrawl-attention-256Att-128MLP-L2\\PPOTrainer_RandomEnv_1e882_00000_0_2022-06-02_15-13-44\\checkpoint_000028\\checkpoint-28",
         checkpoint_freq=5, checkpoint_at_end=True)

Thanks!

avnishn · April 13, 2023, 4:13pm

what version of ray are you using and what framework are you using?

Denys_Ashikhin · April 13, 2023, 4:15pm

Latest windows nightly wheel on using pytorch for the ml-framework.

There is a bit more info/activity at: RuntimeError: Expected scalars to be on CPU, got cuda:0 instead · Issue #34159 · ray-project/ray · GitHub

James_Liu · May 23, 2023, 12:56am

I got the same error and cannot figure it out. All tensors(device_exp_avg_sqs, device_grads, beta2) are on device cuda.

Denys_Ashikhin · May 23, 2023, 12:58am

This has been fixed if you look at the github issue I opened above.

I don’t know if the fixed has been pulled into a release yet, but you can always manually do the code changes (only a few lines changed) and it worked for me

kourosh · May 23, 2023, 3:28am

yep should be fixed in the upcoming release (2.5)

Jules_Damji · May 23, 2023, 5:37pm

@Denys_Ashikhin May be close this issue since it’s fixed in Ray 2.5, and we are in the process of releasing it in the coming days?

Denys_Ashikhin · May 23, 2023, 6:56pm

100% - Thanks for the fix!

Topic		Replies	Views
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) RLlib	4	2905	August 8, 2022
Pytorch+ray train example not working Ray Train	4	809	November 9, 2023
Using checkpoint causes GPU failure and error during training process Checkpointing, Restoring	10	61	July 31, 2025
Ray not finding available GPU on Windows RLlib	4	1012	September 6, 2021
Ray sgd get_model error Ray Tune	4	327	June 8, 2021

RuntimeError: Expected scalars to be on CPU, got cuda:0 instead

Related topics