How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I run into the following error when trying to run a dummy trainer on my 1 node 4 AMD GPU setup:`
Ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=1112323, ip=, actor_id=d2813a7f99b2a331c96e65ad01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x152fa5dbc190>)
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/", line 206, in train_fn
with train_func_context():
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/torch/", line 27, in __enter__
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/torch/cuda/", line 479, in set_device
RuntimeError: HIP error: invalid device ordinal
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
The code I am running, works on single gpu (num_workers=1):
import ray
import torch
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
def train_loop_per_worker():
print("Device count for this worker: {}".format(torch.cuda.device_count()))
train_dataset =
[{"x": x, "y": x + 1} for x in range(32)])
trainer = TorchTrainer(train_loop_per_worker,
scaling_config=ScalingConfig(num_workers=4, use_gpu=True, resources_per_worker= {"GPU":1} ),
datasets={"train": train_dataset})
This code below behaves as expected, I see all four ROCR visible devices. It just when I move to TorchTrainer everything fails
import os
import ray
class GPUActor:
def ping(self):
print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
def gpu_task():
print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
gpu_actor = GPUActor.remote()
I’ve also verified:
torch.cuda.is_available() is True
torch.device_count() is 4
I can set device with torch.set_device() using “0,1,2,3” or “cuda:0…cuda:3”
Python 3.11.7