How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I run into the following error when trying to run a dummy trainer on my 1 node 4 AMD GPU setup:`
Ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=1112323, ip=192.168.192.149, actor_id=d2813a7f99b2a331c96e65ad01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x152fa5dbc190>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 206, in train_fn
with train_func_context():
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/torch/config.py", line 27, in __enter__
torch.cuda.set_device(device)
File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 479, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: HIP error: invalid device ordinal
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
The code I am running, works on single gpu (num_workers=1):
import ray
import torch
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
def train_loop_per_worker():
print("Device count for this worker: {}".format(torch.cuda.device_count()))
ray.init(address="auto")
train_dataset = ray.data.from_items(
[{"x": x, "y": x + 1} for x in range(32)])
trainer = TorchTrainer(train_loop_per_worker,
scaling_config=ScalingConfig(num_workers=4, use_gpu=True, resources_per_worker= {"GPU":1} ),
datasets={"train": train_dataset})
trainer.fit()
This code below behaves as expected, I see all four ROCR visible devices. It just when I move to TorchTrainer everything fails
import os
import ray
ray.init(address='auto')
@ray.remote(num_gpus=2)
class GPUActor:
def ping(self):
print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
@ray.remote(num_gpus=2)
def gpu_task():
print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())
ray.get(gpu_task.remote())
I’ve also verified:
torch.cuda.is_available() is True
torch.device_count() is 4
I can set device with torch.set_device() using “0,1,2,3” or “cuda:0…cuda:3”
Python 3.11.7
torch.version
‘2.5.1+rocm6.2’
ray.version
‘2.38.0’