TorchTrainer fails ROCM multi gpu. Invalid device ordinal

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I run into the following error when trying to run a dummy trainer on my 1 node 4 AMD GPU setup:`

Ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=1112323, ip=192.168.192.149, actor_id=d2813a7f99b2a331c96e65ad01000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x152fa5dbc190>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 206, in train_fn
    with train_func_context():
  File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/ray/train/torch/config.py", line 27, in __enter__
    torch.cuda.set_device(device)
  File "/usr/WS2/amorin1/venv/rocm_6_2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 479, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: HIP error: invalid device ordinal
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

The code I am running, works on single gpu (num_workers=1):

import ray
import torch
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_loop_per_worker():
    print("Device count for this worker: {}".format(torch.cuda.device_count()))

ray.init(address="auto")

train_dataset = ray.data.from_items(
    [{"x": x, "y": x + 1} for x in range(32)])
trainer = TorchTrainer(train_loop_per_worker,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True, resources_per_worker= {"GPU":1} ),
    datasets={"train": train_dataset})

trainer.fit()

This code below behaves as expected, I see all four ROCR visible devices. It just when I move to TorchTrainer everything fails

import os
import ray

ray.init(address='auto')


@ray.remote(num_gpus=2)
class GPUActor:
    def ping(self):
        print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
        print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))

@ray.remote(num_gpus=2)
def gpu_task():
    print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
    print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))


gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())

ray.get(gpu_task.remote())

I’ve also verified:

torch.cuda.is_available() is True
torch.device_count() is 4

I can set device with torch.set_device() using “0,1,2,3” or “cuda:0…cuda:3”

Python 3.11.7

torch.version
‘2.5.1+rocm6.2’

ray.version
‘2.38.0’

Can you try to run the following example code and see if it works on your AMD devices?

https://docs.ray.io/en/latest/train/api/doc/ray.train.torch.get_device.html

Thank you for the reply! All of these examples pass.

import torch

import os

import ray

from ray.train.torch import get_device

os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

ray.get_gpu_ids() == [2]

torch.cuda.is_available() == True

get_device() == torch.device("cuda:0")

print("Pass 1")

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

ray.get_gpu_ids() == [2]

torch.cuda.is_available() == True

get_device() == torch.device("cuda:2")

print("Pass 2")

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

ray.get_gpu_ids() == [2,3]

torch.cuda.is_available() == True

get_device() == torch.device("cuda:2")

print("Pass 3")

model = torch.nn.Linear(in_features=1, out_features=1)

model.to(ray.train.torch.get_device())
print("Pass 4")

python ray_help.py
2024-12-13 11:07:57,855 INFO worker.py:1637 – Connecting to existing Ray cluster at address: 192.168.193.244:23456…
2024-12-13 11:07:57,870 INFO worker.py:1822 – Connected to Ray cluster.
Pass 1
Pass 2
Pass 3
Pass 4

I see. Could you please file a github issue in the ray-project? GitHub - ray-project/ray: Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads., It’s might be a problem that ray train doesn’t AMD GPUs well. We will try to make a fix soon.

Sure. Thank you for the help.

I’ve submitted an issue here: [<Ray component: Train>] Ray Train fails for AMD multi-gpu: Invalid Device Ordinal. · Issue #49260 · ray-project/ray · GitHub