[Ray core] Ray is deadlock on the AMD HPC

Daniel_Wiczew · February 3, 2023, 4:37pm

Hello,

I try to use a code with Ray used for distributing work onto several GPUs. When the code is used for CUDA platform, it works perfectly well and there’s no problem with running it. Nevertheless, the problem appears when AMD HPC ROCm HIP platform instead of CUDA. The process deadlocks and there’s no log, no response, nothing to debug. So it cannon’t be even diagnosed what to do further.

Ray version: 2.2.0
Method to initialize Ray: ray.init(num_of_gpus=8)
Architecture of the HPC node: A Cray Machine with four AMD EPYC 7A53 64-core and four MI250X, 256GB RAM and 512GB VIDEO RAM.
Log level: DEBUG using env variable. The last outputs with this are:
2023-02-03 17:47:01,242 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[2023-02-03 17:47:01,247 I 1451242 1451242] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1

cade · February 4, 2023, 8:38pm

I have never tried Ray on a ROCm platform. Could you share a code snippet so I can try to reproduce on a public cloud? What version of ROCm are you running (driver + userspace)? Are you running in a container?

cc @Chen_Shen have we tested on ROCm before?

Topic		Replies	Views
[rllib] Unable to detect AMD GPUs? RLlib	8	1304	February 3, 2023
TorchTrainer fails ROCM multi gpu. Invalid device ordinal	5	149	December 13, 2024
[Ray Core] RuntimeError: No CUDA GPUs are available Ray Core	5	4993	October 15, 2022
RayExecutor.start() hangs Ray Client	2	478	June 16, 2022
Use iGPUs like AMD 5800U via ROCM? RLlib	1	26	December 11, 2024

[Ray core] Ray is deadlock on the AMD HPC

Related topics