[Ray core] Ray is deadlock on the AMD HPC


I try to use a code with Ray used for distributing work onto several GPUs. When the code is used for CUDA platform, it works perfectly well and there’s no problem with running it. Nevertheless, the problem appears when AMD HPC ROCm HIP platform instead of CUDA. The process deadlocks and there’s no log, no response, nothing to debug. So it cannon’t be even diagnosed what to do further.

Ray version: 2.2.0
Method to initialize Ray: ray.init(num_of_gpus=8)
Architecture of the HPC node: A Cray Machine with four AMD EPYC 7A53 64-core and four MI250X, 256GB RAM and 512GB VIDEO RAM.
Log level: DEBUG using env variable. The last outputs with this are:
2023-02-03 17:47:01,242 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at
[2023-02-03 17:47:01,247 I 1451242 1451242] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1

I have never tried Ray on a ROCm platform. Could you share a code snippet so I can try to reproduce on a public cloud? What version of ROCm are you running (driver + userspace)? Are you running in a container?

cc @Chen_Shen have we tested on ROCm before?