I try to use a code with Ray used for distributing work onto several GPUs. When the code is used for CUDA platform, it works perfectly well and there’s no problem with running it. Nevertheless, the problem appears when AMD HPC ROCm HIP platform instead of CUDA. The process deadlocks and there’s no log, no response, nothing to debug. So it cannon’t be even diagnosed what to do further.
Ray version: 2.2.0
Method to initialize Ray: ray.init(num_of_gpus=8)
Architecture of the HPC node: A Cray Machine with four AMD EPYC 7A53 64-core and four MI250X, 256GB RAM and 512GB VIDEO RAM.
Log level: DEBUG using env variable. The last outputs with this are:
2023-02-03 17:47:01,242 INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[2023-02-03 17:47:01,247 I 1451242 1451242] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1