I write this post because since I use slurm, I have not been able to use ray correctly.
Whenever I use the commands :
ray.init
trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)
, the program crashes with the following message :
core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.
Thank you for reading me and maybe answering.
Have a great day
I see further details on the same question on SO, copying here for visibility:
import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune
ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")
print("success")
To launch the program with slurm, I use the following program :
Just to share that I had the same problem, and for me it seems to be resolved by setting num_cpus=1 (or whatever number of cores I request from slurm) in ray.init solved the problem, as per one of the answers on SO.
@cade Actually, it turns out this still happens for me! But only if I request only a single CPU core from SLURM (i.e. set -n 1 in sbatch). Two or more cores are fine, most of the time even if Ray then uses more worker processes than I requested cores. This happens even if I run ray in local mode! I am told by our cluster support that -n 1 does nothing other than pin all the processes to a single physical core.
One thing I should add is that I am not running a whole Ray cluster on top of slurm, I just want to run one single rllib experiment per slurm job (but multiple such slurm jobs in parallel).