I write this post because since I use slurm, I have not been able to use ray correctly.
Whenever I use the commands :
- trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)
, the program crashes with the following message :
core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.
Thank you for reading me and maybe answering.
Have a great day
I see further details on the same question on SO, copying here for visibility:
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune
trainer = A3CTrainer(env = "my_env")
To launch the program with slurm, I use the following program :
module load anaconda3/2020.02/gcc-9.2.0
@Pierre_houdouin can you share how you’re starting the Ray worker nodes? For example, are you following Starting the Ray worker nodes in the docs?