Ray on slurm - Problems with initialization

Hello everyone,

I write this post because since I use slurm, I have not been able to use ray correctly.
Whenever I use the commands :

  • ray.init
  • trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)
    , the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering.
Have a great day

I see further details on the same question on SO, copying here for visibility:

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")

print("success")

To launch the program with slurm, I use the following program :

#!/bin/bash

#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py

@Pierre_houdouin can you share how you’re starting the Ray worker nodes? For example, are you following Starting the Ray worker nodes in the docs?