Ray on slurm - Problems with initialization

Hello everyone,

I write this post because since I use slurm, I have not been able to use ray correctly.
Whenever I use the commands :

  • ray.init
  • trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)
    , the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering.
Have a great day

I see further details on the same question on SO, copying here for visibility:

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

trainer = A3CTrainer(env = "my_env")


To launch the program with slurm, I use the following program :


#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py

@Pierre_houdouin can you share how you’re starting the Ray worker nodes? For example, are you following Starting the Ray worker nodes in the docs?

Just to share that I had the same problem, and for me it seems to be resolved by setting num_cpus=1 (or whatever number of cores I request from slurm) in ray.init solved the problem, as per one of the answers on SO.

1 Like

@cade Actually, it turns out this still happens for me! But only if I request only a single CPU core from SLURM (i.e. set -n 1 in sbatch). Two or more cores are fine, most of the time even if Ray then uses more worker processes than I requested cores. This happens even if I run ray in local mode! I am told by our cluster support that -n 1 does nothing other than pin all the processes to a single physical core.

One thing I should add is that I am not running a whole Ray cluster on top of slurm, I just want to run one single rllib experiment per slurm job (but multiple such slurm jobs in parallel).

Just a note: this may be due to a large number of OpenBLAS threads when using a large number of CPU cores. See comment with workaround here: [<Ray component: Core] Failed to register worker . Slurm - srun - · Issue #30012 · ray-project/ray · GitHub

1 Like

Could you share the output of ulimit -a? I wonder if the fd limit is too low