Ray on slurm - Problems with initialization

Pierre_houdouin · June 1, 2022, 3:45pm

Hello everyone,

I write this post because since I use slurm, I have not been able to use ray correctly.
Whenever I use the commands :

ray.init
trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)
, the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering.
Have a great day

cade · June 17, 2022, 10:19pm

I see further details on the same question on SO, copying here for visibility:

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")

print("success")

To launch the program with slurm, I use the following program :

#!/bin/bash

#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py

cade · June 17, 2022, 10:20pm

@Pierre_houdouin can you share how you’re starting the Ray worker nodes? For example, are you following Starting the Ray worker nodes in the docs?

mgerstgrasser · September 7, 2022, 9:32pm

Just to share that I had the same problem, and for me it seems to be resolved by setting num_cpus=1 (or whatever number of cores I request from slurm) in ray.init solved the problem, as per one of the answers on SO.

mgerstgrasser · October 17, 2022, 4:31pm

@cade Actually, it turns out this still happens for me! But only if I request only a single CPU core from SLURM (i.e. set -n 1 in sbatch). Two or more cores are fine, most of the time even if Ray then uses more worker processes than I requested cores. This happens even if I run ray in local mode! I am told by our cluster support that -n 1 does nothing other than pin all the processes to a single physical core.

One thing I should add is that I am not running a whole Ray cluster on top of slurm, I just want to run one single rllib experiment per slurm job (but multiple such slurm jobs in parallel).

cupe · December 25, 2022, 6:29am

Just a note: this may be due to a large number of OpenBLAS threads when using a large number of CPU cores. See comment with workaround here: [<Ray component: Core] Failed to register worker . Slurm - srun - · Issue #30012 · ray-project/ray · GitHub

cade · December 29, 2022, 1:51am

Could you share the output of ulimit -a? I wonder if the fd limit is too low

Topic		Replies	Views
Running Ray on Slurm Cluster	11	1254	January 31, 2021
[Core] Ray.init() hanging Ray Core	5	2593	December 21, 2021
Ray actor only uses one core on a cluster managed using SLURM Ray Clusters	1	437	September 16, 2021
Issues with deploying Ray on Slurm: Ray Core	5	1308	March 3, 2021
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	2147	June 15, 2022

Ray on slurm - Problems with initialization

Related topics