Client can't find Raylet address in Ray > 1.5

Hi Forum,

I’ve been successfully using Ray 1.5 for a little while. After upgrading to Ray 1.6 and 1.7 I receive the following error message when starting a Ray worker:

global_state_accessor.cc:394: This node has an IP address of ****, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

The worker is started from CLI as shown below. I’ve been playing with the --node-ip-address arg, but it seems to make no difference.

ray start
–redis-password ${RAY_REDIS_PASSWORD}
–address ****:${RAY_PORT}
# --node-ip-address $(hostname -i)
–block

Both head (server) and worker (client) run from within a singularity container. I believe this shouldn’t matter, it certainly hasn’t with Ray 1.5. I’d be grateful for any pointers how to debug this.

Thanks.

I have the same error in Ray 2.1 and would be interested in more information about that.

same error as well, ray 2.1. advice coming from older posts didn’t help (i.e. sleep delay, reduced # of resources)

running

Blockquote # number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo “Starting WORKER $i at $node_i”
this_node_ip=$(srun --nodes=1 --ntasks=1 -w “$node_i” hostname --ip-address)
srun --nodes=1 --ntasks=1 -w “$node_i”
ray start --address “$ip_head”
–node-ip-address=“$this_node_ip”
–num-cpus “${SLURM_CPUS_PER_TASK}” --block &
sleep 30
done

retrieving:

[2022-11-30 17:53:41,136 I 64548 64548] global_state_accessor.cc:357: This node has an IP address of xx.xx.xx.xx, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

Ok, solved by adopting the right redis strategy, my bad, did not add it to the Python script before. The relevant section in the sbatch (e.g. for worker node) are:

#SBATCH --cpus-per-task=40
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1

and

redis_password=$(uuidgen)
export redis_password

this_node_ip=$(srun --nodes=1 --ntasks=1 -w “$node_i” hostname --ip-address)
srun --nodes=1 --ntasks=1 -w “$node_i”
ray start --address “$ip_head”
–redis-password=“$redis_password”
–node-ip-address=“$this_node_ip”
–num-cpus “${SLURM_CPUS_PER_TASK}” --block &
sleep 10

see ( slurm-basic.sh). And in Python

import os
ray.init(address=“auto”, _redis_password = os.environ[“redis_password”])

see also Ray on SLURM, unmatched Raylet address - #3 by hank7v