Start Ray cluster with error but working

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
    I start ray cluster using a slurm script. There are some errors when I start cluster but my program can run. The error output in one node shows below:
e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 391, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m     loop.run_until_complete(agent.run())
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
e[2me[33m(raylet, ip=10.6.12.47)e[0m     return future.result()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 178, in run
e[2me[33m(raylet, ip=10.6.12.47)e[0m     modules = self._load_modules()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
e[2me[33m(raylet, ip=10.6.12.47)e[0m     c = cls(self)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 163, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     dashboard_agent.metrics_export_port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/metrics_agent.py", line 79, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     address=metrics_export_address)))
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
e[2me[33m(raylet, ip=10.6.12.47)e[0m     options=option, gatherer=option.registry, collector=collector)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     self.serve_http()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
e[2me[33m(raylet, ip=10.6.12.47)e[0m     port=self.options.port, addr=str(self.options.address))
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
e[2me[33m(raylet, ip=10.6.12.47)e[0m     TmpServer.address_family, addr = _get_best_family(addr, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
e[2me[33m(raylet, ip=10.6.12.47)e[0m     infos = socket.getaddrinfo(address, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/socket.py", line 745, in getaddrinfo
e[2me[33m(raylet, ip=10.6.12.47)e[0m     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
e[2me[33m(raylet, ip=10.6.12.47)e[0m socket.gaierror: [Errno -2] Name or service not known
e[2me[33m(raylet, ip=10.6.12.47)e[0m 
e[2me[33m(raylet, ip=10.6.12.47)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet, ip=10.6.12.47)e[0m 
e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 407, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m     gcs_publisher = GcsPublisher(args.gcs_address)
e[2me[33m(raylet, ip=10.6.12.47)e[0m TypeError: __init__() takes 1 positional argument but 2 were given

My slurm script is following:

#!/bin/bash
#SBATCH -p normal
#SBATCH --gres=dcu:4
#SBATCH --exclusive

module unload compiler/rocm/2.9
module load apps/ray/hpcx-2.4.1-gcc-7.3.1-rocm4.0.1

redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

node_1=${nodes_array[0]} 
ip=$(srun --nodes=1 --ntasks=1 -w $node_1 hostname --ip-address) # making redis-address
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w $node_1 start-head.sh $ip $redis_password &
sleep 30
worker_num=$(($SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((  i=1; i <= ${worker_num}; i++ ))
do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w $node_i start-worker.sh $ip_head $redis_password &
  sleep 5
done

which python3
python3 -u ps.py -c $1 -b 16

start-head.sh:

#!/bin/bash
echo "starting ray head node"
# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-gpus=4
sleep infinity

start-worker.sh

#!/bin/bash
echo "starting ray worker node"
ray start --address $1 --redis-password=$2 --num-gpus=4
sleep infinity

Is there something wrong when I run the script?

@Alex can you please help?

@xyzyx it looks like your script is trying to start a head node with an external redis server. Is that intentional? (If so, how are you verifying redis is healthy?)

If not, you may want your head start command to not include mentions of redis/addresses/ports

ray start --head --node-ip-address=$1 --num-gpus=4

Thanks! :smiley:
I actually do not want to start with an external redis server. So I do not need to specify --redis-password?

yep that’s correct. in fact, Ray no longer has a hard dependency on redis and won’t use redis by default now.

I’m not including mention of Redis but the error is still here. The command I run is ray start --block --address=$ip_head

Do you mind verifying the version of Ray that you’re using (on both the head and worker nodes?)

heads up @mwtian (who knows more than me)

Same issue on k8s cluster. Have you solved this problem?

What Ray version are you on? And can you share more details about your setup process

@GoingMyWay please do provide a detailed reproduction on K8s if possible.

I used 1.11.0. :smiley:
I use the ray on a slurm cluster and I startup using a modified script from here.

Re: slurm @tupui might be able to help.

I did not observe such issue on my cluster. @xyzyx you are saying that your program is running, but since you have an exception, is it running in parallel on all nodes or just on the head node? Also could you try using the latest version of ray?

My program is running fine but outputs these error messages. It is running in parallel on all nodes.
I will try the latest version of ray later.

Hi @Dmitri, please see this comment: Ray k8s cluster, cannot run new task when previous task failed - #6 by GoingMyWay