Start Ray cluster with error but working

xyzyx · April 22, 2022, 1:41am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I start ray cluster using a slurm script. There are some errors when I start cluster but my program can run. The error output in one node shows below:

e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 391, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m     loop.run_until_complete(agent.run())
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
e[2me[33m(raylet, ip=10.6.12.47)e[0m     return future.result()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 178, in run
e[2me[33m(raylet, ip=10.6.12.47)e[0m     modules = self._load_modules()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
e[2me[33m(raylet, ip=10.6.12.47)e[0m     c = cls(self)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 163, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     dashboard_agent.metrics_export_port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/metrics_agent.py", line 79, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     address=metrics_export_address)))
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
e[2me[33m(raylet, ip=10.6.12.47)e[0m     options=option, gatherer=option.registry, collector=collector)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m     self.serve_http()
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
e[2me[33m(raylet, ip=10.6.12.47)e[0m     port=self.options.port, addr=str(self.options.address))
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
e[2me[33m(raylet, ip=10.6.12.47)e[0m     TmpServer.address_family, addr = _get_best_family(addr, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
e[2me[33m(raylet, ip=10.6.12.47)e[0m     infos = socket.getaddrinfo(address, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/socket.py", line 745, in getaddrinfo
e[2me[33m(raylet, ip=10.6.12.47)e[0m     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
e[2me[33m(raylet, ip=10.6.12.47)e[0m socket.gaierror: [Errno -2] Name or service not known
e[2me[33m(raylet, ip=10.6.12.47)e[0m 
e[2me[33m(raylet, ip=10.6.12.47)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet, ip=10.6.12.47)e[0m 
e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m   File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 407, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m     gcs_publisher = GcsPublisher(args.gcs_address)
e[2me[33m(raylet, ip=10.6.12.47)e[0m TypeError: __init__() takes 1 positional argument but 2 were given

My slurm script is following:

#!/bin/bash
#SBATCH -p normal
#SBATCH --gres=dcu:4
#SBATCH --exclusive

module unload compiler/rocm/2.9
module load apps/ray/hpcx-2.4.1-gcc-7.3.1-rocm4.0.1

redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

node_1=${nodes_array[0]} 
ip=$(srun --nodes=1 --ntasks=1 -w $node_1 hostname --ip-address) # making redis-address
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w $node_1 start-head.sh $ip $redis_password &
sleep 30
worker_num=$(($SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((  i=1; i <= ${worker_num}; i++ ))
do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w $node_i start-worker.sh $ip_head $redis_password &
  sleep 5
done

which python3
python3 -u ps.py -c $1 -b 16

start-head.sh:

#!/bin/bash
echo "starting ray head node"
# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-gpus=4
sleep infinity

start-worker.sh

#!/bin/bash
echo "starting ray worker node"
ray start --address $1 --redis-password=$2 --num-gpus=4
sleep infinity

Is there something wrong when I run the script?

Ameer_Haj_Ali · April 25, 2022, 11:45am

@Alex can you please help?

Alex · April 25, 2022, 2:27pm

@xyzyx it looks like your script is trying to start a head node with an external redis server. Is that intentional? (If so, how are you verifying redis is healthy?)

If not, you may want your head start command to not include mentions of redis/addresses/ports

ray start --head --node-ip-address=$1 --num-gpus=4

xyzyx · April 26, 2022, 2:18am

Thanks!
I actually do not want to start with an external redis server. So I do not need to specify --redis-password?

Alex · April 27, 2022, 3:00pm

yep that’s correct. in fact, Ray no longer has a hard dependency on redis and won’t use redis by default now.

xyzyx · April 28, 2022, 4:38am

I’m not including mention of Redis but the error is still here. The command I run is ray start --block --address=$ip_head

Alex · April 29, 2022, 3:45pm

Do you mind verifying the version of Ray that you’re using (on both the head and worker nodes?)

Alex · April 29, 2022, 3:46pm

heads up @mwtian (who knows more than me)

GoingMyWay · June 26, 2022, 8:42am

Same issue on k8s cluster. Have you solved this problem?

ckw017 · June 27, 2022, 9:46pm

What Ray version are you on? And can you share more details about your setup process

Dmitri · June 28, 2022, 12:47am

@GoingMyWay please do provide a detailed reproduction on K8s if possible.

xyzyx · June 28, 2022, 12:55am

I used 1.11.0.
I use the ray on a slurm cluster and I startup using a modified script from here.

Dmitri · June 28, 2022, 1:43am

Re: slurm @tupui might be able to help.

tupui · June 28, 2022, 1:44pm

I did not observe such issue on my cluster. @xyzyx you are saying that your program is running, but since you have an exception, is it running in parallel on all nodes or just on the head node? Also could you try using the latest version of ray?

xyzyx · June 29, 2022, 12:49am

My program is running fine but outputs these error messages. It is running in parallel on all nodes.
I will try the latest version of ray later.

GoingMyWay · July 4, 2022, 10:47am

Hi @Dmitri, please see this comment: Ray k8s cluster, cannot run new task when previous task failed - #6 by GoingMyWay

Topic		Replies	Views
Ray on Slurm: shutdown throws errors Ray Clusters	15	940	June 16, 2022
Entire ray cluster dying unexpectedly Ray Core	11	1070	September 20, 2023
Ray crashes on Slurm Ray Clusters	6	1372	October 27, 2022
Ray on SLURM, unmatched Raylet address Ray Clusters	2	987	December 1, 2022
Ray on slurm - Problems with initialization Ray Clusters	6	3621	December 29, 2022

Start Ray cluster with error but working

Related topics