How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I start ray cluster using a slurm script. There are some errors when I start cluster but my program can run. The error output in one node shows below:
e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 391, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m loop.run_until_complete(agent.run())
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
e[2me[33m(raylet, ip=10.6.12.47)e[0m return future.result()
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 178, in run
e[2me[33m(raylet, ip=10.6.12.47)e[0m modules = self._load_modules()
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
e[2me[33m(raylet, ip=10.6.12.47)e[0m c = cls(self)
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 163, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m dashboard_agent.metrics_export_port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/metrics_agent.py", line 79, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m address=metrics_export_address)))
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
e[2me[33m(raylet, ip=10.6.12.47)e[0m options=option, gatherer=option.registry, collector=collector)
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
e[2me[33m(raylet, ip=10.6.12.47)e[0m self.serve_http()
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
e[2me[33m(raylet, ip=10.6.12.47)e[0m port=self.options.port, addr=str(self.options.address))
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
e[2me[33m(raylet, ip=10.6.12.47)e[0m TmpServer.address_family, addr = _get_best_family(addr, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
e[2me[33m(raylet, ip=10.6.12.47)e[0m infos = socket.getaddrinfo(address, port)
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/software/apps/AI/apps/DeepLearning/PyTorch/cccp/pytorch_1.8-rocm_4.0.1-fastmoe/lib/python3.6/socket.py", line 745, in getaddrinfo
e[2me[33m(raylet, ip=10.6.12.47)e[0m for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
e[2me[33m(raylet, ip=10.6.12.47)e[0m socket.gaierror: [Errno -2] Name or service not known
e[2me[33m(raylet, ip=10.6.12.47)e[0m
e[2me[33m(raylet, ip=10.6.12.47)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet, ip=10.6.12.47)e[0m
e[2me[33m(raylet, ip=10.6.12.47)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.6.12.47)e[0m File "/public/home/lifei/xinzk/envs/ray_base/site-packages/ray/dashboard/agent.py", line 407, in <module>
e[2me[33m(raylet, ip=10.6.12.47)e[0m gcs_publisher = GcsPublisher(args.gcs_address)
e[2me[33m(raylet, ip=10.6.12.47)e[0m TypeError: __init__() takes 1 positional argument but 2 were given
My slurm script is following:
#!/bin/bash
#SBATCH -p normal
#SBATCH --gres=dcu:4
#SBATCH --exclusive
module unload compiler/rocm/2.9
module load apps/ray/hpcx-2.4.1-gcc-7.3.1-rocm4.0.1
redis_password=$(uuidgen)
export redis_password
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w $node_1 hostname --ip-address) # making redis-address
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w $node_1 start-head.sh $ip $redis_password &
sleep 30
worker_num=$(($SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for (( i=1; i <= ${worker_num}; i++ ))
do
node_i=${nodes_array[$i]}
echo "STARTING WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w $node_i start-worker.sh $ip_head $redis_password &
sleep 5
done
which python3
python3 -u ps.py -c $1 -b 16
start-head.sh:
#!/bin/bash
echo "starting ray head node"
# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-gpus=4
sleep infinity
start-worker.sh
#!/bin/bash
echo "starting ray worker node"
ray start --address $1 --redis-password=$2 --num-gpus=4
sleep infinity
Is there something wrong when I run the script?