I have been using Ray 1.0.1 to run a distributed Python script on a SLURM system for the past few months. Recently I started having issues with Ray workers unexpectedly crashing and killing my job, so I upgraded to the latest version of Ray (1.6.0 at time of writing).
Now when I run my job it intermittently fails with the following error:
2021-08-27 12:20:02,445 WARNING worker.py:1215 – The autoscaler failed with the following error:
Traceback (most recent call last):
File “/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 324, in run
self._run()
File “lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 214, in _run
self.update_load_metrics()
File “/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 177, in update_load_metrics
request, timeout=4)
File “/lib/python3.7/site-packages/grpc/_channel.py”, line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File “/lib/python3.7/site-packages/grpc/_channel.py”, line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses”
debug_error_string = “{“created”:”@1630081202.423352369",“description”:“Failed to pick subchannel”,“file”:“src/core/ext/filters/client_channel/client_channel.cc”,“file_line”:5419,“referenced_errors”:[{“created”:“@1630081202.423350314”,“description”:“failed to connect to all addresses”,“file”:“src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,“file_line”:397,“grpc_status”:14}]}"
I’m a little confused by this error because it seems like Ray is crashing while trying to autoscale the number of workers, but because I am using SLURM, the number of worker nodes should be fixed. Here is the script I use to set up my ray cluster on SLURM (drawn from the Ray documentation here ):
#!/bin/bash
shellcheck disable=SC2206
THIS FILE IS GENERATED BY AUTOMATION SCRIPT! PLEASE REFER TO ORIGINAL SCRIPT!
THIS FILE IS MODIFIED AUTOMATICALLY FROM TEMPLATE AND SHOULD BE RUNNABLE!
#SBATCH --partition=shared
#SBATCH --job-name=test_0426-1444
#SBATCH --output=test_%j.logThis script works for any number of nodes, Ray will find and manage all resources
#SBATCH --nodes=2
#SBATCH --mem-per-cpu=2500Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --cpus-per-task=48
#SBATCH --tasks-per-node=1
#SBATCH -t 0-16:00:00Load modules or your own conda environment here
module load pytorch/v1.4.0-gpu
conda activate ${CONDA_ENV}
source activate MY_CONDA_ENV
===== DO NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING =====
This script is a modification to the implementation suggest by gregSchwartz18 here:
work with cluster managed by Slurm · Issue #826 · ray-project/ray · GitHub
#redis_password=$(uuidgen)
#export redis_passwordnodes=$(scontrol show hostnames “$SLURM_JOB_NODELIST”) # Getting the node names
nodes_array=($nodes)node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w “$node_1” --mem-per-cpu=2500 hostname --ip-address) # making redis-addressif we detect a space character in the head node IP, we’ll
convert it to an ipv4 address. This step is optional.
if [[ “$ip” == " " ]]; then
IFS=’ ’ read -ra ADDR <<< “$ip”
if [[ ${#ADDR[0]} -gt 16 ]]; then
ip=${ADDR[1]}
else
ip=${ADDR[0]}
fi
echo “IPV6 address detected. We split the IPV4 address as $ip”
fiport=49160
ip_head=$ip:$port
export ip_head
echo “IP Head: $ip_head”echo “STARTING HEAD at $node_1”
srun --nodes=1 --ntasks=1 -w “$node_1” --mem-per-cpu=2500
ray start --head --node-ip-address=“$ip” --port=$port --block &
sleep 30worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo “STARTING WORKER $i at $node_i”
srun --nodes=1 --ntasks=1 -w “$node_i” --mem-per-cpu=2500
ray start --address “$ip_head” --block &
sleep 5
done===== Call your code below =====
python -u run_agent.py
What’s frustrating is that this error only occurs intermittently, and often only after the job has been running for several hours.
Has anyone else encountered this issue, or have any idea how to correct it?
Thanks in advance for any help!