Autoscaling issue with Ray on SLURM?

I have been using Ray 1.0.1 to run a distributed Python script on a SLURM system for the past few months. Recently I started having issues with Ray workers unexpectedly crashing and killing my job, so I upgraded to the latest version of Ray (1.6.0 at time of writing).

Now when I run my job it intermittently fails with the following error:

2021-08-27 12:20:02,445 WARNING worker.py:1215 – The autoscaler failed with the following error:
Traceback (most recent call last):
File “/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 324, in run
self._run()
File “lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 214, in _run
self.update_load_metrics()
File “/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py”, line 177, in update_load_metrics
request, timeout=4)
File “/lib/python3.7/site-packages/grpc/_channel.py”, line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File “/lib/python3.7/site-packages/grpc/_channel.py”, line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses”
debug_error_string = “{“created”:”@1630081202.423352369",“description”:“Failed to pick subchannel”,“file”:“src/core/ext/filters/client_channel/client_channel.cc”,“file_line”:5419,“referenced_errors”:[{“created”:"@1630081202.423350314",“description”:“failed to connect to all addresses”,“file”:“src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,“file_line”:397,“grpc_status”:14}]}"

I’m a little confused by this error because it seems like Ray is crashing while trying to autoscale the number of workers, but because I am using SLURM, the number of worker nodes should be fixed. Here is the script I use to set up my ray cluster on SLURM (drawn from the Ray documentation here ):

#!/bin/bash

shellcheck disable=SC2206

THIS FILE IS GENERATED BY AUTOMATION SCRIPT! PLEASE REFER TO ORIGINAL SCRIPT!

THIS FILE IS MODIFIED AUTOMATICALLY FROM TEMPLATE AND SHOULD BE RUNNABLE!

#SBATCH --partition=shared
#SBATCH --job-name=test_0426-1444
#SBATCH --output=test_%j.log

This script works for any number of nodes, Ray will find and manage all resources

#SBATCH --nodes=2
#SBATCH --mem-per-cpu=2500

Give all resources to a single Ray task, ray can manage the resources internally

#SBATCH --cpus-per-task=48
#SBATCH --tasks-per-node=1
#SBATCH -t 0-16:00:00

Load modules or your own conda environment here

module load pytorch/v1.4.0-gpu

conda activate ${CONDA_ENV}

source activate MY_CONDA_ENV

===== DO NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING =====

This script is a modification to the implementation suggest by gregSchwartz18 here:

work with cluster managed by Slurm · Issue #826 · ray-project/ray · GitHub

#redis_password=$(uuidgen)
#export redis_password

nodes=$(scontrol show hostnames “$SLURM_JOB_NODELIST”) # Getting the node names
nodes_array=($nodes)

node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w “$node_1” --mem-per-cpu=2500 hostname --ip-address) # making redis-address

if we detect a space character in the head node IP, we’ll

convert it to an ipv4 address. This step is optional.

if [[ “$ip” == " " ]]; then
IFS=’ ’ read -ra ADDR <<< “$ip”
if [[ ${#ADDR[0]} -gt 16 ]]; then
ip=${ADDR[1]}
else
ip=${ADDR[0]}
fi
echo “IPV6 address detected. We split the IPV4 address as $ip”
fi

port=49160
ip_head=$ip:$port
export ip_head
echo “IP Head: $ip_head”

echo “STARTING HEAD at $node_1”
srun --nodes=1 --ntasks=1 -w “$node_1” --mem-per-cpu=2500
ray start --head --node-ip-address="$ip" --port=$port --block &
sleep 30

worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo “STARTING WORKER $i at $node_i”
srun --nodes=1 --ntasks=1 -w “$node_i” --mem-per-cpu=2500
ray start --address “$ip_head” --block &
sleep 5
done

===== Call your code below =====

python -u run_agent.py

What’s frustrating is that this error only occurs intermittently, and often only after the job has been running for several hours.

Has anyone else encountered this issue, or have any idea how to correct it?

Thanks in advance for any help!

I figured out what was going on here and wanted to post an update in case anyone else runs into the same issue.

There was a bug in my code that was running the same block of code ~20 times, meaning that instead of running ~120 parallel tasks I was trying to run about 2000 on a cluster with ~100 CPUs. Removing this bug and scaling the number of tasks down to match the size of the cluster again allowed the task to run without any further problems.

1 Like

That’s great to know!