Issues with deploying Ray on Slurm:

I’ve been trying to run the following script on a slurm cluster (this script works fine on a local machine with ray installed).

import numpy as np
import pandas as pd
from evaluate_iteration import *
import ray

lr = 27
lb = 29
a = 0.8
x = 0.5
n = 1e5
nMonte = 1000

rr= np.linspace(0.1, 3, lr)
bb=np.linspace(0.45, 0.99, lb)
params = [(beta, r) for r in rr for beta in bb]


ray.init(address='auto')
res = [evaluate_iteration.remote(a, x, par[0], par[1], n,
                nMonte = nMonte, metric = 'power') for par in params]
return_value = ray.get(res)

df = pd.DataFrame()
for r in return_value :
    df = df.append(r, ignore_index=True)
df.to_csv('results.csv')

The submission script is as follows (based on Deploying on Slurm — Ray v2.0.0.dev0):

#!/bin/bash
#BATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=2GB
#SBATCH --nodes=5
#SBATCH --tasks-per-node 1
worker_num=3 # Must be one less that the total number of nodes
# module load Langs/Python/3.6.4 # This will vary depending on your environment
# source venv/bin/activate
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)
export ip_head # Exporting for latter access by trainer.py
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head --redis-port=6379 --redis-password=$redis_password & # Starting the head
sleep 5
# Make sure the head successfully starts before any worker does, otherwise
# the worker will not be able to connect to redis. In case of longer delay,
# adjust the sleeptime above to ensure proper order.
for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
  srun --nodes=1 --ntasks=1 -w $node2 ray start --block --address=$ip_head --redis-password=$redis_password & # Starting the workers
  # Flag --block will keep ray process alive on each compute node.
  sleep 5
done
python -u phase_diagram.py $redis_password 15 # Pass the total number of allocated CPUs

I get the following error in the .out file:
Error: no such option: --redis-port
srun: error: sh02-01n11: task 0: Exited with exit code 2

Hi @richardliaw, @zhz,
It would be great if you can take a look at this.

The argument --redis-port was renamed renamed --port at some point. If you leave it off, it defaults to 6379.

The change is reflected in the current docs, I think: Deploying on Slurm — Ray v2.0.0.dev0

1 Like

Thanks @Dmitri. I’ve used the slurm-launch.py script mentioned in Deploying on Slurm.
I now receive the following error:

[kipnisal@sh03-ln01 login ~/PhaseTrans]$ cat PT_0302-1947.log
/var/spool/slurmd/job19469069/slurm_script: line 19: None: command not found
IP Head: 10.19.1.1:6379
STARTING HEAD at sh03-01n01
srun: Job 19469069 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 19469069
2021-03-02 19:48:56,326 INFO services.py:1174 -- View the Ray dashboard at http://localhost:8265
STARTING WORKER 1 at sh03-01n02
STARTING WORKER 2 at sh03-01n09
2021-03-02 19:49:35,348 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 10.19.1.1:6379
Traceback (most recent call last):
  File "phase_diagram.py", line 22, in <module>
    ray.init(address='auto')
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/ray/worker.py", line 759, in init
    connect_only=True)
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/ray/node.py", line 153, in __init__
    session_name = _get_with_retry(redis_client, "session_name")
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/ray/node.py", line 39, in _get_with_retry
    result = redis_client.get(key)
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/client.py", line 1606, in get
    return self.execute_command('GET', name)
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "/home/users/kipnisal/CJinstalled/miniconda/lib/python3.7/site-packages/redis/connection.py", line 756, in read_response
    raise response
redis.exceptions.ResponseError: WRONGPASS invalid username-password pair

Does the Redis password need to get passed into ray.init?

Yeah, I think one fix is to make sure ray.init(address="auto", _redis_password=sys.argv[1]).

The other fix (which I would probably recommend) is to remove all references to the redis password altogether in both your python and your sbatch script.