1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.54.0
- Python version: 3.12.8
- OS: RHEL 8.8
- Cloud/Infrastructure: SLURM
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: Raytune performs hparam tuning over my search space
- Actual: It gets stuck after startup and I get no logs/output anywhere
I closely followed the guide on Deploying on Slurm — Ray 2.54.0 .
My SBATCH looks like:
#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --partition=zen3_0512
#SBATCH --qos=zen3_0512
#SBATCH --job-name=MIN-poc
#SBATCH --output=/home/...
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:0
#SBATCH --mail-user=...
#SBATCH --mail-type=ALL
### Limit time
#SBATCH --time=0:30:00
set "mail_addr=..."
module purge
module load python/3.12.8-gcc-12.2.0-4y5tbpr
source "/home/impl/.envrc" # activates pip env and some relevant env vars
# Ensure Python prints appear in SLURM logs immediately (not block-buffered)
export PYTHONUNBUFFERED=1
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
# Resolve head-node IP
head_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# If multiple addresses are returned, keep IPv4
if [[ "$head_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<< "$head_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_ip=${ADDR[1]}
else
head_ip=${ADDR[0]}
fi
fi
port=6379
ip_head="${head_ip}:${port}"
export ip_head
echo "Head node: ${head_node}"
echo "Head IP: ${ip_head}"
redis_password=$(uuidgen)
export redis_password
NUM_CPUS_PER_NODE="${SLURM_CPUS_ON_NODE:-1}"
NUM_GPUS_PER_NODE="${SLURM_GPUS_ON_NODE:-0}"
echo "NUM_CPUS_PER_NODE=${NUM_CPUS_PER_NODE}"
echo "NUM_GPUS_PER_NODE=0"
# Single symmetric launch across all nodes.
# Ray starts on all nodes; the entrypoint runs only on the head node.
srun --nodes="${SLURM_JOB_NUM_NODES}" --ntasks="${SLURM_JOB_NUM_NODES}" \
ray symmetric-run \
--address "${ip_head}" \
--min-nodes "${SLURM_JOB_NUM_NODES}" \
--num-cpus "${NUM_CPUS_PER_NODE}" \
--num-gpus "0" \
--redis-password "${redis_password}" \
-- \
main.py
Afterwards, the process starts fine, I get no errros but everything is stuck:
Head node: n3501-021
Head IP: 10.191.1.21:6379
NUM_CPUS_PER_NODE=256
NUM_GPUS_PER_NODE=0
SLURM_JOB_NUM_NODES=4
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
On head node. Starting Ray cluster head...
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
Ray cluster is ready!
Ray cluster is ready!
Ray cluster is ready!
2026-03-19 14:41:08,136 INFO scripts.py:1124 -- Local node IP: 10.191.1.29
2026-03-19 14:41:08,170 INFO scripts.py:1124 -- Local node IP: 10.191.2.46
2026-03-19 14:41:08,177 INFO scripts.py:1124 -- Local node IP: 10.191.2.7
2026-03-19 14:41:10,284 SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:10,284 SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:10,284 SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:10,284 INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:10,284 INFO scripts.py:1145 -- ray stop
2026-03-19 14:41:10,284 INFO scripts.py:1155 -- --block
2026-03-19 14:41:10,285 INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:10,285 INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:10,285 INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
2026-03-19 14:41:11,320 SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:11,320 SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:11,320 SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:11,320 INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:11,320 INFO scripts.py:1145 -- ray stop
2026-03-19 14:41:11,321 INFO scripts.py:1155 -- --block
2026-03-19 14:41:11,328 SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:11,328 SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:11,321 INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:11,321 INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:11,321 INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
2026-03-19 14:41:11,328 SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:11,328 INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:11,328 INFO scripts.py:1145 -- ray stop
2026-03-19 14:41:11,328 INFO scripts.py:1155 -- --block
2026-03-19 14:41:11,328 INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:11,328 INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:11,328 INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
Head node started.
=======================
2026-03-19 14:41:11,474 INFO worker.py:1669 -- Using address 10.191.1.21:6379 set in the environment variable RAY_ADDRESS
2026-03-19 14:41:11,479 INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.191.1.21:6379...
2026-03-19 14:41:11,618 INFO worker.py:2013 -- Connected to Ray cluster.