Ray gets stuck on SLRUM cluster without any exception

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.54.0
  • Python version: 3.12.8
  • OS: RHEL 8.8
  • Cloud/Infrastructure: SLURM
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Raytune performs hparam tuning over my search space
  • Actual: It gets stuck after startup and I get no logs/output anywhere

I closely followed the guide on Deploying on Slurm — Ray 2.54.0 .

My SBATCH looks like:

#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --partition=zen3_0512
#SBATCH --qos=zen3_0512
#SBATCH --job-name=MIN-poc
#SBATCH --output=/home/...
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:0
#SBATCH --mail-user=...
#SBATCH --mail-type=ALL
### Limit time
#SBATCH --time=0:30:00

set "mail_addr=..."

module purge
module load  python/3.12.8-gcc-12.2.0-4y5tbpr
source "/home/impl/.envrc" # activates pip env and some relevant env vars

# Ensure Python prints appear in SLURM logs immediately (not block-buffered)
export PYTHONUNBUFFERED=1

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}

# Resolve head-node IP
head_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# If multiple addresses are returned, keep IPv4
if [[ "$head_ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<< "$head_ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    head_ip=${ADDR[1]}
  else
    head_ip=${ADDR[0]}
  fi
fi

port=6379
ip_head="${head_ip}:${port}"
export ip_head

echo "Head node: ${head_node}"
echo "Head IP: ${ip_head}"

redis_password=$(uuidgen)
export redis_password

NUM_CPUS_PER_NODE="${SLURM_CPUS_ON_NODE:-1}"
NUM_GPUS_PER_NODE="${SLURM_GPUS_ON_NODE:-0}"

echo "NUM_CPUS_PER_NODE=${NUM_CPUS_PER_NODE}"
echo "NUM_GPUS_PER_NODE=0"

# Single symmetric launch across all nodes.
# Ray starts on all nodes; the entrypoint runs only on the head node.
srun --nodes="${SLURM_JOB_NUM_NODES}" --ntasks="${SLURM_JOB_NUM_NODES}" \
  ray symmetric-run \
    --address "${ip_head}" \
    --min-nodes "${SLURM_JOB_NUM_NODES}" \
    --num-cpus "${NUM_CPUS_PER_NODE}" \
    --num-gpus "0" \
    --redis-password "${redis_password}" \
    -- \
    main.py

Afterwards, the process starts fine, I get no errros but everything is stuck:

Head node: n3501-021
Head IP: 10.191.1.21:6379
NUM_CPUS_PER_NODE=256
NUM_GPUS_PER_NODE=0
SLURM_JOB_NUM_NODES=4
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
On head node. Starting Ray cluster head...
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
On worker node. Connecting to Ray cluster at 10.191.1.21:6379...
Ray cluster is ready!
Ray cluster is ready!
Ray cluster is ready!
2026-03-19 14:41:08,136	INFO scripts.py:1124 -- Local node IP: 10.191.1.29
2026-03-19 14:41:08,170	INFO scripts.py:1124 -- Local node IP: 10.191.2.46
2026-03-19 14:41:08,177	INFO scripts.py:1124 -- Local node IP: 10.191.2.7
2026-03-19 14:41:10,284	SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:10,284	SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:10,284	SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:10,284	INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:10,284	INFO scripts.py:1145 --   ray stop
2026-03-19 14:41:10,284	INFO scripts.py:1155 -- --block
2026-03-19 14:41:10,285	INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:10,285	INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:10,285	INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
2026-03-19 14:41:11,320	SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:11,320	SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:11,320	SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:11,320	INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:11,320	INFO scripts.py:1145 --   ray stop
2026-03-19 14:41:11,321	INFO scripts.py:1155 -- --block
2026-03-19 14:41:11,328	SUCC scripts.py:1140 -- --------------------
2026-03-19 14:41:11,328	SUCC scripts.py:1141 -- Ray runtime started.
2026-03-19 14:41:11,321	INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:11,321	INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:11,321	INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
2026-03-19 14:41:11,328	SUCC scripts.py:1142 -- --------------------
2026-03-19 14:41:11,328	INFO scripts.py:1144 -- To terminate the Ray runtime, run
2026-03-19 14:41:11,328	INFO scripts.py:1145 --   ray stop
2026-03-19 14:41:11,328	INFO scripts.py:1155 -- --block
2026-03-19 14:41:11,328	INFO scripts.py:1156 -- This command will now block forever until terminated by a signal.
2026-03-19 14:41:11,328	INFO scripts.py:1159 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-03-19 14:41:11,328	INFO scripts.py:1164 -- Process exit logs will be saved to: /tmp/ray/session_2026-03-19_14-41-05_771929_39987/logs/ray_process_exit.log
Head node started.
=======================
2026-03-19 14:41:11,474	INFO worker.py:1669 -- Using address 10.191.1.21:6379 set in the environment variable RAY_ADDRESS
2026-03-19 14:41:11,479	INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.191.1.21:6379...
2026-03-19 14:41:11,618	INFO worker.py:2013 -- Connected to Ray cluster.

Severity: High (completely blocks you).
Your SLURM and Ray cluster setup appears correct and matches the official Ray SLURM guide, but Ray Tune hangs after startup with no logs or output. This is a known issue that can occur if the entrypoint script (your impl ... train -l "MIN" -rt command) is not actually running on the head node, or if the Ray cluster is not fully ready before the script starts, or if there are environment/path issues that prevent your Python CLI from running as expected.

The Ray logs show all nodes start and connect, but there is no evidence your training script is executed. The docs and symmetric-run implementation confirm that only the head node should run the entrypoint, and all others block. If your CLI is not found, or if there are missing dependencies, the process may silently fail. Also, ensure that your redis_password is set and exported before use, as missing or mismatched passwords can cause silent hangs.

Would you like more detail on debugging steps, such as verifying the entrypoint execution, checking Ray logs, or troubleshooting environment issues?

Sources:

Hint: Mention @RunLLM in the post for followups.

I try running with an absolute bare minimum:

import os
import socket

import ray
import torch
import lightning.pytorch as pl
from ray import tune
from ray.tune.integration.pytorch_lightning import TuneReportCheckpointCallback
from torch import nn
from torch.utils.data import Dataset, DataLoader


# Synthetic data

class RandomDataset(Dataset):
    def __init__(self, n=512, dim=16):
        g = torch.Generator().manual_seed(0)
        self.x = torch.randn(n, dim, generator=g)
        w = torch.randn(dim, 1, generator=g)
        self.y = (self.x @ w > 0).float()

    def __len__(self):
        return len(self.x)

    def __getitem__(self, i):
        return self.x[i], self.y[i]


# Minimal model

class TinyModel(pl.LightningModule):
    def __init__(self, dim=16, lr=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(nn.Linear(dim, 8), nn.ReLU(), nn.Linear(8, 1))
        self.loss_fn = nn.BCEWithLogitsLoss()

    def training_step(self, batch, _):
        x, y = batch
        loss = self.loss_fn(self.net(x), y)
        self.log("train_loss", loss, on_epoch=True, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)


# Training function (what Ray Tune calls per trial)

def train_fn(config):
    dm = DataLoader(RandomDataset(), batch_size=32, shuffle=True)
    model = TinyModel(lr=config["lr"])
    trainer = pl.Trainer(
        max_epochs=config.get("max_epochs", 3),
        accelerator="cpu",
        devices=1,
        enable_progress_bar=False,
        callbacks=[
            TuneReportCheckpointCallback(
                metrics={"train_loss": "train_loss"},
                on="train_epoch_end",
                save_checkpoints=False,
            )
        ],
    )
    trainer.fit(model, dm)


if __name__ == "__main__":
    # Connect to existing cluster (set by RAY_ADDRESS) or start local one
    ray.init(address="auto", ignore_reinit_error=True)

    print(f"[DIAG] host={socket.gethostname()} pid={os.getpid()}")
    print(f"[DIAG] cluster_resources={ray.cluster_resources()}")
    print(f"[DIAG] available_resources={ray.available_resources()}")
    print(f"[DIAG] nodes={[n['NodeManagerAddress'] for n in ray.nodes()]}")


    # Run Tune
    tuner = tune.Tuner(
        tune.with_resources(train_fn, resources={"cpu": 1}),
        param_space={"lr": tune.loguniform(1e-4, 1e-2), "max_epochs": 3},
        tune_config=tune.TuneConfig(metric="train_loss", mode="min", num_samples=2),
        run_config=tune.RunConfig(name="minimum_poc", log_to_file=True),
    )
    results = tuner.fit()
    print(f"[DONE] best config: {results.get_best_result().config}")