Ray + slurm crashes early in run

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

I have been trying to get Ray setup parrellised across multiple nodes on my universities cluster for a while now. It always crashes with the error:

3f962e125c544bf07d0a709c42e2532aad968f5d16d321f Node ID: 5336409fc44f023f104ffbb8a2e56686bad569aee3bcf4941ebfd7d5 Worker IP address: 10.43.77.76 Worker port: 10064 Worker PID: 1345983 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

I am under utilising the memory in the CPU’s available so do not understand why memory is being over used. Tracking memory usage in wandb, it does not appear as though I am anywhere near memory limits. I have spent weeks debugging this issue to no avail. The only way I can make tune work is to run 1 job per node which pretty much defeats the point of using tune

SLURM file:

#SBATCH --nodes=20
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=99GB
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH --exclusive

# Load modules and activate Python environment
. /etc/profile.d/modules.sh
module purge
module load rhel8/default-icl
module load python
source /rds/user/sakl2/hpc-work/development/calibration_testing/venv/bin/activate


# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
  head_node_ip=${ADDR[1]}
else
  head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --num-cpus=5 --block &
# saving 1 cpu here for ray processes
# optional, though may be useful in certain versions of Ray < 1.0.
sleep 30

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --num-cpus=5 --block &
    sleep 30
done
# saving 1 cpu here for ray processes

# Execute Python script ray status
python -u /rds/user/sakl2/hpc-work/development/calibration_testing/ow/calibration/ml_g/main.py "$SLURM_CPUS_PER_TASK" "$head_node_ip"

Main file:

if __name__ == "__main__":
    print("Starting")
    ip_head = str(sys.argv[2])
    print("IP head is " + str(ip_head))
    ray.init(_node_ip_address=ip_head)
    print("Nodes in the Ray cluster:")
    print(ray.nodes())
    wandb.login()

    # Define the parameter space for Bayesian Optimization
    config = {
        "batch_size": tune.quniform(10, 1000, 1),
        "dropout": tune.uniform(0.1, 0.7),
        "lr": tune.loguniform(1e-7, 5e-2),
    }

    # Use Bayesian Optimization
    bayesopt_ = BayesOptSearch(
        metric="val_loss",
        mode="min",
        patience=500,
        random_search_steps=100,
    )
    bayesopt = ConcurrencyLimiter(bayesopt_, max_concurrent=80)

    # Optional: Use a scheduler
    scheduler = AsyncHyperBandScheduler(
        metric="val_loss",
        mode="min",
        grace_period=50,
        max_t=5000,  # EPOCHS
        brackets=1,
    )

    local_dir = os.getcwd()

    resources_per_trial = {
        "cpu": 1,
        "gpu": 0,
    }
    tuner = Tuner(
        tune.with_resources(train_with_tune, resources=resources_per_trial),
        tune_config=TuneConfig(
            search_alg=bayesopt,
            scheduler=scheduler,
            num_samples=1000,
        ),
        run_config=RunConfig(
            local_dir=local_dir,
            callbacks=[
                WandbLoggerCallback(
                    project="bigrun",
                    job_type="training",
                )
            ],
        ),
        param_space=config,
    )

    print("Starting 3")
    results = tuner.fit()

    print("Starting 4")
    print("Done")

Error (redacted to fit):
Trial status: 14 RUNNING | 8 TERMINATED | 1 PENDING
Current time: 2024-03-21 04:50:00. Total running time: 3min 0s
Logical resource usage: 16.0/100 CPUs, 0/0 GPUs

(_QueueActor pid=1346264) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0xcd) [0x15041d2386fd] ray::core::CoreWorker::RunTaskExecutionLoop() [repeated 2x across cluster]
(_QueueActor pid=1346264) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x15041d27a73c] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() [repeated 2x across cluster]
(_QueueActor pid=1346264) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x15041d27a8ed] ray::core::CoreWorkerProcess::RunTaskExecutionLoop() [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(+0xa2fb7) [0x15042efeffb7] method_vectorcall_NOARGS [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(+0x6be9b) [0x15042efb8e9b] function_code_fastcall [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x3e) [0x15042f0c71be] PyEval_EvalCodeEx [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(PyEval_EvalCode+0x1b) [0x15042f0c71eb] PyEval_EvalCode [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(+0x1b80cc) [0x15042f1050cc] run_mod [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(PyRun_FileExFlags+0x92) [0x15042f1068b2] PyRun_FileExFlags [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(PyRun_SimpleFileExFlags+0xeb) [0x15042f106a0b] PyRun_SimpleFileExFlags [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(Py_RunMain+0x7a8) [0x15042f1239d8] Py_RunMain [repeated 2x across cluster]
(_QueueActor pid=1346264) /usr/local/software/archive/linux-scientific7-x86_64/gcc-9/python-3.8.2-pxfbotweq2mojfmgod6qp7dtp6zow2rt/lib/libpython3.8.so.1.0(Py_BytesMain+0x43) [0x15042f123dc3] Py_BytesMain [repeated 2x across cluster]
(_QueueActor pid=1346264) /lib64/libc.so.6(__libc_start_main+0xe5) [0x15042dde4d85] __libc_start_main [repeated 2x across cluster]
(_QueueActor pid=1346264) ray::_QueueActor() [0x400fae] [repeated 2x across cluster]
(raylet) *** SIGABRT received at time=1710996684 on cpu 13 *** [repeated 5x across cluster]
(raylet) PC: @ 0x14d7e79cdacf (unknown) raise [repeated 5x across cluster]
(raylet) [2024-03-21 04:51:24,244 E 1346326 1346326] logging.cc:361: *** SIGABRT received at time=1710996684 on cpu 13 *** [repeated 5x across cluster]
(raylet) [2024-03-21 04:51:24,244 E 1346326 1346326] logging.cc:361: PC: @ 0x14d7e79cdacf (unknown) raise [repeated 5x across cluster]
(raylet) [2024-03-21 04:51:24,245 E 1346326 1346326] logging.cc:361: @ 0x14d7e9062a20 (unknown) (unknown) [repeated 5x across cluster]
(raylet) Fatal Python error: Aborted [repeated 5x across cluster]
(raylet) Stack (most recent call first): [repeated 3x across cluster]
(_QueueActor pid=1346264) File “/rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_private/worker.py”, line 847 in main_loop [repeated 2x across cluster]
(raylet) File “/rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_private/workers/default_worker.py”, line 247 in [repeated 3x across cluster]
(_QueueActor pid=1346094) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x899) [0x152b6070a109] ray::core::CoreWorker::HandleGetCoreWorkerStats()
(_QueueActor pid=1346094) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x152b606ffef4] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(_QueueActor pid=1346094) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xc9) [0x152b606e11d9] ray::core::CoreWorker::RunIOService()
(_QueueActor pid=1346094) /rds/user/sakl2/hpc-work/development/calibration_testing/venv/lib/python3.8/site-packages/ray/_raylet.so(+0xb2ee00) [0x152b60aa0e00] thread_proxy
(_QueueActor pid=1346094) /lib64/libpthread.so.0(+0x81ca) [0x152b71dae1ca] start_thread
(_QueueActor pid=1346094) /lib64/libc.so.6(clone+0x43) [0x152b71290e73] clone