Ray cluster is not found at node

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi! I’m a beginner for Ray. I would like to distribute my own python task on the cluster using Ray, However, I cannot always run ray on the cluster, i.e. only successfully run for two times and failed several times.

Environment: The cluster is on supercomputer, using lsf for scheduler. Thus I am using this to start Ray:

My lsf file:

#!/bin/bash
#BSUB -q short 
#BSUB -n 80 
#BSUB -e %J.err 
#BSUB -o %J.out 
#BSUB -R "span[ptile=40]" 


source ~/softwares/python/anaconda3/2022.10/anaconda.2022.10.source
source activate
conda deactivate
conda activate ray

bash -i ~/softwares/ray-integration/ray_launch_cluster.sh -c "python /work/myname/distribution_gp/ray_test.py" -n "ray"

My demo python file:

import ray
import math
import time
import random
import datetime as dt 

@ray.remote
class ProgressActor:
    def __init__(self, total_num_samples: int):
        self.total_num_samples = total_num_samples
        self.num_samples_completed_per_task = {}

    def report_progress(self, task_id: int, num_samples_completed: int) -> None:
        self.num_samples_completed_per_task[task_id] = num_samples_completed

    def get_progress(self) -> float:
        return (
            sum(self.num_samples_completed_per_task.values()) / self.total_num_samples
        )
        
@ray.remote
def sampling_task(num_samples: int, task_id: int,
                  progress_actor: ray.actor.ActorHandle) -> int:
    num_inside = 0
    for i in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1

        # Report progress every 1 million samples.
        if (i + 1) % 1_000_000 == 0:
            # This is async.
            progress_actor.report_progress.remote(task_id, i + 1)

    # Report the final progress.
    progress_actor.report_progress.remote(task_id, num_samples)
    return num_inside

# Change this to match your cluster scale.
NUM_SAMPLING_TASKS = 75
NUM_SAMPLES_PER_TASK = 10_000_000_000
TOTAL_NUM_SAMPLES = NUM_SAMPLING_TASKS * NUM_SAMPLES_PER_TASK

# Create the progress actor.
progress_actor = ProgressActor.remote(TOTAL_NUM_SAMPLES)
# Create and execute all sampling tasks in parallel.
results = [
    sampling_task.remote(NUM_SAMPLES_PER_TASK, i, progress_actor)
    for i in range(NUM_SAMPLING_TASKS)
]
# Query progress periodically.
while True:
    progress = ray.get(progress_actor.get_progress.remote())
    print(f"{dt.datetime.now()}Progress: {int(progress * 100)}%")

    if progress == 1:
        break

    time.sleep(1)

# Get all the sampling tasks results.
total_num_inside = sum(ray.get(results))
pi = (total_num_inside * 4) / TOTAL_NUM_SAMPLES
print(f"Estimated value of π is: {pi}")

Failure output:

Adding host:  r13n15
Adding host:  r07n33
The host list is:  r13n15 r07n33
Head node will use port:  50755
Dashboard will use port:  51743
Num cpus per host is: r13n15 40 r07n33 40
Object store memory for the cluster is set to 4GB
Starting ray head node on:  r13n15
using default object store mem of 4GB make sure your cluster has mem greater than 4GB
Ray cluster is not found at r13n15:50755
2024-01-10 15:49:20,516	INFO usage_lib.py:449 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-01-10 15:49:20,520	INFO scripts.py:744 -- Local node IP: NODE_IP
2024-01-10 15:49:26,595	SUCC scripts.py:781 -- --------------------
2024-01-10 15:49:26,595	SUCC scripts.py:782 -- Ray runtime started.
2024-01-10 15:49:26,596	SUCC scripts.py:783 -- --------------------
2024-01-10 15:49:26,596	INFO scripts.py:785 -- Next steps
2024-01-10 15:49:26,596	INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-01-10 15:49:26,596	INFO scripts.py:791 --   ray start --address='NODE_IP:50755'
2024-01-10 15:49:26,596	INFO scripts.py:800 -- To connect to this Ray cluster:
2024-01-10 15:49:26,596	INFO scripts.py:802 -- import ray
2024-01-10 15:49:26,596	INFO scripts.py:803 -- ray.init()
2024-01-10 15:49:26,596	INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-01-10 15:49:26,596	INFO scripts.py:835 --   ray stop
2024-01-10 15:49:26,596	INFO scripts.py:838 -- To view the status of the cluster, use
2024-01-10 15:49:26,596	INFO scripts.py:839 --   ray status
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755

Success Output:

Adding host:  r07n36
Adding host:  r06n35
The host list is:  r07n36 r06n35
Head node will use port:  20092
Dashboard will use port:  34444
Num cpus per host is: r07n36 40 r06n35 40
Object store memory for the cluster is set to 4GB
Starting ray head node on:  r07n36
using default object store mem of 4GB make sure your cluster has mem greater than 4GB
2024-01-11 10:35:24,474	INFO usage_lib.py:449 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-01-11 10:35:24,479	INFO scripts.py:744 -- Local node IP: NODE_IP
2024-01-11 10:35:26,761	SUCC scripts.py:781 -- --------------------
2024-01-11 10:35:26,761	SUCC scripts.py:782 -- Ray runtime started.
2024-01-11 10:35:26,761	SUCC scripts.py:783 -- --------------------
2024-01-11 10:35:26,761	INFO scripts.py:785 -- Next steps
2024-01-11 10:35:26,761	INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-01-11 10:35:26,762	INFO scripts.py:791 --   ray start --address='NODE_IP:20092'
2024-01-11 10:35:26,762	INFO scripts.py:800 -- To connect to this Ray cluster:
2024-01-11 10:35:26,762	INFO scripts.py:802 -- import ray
2024-01-11 10:35:26,762	INFO scripts.py:803 -- ray.init()
2024-01-11 10:35:26,762	INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-01-11 10:35:26,762	INFO scripts.py:835 --   ray stop
2024-01-11 10:35:26,762	INFO scripts.py:838 -- To view the status of the cluster, use
2024-01-11 10:35:26,762	INFO scripts.py:839 --   ray status
Ray cluster is not found at r07n36:20092
Ray cluster is not found at r07n36:20092
======== Autoscaler status: 2024-01-11 10:35:46.082882 ========
Node status
---------------------------------------------------------------
Active:
 1 node_c1abf0988a852af4d0178f45769fed03965783ee478bd7ebb12cc6be
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/40.0 CPU
 0B/154.43GiB memory
 0B/3.73GiB object_store_memory

Demands:
 (no resource demands)
adding the workers to head node:  r06n35
starting worker on:  r06n35 and using master node:  r07n36
Ray cluster is not found at r07n36:20092
Running user workload:  python /work/aais-wuk/distribution_gp/ray_test.py
2024-01-11 10:35:59,957	INFO scripts.py:926 -- Local node IP: NODE_IP
2024-01-11 10:36:11,069	SUCC scripts.py:939 -- --------------------
2024-01-11 10:36:11,069	SUCC scripts.py:940 -- Ray runtime started.
2024-01-11 10:36:11,069	SUCC scripts.py:941 -- --------------------
2024-01-11 10:36:11,070	INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-01-11 10:36:11,070	INFO scripts.py:944 --   ray stop
2024-01-11 10:37:00.392694Progress: 42%
2024-01-11 10:37:01.397848Progress: 92%
2024-01-11 10:37:02.400529Progress: 92%
2024-01-11 10:37:03.403779Progress: 93%
2024-01-11 10:37:04.406696Progress: 93%
2024-01-11 10:37:05.409985Progress: 93%
2024-01-11 10:37:06.412844Progress: 94%
2024-01-11 10:37:07.416366Progress: 94%
2024-01-11 10:37:08.418810Progress: 94%
2024-01-11 10:37:09.422258Progress: 94%
2024-01-11 10:37:10.424948Progress: 94%
2024-01-11 10:37:11.428028Progress: 94%
2024-01-11 10:37:12.431540Progress: 94%
2024-01-11 10:37:13.435088Progress: 94%
2024-01-11 10:37:14.437817Progress: 94%
2024-01-11 10:37:15.441073Progress: 95%
2024-01-11 10:37:16.444509Progress: 96%
2024-01-11 10:37:17.447705Progress: 97%
2024-01-11 10:37:18.451081Progress: 98%
2024-01-11 10:37:19.454638Progress: 98%
2024-01-11 10:37:20.457979Progress: 99%
2024-01-11 10:37:21.461114Progress: 100%
Estimated value of π is: 3.141507648
Done
Shutting down the Job
Job <6272482> is being terminated

How should I fix this? Thanks in advance!