How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi! I’m a beginner for Ray. I would like to distribute my own python task on the cluster using Ray, However, I cannot always run ray on the cluster, i.e. only successfully run for two times and failed several times.
Environment: The cluster is on supercomputer, using lsf for scheduler. Thus I am using this to start Ray:
My lsf file:
#!/bin/bash
#BSUB -q short
#BSUB -n 80
#BSUB -e %J.err
#BSUB -o %J.out
#BSUB -R "span[ptile=40]"
source ~/softwares/python/anaconda3/2022.10/anaconda.2022.10.source
source activate
conda deactivate
conda activate ray
bash -i ~/softwares/ray-integration/ray_launch_cluster.sh -c "python /work/myname/distribution_gp/ray_test.py" -n "ray"
My demo python file:
import ray
import math
import time
import random
import datetime as dt
@ray.remote
class ProgressActor:
def __init__(self, total_num_samples: int):
self.total_num_samples = total_num_samples
self.num_samples_completed_per_task = {}
def report_progress(self, task_id: int, num_samples_completed: int) -> None:
self.num_samples_completed_per_task[task_id] = num_samples_completed
def get_progress(self) -> float:
return (
sum(self.num_samples_completed_per_task.values()) / self.total_num_samples
)
@ray.remote
def sampling_task(num_samples: int, task_id: int,
progress_actor: ray.actor.ActorHandle) -> int:
num_inside = 0
for i in range(num_samples):
x, y = random.uniform(-1, 1), random.uniform(-1, 1)
if math.hypot(x, y) <= 1:
num_inside += 1
# Report progress every 1 million samples.
if (i + 1) % 1_000_000 == 0:
# This is async.
progress_actor.report_progress.remote(task_id, i + 1)
# Report the final progress.
progress_actor.report_progress.remote(task_id, num_samples)
return num_inside
# Change this to match your cluster scale.
NUM_SAMPLING_TASKS = 75
NUM_SAMPLES_PER_TASK = 10_000_000_000
TOTAL_NUM_SAMPLES = NUM_SAMPLING_TASKS * NUM_SAMPLES_PER_TASK
# Create the progress actor.
progress_actor = ProgressActor.remote(TOTAL_NUM_SAMPLES)
# Create and execute all sampling tasks in parallel.
results = [
sampling_task.remote(NUM_SAMPLES_PER_TASK, i, progress_actor)
for i in range(NUM_SAMPLING_TASKS)
]
# Query progress periodically.
while True:
progress = ray.get(progress_actor.get_progress.remote())
print(f"{dt.datetime.now()}Progress: {int(progress * 100)}%")
if progress == 1:
break
time.sleep(1)
# Get all the sampling tasks results.
total_num_inside = sum(ray.get(results))
pi = (total_num_inside * 4) / TOTAL_NUM_SAMPLES
print(f"Estimated value of π is: {pi}")
Failure output:
Adding host: r13n15
Adding host: r07n33
The host list is: r13n15 r07n33
Head node will use port: 50755
Dashboard will use port: 51743
Num cpus per host is: r13n15 40 r07n33 40
Object store memory for the cluster is set to 4GB
Starting ray head node on: r13n15
using default object store mem of 4GB make sure your cluster has mem greater than 4GB
Ray cluster is not found at r13n15:50755
2024-01-10 15:49:20,516 INFO usage_lib.py:449 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-01-10 15:49:20,520 INFO scripts.py:744 -- Local node IP: NODE_IP
2024-01-10 15:49:26,595 SUCC scripts.py:781 -- --------------------
2024-01-10 15:49:26,595 SUCC scripts.py:782 -- Ray runtime started.
2024-01-10 15:49:26,596 SUCC scripts.py:783 -- --------------------
2024-01-10 15:49:26,596 INFO scripts.py:785 -- Next steps
2024-01-10 15:49:26,596 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-01-10 15:49:26,596 INFO scripts.py:791 -- ray start --address='NODE_IP:50755'
2024-01-10 15:49:26,596 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-01-10 15:49:26,596 INFO scripts.py:802 -- import ray
2024-01-10 15:49:26,596 INFO scripts.py:803 -- ray.init()
2024-01-10 15:49:26,596 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-01-10 15:49:26,596 INFO scripts.py:835 -- ray stop
2024-01-10 15:49:26,596 INFO scripts.py:838 -- To view the status of the cluster, use
2024-01-10 15:49:26,596 INFO scripts.py:839 -- ray status
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Ray cluster is not found at r13n15:50755
Success Output:
Adding host: r07n36
Adding host: r06n35
The host list is: r07n36 r06n35
Head node will use port: 20092
Dashboard will use port: 34444
Num cpus per host is: r07n36 40 r06n35 40
Object store memory for the cluster is set to 4GB
Starting ray head node on: r07n36
using default object store mem of 4GB make sure your cluster has mem greater than 4GB
2024-01-11 10:35:24,474 INFO usage_lib.py:449 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-01-11 10:35:24,479 INFO scripts.py:744 -- Local node IP: NODE_IP
2024-01-11 10:35:26,761 SUCC scripts.py:781 -- --------------------
2024-01-11 10:35:26,761 SUCC scripts.py:782 -- Ray runtime started.
2024-01-11 10:35:26,761 SUCC scripts.py:783 -- --------------------
2024-01-11 10:35:26,761 INFO scripts.py:785 -- Next steps
2024-01-11 10:35:26,761 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-01-11 10:35:26,762 INFO scripts.py:791 -- ray start --address='NODE_IP:20092'
2024-01-11 10:35:26,762 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-01-11 10:35:26,762 INFO scripts.py:802 -- import ray
2024-01-11 10:35:26,762 INFO scripts.py:803 -- ray.init()
2024-01-11 10:35:26,762 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-01-11 10:35:26,762 INFO scripts.py:835 -- ray stop
2024-01-11 10:35:26,762 INFO scripts.py:838 -- To view the status of the cluster, use
2024-01-11 10:35:26,762 INFO scripts.py:839 -- ray status
Ray cluster is not found at r07n36:20092
Ray cluster is not found at r07n36:20092
======== Autoscaler status: 2024-01-11 10:35:46.082882 ========
Node status
---------------------------------------------------------------
Active:
1 node_c1abf0988a852af4d0178f45769fed03965783ee478bd7ebb12cc6be
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/40.0 CPU
0B/154.43GiB memory
0B/3.73GiB object_store_memory
Demands:
(no resource demands)
adding the workers to head node: r06n35
starting worker on: r06n35 and using master node: r07n36
Ray cluster is not found at r07n36:20092
Running user workload: python /work/aais-wuk/distribution_gp/ray_test.py
2024-01-11 10:35:59,957 INFO scripts.py:926 -- Local node IP: NODE_IP
2024-01-11 10:36:11,069 SUCC scripts.py:939 -- --------------------
2024-01-11 10:36:11,069 SUCC scripts.py:940 -- Ray runtime started.
2024-01-11 10:36:11,069 SUCC scripts.py:941 -- --------------------
2024-01-11 10:36:11,070 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-01-11 10:36:11,070 INFO scripts.py:944 -- ray stop
2024-01-11 10:37:00.392694Progress: 42%
2024-01-11 10:37:01.397848Progress: 92%
2024-01-11 10:37:02.400529Progress: 92%
2024-01-11 10:37:03.403779Progress: 93%
2024-01-11 10:37:04.406696Progress: 93%
2024-01-11 10:37:05.409985Progress: 93%
2024-01-11 10:37:06.412844Progress: 94%
2024-01-11 10:37:07.416366Progress: 94%
2024-01-11 10:37:08.418810Progress: 94%
2024-01-11 10:37:09.422258Progress: 94%
2024-01-11 10:37:10.424948Progress: 94%
2024-01-11 10:37:11.428028Progress: 94%
2024-01-11 10:37:12.431540Progress: 94%
2024-01-11 10:37:13.435088Progress: 94%
2024-01-11 10:37:14.437817Progress: 94%
2024-01-11 10:37:15.441073Progress: 95%
2024-01-11 10:37:16.444509Progress: 96%
2024-01-11 10:37:17.447705Progress: 97%
2024-01-11 10:37:18.451081Progress: 98%
2024-01-11 10:37:19.454638Progress: 98%
2024-01-11 10:37:20.457979Progress: 99%
2024-01-11 10:37:21.461114Progress: 100%
Estimated value of π is: 3.141507648
Done
Shutting down the Job
Job <6272482> is being terminated
How should I fix this? Thanks in advance!