System will be halted when tasks number is large

  • High: It blocks me to complete my task.
 import logging
import random

from typing import Tuple

import numpy as np
import pandas as pd
import pyarrow.parquet as pq
import torch
import ray


def create_rand_tensor(size: Tuple[int, int]) -> torch.tensor:
    return torch.randn(size=(size), dtype=torch.float)


new_tensor=create_rand_tensor((2, 3))

@ray.remote
def transform_rand_tensor(tensor: torch.tensor) -> torch.tensor:
    return torch.transpose(tensor, 0, 1)

torch.manual_seed(42)
#
# Create a tensor of shape (X, 50)
#

tensor_list_obj_ref = [ray.put(create_rand_tensor(((i+1)*25, 150))) for i in range(0, 100)]

transformed_object_list = [transform_rand_tensor.remote(t_obj_ref) for t_obj_ref in tensor_list_obj_ref]
print(ray.get(transformed_object_list).size())

here is the raylet.out

OS: CentOS: 7.9
CPUs: 128
Ray: 2.3
Python: 3.9

@Li_Bin Can you try this notebook? It seems this code snippet from here.

Also, how many tasks in all are you creating in the for loop. Or how many object ref or tensors you have created?

Cheers
Jules

I believe @Li_Bin used the notebook and ran into the problem in CentOS. Here is the related slack thread .

It seems that the problem only happens on CentOS?

I tried to reproduce this but I couldn’t. I ran the script you pasted here successfully on a GCP VM with python=3.9.16, ray =2.3, centOS 7.

btw, need to modify the last line of your code print(len(ray.get(transformed_object_list)))

btw, can you check several things and paste them here?

  • task table in the ray dashboard and look at the error of the failed tasks
  • when the tasks are pending/failing, check the ray status output in the job detail page of ray dashboard.
  • maybe also attach the monitor.log file?

When the ray hang, I even could not click into the task table to check the failed tasks from the dashboard.

I took some screenshots and uploaded the monitor.* files. In case you need more detailed info, I also uploaded the whole logs . here is the outputs

I also exported the grafana dashboard and hope it helps.

Pls let me know if any further info needed.

bwt, I also tried in my macbook M1 pro 8+16g and an ubuntu 12cpu VM, all work just fine. These issues all raised from this Centos GPU server. Actually , almost all of the examples will go into trouble as long as the task number is increased large.

Hi, Jules,
I did try that notebook to learn Ray. The system hang without changing any code.
Here are some observations:
1)the task number 30 is ok and the program can exit normally.
2) the task number 50 will lead hanging, the last line can be reached and the 50 can be printed but the program hang there.
3) 100 will lead the hanging and program can not reach the last line to execute the print.

I tried serval examples and almost all of them have the same troubles when task number is largee.

Thanks any way for your time.
LiBin

@sangcho maybe you or someone from the core team should take a look?

It seems like this is hanging on CentOS not on any other platforms. For I can run on our AWS cluster, MacOS. without any hangs.

Try these simple tests on CentOS and see if that hangs at what #NUM_TASKs.

(Also note tthat in the original notebook, we are inserting large tensors into object store, so if you do have enough memory, you should see messages that info object spilling into the disk.)

import math
import ray
import numpy as np
import random
import logging

@ray.remote
def ray_task(a, b):
    arr = np.random.rand(a,b)
    return math.sqrt(np.sum(arr)) * a * b

if __name__ == "__main__":

    if ray.is_initialized:
        ray.shutdown()
    ray.init(logging_level=logging.ERROR)

    NUM_TASKS = [25, 50, 75, 100]
    for task in NUM_TASKS:
        obj_ref = [ray_task.remote(random.randint(100, 200), random.randint(100, 200)) for i in range(task)]
        print("Num of tasks: {task}")
        results = sum(ray.get(object_refs=obj_ref))
        print(f"Number of arrays: {len(obj_ref)}, Sum of all Numpy arrays: {results:.2f})")
  1. I run the test code (add one line print(“Ray starts initialization”) before init func) but the program hangs like this:

  2. I start ray with CLI and run test code again, the program terminates with python fatal error but the 250 tasks are finished with green bar in the dashboard. Here is the logs.

Thanks and let me know if any info needed.
LiBin

btw, where can I upload files in this forum?

No sure you unless, you uploaded them to a public Google drive and share a pointer to it

I am not sure but I guess the root cause is when the os has many ,say, 128,CPUs , it is difficult for the raylet to spreed large number of tasks on those CPUs as soon as possible.

Hi, jules,
Any idea on this issue? do you need more info ?

Thanks
LiBin

@Li_Bin,

Seems we have trouble reproducing the issue. Is there a way we can have access to the exact same hardware and os?

As @jjyao we can’t reproduce this on MacOS or on our Anyscale AWS instances. We don’t have your hardware setup. The test code I asked you to try seems to work and finish without hanging.

I browsed the logs you posted.

  1. a lot of ray worker processes get started and died soon. (see gcs_server.out below)
  2. the tasks failed to run because the workers died (see python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1864.log below)
  3. the worker died because of Unhandled exception: St12system_error. what(): Resource temporarily unavailable (see raylet.err below)

@jjyao Do you know what are the possible causes of this exception?

gcs_server.out

[2023-03-15 09:04:36,112 W 130953 130953] (gcs_server) gcs_worker_manager.cc:55: Reporting worker exit, worker id = 3c6c4ee9801437c0eb2ccaecc92c6a2361da93456329148268c104a2, node id = ffffffffffffffffffffffffffffffffffffffffffffffffffffffff, address = , exit_type = SYSTEM_ERROR, exit_detail = Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors… Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.

python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1864.log

[2023-03-15 09:04:36,208 I 1864 1927] raylet_client.cc:381: Error returning worker: Invalid: Returned worker does not exist any more

[2023-03-15 09:04:36,208 I 1864 1927] task_manager.cc:467: task 91581beb08e6c9deffffffffffffffffffffffff01000000 retries left: 3, oom retries left: -1, task failed due to oom: 0

[2023-03-15 09:04:36,208 I 1864 1927] task_manager.cc:471: Attempting to resubmit task 91581beb08e6c9deffffffffffffffffffffffff01000000 for attempt number: 0

[2023-03-15 09:04:36,209 I 1864 1927] core_worker.cc:350: Will resubmit task after a 0ms delay: Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=remote_object, class_name=, function_name=transform_rand_tensor, function_hash=dd13dc2a0abe4e4484330a28e4065226}, task_id=91581beb08e6c9deffffffffffffffffffffffff01000000, task_name=transform_rand_tensor, job_id=01000000, num_args=2, num_returns=1, depth=1, attempt_number=1, max_retries=3

raylet.err

[2023-03-15 09:04:35,991 E 2035 2035] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2049 2049] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2042 2042] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2018 2018] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2064 2064] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2033 2033] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,991 E 2037 2037] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2040 2040] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 1986 1986] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2012 2012] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2010 2010] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2044 2044] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2023 2023] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2047 2047] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
[2023-03-15 09:04:35,992 E 2050 2050] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
Traceback (most recent call last):
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py”, line 210, in
Traceback (most recent call last):
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py”, line 210, in
ray._private.worker.connect(
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py”, line 2111, in connect
ray._private.worker.connect(
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py”, line 2111, in connect
worker.import_thread.start()
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/import_thread.py”, line 61, in start
self.t.start()
File “/home/binli/anaconda3/lib/python3.9/threading.py”, line 899, in start
worker.import_thread.start()
File “/home/binli/anaconda3/lib/python3.9/site-packages/ray/_private/import_thread.py”, line 61, in start
_start_new_thread(self._bootstrap, ())
self.t.start()
File “/home/binli/anaconda3/lib/python3.9/threading.py”, line 899, in start
RuntimeError: can’t start new thread
_start_new_thread(self._bootstrap, ())
RuntimeError: can’t start new thread

Hmm, that looks like it’s related to @Li_Bin 's environment.

@Li_Bin Can you provide more details about your env?
Is the GPU server running on a cloud provider? Or is it an on-prem gpu server? Is there a way for us to get access to it?
What is the spec of the cpu? Intel? Which version?