System will be halted when tasks number is large

Li_Bin · March 28, 2023, 12:07am

on-prem 3090 gpu server , centos 7.9.

sangcho · March 28, 2023, 1:11am

The error indicates there’s some system level error. For example resource not available error is due to some OS issue; https://docs.oracle.com/cd/E19455-01/806-1075/msgs-1980/index.html

RuntimeError: can’t start new thread also probably happens due to the same reason. python - error: can't start new thread - Stack Overflow

Is it possible to try starting ray with a fresh clean environment? E.g., can you restart the server or sth? Also try starting ray with ulimit. E.g, ulimit -n 60000 ray start --head

Li_Bin · March 28, 2023, 2:53am

I tried ulimit and also tried restarting the server, but doesn’t help.

def create_rand_tensor(size: Tuple[int, int]) -> torch.tensor:
    return torch.randn(size=(size), dtype=torch.float)


new_tensor=create_rand_tensor((2, 3))

@ray.remote
def transform_rand_tensor(tensor: torch.tensor) -> torch.tensor:
    return torch.transpose(tensor, 0, 1)

torch.manual_seed(42)
#
# Create a tensor of shape (X, 50)
#

tensor_list_obj_ref = [ray.put(create_rand_tensor(((i+1)*25, 10))) for i in range(0, 36)]

transformed_object_list = [transform_rand_tensor.remote(t_obj_ref) for t_obj_ref in tensor_list_obj_ref]
print(len(ray.get(transformed_object_list)))

In this program, the tasks’ number 36 will definitly work. When I increase ths number, the error will mostly occur. Pls note I can not say 36 is a magic number in case misleading the troubleshoting.

Li_Bin · March 29, 2023, 12:20am

Hi, Jules,
Did you try to launch tasks on a computer with big number of cpus? for example, a machine with >100 cpus ?

Libin

Jules_Damji · March 29, 2023, 12:44am

We run it all the time. This one is running with 72 CPUs and I have more than 100. Only matter of time before all the CPUs are used up as my training proceeds.

Li_Bin · March 29, 2023, 12:54am

does ray need a specific version fo gRPC? maybe it’s grpc 's bug when to start many clients in some OS . Not sure yet. I am checking the code.

Jules_Damji · March 29, 2023, 1:08am

Am running the latest on the master.

sangcho · March 29, 2023, 5:40am

I think your server (OS or hardware) probably have some sort of resource limit (like max number of threads, processes, fd, etc.). The error message you are seeing is from the OS syscall saying you don’t have enough resources… I doubt it is the grpc related issue, but you can try out the same version as jules to see…

Li_Bin · April 27, 2023, 10:13am

Hi, guys,
I would like to come back this issue and tell some observations after many days’ debugging on the ray code . If I understand correctly, I hope this helps to improve the system.

When I submit ,say , 100 tasks simultaniously from the driver, it seems the raylet will schedule 100 workers to make each task done. Each worker will at least starts a boost thread pool of size 128 (since my computer has 128 cores) to handle the server call for rpc in server_call.cc file,

std::unique_ptr<boost::asio::thread_pool> &_GetServerCallExecutor() {
  static auto thread_pool = std::make_unique<boost::asio::thread_pool>(
      ::RayConfig::instance().num_server_call_thread());
  return thread_pool;
}

that is, there might be at least 12800 threads to be started if possible(Acturally, the threads used by ray system seems far more than 12800 ). In my centos, the ulimit is 4096 and the thread pool can not be created . An exception is thrown and the workers died . The error is so fatal that the whole system looks like a died man.

@sangcho had advised me to increase the ulimit to 60000 but it did not work, unfortunatelly ,somehow.

Anyway, there will always be the case that you can not create as many threads as you want. As a roubust system, I think it should take this scenario into account. At least the system can catch the exception and decrease the size of the thread pool to keep system alive. Or the raylet can schedule a queue to keep the task waitting for a while. But the bottle line is clear: keep the system HA.

Hope ray becomes better
LiBin

sangcho · April 27, 2023, 1:54pm

Btw, you should use uliimt -u to control threads not ulimit -n

Regarding 128 threads, our default config uses

       std::max((int64_t)1, (int64_t)(std::thread::hardware_concurrency() / 4U)))

which means it should be at most 3 threads per worker, not 128

sangcho · April 27, 2023, 1:54pm

If you create an issue, we can do some investigation regarding the default # of threads at 128 core machines. Please create an issue and tag me!

Li_Bin · April 27, 2023, 2:36pm

Thank you so much. I have created an issue Ray Core: System may hang when the task number is large on a single machine of many cpu cores · Issue #34829 · ray-project/ray · GitHub
sorry but I don’t know how to reach you .

Li_Bin · April 28, 2023, 2:16am

@sangcho , thanks for your tips and I re-tried to start ray with ulimit -u command . Now system works well. The issue comes from limited thread resources . For now I can go head and just hope ray can handle this case more gracefully in the future version. Thanks anyway.

Topic		Replies	Views
Ray stuck at the number of tasks reaching to 10000 Ray Core	3	193	May 29, 2024
Tasks become slow when num of submitted task greater than num cpus Ray Core	1	315	November 23, 2021
Progressive Slowdown and Deadlock in Ray Remote Tasks During Black-Box Optimization Ray Core	1	39	August 12, 2024
Resource deadlock in TorchTrainer? Ray Train	5	490	February 27, 2023
Not distributed on ray cluster case Ray Core	0	320	November 9, 2022

System will be halted when tasks number is large

Related topics