System will be halted when tasks number is large

  1. on-prem 3090 gpu server , centos 7.9.

The error indicates there’s some system level error. For example resource not available error is due to some OS issue; https://docs.oracle.com/cd/E19455-01/806-1075/msgs-1980/index.html

RuntimeError: can’t start new thread also probably happens due to the same reason. python - error: can't start new thread - Stack Overflow

Is it possible to try starting ray with a fresh clean environment? E.g., can you restart the server or sth? Also try starting ray with ulimit. E.g, ulimit -n 60000 ray start --head

I tried ulimit and also tried restarting the server, but doesn’t help.

def create_rand_tensor(size: Tuple[int, int]) -> torch.tensor:
    return torch.randn(size=(size), dtype=torch.float)


new_tensor=create_rand_tensor((2, 3))

@ray.remote
def transform_rand_tensor(tensor: torch.tensor) -> torch.tensor:
    return torch.transpose(tensor, 0, 1)

torch.manual_seed(42)
#
# Create a tensor of shape (X, 50)
#

tensor_list_obj_ref = [ray.put(create_rand_tensor(((i+1)*25, 10))) for i in range(0, 36)]

transformed_object_list = [transform_rand_tensor.remote(t_obj_ref) for t_obj_ref in tensor_list_obj_ref]
print(len(ray.get(transformed_object_list)))

In this program, the tasks’ number 36 will definitly work. When I increase ths number, the error will mostly occur. Pls note I can not say 36 is a magic number in case misleading the troubleshoting.

Hi, Jules,
Did you try to launch tasks on a computer with big number of cpus? for example, a machine with >100 cpus ?

Libin

We run it all the time. This one is running with 72 CPUs and I have more than 100. Only matter of time before all the CPUs are used up as my training proceeds.

does ray need a specific version fo gRPC? maybe it’s grpc 's bug when to start many clients in some OS . Not sure yet. I am checking the code.

Am running the latest on the master.

I think your server (OS or hardware) probably have some sort of resource limit (like max number of threads, processes, fd, etc.). The error message you are seeing is from the OS syscall saying you don’t have enough resources… I doubt it is the grpc related issue, but you can try out the same version as jules to see…

Hi, guys,
I would like to come back this issue and tell some observations after many days’ debugging on the ray code . If I understand correctly, I hope this helps to improve the system.

When I submit ,say , 100 tasks simultaniously from the driver, it seems the raylet will schedule 100 workers to make each task done. Each worker will at least starts a boost thread pool of size 128 (since my computer has 128 cores) to handle the server call for rpc in server_call.cc file,

std::unique_ptr<boost::asio::thread_pool> &_GetServerCallExecutor() {
  static auto thread_pool = std::make_unique<boost::asio::thread_pool>(
      ::RayConfig::instance().num_server_call_thread());
  return thread_pool;
}

that is, there might be at least 12800 threads to be started if possible(Acturally, the threads used by ray system seems far more than 12800 ). In my centos, the ulimit is 4096 and the thread pool can not be created . An exception is thrown and the workers died . The error is so fatal that the whole system looks like a died man.

@sangcho had advised me to increase the ulimit to 60000 but it did not work, unfortunatelly ,somehow.

Anyway, there will always be the case that you can not create as many threads as you want. As a roubust system, I think it should take this scenario into account. At least the system can catch the exception and decrease the size of the thread pool to keep system alive. Or the raylet can schedule a queue to keep the task waitting for a while. But the bottle line is clear: keep the system HA.

Hope ray becomes better
LiBin

Btw, you should use uliimt -u to control threads not ulimit -n

Regarding 128 threads, our default config uses

       std::max((int64_t)1, (int64_t)(std::thread::hardware_concurrency() / 4U)))

which means it should be at most 3 threads per worker, not 128

If you create an issue, we can do some investigation regarding the default # of threads at 128 core machines. Please create an issue and tag me!

Thank you so much. I have created an issue Ray Core: System may hang when the task number is large on a single machine of many cpu cores · Issue #34829 · ray-project/ray · GitHub
sorry but I don’t know how to reach you .

@sangcho , thanks for your tips and I re-tried to start ray with ulimit -u command . Now system works well. The issue comes from limited thread resources . For now I can go head and just hope ray can handle this case more gracefully in the future version. Thanks anyway.