(raylet) OpenBLAS

I am getting the following error. Have anyone experienced this? The scheduler system is Slurm. To provide more context, I ran the same script in another cluster with the same scheduler and it worked.

(raylet) OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1541207 max
(raylet) OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable
(raylet) OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1541207 max

Hey @wrios, thanks for posting this.

A couple of questions fro me:

  1. Have you checked the threads limit on your system?
  2. How’s your ray workload like? How many actors/tasks/nodes you have on a single machine?
  3. Did you set any other ENV like OPENBLAS_NUM_THREADS?
  1. Threads limit is 3083125

  2. The node has 60 cores. Even if I only have 50 it does present a problem

  3. When I set OPENBLAS_NUM_THREADS=1 I get a lot of

30452 thread_pool.cc:253] Waiting for thread pool to idle before forking

So I guess we should set OPENBLAS_NUM_THREADS to number of cpus available for the task/actor in ray (We do so for OMP_NUM_THREADS) [core] Set OPENBLAS_NUM_THREADS to number of cpus automatically · Issue #34724 · ray-project/ray · GitHub

Alternatively, you should set OPENBLAS_NUM_THREADS to a reasonable number per process (e.g. if you know you have 4 tasks, and a 60 cores, you could set it to 60/4 = 15) as a hack.
W.r.t that error, it’s a known issue with grpc I believe. https://github.com/grpc/grpc/issues/31885