How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am running Ray on a SLURM cluster and followed the documentation when writing my sbatch script and haven’t had any issues until today. The essence of my code is to take a large input split it up into N parts (1 part for each CPU core) and use Ray to have each part executed on a single CPU core. The function which does the processing is decorated with num_cpus = 1
and the node I’m working on has 128 CPU cores.
When I execute the program and view the Ray dashboard, I can see 128 workers being created, but then all of the work is being done by a single worker. The Resource Status on the dashboard also shows 1.0/128.0 CPUs being used.
I just updated Ray to version 2.5.1. Earlier today I was on 2.2.0 and the issue I was facing was similar (only 2 workers were initially created and 1 did all the work).
It could also be an issue with my SLURM cluster (not sure what I should be looking for to verify that), but I thought I’d start here given the community size! Thanks in advance!
EDIT: To get some more info, I looked at the stack trace for the workers on the dashboard and for the ones that weren’t executing anything they look like this
Thread 756850 (idle): "MainThread"
pthread_cond_wait@@GLIBC_2.3.2 (libpthread-2.28.so)
boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
boost::asio::detail::scheduler::run (ray/_raylet.so)
boost::asio::io_context::run (ray/_raylet.so)
ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
run_task_loop (ray/_raylet.so)
main_loop (ray/_private/worker.py:861)
<module> (ray/_private/workers/default_worker.py:262)
whereas the one worker that is doing work’s stack trace looks like this:
Thread 756949 (idle): "MainThread"
pthread_cond_wait@@GLIBC_2.3.2 (libpthread-2.28.so)
boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
boost::asio::detail::scheduler::run (ray/_raylet.so)
boost::asio::io_context::run (ray/_raylet.so)
ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
run_task_loop (ray/_raylet.so)
main_loop (ray/_private/worker.py:861)
<module> (ray/_private/workers/default_worker.py:262)
Thread 763971 (idle): "ray_import_thread"
do_futex_wait (libpthread-2.28.so)
__new_sem_wait_slow (libpthread-2.28.so)
PyThread_acquire_lock_timed (python3.9)
lock_PyThread_acquire_lock (python3.9)
wait (threading.py:316)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:733)
_poll_locked (ray/_private/gcs_pubsub.py:217)
poll (ray/_private/gcs_pubsub.py:372)
_run (ray/_private/import_thread.py:74)
run (threading.py:917)
_bootstrap_inner (threading.py:980)
_bootstrap (threading.py:937)
clone (libc-2.28.so)
Thread 772856 (idle): "Thread-20"
epoll_wait (libc-2.28.so)
0x1512bff09fda (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
0x1512bffb184c (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
0x1512c0014a55 (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
0x1512c008f12f (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
0x1512c0090a5d (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
channel_spin (grpc/_channel.py:1258)
0x1512bfff9dec (grpc/_cython/cygrpc.cpython-39-x86_64-linux-gnu.so)
run (threading.py:917)
_bootstrap_inner (threading.py:980)
_bootstrap (threading.py:937)
clone (libc-2.28.so)