Severity:
- High: It blocks me to complete my task.
For many-actor jobs, we’re regularly (almost always happens on our large jobs) running into a problem where making a remote actor call blocks forever. This occurs after remote actor calls work for a while, and the only way to restore liveness to the actor is to SIGKILL the actor process manually, which triggers a restart.
History
We were previously hitting this issue when using Ray Serve, though we thought it was a Serve issue at the time (links in “Related” at the bottom). Behavior and py-spy
traces are identical to what we saw earlier, even though we now use pure Ray Core instead.
Potential Root Cause
The problem appears to be that once entering the C/C++ code in the core worker that the remote call blocks indefinitely, but since there’s no way for Python to take back control as the C code has the GIL, the Ray actor itself is also stuck.
We were able to repro this problem using a basic Python C extension here: python-c-extension/00-HelloWorld at master · spolcyn/python-c-extension · GitHub (use make
to build, then python test.py
to demo, sleep(500)
in libmypy.c
mimics the blocking remote actor call). The behavior and py-spy
traces are exactly what we observe with the hung Ray actors.
To restore the expected behavior of the C function call (hello
) not blocking the interpreter, either use a ThreadPool
and uncommenting the GIL macros in libmypy.c
, or leave the GIL macros commented out and use a ProcessPool
.
We’ve tried using a ProcessPool
with Ray, but we’ve found sending remote functions via pickle
into multiprocessing doesn’t work.
Is there a way to add a timeout within the worker code to deal with Ray remote calls that block forever, or a way to run these calls within a subprocess so we can cancel them from Python?
Thanks!
Debug Data and Related Issues
Platform Info: Python 3.9.10, Ray 1.13.0, Ubuntu Linux 20.04
Here’s the py-spy
dump of the actor stuck on the method call:
Thread 0x7F691F178740 (active): "MainThread"
main_loop (ray/worker.py:451)
<module> (ray/workers/default_worker.py:238)
Thread 2569 (idle): "ray_import_thread"
wait (threading.py:316)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:733)
_poll_locked (ray/_private/gcs_pubsub.py:249)
poll (ray/_private/gcs_pubsub.py:385)
_run (ray/_private/import_thread.py:70)
run (threading.py:910)
_bootstrap_inner (threading.py:973)
_bootstrap (threading.py:930)
Thread 2574 (idle): "AsyncIO Thread: default"
_actor_method_call (ray/actor.py:1075) < -- **Calls the core worker actor C code**
invocation (ray/actor.py:165)
_remote (ray/actor.py:178)
_start_span (ray/util/tracing/tracing_helper.py:421)
remote (ray/actor.py:132)
do_work (our_file.py:170) < -- **The function we call**
_run (asyncio/events.py:80)
_run_once (asyncio/base_events.py:1890)
run_forever (asyncio/base_events.py:596)
run (threading.py:910)
_bootstrap_inner (threading.py:973)
_bootstrap (threading.py:930)
Thread 0x7F685F5FE700 (active)
Thread 2581 (idle): "asyncio_0"
_worker (concurrent/futures/thread.py:81)
run (threading.py:910)
_bootstrap_inner (threading.py:973)
_bootstrap (threading.py:930)
Thread 2596 (idle): "Thread-2"
wait (threading.py:316)
wait (threading.py:574)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:973)
_bootstrap (threading.py:930)
Thread 3437 (idle): "Thread-20"
channel_spin (grpc/_channel.py:1258)
run (threading.py:910)
_bootstrap_inner (threading.py:973)
_bootstrap (threading.py:930)
Thread 2564 (idle)
Related:
- Actor remote function blocks on client – we’re using the latest version of Ray already, so doesn’t seem to be addressed by this (which is a client issue)
- Serve Handle Remote Calls Block Forever – posted previously, but thought it was a Ray Serve issue at the time. Symptoms for this are identical to that problem.
- [Serve] Serve stalls out after repeated replica failures · Issue #24419 · ray-project/ray · GitHub – this fixed a different issue than the one described here