- High: It blocks me to complete my task.
For many-actor jobs, we’re regularly (almost always happens on our large jobs) running into a problem where making a remote actor call blocks forever. This occurs after remote actor calls work for a while, and the only way to restore liveness to the actor is to SIGKILL the actor process manually, which triggers a restart.
We were previously hitting this issue when using Ray Serve, though we thought it was a Serve issue at the time (links in “Related” at the bottom). Behavior and
py-spy traces are identical to what we saw earlier, even though we now use pure Ray Core instead.
The problem appears to be that once entering the C/C++ code in the core worker that the remote call blocks indefinitely, but since there’s no way for Python to take back control as the C code has the GIL, the Ray actor itself is also stuck.
We were able to repro this problem using a basic Python C extension here: python-c-extension/00-HelloWorld at master · spolcyn/python-c-extension · GitHub (use
make to build, then
python test.py to demo,
libmypy.c mimics the blocking remote actor call). The behavior and
py-spy traces are exactly what we observe with the hung Ray actors.
To restore the expected behavior of the C function call (
hello) not blocking the interpreter, either use a
ThreadPool and uncommenting the GIL macros in
libmypy.c, or leave the GIL macros commented out and use a
We’ve tried using a
ProcessPool with Ray, but we’ve found sending remote functions via
pickle into multiprocessing doesn’t work.
Is there a way to add a timeout within the worker code to deal with Ray remote calls that block forever, or a way to run these calls within a subprocess so we can cancel them from Python?
Platform Info: Python 3.9.10, Ray 1.13.0, Ubuntu Linux 20.04
py-spy dump of the actor stuck on the method call:
Thread 0x7F691F178740 (active): "MainThread" main_loop (ray/worker.py:451) <module> (ray/workers/default_worker.py:238) Thread 2569 (idle): "ray_import_thread" wait (threading.py:316) _wait_once (grpc/_common.py:106) wait (grpc/_common.py:148) result (grpc/_channel.py:733) _poll_locked (ray/_private/gcs_pubsub.py:249) poll (ray/_private/gcs_pubsub.py:385) _run (ray/_private/import_thread.py:70) run (threading.py:910) _bootstrap_inner (threading.py:973) _bootstrap (threading.py:930) Thread 2574 (idle): "AsyncIO Thread: default" _actor_method_call (ray/actor.py:1075) < -- **Calls the core worker actor C code** invocation (ray/actor.py:165) _remote (ray/actor.py:178) _start_span (ray/util/tracing/tracing_helper.py:421) remote (ray/actor.py:132) do_work (our_file.py:170) < -- **The function we call** _run (asyncio/events.py:80) _run_once (asyncio/base_events.py:1890) run_forever (asyncio/base_events.py:596) run (threading.py:910) _bootstrap_inner (threading.py:973) _bootstrap (threading.py:930) Thread 0x7F685F5FE700 (active) Thread 2581 (idle): "asyncio_0" _worker (concurrent/futures/thread.py:81) run (threading.py:910) _bootstrap_inner (threading.py:973) _bootstrap (threading.py:930) Thread 2596 (idle): "Thread-2" wait (threading.py:316) wait (threading.py:574) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:973) _bootstrap (threading.py:930) Thread 3437 (idle): "Thread-20" channel_spin (grpc/_channel.py:1258) run (threading.py:910) _bootstrap_inner (threading.py:973) _bootstrap (threading.py:930) Thread 2564 (idle)
- Actor remote function blocks on client – we’re using the latest version of Ray already, so doesn’t seem to be addressed by this (which is a client issue)
- Serve Handle Remote Calls Block Forever – posted previously, but thought it was a Ray Serve issue at the time. Symptoms for this are identical to that problem.
- [Serve] Serve stalls out after repeated replica failures · Issue #24419 · ray-project/ray · GitHub – this fixed a different issue than the one described here