Ray actor.remote() call blocks forever

Severity:

  • High: It blocks me to complete my task.

For many-actor jobs, we’re regularly (almost always happens on our large jobs) running into a problem where making a remote actor call blocks forever. This occurs after remote actor calls work for a while, and the only way to restore liveness to the actor is to SIGKILL the actor process manually, which triggers a restart.

History

We were previously hitting this issue when using Ray Serve, though we thought it was a Serve issue at the time (links in “Related” at the bottom). Behavior and py-spy traces are identical to what we saw earlier, even though we now use pure Ray Core instead.

Potential Root Cause

The problem appears to be that once entering the C/C++ code in the core worker that the remote call blocks indefinitely, but since there’s no way for Python to take back control as the C code has the GIL, the Ray actor itself is also stuck.

We were able to repro this problem using a basic Python C extension here: python-c-extension/00-HelloWorld at master · spolcyn/python-c-extension · GitHub (use make to build, then python test.py to demo, sleep(500) in libmypy.c mimics the blocking remote actor call). The behavior and py-spy traces are exactly what we observe with the hung Ray actors.

To restore the expected behavior of the C function call (hello) not blocking the interpreter, either use a ThreadPool and uncommenting the GIL macros in libmypy.c, or leave the GIL macros commented out and use a ProcessPool.

We’ve tried using a ProcessPool with Ray, but we’ve found sending remote functions via pickle into multiprocessing doesn’t work.

Is there a way to add a timeout within the worker code to deal with Ray remote calls that block forever, or a way to run these calls within a subprocess so we can cancel them from Python?

Thanks!

Debug Data and Related Issues

Platform Info: Python 3.9.10, Ray 1.13.0, Ubuntu Linux 20.04

Here’s the py-spy dump of the actor stuck on the method call:

			Thread 0x7F691F178740 (active): "MainThread"
			    main_loop (ray/worker.py:451)
			    <module> (ray/workers/default_worker.py:238)
			Thread 2569 (idle): "ray_import_thread"
			    wait (threading.py:316)
			    _wait_once (grpc/_common.py:106)
			    wait (grpc/_common.py:148)
			    result (grpc/_channel.py:733)
			    _poll_locked (ray/_private/gcs_pubsub.py:249)
			    poll (ray/_private/gcs_pubsub.py:385)
			    _run (ray/_private/import_thread.py:70)
			    run (threading.py:910)
			    _bootstrap_inner (threading.py:973)
			    _bootstrap (threading.py:930)
			Thread 2574 (idle): "AsyncIO Thread: default"
			    _actor_method_call (ray/actor.py:1075) < -- **Calls the core worker actor C code**
			    invocation (ray/actor.py:165)
			    _remote (ray/actor.py:178)
			    _start_span (ray/util/tracing/tracing_helper.py:421)
			    remote (ray/actor.py:132)
			    do_work (our_file.py:170) < -- **The function we call**
			    _run (asyncio/events.py:80)
			    _run_once (asyncio/base_events.py:1890)
			    run_forever (asyncio/base_events.py:596)
			    run (threading.py:910)
			    _bootstrap_inner (threading.py:973)
			    _bootstrap (threading.py:930)
			Thread 0x7F685F5FE700 (active)
			Thread 2581 (idle): "asyncio_0"
			    _worker (concurrent/futures/thread.py:81)
			    run (threading.py:910)
			    _bootstrap_inner (threading.py:973)
			    _bootstrap (threading.py:930)
			Thread 2596 (idle): "Thread-2"
			    wait (threading.py:316)
			    wait (threading.py:574)
			    run (tqdm/_monitor.py:60)
			    _bootstrap_inner (threading.py:973)
			    _bootstrap (threading.py:930)
			Thread 3437 (idle): "Thread-20"
			    channel_spin (grpc/_channel.py:1258)
			    run (threading.py:910)
			    _bootstrap_inner (threading.py:973)
			    _bootstrap (threading.py:930)
Thread 2564 (idle)

Related:

  1. Actor remote function blocks on client – we’re using the latest version of Ray already, so doesn’t seem to be addressed by this (which is a client issue)
  2. Serve Handle Remote Calls Block Forever – posted previously, but thought it was a Ray Serve issue at the time. Symptoms for this are identical to that problem.
  3. [Serve] Serve stalls out after repeated replica failures · Issue #24419 · ray-project/ray · GitHub – this fixed a different issue than the one described here

@spolcyn thanks for reporting this. I think this might be related to [Core] Deadlock when actor is failed · Issue #26414 · ray-project/ray · GitHub

The theory is that _actor_method_call shouldn’t be blocking forever and I’m still investigating what’s happening there.

If you are interested you probably can give this PR a try.

1 Like

@spolcyn do you have a simple script to reproduce this? I’m struggling to find an easy way to reproduce this :frowning: If you know how it’ll be a big help to me.

@yic I haven’t found a simple script thus far that exactly reproduces this – did you get a chance to look at the repro I linked using the basic C extension?

I was able to mimic the blocking behavior via a sleep(500) call in the extension. What do you think about putting that on the other side of _actor_method_call to test the fix for it? Also happy to arrange a call to talk through what I’ve tried and seen so far.

Also, noticed your comment here ([coe] Remove gil when submit actor and tasks to avoid deadlock for some cases. by iycheng · Pull Request #26421 · ray-project/ray · GitHub) – was that a successful repro/root cause diagnosis?

@spolcyn I checked your c example there. I know the issue you mentioned. The problem here is why the program enters the c part and hangs there. It should be expected to be finished within a reasonable time.

For the fix, now I know the root cause, but I’m not sure whether it’s the same as your case. The deadlock is when an actor dies, the client side might hang because the submit task is holding the gil, and the callback of the actor dies wants to call py code. Both of them are trying to hold a mutex to protect the data. That’s the place where the deadlock happens.

To encounter this, you’ll need to await object ref and have the actor die somehow.

2 Likes

@yic Got it, I see your point on figuring out why the call is not finishing in a reasonable amount of time.

When using Ray Serve, we noticed this happened extremely frequently right after a bunch of Serve Replicas (actors) died all at once (e.g., because a node was terminated) – thus, it seems likely the root cause is the same.

To double-check, the conditions we’d need to reproduce would be:

  1. Actor 1 (client) calls actor method on Actor 2 (remote), Actor 1 holds GIL and enters C++ code
  2. Actor 2 (remote) dies
  3. Global worker for Actor 1 receives notification that Actor 2 has died, attempts to acquire GIL and run callback
  4. Global worker for Actor 1 deadlocks, causing Actor 1’s remote actor method call to deadlock

So, in other words, if Actor 1 makes a remote actor call to Actor 2 before knowing that Actor 2 has died, the deadlock will occur?

@spolcyn this issue is fixed in the master ([coe] Remove gil when submit actor and tasks to avoid deadlock for some cases. by iycheng · Pull Request #26421 · ray-project/ray · GitHub)

Do you want to give it another try?

1 Like

@yic Sure, we’ll see if we can get one of our bigger jobs running on the nightly build and whether we run into the error or not. Thanks!

1 Like

We’re facing the same issue when trying to make this remote() call from a pythonic thread. Were you also used threads on your original failed workflow, or was the remote() call invoked from the main thread? Thanks!

IIRC it was invoked via the main thread, with trackingremote() calls managed either via the Ray API (ray.wait, ray.get, etc.) or via Python’s asyncio library (treating ObjectRef’s as Awaitable objects)

1 Like

We run some long-running computations in Ray Serve, and noticed the issue of actors getting killed by the Serve Controller due to failing health checks, although it was fixed in [Serve] Move replica health check to separate concurrency group · Issue #24554 · ray-project/ray · GitHub .

We realized this was being caused because the GIL was not being released in a python extension that we were calling in the ray serve deployment. Releasing the GIL resolved the issue.

Is there a way to add a timeout within the worker code to deal with Ray remote calls that block forever, or a way to run these calls within a subprocess so we can cancel them from Python?

We are making use of the health check feature of serve deployments to somewhat achieve this - Add End-to-End Fault Tolerance — Ray 2.8.0 . The duration of on-going calls are tracked, and if it exceeds the timeout, check_health returns false, which will result in replica being killed, and a new one getting created.

Is there a way to add a timeout within the worker code to deal with Ray remote calls that block forever

There’s no support for this feature. I feel like although it is supported, it will probably not work with the custom extensions that doesn’t release GIL since it may be difficult to interrupt none-python code (maybe we can send signals, but it depends on if your custom extension can handle interrupt and comes back to python code). Interrupting a actor call is actual harmful because it can screw up your states.

That said, if it requires long running funcs that doesn’t release a GIL, it may be better starting a subprocess. For this one, I believe you can just simply start a subprocess using subprocess — Subprocess management — Python 3.12.0 documentation