Actor remote function blocks on client

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Sometimes a call to a remote actor function blocks without ever returning. As I understand remote function is supposed to return immediately.
I’ve tried to look at all logs in verbose and can’t find anything (i don’t see the actor function starting in the worker)
A little bit about my client.
I use ray client from a process outside the ray cluster and call the ray client from multiple threads in this process.

I am new to ray and I would appreciate any help in understanding which logs will show what the client is actually doing

When this happens, can you use py-spy to get a stack trace of the hanging process?

hanging thread:
Thread 21973 (idle): processing_0"
_async_send (ray/util/client/dataclient.py:281)
ReleaseObject (ray/util/client/dataclient.py:363)
_release_server (ray/util/client/worker.py:528)
call_release (ray/util/client/worker.py:522)
call_release (ray/util/client/api.py:118)
notify (threading.py:354)
put (queue.py:151)
_async_send (ray/util/client/dataclient.py:287)
Schedule (ray/util/client/dataclient.py:368)
_call_schedule_for_task (ray/util/client/worker.py:496)
call_remote (ray/util/client/worker.py:455)
call_remote (ray/util/client/api.py:106)
remote (ray/util/client/common.py:346)
_run_ray_task (task_code.py:89) – my code running function on actor
_thread_process_slice (task_code.py:35) – my code
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

For completeness all my other threads:
Thread 20164 (idle): “MainThread”
wait (threading.py:302)
wait (threading.py:558)
(my_module/main.py:15) – my main module
_run_code (runpy.py:85)
_run_module_as_main (runpy.py:192)
Thread 20254 (idle): “Thread-1”
wait (threading.py:302)
get (queue.py:170)
dequeue (logging/handlers.py:1427)
_monitor (logging/handlers.py:1478)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21445 (idle): “Thread-2”
_poll_connectivity (grpc/_channel.py:1391)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21452 (idle): “ray_client_streaming_rpc”
_process_response (ray/util/client/dataclient.py:133)
_data_main (ray/util/client/dataclient.py:99)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21453 (idle): “Thread-7”
wait (threading.py:306)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:141)
_next (grpc/_channel.py:817)
next (grpc/_channel.py:426)
_log_main (ray/util/client/logsclient.py:68)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21454 (idle): “Thread-8”
channel_spin (grpc/_channel.py:1258)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21455 (idle): “Thread-9”
enter (threading.py:247)
get (queue.py:164)
consume_request_iterator (grpc/_channel.py:203)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21456 (idle): “Thread-10”
wait (threading.py:302)
get (queue.py:170)
consume_request_iterator (grpc/_channel.py:203)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21754 (idle): “grpc_server”
wait (threading.py:306)
wait (threading.py:558)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:141)
wait_for_termination (grpc/_server.py:985)
run (grpc/app.py:63)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21761 (idle): “Thread-11”
_serve (grpc/_server.py:879)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21762 (idle): “grpc_client_0”
wait (threading.py:302)
result (concurrent/futures/_base.py:434)
DoWork (flow_control.py:31) - my code recieving grpc call from another server
_call_behavior (grpc/_server.py:443)
_unary_response_in_pool (grpc/_server.py:560)
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

I’ve upgraded to 1.12.0 and this is the callstack that I get
Thread 17484 (idle): “slice processing_0”
_async_send (ray/util/client/dataclient.py:426)
ReleaseObject (ray/util/client/dataclient.py:531)
_release_server (ray/util/client/worker.py:628)
call_release (ray/util/client/worker.py:622)
del (ray/util/client/common.py:110)
put (queue.py:132)
_async_send (ray/util/client/dataclient.py:432)
Schedule (ray/util/client/dataclient.py:535)
_call_schedule_for_task (ray/util/client/worker.py:592)
call_remote (ray/util/client/worker.py:550)
call_remote (ray/util/client/api.py:109)
remote (ray/util/client/common.py:540)
(detection_scan_server/processing_manager.py:116)
_run_defects_creation (detection_scan_server/processing_manager.py:116)
_thread_process_slice (detection_scan_server/processing_manager.py:57)
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

to me it looks like the issue in

i just run actor remote functions in a loop and it happens
So I don’t understand what I am doing wrong since it would seem like everyone should be suffering from this bug

i found the following issue in github but it was reverted

@shiranbi yes this is a known problem which we don’t have a good fix for Python 3.6. Are you using Python 3.7 or above? If yes, we can try to merge in a mitigation for it.

Hi @Mingwei,
Yes, I am working with python3.8, trying to migrate to python3.9 but that might still be a while.
I would really appreciate anything that could help, I had to disable my entire project since this happens so often

That is really unfortunate. I’m optimistic that [Ray client] use `SimpleQueue` on Python 3.7 and newer for async `dataclient` by mwtian · Pull Request #23995 · ray-project/ray · GitHub can help in this case by avoiding the deadlock during Python GC. Would you be willing to try out Ray nightly release once that PR is merged?

I have no problem trying night release, I tried running with nightly a few days ago and it failed on something else completely but I have no problem trying it again once the pull request is merged

@Mingwei i used the nightly build, from what I can tell it includes your fix.
I ran it several times and my tests pass, I will start using this version in larger scale next week and will be able to see it in more uses cases.
I have noticed that since it is a timing issue changing one thing sometimes makes the deadlock less frequent so only time will tell if it solved the entire problem
Thank you very much

Thanks for testing it out! And great that the fix shows promises. Nightly build now definitely includes [Ray client] use `SimpleQueue` on Python 3.7 and newer in async `dataclient` by mwtian · Pull Request #23995 · ray-project/ray · GitHub. You are right that this is a timing issue. There is one more fix I want to make to fully resolve the issue.