Actor remote function blocks on client

shiranbi · April 13, 2022, 3:28pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Sometimes a call to a remote actor function blocks without ever returning. As I understand remote function is supposed to return immediately.
I’ve tried to look at all logs in verbose and can’t find anything (i don’t see the actor function starting in the worker)
A little bit about my client.
I use ray client from a process outside the ray cluster and call the ray client from multiple threads in this process.

I am new to ray and I would appreciate any help in understanding which logs will show what the client is actually doing

Mingwei · April 13, 2022, 3:45pm

When this happens, can you use py-spy to get a stack trace of the hanging process?

shiranbi · April 14, 2022, 12:34pm

hanging thread:
Thread 21973 (idle): processing_0"
_async_send (ray/util/client/dataclient.py:281)
ReleaseObject (ray/util/client/dataclient.py:363)
_release_server (ray/util/client/worker.py:528)
call_release (ray/util/client/worker.py:522)
call_release (ray/util/client/api.py:118)
notify (threading.py:354)
put (queue.py:151)
_async_send (ray/util/client/dataclient.py:287)
Schedule (ray/util/client/dataclient.py:368)
_call_schedule_for_task (ray/util/client/worker.py:496)
call_remote (ray/util/client/worker.py:455)
call_remote (ray/util/client/api.py:106)
remote (ray/util/client/common.py:346)
_run_ray_task (task_code.py:89) – my code running function on actor
_thread_process_slice (task_code.py:35) – my code
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

For completeness all my other threads:
Thread 20164 (idle): “MainThread”
wait (threading.py:302)
wait (threading.py:558)
(my_module/main.py:15) – my main module
_run_code (runpy.py:85)
_run_module_as_main (runpy.py:192)
Thread 20254 (idle): “Thread-1”
wait (threading.py:302)
get (queue.py:170)
dequeue (logging/handlers.py:1427)
_monitor (logging/handlers.py:1478)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21445 (idle): “Thread-2”
_poll_connectivity (grpc/_channel.py:1391)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21452 (idle): “ray_client_streaming_rpc”
_process_response (ray/util/client/dataclient.py:133)
_data_main (ray/util/client/dataclient.py:99)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21453 (idle): “Thread-7”
wait (threading.py:306)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:141)
_next (grpc/_channel.py:817)
next (grpc/_channel.py:426)
_log_main (ray/util/client/logsclient.py:68)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21454 (idle): “Thread-8”
channel_spin (grpc/_channel.py:1258)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21455 (idle): “Thread-9”
enter (threading.py:247)
get (queue.py:164)
consume_request_iterator (grpc/_channel.py:203)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21456 (idle): “Thread-10”
wait (threading.py:302)
get (queue.py:170)
consume_request_iterator (grpc/_channel.py:203)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21754 (idle): “grpc_server”
wait (threading.py:306)
wait (threading.py:558)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:141)
wait_for_termination (grpc/_server.py:985)
run (grpc/app.py:63)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21761 (idle): “Thread-11”
_serve (grpc/_server.py:879)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
Thread 21762 (idle): “grpc_client_0”
wait (threading.py:302)
result (concurrent/futures/_base.py:434)
DoWork (flow_control.py:31) - my code recieving grpc call from another server
_call_behavior (grpc/_server.py:443)
_unary_response_in_pool (grpc/_server.py:560)
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

shiranbi · April 17, 2022, 5:22am

I’ve upgraded to 1.12.0 and this is the callstack that I get
Thread 17484 (idle): “slice processing_0”
_async_send (ray/util/client/dataclient.py:426)
ReleaseObject (ray/util/client/dataclient.py:531)
_release_server (ray/util/client/worker.py:628)
call_release (ray/util/client/worker.py:622)
del (ray/util/client/common.py:110)
put (queue.py:132)
_async_send (ray/util/client/dataclient.py:432)
Schedule (ray/util/client/dataclient.py:535)
_call_schedule_for_task (ray/util/client/worker.py:592)
call_remote (ray/util/client/worker.py:550)
call_remote (ray/util/client/api.py:109)
remote (ray/util/client/common.py:540)
(detection_scan_server/processing_manager.py:116)
_run_defects_creation (detection_scan_server/processing_manager.py:116)
_thread_process_slice (detection_scan_server/processing_manager.py:57)
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)

to me it looks like the issue in

github.com/ray-project/ray

[client] Fix client blocking in GC issue

ray-project:master ← iycheng:client-blocking-fix

opened 12:46AM - 17 Feb 22 UTC

iycheng

+63 -8

## Why are these changes needed? Ray client will hang when GC happens durin…g async_send. This PR put all request to GC to deque which is a reentrant queue and actually send it in data thread. ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

i just run actor remote functions in a loop and it happens
So I don’t understand what I am doing wrong since it would seem like everyone should be suffering from this bug

shiranbi · April 17, 2022, 5:57am

i found the following issue in github but it was reverted

github.com/ray-project/ray

[Client] avoid locking in async send

ray-project:master ← mwtian:client-lock

opened 11:06PM - 07 Feb 22 UTC

mwtian

+139 -63

## Why are these changes needed? As @iycheng discovered in https://github.com/r…ay-project/ray/issues/22082#issuecomment-1031821631, when `ClientObjectRef` is being GC'ed, `DataClient.lock` is acquired which may cause deadlock. This change avoids acquiring lock in `DataClient._async_send()`. ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

Mingwei · April 18, 2022, 9:21pm

@shiranbi yes this is a known problem which we don’t have a good fix for Python 3.6. Are you using Python 3.7 or above? If yes, we can try to merge in a mitigation for it.

shiranbi · April 19, 2022, 4:06am

Hi @Mingwei,
Yes, I am working with python3.8, trying to migrate to python3.9 but that might still be a while.
I would really appreciate anything that could help, I had to disable my entire project since this happens so often

Mingwei · April 19, 2022, 4:20am

That is really unfortunate. I’m optimistic that [Ray client] use `SimpleQueue` on Python 3.7 and newer for async `dataclient` by mwtian · Pull Request #23995 · ray-project/ray · GitHub can help in this case by avoiding the deadlock during Python GC. Would you be willing to try out Ray nightly release once that PR is merged?

shiranbi · April 19, 2022, 4:36am

I have no problem trying night release, I tried running with nightly a few days ago and it failed on something else completely but I have no problem trying it again once the pull request is merged

shiranbi · April 20, 2022, 11:07am

@Mingwei i used the nightly build, from what I can tell it includes your fix.
I ran it several times and my tests pass, I will start using this version in larger scale next week and will be able to see it in more uses cases.
I have noticed that since it is a timing issue changing one thing sometimes makes the deadlock less frequent so only time will tell if it solved the entire problem
Thank you very much

Mingwei · April 20, 2022, 5:35pm

Thanks for testing it out! And great that the fix shows promises. Nightly build now definitely includes [Ray client] use `SimpleQueue` on Python 3.7 and newer in async `dataclient` by mwtian · Pull Request #23995 · ray-project/ray · GitHub. You are right that this is a timing issue. There is one more fix I want to make to fully resolve the issue.

Topic		Replies	Views
Ray actor.remote() call blocks forever Ray Core	11	1914	April 26, 2023
Serve Handle Remote Calls Block Forever Ray Serve	7	835	April 16, 2023
Multiple requests to a single Actor Ray Core	5	450	September 29, 2021
Violation of number of reserved CPUs for remote function Ray Core	6	553	March 15, 2022
Ray Actors will not report error Ray Core	5	394	June 3, 2022

Actor remote function blocks on client

Related topics