Serve Handle Remote Calls Block Forever

spolcyn · April 21, 2022, 3:13pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Consistently reproducible on Ray 1.11 & Ray 1.12, though we don’t have a small repro script yet.

Background:

Running on k8s cluster
Sending ~100s of requests from an actor to an autoscaling Serve deployment.
Many actors (3-7), each sending to one autoscaling Serve deployment.

On a job as described in the background, we routinely (at least one actor, typically more) will stall out at the beginning or midway through the job. Typically (but not always) this happens after at least one Serve replica fails (usually due to k8s pod shuffling), the stack trace of which we see in the logs of the Ray Actor that makes the call to the Serve Handle. However, that exception doesn’t seem to cause the calling actor to fail, or the remote() call to the Serve Handle.

Both seem to stall out on the _try_assign_replica call.

Any ideas on why the remote() call would never complete, or mitigations to avoid this deadlocking?

Below is py-spy dump and top output of the actor calling the deployment, for the two common cases seen:
Case #1:

Dump #1:

Process 9756: ray::InferenceJobHandlerActor.handle_job()
Python v3.7.12 (/venv/bin/python3.7)

Thread 9756 (idle): "MainThread"
    wait (threading.py:296)
    result (concurrent/futures/_base.py:430)
    remote (ray/serve/handle.py:236)
   < our application code >
    _resume_span (ray/util/tracing/tracing_helper.py:462)
    actor_method_executor (ray/_private/function_manager.py:701)
    main_loop (ray/worker.py:449)
    <module> (ray/workers/default_worker.py:235)
Thread 9783 (idle): "ray_import_thread"
    wait (threading.py:300)
    _wait_once (grpc/_common.py:106)
    wait (grpc/_common.py:148)
    result (grpc/_channel.py:735)
    _poll_locked (ray/_private/gcs_pubsub.py:274)
    poll (ray/_private/gcs_pubsub.py:410)
    _run (ray/_private/import_thread.py:67)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9792 (idle): "Thread-2"
    _actor_method_call (ray/actor.py:1176)
    invocation (ray/actor.py:161)
    _remote (ray/actor.py:168)
    _start_span (ray/util/tracing/tracing_helper.py:421)
    remote (ray/actor.py:122)
    _try_assign_replica (ray/serve/router.py:114)
    assign_replica (ray/serve/router.py:162)
    assign_request (ray/serve/router.py:217)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1786)
    run_forever (asyncio/base_events.py:541)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9795 (idle): "Thread-4"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9952 (idle): "Thread-26"
    channel_spin (grpc/_channel.py:1258)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9778 (idle): "Dummy-3"
    _write_to_self (asyncio/selector_events.py:137)
    call_soon_threadsafe (asyncio/base_events.py:742)
    _call_set_state (asyncio/futures.py:365)
    _invoke_callbacks (concurrent/futures/_base.py:324)
    set_exception (concurrent/futures/_base.py:537)

Top #1:

Collecting samples from 'ray::<our_actor>.<our_method>()' (python v3.7.12)
Total Samples 1300
GIL: 0.00%, Active: 0.00%, Threads: 6

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  0.00%   0.00%   0.060s    0.060s   _write_to_self (asyncio/selector_events.py:137)
  0.00%   0.00%   0.020s    0.020s   channel_spin (grpc/_channel.py:1258)
  0.00%   0.00%   0.000s    0.020s   _bootstrap (threading.py:890)
  0.00%   0.00%   0.000s    0.060s   _invoke_callbacks (concurrent/futures/_base.py:324)
  0.00%   0.00%   0.000s    0.060s   set_exception (concurrent/futures/_base.py:537)
  0.00%   0.00%   0.000s    0.060s   _call_set_state (asyncio/futures.py:365)
  0.00%   0.00%   0.000s    0.020s   run (threading.py:870)
  0.00%   0.00%   0.000s    0.020s   _bootstrap_inner (threading.py:926)
  0.00%   0.00%   0.000s    0.060s   call_soon_threadsafe (asyncio/base_events.py:742)

Case #2:
Dump #2:

Python v3.7.12 (/venv/bin/python3.7)

Thread 413 (idle): "MainThread"
    wait (threading.py:296)
    result (concurrent/futures/_base.py:430)
    remote (ray/serve/handle.py:234)
   < our application code >
    _resume_span (ray/util/tracing/tracing_helper.py:451)
    actor_method_executor (ray/_private/function_manager.py:648)
    main_loop (ray/worker.py:430)
    <module> (ray/workers/default_worker.py:224)
Thread 520 (idle): "ray_import_thread"
    wait (threading.py:300)
    _wait_once (grpc/_common.py:106)
    wait (grpc/_common.py:148)
    result (grpc/_channel.py:735)
    _poll_locked (ray/_private/gcs_pubsub.py:270)
    poll (ray/_private/gcs_pubsub.py:409)
    _run (ray/_private/import_thread.py:71)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 14879 (idle): "Thread-16"
    _actor_method_call (ray/actor.py:1022)
    invocation (ray/actor.py:166)
    _remote (ray/actor.py:172)
    _start_span (ray/util/tracing/tracing_helper.py:410)
    remote (ray/actor.py:124)
    _try_assign_replica (ray/serve/router.py:121)
    assign_replica (ray/serve/router.py:148)
    assign_request (ray/serve/router.py:220)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1786)
    run_forever (asyncio/base_events.py:541)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 14884 (idle): "Thread-17"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 12899 (idle): "Thread-123"
    channel_spin (grpc/_channel.py:1258)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 499 (idle): "Dummy-9"

simon-mo · April 21, 2022, 6:37pm

Sending ~100s of requests from an actor to an autoscaling Serve deployment.

Through HTTP or Python? We do cancel requests that’s still waiting on replicas if their client drop the HTTP call.

spolcyn · April 21, 2022, 7:01pm

This is through the Python ServeHandle API.

simon-mo · April 21, 2022, 8:23pm

Ah unfortunately we don’t have a cancellation mechanism right now; nor a way to reject new submission if the handle is overloaded. Both should be straightforward to implement, can you help open new Github Issue for this?

simon-mo · April 21, 2022, 8:25pm

Using async handle might help. You can do it by getting handle with sync=False flag. And instead of doing ref = handle.remote() you can do ref = await handle.remote() to make it non blocking

spolcyn · April 21, 2022, 8:59pm

Thanks! I can try the sync=False flag. Will also post a Github issue.

Is your current thought that the Serve Handle is overloaded, and if it receives a request in that state, then it gets into some sort of inconsistent state and can’t make any progress?

spolcyn · April 26, 2022, 6:47pm

As an update to this, using the async handle did not resolve the problem – the key issue seems to be that the the bottom part of the call stack (below) calls synchronous code. When the call to replica.actor_handle.handle_request.remote() is called in _try_assign_replica, it seems to go to synchronous code that isn’t returning.

We got around this by making the call in a pre-emptible thread with a timeout. I’ll add that to the Github issue.

 _actor_method_call (ray/actor.py:1176)
invocation (ray/actor.py:161)
_remote (ray/actor.py:168)
_start_span (ray/util/tracing/tracing_helper.py:421)
remote (ray/actor.py:122)
_try_assign_replica (ray/serve/router.py:114)
assign_replica (ray/serve/router.py:142)

eladgm · April 16, 2023, 6:03pm

We’re facing the same issue when trying to make this remote() call from a pythonic thread. Were you also used threads on your original failed workflow, or was the remote() call invoked from the main thread?

Topic		Replies	Views
Ray actor.remote() call blocks forever Ray Core	11	1914	April 26, 2023
Actor remote function blocks on client Ray Core	10	499	April 20, 2022
Multiple requests to a single Actor Ray Core	5	450	September 29, 2021
.remote() call occasionally hangs Ray Serve	3	390	October 7, 2024
Calling serve._run hangs Ray Serve	2	59	August 13, 2024

Serve Handle Remote Calls Block Forever

Related topics