Serve Handle Remote Calls Block Forever

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Consistently reproducible on Ray 1.11 & Ray 1.12, though we don’t have a small repro script yet.

Background:

  1. Running on k8s cluster
  2. Sending ~100s of requests from an actor to an autoscaling Serve deployment.
  3. Many actors (3-7), each sending to one autoscaling Serve deployment.

On a job as described in the background, we routinely (at least one actor, typically more) will stall out at the beginning or midway through the job. Typically (but not always) this happens after at least one Serve replica fails (usually due to k8s pod shuffling), the stack trace of which we see in the logs of the Ray Actor that makes the call to the Serve Handle. However, that exception doesn’t seem to cause the calling actor to fail, or the remote() call to the Serve Handle.

Both seem to stall out on the _try_assign_replica call.

Any ideas on why the remote() call would never complete, or mitigations to avoid this deadlocking?

Below is py-spy dump and top output of the actor calling the deployment, for the two common cases seen:
Case #1:

Dump #1:

Process 9756: ray::InferenceJobHandlerActor.handle_job()
Python v3.7.12 (/venv/bin/python3.7)

Thread 9756 (idle): "MainThread"
    wait (threading.py:296)
    result (concurrent/futures/_base.py:430)
    remote (ray/serve/handle.py:236)
   < our application code >
    _resume_span (ray/util/tracing/tracing_helper.py:462)
    actor_method_executor (ray/_private/function_manager.py:701)
    main_loop (ray/worker.py:449)
    <module> (ray/workers/default_worker.py:235)
Thread 9783 (idle): "ray_import_thread"
    wait (threading.py:300)
    _wait_once (grpc/_common.py:106)
    wait (grpc/_common.py:148)
    result (grpc/_channel.py:735)
    _poll_locked (ray/_private/gcs_pubsub.py:274)
    poll (ray/_private/gcs_pubsub.py:410)
    _run (ray/_private/import_thread.py:67)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9792 (idle): "Thread-2"
    _actor_method_call (ray/actor.py:1176)
    invocation (ray/actor.py:161)
    _remote (ray/actor.py:168)
    _start_span (ray/util/tracing/tracing_helper.py:421)
    remote (ray/actor.py:122)
    _try_assign_replica (ray/serve/router.py:114)
    assign_replica (ray/serve/router.py:162)
    assign_request (ray/serve/router.py:217)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1786)
    run_forever (asyncio/base_events.py:541)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9795 (idle): "Thread-4"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9952 (idle): "Thread-26"
    channel_spin (grpc/_channel.py:1258)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 9778 (idle): "Dummy-3"
    _write_to_self (asyncio/selector_events.py:137)
    call_soon_threadsafe (asyncio/base_events.py:742)
    _call_set_state (asyncio/futures.py:365)
    _invoke_callbacks (concurrent/futures/_base.py:324)
    set_exception (concurrent/futures/_base.py:537)

Top #1:

Collecting samples from 'ray::<our_actor>.<our_method>()' (python v3.7.12)
Total Samples 1300
GIL: 0.00%, Active: 0.00%, Threads: 6

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  0.00%   0.00%   0.060s    0.060s   _write_to_self (asyncio/selector_events.py:137)
  0.00%   0.00%   0.020s    0.020s   channel_spin (grpc/_channel.py:1258)
  0.00%   0.00%   0.000s    0.020s   _bootstrap (threading.py:890)
  0.00%   0.00%   0.000s    0.060s   _invoke_callbacks (concurrent/futures/_base.py:324)
  0.00%   0.00%   0.000s    0.060s   set_exception (concurrent/futures/_base.py:537)
  0.00%   0.00%   0.000s    0.060s   _call_set_state (asyncio/futures.py:365)
  0.00%   0.00%   0.000s    0.020s   run (threading.py:870)
  0.00%   0.00%   0.000s    0.020s   _bootstrap_inner (threading.py:926)
  0.00%   0.00%   0.000s    0.060s   call_soon_threadsafe (asyncio/base_events.py:742)

Case #2:
Dump #2:

Python v3.7.12 (/venv/bin/python3.7)

Thread 413 (idle): "MainThread"
    wait (threading.py:296)
    result (concurrent/futures/_base.py:430)
    remote (ray/serve/handle.py:234)
   < our application code >
    _resume_span (ray/util/tracing/tracing_helper.py:451)
    actor_method_executor (ray/_private/function_manager.py:648)
    main_loop (ray/worker.py:430)
    <module> (ray/workers/default_worker.py:224)
Thread 520 (idle): "ray_import_thread"
    wait (threading.py:300)
    _wait_once (grpc/_common.py:106)
    wait (grpc/_common.py:148)
    result (grpc/_channel.py:735)
    _poll_locked (ray/_private/gcs_pubsub.py:270)
    poll (ray/_private/gcs_pubsub.py:409)
    _run (ray/_private/import_thread.py:71)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 14879 (idle): "Thread-16"
    _actor_method_call (ray/actor.py:1022)
    invocation (ray/actor.py:166)
    _remote (ray/actor.py:172)
    _start_span (ray/util/tracing/tracing_helper.py:410)
    remote (ray/actor.py:124)
    _try_assign_replica (ray/serve/router.py:121)
    assign_replica (ray/serve/router.py:148)
    assign_request (ray/serve/router.py:220)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1786)
    run_forever (asyncio/base_events.py:541)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 14884 (idle): "Thread-17"
    wait (threading.py:300)
    wait (threading.py:552)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 12899 (idle): "Thread-123"
    channel_spin (grpc/_channel.py:1258)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 499 (idle): "Dummy-9"

Sending ~100s of requests from an actor to an autoscaling Serve deployment.

Through HTTP or Python? We do cancel requests that’s still waiting on replicas if their client drop the HTTP call.

This is through the Python ServeHandle API.

Ah unfortunately we don’t have a cancellation mechanism right now; nor a way to reject new submission if the handle is overloaded. Both should be straightforward to implement, can you help open new Github Issue for this?

Using async handle might help. You can do it by getting handle with sync=False flag. And instead of doing ref = handle.remote() you can do ref = await handle.remote() to make it non blocking

Thanks! I can try the sync=False flag. Will also post a Github issue.

Is your current thought that the Serve Handle is overloaded, and if it receives a request in that state, then it gets into some sort of inconsistent state and can’t make any progress?

As an update to this, using the async handle did not resolve the problem – the key issue seems to be that the the bottom part of the call stack (below) calls synchronous code. When the call to replica.actor_handle.handle_request.remote() is called in _try_assign_replica, it seems to go to synchronous code that isn’t returning.

We got around this by making the call in a pre-emptible thread with a timeout. I’ll add that to the Github issue.

 _actor_method_call (ray/actor.py:1176)
invocation (ray/actor.py:161)
_remote (ray/actor.py:168)
_start_span (ray/util/tracing/tracing_helper.py:421)
remote (ray/actor.py:122)
_try_assign_replica (ray/serve/router.py:114)
assign_replica (ray/serve/router.py:142)