How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Consistently reproducible on Ray 1.11 & Ray 1.12, though we don’t have a small repro script yet.
Background:
- Running on k8s cluster
- Sending ~100s of requests from an actor to an autoscaling Serve deployment.
- Many actors (3-7), each sending to one autoscaling Serve deployment.
On a job as described in the background, we routinely (at least one actor, typically more) will stall out at the beginning or midway through the job. Typically (but not always) this happens after at least one Serve replica fails (usually due to k8s pod shuffling), the stack trace of which we see in the logs of the Ray Actor that makes the call to the Serve Handle. However, that exception doesn’t seem to cause the calling actor to fail, or the remote()
call to the Serve Handle.
Both seem to stall out on the _try_assign_replica
call.
Any ideas on why the remote()
call would never complete, or mitigations to avoid this deadlocking?
Below is py-spy
dump
and top
output of the actor calling the deployment, for the two common cases seen:
Case #1:
Dump #1:
Process 9756: ray::InferenceJobHandlerActor.handle_job()
Python v3.7.12 (/venv/bin/python3.7)
Thread 9756 (idle): "MainThread"
wait (threading.py:296)
result (concurrent/futures/_base.py:430)
remote (ray/serve/handle.py:236)
< our application code >
_resume_span (ray/util/tracing/tracing_helper.py:462)
actor_method_executor (ray/_private/function_manager.py:701)
main_loop (ray/worker.py:449)
<module> (ray/workers/default_worker.py:235)
Thread 9783 (idle): "ray_import_thread"
wait (threading.py:300)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:735)
_poll_locked (ray/_private/gcs_pubsub.py:274)
poll (ray/_private/gcs_pubsub.py:410)
_run (ray/_private/import_thread.py:67)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 9792 (idle): "Thread-2"
_actor_method_call (ray/actor.py:1176)
invocation (ray/actor.py:161)
_remote (ray/actor.py:168)
_start_span (ray/util/tracing/tracing_helper.py:421)
remote (ray/actor.py:122)
_try_assign_replica (ray/serve/router.py:114)
assign_replica (ray/serve/router.py:162)
assign_request (ray/serve/router.py:217)
_run (asyncio/events.py:88)
_run_once (asyncio/base_events.py:1786)
run_forever (asyncio/base_events.py:541)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 9795 (idle): "Thread-4"
wait (threading.py:300)
wait (threading.py:552)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 9952 (idle): "Thread-26"
channel_spin (grpc/_channel.py:1258)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 9778 (idle): "Dummy-3"
_write_to_self (asyncio/selector_events.py:137)
call_soon_threadsafe (asyncio/base_events.py:742)
_call_set_state (asyncio/futures.py:365)
_invoke_callbacks (concurrent/futures/_base.py:324)
set_exception (concurrent/futures/_base.py:537)
Top #1:
Collecting samples from 'ray::<our_actor>.<our_method>()' (python v3.7.12)
Total Samples 1300
GIL: 0.00%, Active: 0.00%, Threads: 6
%Own %Total OwnTime TotalTime Function (filename:line)
0.00% 0.00% 0.060s 0.060s _write_to_self (asyncio/selector_events.py:137)
0.00% 0.00% 0.020s 0.020s channel_spin (grpc/_channel.py:1258)
0.00% 0.00% 0.000s 0.020s _bootstrap (threading.py:890)
0.00% 0.00% 0.000s 0.060s _invoke_callbacks (concurrent/futures/_base.py:324)
0.00% 0.00% 0.000s 0.060s set_exception (concurrent/futures/_base.py:537)
0.00% 0.00% 0.000s 0.060s _call_set_state (asyncio/futures.py:365)
0.00% 0.00% 0.000s 0.020s run (threading.py:870)
0.00% 0.00% 0.000s 0.020s _bootstrap_inner (threading.py:926)
0.00% 0.00% 0.000s 0.060s call_soon_threadsafe (asyncio/base_events.py:742)
Case #2:
Dump #2:
Python v3.7.12 (/venv/bin/python3.7)
Thread 413 (idle): "MainThread"
wait (threading.py:296)
result (concurrent/futures/_base.py:430)
remote (ray/serve/handle.py:234)
< our application code >
_resume_span (ray/util/tracing/tracing_helper.py:451)
actor_method_executor (ray/_private/function_manager.py:648)
main_loop (ray/worker.py:430)
<module> (ray/workers/default_worker.py:224)
Thread 520 (idle): "ray_import_thread"
wait (threading.py:300)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:735)
_poll_locked (ray/_private/gcs_pubsub.py:270)
poll (ray/_private/gcs_pubsub.py:409)
_run (ray/_private/import_thread.py:71)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 14879 (idle): "Thread-16"
_actor_method_call (ray/actor.py:1022)
invocation (ray/actor.py:166)
_remote (ray/actor.py:172)
_start_span (ray/util/tracing/tracing_helper.py:410)
remote (ray/actor.py:124)
_try_assign_replica (ray/serve/router.py:121)
assign_replica (ray/serve/router.py:148)
assign_request (ray/serve/router.py:220)
_run (asyncio/events.py:88)
_run_once (asyncio/base_events.py:1786)
run_forever (asyncio/base_events.py:541)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 14884 (idle): "Thread-17"
wait (threading.py:300)
wait (threading.py:552)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 12899 (idle): "Thread-123"
channel_spin (grpc/_channel.py:1258)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 499 (idle): "Dummy-9"