we are doing some ray data repartition, batch etc. And then the program seems to hangs at repartition.
ray 2.4.0
did a py-spy for ray::main on head node and found that
$ py-spy dump -p 737
Process 737: ray::main
Python v3.10.2 (/export/apps/python/3.10.2/bin/python3.10)
Thread 737 (idle): "MainThread"
get_output_blocking (ray/data/_internal/execution/streaming_executor_state.py:222)
get_next (ray/data/_internal/execution/streaming_executor.py:103)
__next__ (ray/data/_internal/execution/interfaces.py:484)
_bundles_to_block_list (ray/data/_internal/execution/legacy_compat.py:334)
execute_to_legacy_block_list (ray/data/_internal/execution/legacy_compat.py:114)
execute (ray/data/_internal/plan.py:580)
write_datasource (ray/data/dataset.py:2927)
main (jobuser/synopai/summarization.py:147)
main_loop (ray/_private/worker.py:844)
<module> (ray/_private/workers/default_worker.py:258)
Thread 804 (idle): "ray_import_thread"
wait (threading.py:324)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:733)
_poll_locked (ray/_private/gcs_pubsub.py:258)
poll (ray/_private/gcs_pubsub.py:399)
_run (ray/_private/import_thread.py:77)
run (threading.py:946)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
Thread 1586 (idle): "Thread-2"
get_objects (ray/_private/worker.py:742)
get (ray/_private/worker.py:2515)
wrapper (ray/_private/client_mode_hook.py:105)
_split_all_blocks (ray/data/_internal/split.py:196)
_split_at_indices (ray/data/_internal/split.py:283)
split_at_indices (ray/data/dataset.py:1452)
fast_repartition (ray/data/_internal/fast_repartition.py:35)
do_fast_repartition (ray/data/_internal/stage_impl.py:86)
bulk_fn (ray/data/_internal/execution/legacy_compat.py:312)
inputs_done (ray/data/_internal/execution/operators/all_to_all_operator.py:65)
process_completed_tasks (ray/data/_internal/execution/streaming_executor_state.py:328)
_scheduling_loop_step (ray/data/_internal/execution/streaming_executor.py:205)
run (ray/data/_internal/execution/streaming_executor.py:157)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
Thread 431064 (idle): "Thread-302 (_run)"
channel_spin (grpc/_channel.py:1258)
run (threading.py:946)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
The ray_import-thread is blocked at a grpc call
# Wait for result to become available, or cancel if the
# subscriber has closed.
while True:
try:
# Use 1s timeout to check for subscriber closing
# periodically.
fut.result(timeout=1)
break
Want to understand whether this ray_import_thread is on the critical path and whether it is the root cause of hanging.