Hey.
I have a ray cluster with two workers and I am trying to get ray training working. It is hanging and I would like to get to the bottom of it.
In Ray Train hangs for long time, @kai mentioned using py-spy. Where exactly should I run the py-spy command?
kai
2
Best is to do something like
ps a | grep ray
The output could be something like this:
...
19790 s001 SN+ 0:01.86 ray::IDLE
19791 s001 SN+ 0:01.85 ray::IDLE
19857 s001 SN+ 0:01.07 ray::Actor
you can then do py-spy on the PID of your ray worker (e.g. the “Actor” class) (usually needs sudo)
> sudo py-spy dump --pid 19857
Password:
Process 19857: ray::Actor
Python v3.7.7 (/Users/kai/.pyenv/versions/3.7.7/bin/python3.7)
Thread 0x1134DB600 (idle): "MainThread"
main_loop (ray/_private/worker.py:754)
<module> (ray/_private/workers/default_worker.py:237)
Thread 0x70000EFBC000 (idle): "ray_import_thread"
wait (threading.py:300)
_wait_once (grpc/_common.py:106)
wait (grpc/_common.py:148)
result (grpc/_channel.py:735)
_poll_locked (ray/_private/gcs_pubsub.py:249)
poll (ray/_private/gcs_pubsub.py:385)
_run (ray/_private/import_thread.py:70)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
Thread 0x70000F4BF000 (idle): "Thread-1"
channel_spin (grpc/_channel.py:1258)
run (threading.py:870)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
1 Like