How to analysis or debug the connecting procedure

As mentioned before ([ray1.0.0] stuck when connecting to existing ray cluster), I met the same problem again, Worker Node can’t connect to Head Node, but all ports needed is open (tested by telnet).

>>> ray.init(address="auto")
2021-03-17 16:09:09,013 INFO worker.py:634 -- Connecting to existing Ray cluster at address: 192.168.250.10:6379

It stucking with nothing output.

What version of Ray are you using + What’s your operating systems?

  • CentOS Linux release 7.8.2003
  • Ray(1.0.0)
  • Python3.6.9

Seems stucking in ray._raylet.CoreWorker

Do you use VPN? Also, if you just do this;

ray start --head
# start a script that runs ray.init(address='auto')

Does it still hang?

I’m using docker run --net host ..., no VPN.
And as did as you say, but sometimes it hang in the head node also.

Hmm usually this kind of hanging has lots of different reasons. If you use the more recent version of ray, it still happens? If not, I’d like to probably ask for a pair debugging to help you unresolving it.

I tried Ray 1.2.0, It can connect to head, but all tasks executed in the same node which run the python script (though all other nodes are idle), and there is no much detail logs. The raylet.out shows:

Actually, Ray 1.2 has more strange problems compare to Ray 1.0 when I run the same project code.
In Ray 1.2.0, The driver shows:


debug_state.txt

  1. Tasks executed on the same node are expected. Tasks are started to be scheduled on a remote node once the local node is saturated. If you set your head node cpu == 0, are those tasks scheduled?
  2. The messages can happen for various reasons. Are your actors created eventually, or is it hanging?

Also, if you are in public slack, I’d love to take a look at your issue in a video chat.