How to analysis or debug the connecting procedure

crystalww · March 17, 2021, 8:17am

As mentioned before ([ray1.0.0] stuck when connecting to existing ray cluster), I met the same problem again, Worker Node can’t connect to Head Node, but all ports needed is open (tested by telnet).

>>> ray.init(address="auto")
2021-03-17 16:09:09,013 INFO worker.py:634 -- Connecting to existing Ray cluster at address: 192.168.250.10:6379

It stucking with nothing output.

sangcho · March 17, 2021, 11:32pm

What version of Ray are you using + What’s your operating systems?

crystalww · March 18, 2021, 7:09am

CentOS Linux release 7.8.2003
Ray(1.0.0)
Python3.6.9

Seems stucking in ray._raylet.CoreWorker

sangcho · March 18, 2021, 7:22pm

Do you use VPN? Also, if you just do this;

ray start --head
# start a script that runs ray.init(address='auto')

Does it still hang?

crystalww · March 19, 2021, 5:59am

I’m using docker run --net host ..., no VPN.
And as did as you say, but sometimes it hang in the head node also.

sangcho · March 19, 2021, 10:21pm

Hmm usually this kind of hanging has lots of different reasons. If you use the more recent version of ray, it still happens? If not, I’d like to probably ask for a pair debugging to help you unresolving it.

crystalww · March 23, 2021, 9:49am

I tried Ray 1.2.0, It can connect to head, but all tasks executed in the same node which run the python script (though all other nodes are idle), and there is no much detail logs. The raylet.out shows:

crystalww · March 23, 2021, 12:51pm

Actually, Ray 1.2 has more strange problems compare to Ray 1.0 when I run the same project code.
In Ray 1.2.0, The driver shows：

debug_state.txt

sangcho · March 23, 2021, 5:26pm

Tasks executed on the same node are expected. Tasks are started to be scheduled on a remote node once the local node is saturated. If you set your head node cpu == 0, are those tasks scheduled?
The messages can happen for various reasons. Are your actors created eventually, or is it hanging?

Also, if you are in public slack, I’d love to take a look at your issue in a video chat.

Topic		Replies	Views
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1753	December 15, 2020
Unable to connect to head node Ray Clusters	4	816	July 12, 2022
Ray status can connect but client code cannot Ray Clusters	10	1319	August 3, 2021
Error while trying to connec to ray cluster from docker Ray Clusters	4	429	July 15, 2021
Failed to get the system config from Raylet: IOError: 14: failed to connect to all addresses Ray Core	1	895	July 28, 2021

How to analysis or debug the connecting procedure

Related topics