Problem connecting client to cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Currently when I try using ray.init to connect a client to head node of a cluster (of two machines) I get the error message:

ConnectionError: Request can’t be sent because the Ray client has already been disconnected due to an error. Last exception: <_MultiThreadedRendezvous of RPC that terminated with:

status = StatusCode.NOT_FOUND
details = “Attempted to reconnect a session that has already been cleaned up”
debug_error_string = “{“created”:”@1668701483.254000000",“description”:“Error received from peer ipv4::”,“file”:“src/code/lib/surfaces/call.cc”,“file_line”:1075,“gprc_message”:“Attempted to reconnect a session that has already been cleaned up”,“gprc_status”:5}"

When I only the head node standalone (without any other cluster machines connected), I do not get this error. Its only when I try to connect other machines to the head node. I am running on Windows. Happy to provide more more info! Just frustrated because I’ve been stuck on this for a while.

Hi @Albert
I’m happy to help here. Do you mind showing some logs in your head node?
The logs should be in /tmp/ray/session_latest/logs/ by default.

Could you check gcs_server and ray_client_server and also raylet logs to see whether there something abnormal?

In ray_client_server.err

INFO proxier.py:670 – New data connection from client [xxxx]
INFO proxier.py:340 – SpecificServer started on port: 23000 with PID: 8292 for client [xxxx]
ERROR proxier.py:723 – Proxying Datapath failed!
Traceback (most recent call last):
File “… ray\util\client\server\proxier.py”, line 716, in Datapath
for rep in rep_stream:
File “…gprc_channel.py”, line 426, in next
return self._next()
File “…gprc_channel.py”, line 826, in _next
raise self
gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = “Stream removed”
debug_error_string = “{“created”:”@16688818856.548000000",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

ERROR proxier.py:797 – Proxying Logstream failed!
Traceback (most recent call last):
File “… ray\util\client\server\proxier.py”, line 794, in Logstream
for rep in rep_stream:
File “…gprc_channel.py”, line 426, in next
return self._next()
File “…gprc_channel.py”, line 826, in _next
raise self
gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = “Stream removed”
debug_error_string = “{“created”:”@16688818856.548000000",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

INFO proxier.py:390 – Specific server [xxxxx] is no longer running, freeing its port 23000
INFO proxier.py:742 – [xxxxx] last started stream at 1668818843.641747. Current stream started at 1668818843.641747.

In gcs_server.out… I get the following repeated error messages:

[datetime] (gcs_server.exe) gcs_server.cc:285: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

I did not see anything weird in raylet.out
Let me know if I can provide any additional information.

Is there anything that can be gleaned from what I posted earlier? Or maybe a general direction in which I should investigate to try to resolve this issue? It only happens when I connect an additional machine to the head.