Problem connecting client to cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Currently when I try using ray.init to connect a client to head node of a cluster (of two machines) I get the error message:

ConnectionError: Request can’t be sent because the Ray client has already been disconnected due to an error. Last exception: <_MultiThreadedRendezvous of RPC that terminated with:

status = StatusCode.NOT_FOUND
details = “Attempted to reconnect a session that has already been cleaned up”
debug_error_string = “{“created”:”@1668701483.254000000",“description”:“Error received from peer ipv4::”,“file”:“src/code/lib/surfaces/call.cc”,“file_line”:1075,“gprc_message”:“Attempted to reconnect a session that has already been cleaned up”,“gprc_status”:5}"

When I only the head node standalone (without any other cluster machines connected), I do not get this error. Its only when I try to connect other machines to the head node. I am running on Windows. Happy to provide more more info! Just frustrated because I’ve been stuck on this for a while.

1 Like

Hi @Albert
I’m happy to help here. Do you mind showing some logs in your head node?
The logs should be in /tmp/ray/session_latest/logs/ by default.

Could you check gcs_server and ray_client_server and also raylet logs to see whether there something abnormal?

In ray_client_server.err

INFO proxier.py:670 – New data connection from client [xxxx]
INFO proxier.py:340 – SpecificServer started on port: 23000 with PID: 8292 for client [xxxx]
ERROR proxier.py:723 – Proxying Datapath failed!
Traceback (most recent call last):
File “… ray\util\client\server\proxier.py”, line 716, in Datapath
for rep in rep_stream:
File “…gprc_channel.py”, line 426, in next
return self._next()
File “…gprc_channel.py”, line 826, in _next
raise self
gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = “Stream removed”
debug_error_string = “{“created”:”@16688818856.548000000",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

ERROR proxier.py:797 – Proxying Logstream failed!
Traceback (most recent call last):
File “… ray\util\client\server\proxier.py”, line 794, in Logstream
for rep in rep_stream:
File “…gprc_channel.py”, line 426, in next
return self._next()
File “…gprc_channel.py”, line 826, in _next
raise self
gprc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = “Stream removed”
debug_error_string = “{“created”:”@16688818856.548000000",“descripton”:“Error received from peer ipv4:127.0.0.1:23000”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1075,“gprc_message”:“Stream removed”,“gprc_status”:2}"

INFO proxier.py:390 – Specific server [xxxxx] is no longer running, freeing its port 23000
INFO proxier.py:742 – [xxxxx] last started stream at 1668818843.641747. Current stream started at 1668818843.641747.

In gcs_server.out… I get the following repeated error messages:

[datetime] (gcs_server.exe) gcs_server.cc:285: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

I did not see anything weird in raylet.out
Let me know if I can provide any additional information.

Is there anything that can be gleaned from what I posted earlier? Or maybe a general direction in which I should investigate to try to resolve this issue? It only happens when I connect an additional machine to the head.

Hello, I am seeing the same error after upgrading our Ray image to versions later than 2.3.1 (Tried Kuberay 0.5.0, 1.1.0, and Ray versions 2.9, 2.11 and 2.12). I know in the docs it says this error is indicative of the Ray head recently restarting, but my head node has 0 restarts and I’m still seeing this. Any ideas?

2024-04-30 11:51:36.557 | INFO | am_analytics.utils.ray_config:ray_init:13 - Using existing cluster: ray://ray-kuberay-head-svc..svc.cluster.local:10001

2024-04-30 11:51:36,591 INFO client_builder.py:244 – Passing the following kwargs to ray.init() on the server: logging_level

2024-04-30 11:51:36,629 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.IDLE

2024-04-30 11:51:36,831 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.CONNECTING

2024-04-30 11:51:36,836 DEBUG worker.py:378 – client gRPC channel state change: ChannelConnectivity.READY

2024-04-30 11:51:36,837 DEBUG worker.py:818 – Pinging server.

SIGTERM handler is not set because current thread is not the main thread.

2024-04-30 11:52:19,358 DEBUG dataclient.py:333 – Recoverable error in data channel.

2024-04-30 11:52:19,358 DEBUG dataclient.py:334 – <_MultiThreadedRendezvous of RPC that terminated with:

status = StatusCode.UNAVAILABLE

details = “Socket closed”

debug_error_string = “UNKNOWN:Error received from peer {created_time:“2024-04-30T11:52:19.358331754+00:00”, grpc_status:14, grpc_message:“Socket closed”}”

2024-04-30 11:52:19,359 DEBUG worker.py:818 – Pinging server.

2024-04-30 11:52:19,361 ERROR dataclient.py:330 – Unrecoverable error in data channel.

2024-04-30 11:52:19,361 DEBUG dataclient.py:331 – <_MultiThreadedRendezvous of RPC that terminated with:

status = StatusCode.NOT_FOUND

details = “Attempted to reconnect a session that has already been cleaned up”

debug_error_string = “UNKNOWN:Error received from peer {created_time:“2024-04-30T11:52:19.360925106+00:00”, grpc_status:5, grpc_message:“Attempted to reconnect a session that has already been cleaned up”}”

2024-04-30 11:52:19,361 DEBUG dataclient.py:285 – Shutting down data channel.