Ray client fails to reconnect

Tanvi_Thakur · November 18, 2021, 1:31am

Hey, trying to run a simple ML Model using ray which uses Ray Tune as well. We are using EKS cluster and ray-operator in order to use Ray. Apparently, when we try to run our ML model in jupyter notebook, it fails giving error:

Request can’t be sent because the Ray client has already been disconnected due to an error. Last exception: Failed to reconnect within the reconnection grace period (30s)

Any insights on this? We tried changing our cluster config as well in order to use high cpu and better ec2 instances, but it is still throwing an error. Is there a way we can increase the number of threads? (

samrogers226 · November 18, 2021, 7:39pm

Hi @Tanvi_Thakur,

Is it possible that you are opening too many clients to the cluster? This can be increased by setting the RAY_CLIENT_SERVER_MAX_THREADS environment variable on the server side. I believe by default it is 50.

Hope this helps.

Tanvi_Thakur · November 18, 2021, 8:09pm

In rayclusters K8s crd, I tried changing the RAY_gcs_server_rpc_server_thread_num but the max it takes is 10 after that it gives error as head stops accepting connections. Where is this RAY_CLIENT_SERVER_MAX_THREADS variable set?

Topic		Replies	Views
Ray.init() max num clients Ray Client	1	670	March 29, 2022
ConnectionError: Cannot send request due to data channel shutting down Ray Core	7	1869	August 13, 2021
Ray Client Max Connections Ray Client	1	386	September 27, 2023
Connect multiple jobs to same ray cluster Ray Core	7	433	January 21, 2021
Error when creating many simultaneous client-server connections Ray Core	2	755	December 6, 2022

Ray client fails to reconnect

Related topics