Ray client fails to reconnect

Hey, trying to run a simple ML Model using ray which uses Ray Tune as well. We are using EKS cluster and ray-operator in order to use Ray. Apparently, when we try to run our ML model in jupyter notebook, it fails giving error:

Request can’t be sent because the Ray client has already been disconnected due to an error. Last exception: Failed to reconnect within the reconnection grace period (30s)

Any insights on this? We tried changing our cluster config as well in order to use high cpu and better ec2 instances, but it is still throwing an error. Is there a way we can increase the number of threads? (

Hi @Tanvi_Thakur,

Is it possible that you are opening too many clients to the cluster? This can be increased by setting the RAY_CLIENT_SERVER_MAX_THREADS environment variable on the server side. I believe by default it is 50.

Hope this helps.

In rayclusters K8s crd, I tried changing the RAY_gcs_server_rpc_server_thread_num but the max it takes is 10 after that it gives error as head stops accepting connections. Where is this RAY_CLIENT_SERVER_MAX_THREADS variable set?