The Ray cluster runs on Kubernetes (k8s). When I initialize Ray locally and try to submit code to the Ray cluster, it results in an error. However, submitting tasks directly from within the Ray pod works fine. The error message contains 192.168.85.116:6379, which is the IP address of the pod, but my local connection specifies the Ray cluster IP and NodePort. I don’t understand why the pod IP is thrown after submitting locally. After restarting the Ray pod, everything works normally for a while, but the same issue reoccurs after some time. Currently, I can only keep restarting Ray. Is there any way to resolve this?
The error message is as follows:
gcs_rpc_client.h:179: Failed to connect to GCS at address 192.168.85.116:6379 within 5 seconds.
Hi there tao_sun,
Someone else also recently ran into a similar issue, maybe you can see what they did and see if that’ll unblock you as well? They had to allow connections onto the port.
This does not solve my issue. My Ray server and the client that initiates tasks are not in the same Kubernetes cluster. At the time of Ray initialization, my client was functioning normally, using the host machine’s IP and NodePort. However, after some time, for reasons unknown, my client changes the Ray server’s IP to the Pod IP and port, which results in connection failures.
Hi tao, thanks for replying! After doing some thinking, can you let me know if you’ve tried these things out?
1. Pod IP vs. NodePort Confusion
When you connect to Ray using the NodePort, it should keep that connection.
However, if Ray later tries to connect using the Pod IP, it might be because the GCS (Global Control Store) or another component restarted, causing Ray to resolve the IP differently.
2. GCS & Redis Issues
If GCS is backed by Redis and it restarts, it may change the stored IP address.
Try using a fully qualified domain name (FQDN) instead of an IP address so that all nodes consistently resolve to the correct GCS instance
3. Consider Using KubeRay
If you aren’t already using KubeRay, consider switching to it.
KubeRay improves fault tolerance and lifecycle management for Ray components in Kubernetes!
4. Increase Reconnect Timeout
May you can try setting the environment variable RAY_gcs_rpc_server_reconnect_timeout_s to a higher value, so this allows Ray nodes more time to reconnect after a GCS restart.
5. Check Your Network Configuration
Verify that NodePort is correctly set and that your local machine can persistently connect to the Ray cluster.
Run kubectl get services to ensure the correct external IP is used.
Thank you for your professional reply and suggestions. We are currently using KubeRay. If the parsing error is really caused by GCS, I don’t think I can solve it. I can try to increase the RAY_gcs_rpc_server_reconnect_timeout_s to increase the connection duration.