Gcs_rpc_client.h:179: Failed to connect to GCS at address 192.168.85.116:6379 within 5 seconds

tao_sun · January 23, 2025, 8:14pm

The Ray cluster runs on Kubernetes (k8s). When I initialize Ray locally and try to submit code to the Ray cluster, it results in an error. However, submitting tasks directly from within the Ray pod works fine. The error message contains 192.168.85.116:6379, which is the IP address of the pod, but my local connection specifies the Ray cluster IP and NodePort. I don’t understand why the pod IP is thrown after submitting locally. After restarting the Ray pod, everything works normally for a while, but the same issue reoccurs after some time. Currently, I can only keep restarting Ray. Is there any way to resolve this?

The error message is as follows:
gcs_rpc_client.h:179: Failed to connect to GCS at address 192.168.85.116:6379 within 5 seconds.

christina · January 29, 2025, 12:01am

Hi there tao_sun,
Someone else also recently ran into a similar issue, maybe you can see what they did and see if that’ll unblock you as well? They had to allow connections onto the port.

Christy

tao_sun · February 11, 2025, 2:44am

This does not solve my issue. My Ray server and the client that initiates tasks are not in the same Kubernetes cluster. At the time of Ray initialization, my client was functioning normally, using the host machine’s IP and NodePort. However, after some time, for reasons unknown, my client changes the Ray server’s IP to the Pod IP and port, which results in connection failures.

christina · February 11, 2025, 8:22pm

Hi tao, thanks for replying! After doing some thinking, can you let me know if you’ve tried these things out?

1. Pod IP vs. NodePort Confusion

When you connect to Ray using the NodePort, it should keep that connection.
However, if Ray later tries to connect using the Pod IP, it might be because the GCS (Global Control Store) or another component restarted, causing Ray to resolve the IP differently.

2. GCS & Redis Issues

If GCS is backed by Redis and it restarts, it may change the stored IP address.
Try using a fully qualified domain name (FQDN) instead of an IP address so that all nodes consistently resolve to the correct GCS instance

3. Consider Using KubeRay

If you aren’t already using KubeRay, consider switching to it.
KubeRay improves fault tolerance and lifecycle management for Ray components in Kubernetes!

4. Increase Reconnect Timeout

May you can try setting the environment variable RAY_gcs_rpc_server_reconnect_timeout_s to a higher value, so this allows Ray nodes more time to reconnect after a GCS restart.

5. Check Your Network Configuration

Verify that NodePort is correctly set and that your local machine can persistently connect to the Ray cluster.
Run kubectl get services to ensure the correct external IP is used.

Here’s some docs that might be helpful:
Docs:

tao_sun · February 12, 2025, 1:20am

Thank you for your professional reply and suggestions. We are currently using KubeRay. If the parsing error is really caused by GCS, I don’t think I can solve it. I can try to increase the RAY_gcs_rpc_server_reconnect_timeout_s to increase the connection duration.

Topic		Replies	Views
2023-03-19 13:38:56,574 WARNING utils.py:1445 -- Unable to connect to GCS at gaowei0155.69.142.146:8901 Ray Core	1	447	March 21, 2023
Local Cluster - Failed to connect to GCS Ray Core	3	1681	August 21, 2023
Cannot connect to GCS Ray Clusters	3	1576	March 1, 2023
ERROR gcs_utils.py:137 -- Failed to send request to gcs Ray Clusters	20	2659	February 11, 2022
Couldn't reconnect to GCS server Ray Core	2	3533	December 22, 2020