GCS fails to start in Kubernetes using the KubeRay Operator

EuanScottWatson · December 4, 2023, 2:33pm

I am having an issue with Kuberay on Kubernetes. For my company, I am trying to get a namespace working such that users can then programmatically startup ray clusters and give them jobs to do. I have the kuberay operator running correctly afaik. However, when I try to get the cluster up an running, the program fails with this error in the ray-head container:

RuntimeError: Failed to start GCS.  Last 0 lines of error files:

But when I check the logs file specified by this error, it shows that the server was started correctly:

[2023-11-29 17:26:07,320 I 19 19] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:234: Set ray event level to warning
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_server.cc:58: GCS storage type is memory
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-11-29 17:26:07,722 I 19 19] (gcs_server) grpc_server.cc:140: GcsServer server started, listening on port 6379.

When I check the autoscaler logs too I see this:

Traceback (most recent call last):
  File "/default-pegasus-venv/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 123, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
 status = StatusCode.UNAVAILABLE
 details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
 debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-12-01T16:59:18.428204444+00:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-12-01T16:59:18.42819749+00:00", grpc_status:14}]}"

So I’m not too sure why the GCS server is unconnectable? The dashboard is also not operational and the ray.init(address) connection times out when trying to connect to it.

EuanScottWatson · December 6, 2023, 1:43pm

In case anyone else comes across this, one simple fix was to just massively increase the resources. I had moved from KubeRay V0.5 to V1.0 and didn’t think to increase limits. My cluster is now up and running. I also made sure I had all the correct permissions set in my operator yaml as that was missing some key features

Topic		Replies	Views
Cannot connect to GCS Ray Clusters	3	1601	March 1, 2023
ERROR gcs_utils.py:137 -- Failed to send request to gcs Ray Clusters	20	2690	February 11, 2022
Couldn't reconnect to GCS server Ray Core	2	3590	December 22, 2020
Gcs_server.out file filling up with Couldn't get resource request from raylet Kubernetes	4	573	November 7, 2021
Local Cluster - Failed to connect to GCS Ray Core	3	1782	August 21, 2023

GCS fails to start in Kubernetes using the KubeRay Operator

Related topics