GCS fails to start in Kubernetes using the KubeRay Operator

I am having an issue with Kuberay on Kubernetes. For my company, I am trying to get a namespace working such that users can then programmatically startup ray clusters and give them jobs to do. I have the kuberay operator running correctly afaik. However, when I try to get the cluster up an running, the program fails with this error in the ray-head container:

RuntimeError: Failed to start GCS.  Last 0 lines of error files:

But when I check the logs file specified by this error, it shows that the server was started correctly:

[2023-11-29 17:26:07,320 I 19 19] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:234: Set ray event level to warning
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_server.cc:58: GCS storage type is memory
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-11-29 17:26:07,722 I 19 19] (gcs_server) grpc_server.cc:140: GcsServer server started, listening on port 6379.

When I check the autoscaler logs too I see this:

Traceback (most recent call last):
  File "/default-pegasus-venv/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 123, in check_health
    resp = stub.CheckAlive(req, timeout=timeout)
  File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
 status = StatusCode.UNAVAILABLE
 details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
 debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-12-01T16:59:18.428204444+00:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-12-01T16:59:18.42819749+00:00", grpc_status:14}]}"

So I’m not too sure why the GCS server is unconnectable? The dashboard is also not operational and the ray.init(address) connection times out when trying to connect to it.

In case anyone else comes across this, one simple fix was to just massively increase the resources. I had moved from KubeRay V0.5 to V1.0 and didn’t think to increase limits. My cluster is now up and running. I also made sure I had all the correct permissions set in my operator yaml as that was missing some key features