I am having an issue with Kuberay on Kubernetes. For my company, I am trying to get a namespace working such that users can then programmatically startup ray clusters and give them jobs to do. I have the kuberay operator running correctly afaik. However, when I try to get the cluster up an running, the program fails with this error in the ray-head
container:
RuntimeError: Failed to start GCS. Last 0 lines of error files:
But when I check the logs file specified by this error, it shows that the server was started correctly:
[2023-11-29 17:26:07,320 I 19 19] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:234: Set ray event level to warning
[2023-11-29 17:26:07,321 I 19 19] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_server.cc:58: GCS storage type is memory
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-11-29 17:26:07,520 I 19 19] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-11-29 17:26:07,722 I 19 19] (gcs_server) grpc_server.cc:140: GcsServer server started, listening on port 6379.
When I check the autoscaler
logs too I see this:
Traceback (most recent call last):
File "/default-pegasus-venv/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 123, in check_health
resp = stub.CheckAlive(req, timeout=timeout)
File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/default-pegasus-venv/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-12-01T16:59:18.428204444+00:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-12-01T16:59:18.42819749+00:00", grpc_status:14}]}"
So I’m not too sure why the GCS server is unconnectable? The dashboard is also not operational and the ray.init(address)
connection times out when trying to connect to it.