### What happened + What you expected to happen
ray head crash when work node r…educe replica by k8s hpa.
`[2024-04-11 16:39:09,766 E 539 649] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.'`
gcs_server.out
```
[2024-04-11 16:44:10,918 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 failed, reconstructing actors.
[2024-04-11 16:44:10,918 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 failed, mark all jobs from this node as finished
[2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1
[2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1, node name = 10.155.5.123
[2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, rescheduling the placement groups on the dead node.
[2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, reconstructing actors.
[2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, mark all jobs from this node as finished
[2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854
[2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854, node name = 10.155.143.110
[2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, rescheduling the placement groups on the dead node.
[2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, reconstructing actors.
[2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, mark all jobs from this node as finished
[2024-04-11 16:44:11,090 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster.
[2024-04-11 16:44:11,193 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster.
[2024-04-11 16:44:11,214 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster.
[2024-04-11 16:44:11,228 I 240 270] (gcs_server) ray_syncer-inl.h:308: Failed to read the message from: 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1
[2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd
[2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd, node name = 10.155.158.61
[2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, rescheduling the placement groups on the dead node.
[2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, reconstructing actors.
[2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, mark all jobs from this node as finished
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: *** SIGSEGV received at time=1712853851 on cpu 0 ***
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: PC: @ 0x5574dc4b433d (unknown) absl::lts_20220623::inlined_vector_internal::Storage<>::DestroyContents()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x7f5838f67980 1552 (unknown)
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc4b4a19 64 absl::lts_20220623::Status::UnrefNonInlined()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc44617d 32 grpc_core::promise_filter_detail::ServerCallData::~ServerCallData()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc156ba0 32 _ZZN9grpc_core22MakePromiseBasedFilterINS_16HttpServerFilterELNS_14FilterEndpointE1ELh1EEENSt9enable_ifIXsrSt10is_base_ofINS_13ChannelFilterET_E5valueE19grpc_channel_filterE4typeEPKcENUlP17grpc_call_elementPK20grpc_call_final_infoP12grpc_closureE4_4_FUNESE_SH_SJ_
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc42019d 80 grpc_call_stack_destroy()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc42ed07 64 grpc_core::FilterStackCall::DestroyCall()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc491134 80 grpc_core::ExecCtx::Flush()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc422f08 304 cq_next()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc0cd0b5 48 grpc::(anonymous namespace)::CallbackAlternativeCQ::Ref()::{lambda()#1}::_FUN()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc4af0f6 112 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::_FUN()
[2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x7f5838f5c6db (unknown) start_thread
```
I don’t have any clue about this problem now. Could you give me some directions and suggestions for troubleshooting?
btw, I found the lots of logs like `[2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer.cc:229: Connection is broken. Reconnect to node: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer-inl.h:308: Failed to read the message from: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer-inl.h:292: Failed to send the message to: 00000000000000000000000000000000000000000000000000000000`.
### Versions / Dependencies
ray==2.8.0
python3.10
### Reproduction script
none
### Issue Severity
High: It blocks me from completing my task.