Some confusion about raylet.out

vox · April 12, 2024, 8:37am

I found lots of logs like:

[2024-04-11 16:44:11,284 I 455 455] (raylet) ray_syncer-inl.h:308: Failed to read the message from: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:13,284 I 455 455] (raylet) ray_syncer.cc:229: Connection is broken. Reconnect to node: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:13,284 I 455 455] (raylet) ray_syncer-inl.h:308: Failed to read the message from: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:13,284 I 455 455] (raylet) ray_syncer-inl.h:292: Failed to send the message to: 00000000000000000000000000000000000000000000000000000000
[2024-04-11 16:44:15,285 I 455 455] (raylet) ray_syncer.cc:229: Connection is broken. Reconnect to node: 00000000000000000000000000000000000000000000000000000000
...
...

why node is 00000000000000000000000000000000000000000000000000000000

what do these mean？What are the possible situations where this kind of log will print?

ruisearch42 · April 17, 2024, 6:19pm

It looks like the node_id is incorrectly set for this node. How did you start Ray? Is it a cluster or from a script?

vox · April 18, 2024, 3:54am

The triggering condition seems to occur when gcs server is restarted.

github.com/ray-project/ray

head node crash by Failed to connect to GCS within 60 seconds

opened 09:20AM - 12 Apr 24 UTC

vonsago

bug triage core

### What happened + What you expected to happen ray head crash when work node r…educe replica by k8s hpa. `[2024-04-11 16:39:09,766 E 539 649] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.'` gcs_server.out ``` [2024-04-11 16:44:10,918 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 failed, reconstructing actors. [2024-04-11 16:44:10,918 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 failed, mark all jobs from this node as finished [2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 [2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1, node name = 10.155.5.123 [2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, rescheduling the placement groups on the dead node. [2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, reconstructing actors. [2024-04-11 16:44:11,039 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 failed, mark all jobs from this node as finished [2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 [2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854, node name = 10.155.143.110 [2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, rescheduling the placement groups on the dead node. [2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, reconstructing actors. [2024-04-11 16:44:11,067 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 failed, mark all jobs from this node as finished [2024-04-11 16:44:11,090 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet 96bdb79bfcf67cdd29a31b4ea457ef6066269e14966ad5e69a352b03 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster. [2024-04-11 16:44:11,193 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster. [2024-04-11 16:44:11,214 I 240 240] (gcs_server) gcs_node_manager.cc:140: Raylet d6e171bca8f50153d4ce99e92c20b9f760cde0b05a3d78392b123854 is drained. Status GrpcUnavailable: RPC Error message: Cancelling all calls; RPC Error details: . The information will be published to the cluster. [2024-04-11 16:44:11,228 I 240 270] (gcs_server) ray_syncer-inl.h:308: Failed to read the message from: 84aa3962753455d7c57d809262faaac4ece059f5a8fd7237f648b6b1 [2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_node_manager.cc:99: Draining node info, node id = 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd [2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_node_manager.cc:215: Removing node, node id = 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd, node name = 10.155.158.61 [2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_placement_group_manager.cc:763: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, rescheduling the placement groups on the dead node. [2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_actor_manager.cc:1038: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, reconstructing actors. [2024-04-11 16:44:11,237 I 240 240] (gcs_server) gcs_job_manager.cc:303: Node 4e8107ce80e786fc6b6175d88425aecdad253bf1d685659edc07c4bd failed, mark all jobs from this node as finished [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: *** SIGSEGV received at time=1712853851 on cpu 0 *** [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: PC: @ 0x5574dc4b433d (unknown) absl::lts_20220623::inlined_vector_internal::Storage<>::DestroyContents() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x7f5838f67980 1552 (unknown) [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc4b4a19 64 absl::lts_20220623::Status::UnrefNonInlined() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc44617d 32 grpc_core::promise_filter_detail::ServerCallData::~ServerCallData() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc156ba0 32 _ZZN9grpc_core22MakePromiseBasedFilterINS_16HttpServerFilterELNS_14FilterEndpointE1ELh1EEENSt9enable_ifIXsrSt10is_base_ofINS_13ChannelFilterET_E5valueE19grpc_channel_filterE4typeEPKcENUlP17grpc_call_elementPK20grpc_call_final_infoP12grpc_closureE4_4_FUNESE_SH_SJ_ [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc42019d 80 grpc_call_stack_destroy() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc42ed07 64 grpc_core::FilterStackCall::DestroyCall() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc491134 80 grpc_core::ExecCtx::Flush() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc422f08 304 cq_next() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc0cd0b5 48 grpc::(anonymous namespace)::CallbackAlternativeCQ::Ref()::{lambda()#1}::_FUN() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x5574dc4af0f6 112 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::_FUN() [2024-04-11 16:44:11,241 E 240 281] (gcs_server) logging.cc:361: @ 0x7f5838f5c6db (unknown) start_thread ``` I don’t have any clue about this problem now. Could you give me some directions and suggestions for troubleshooting? btw, I found the lots of logs like `[2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer.cc:229: Connection is broken. Reconnect to node: 00000000000000000000000000000000000000000000000000000000 [2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer-inl.h:308: Failed to read the message from: 00000000000000000000000000000000000000000000000000000000 [2024-04-11 16:44:31,287 I 455 455] (raylet) ray_syncer-inl.h:292: Failed to send the message to: 00000000000000000000000000000000000000000000000000000000`. ### Versions / Dependencies ray==2.8.0 python3.10 ### Reproduction script none ### Issue Severity High: It blocks me from completing my task.

Topic		Replies	Views
Very rare error that occurs when nodes disconnect and then reconnect Ray Core	3	371	April 20, 2023
Exception info in warning and/or debug logs Monitoring & Debugging	2	537	October 21, 2022
Log monitor failing Ray Core	5	790	December 15, 2022
(raylet) object_manager.cc:293: Couldn't send pull request from Ray Core	2	302	January 20, 2023
Failed to get the system config from Raylet: IOError: 14: failed to connect to all addresses Ray Core	1	831	July 28, 2021

Some confusion about raylet.out

Related Topics