Raylet errors some worker have not registered within the timeout

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

It really bother me. I always get the error below. And just keep output the message below.
I am using a slurm. I installed ray by pip install ray and my ray version is 1.12.1.
The error below blocks me to complete my job.

1 Like

@tupui Do you have any ideas about this one?

Here ray status shows 1 node 8 GPUs,

This might again be the networking issue. @zyc-bit do you have multiple network interfaces? And if yes do you have some specific isolation from one network to the other?

Thank you @tupui . I’ll ask my Slurm cluster administrator about the information you mentioned above. And once I get the answers from the administrator, I’ll come back to reply.

and btw, here I only start a ray head node.

Hi @tupui ,sorry it took so long to get back to you. I do have a lot of network interfaces as shown in the picture. But the admin says there is no special isolation. The admin says they are all interoperable except for the ib interface.

And I found the link below from the Ray document Slurm part, Maybe the content being added to this issue will solve my problem? I saw you are working on this issue too.
[Feature] [core] Selecting network interface · Issue #22732 · ray-project/ray (github.com)

It might be linked yes.

Thank you for helping me all the way here.
I only start a head node. So in the one node, the head node, there are also network problems?
image

And btw, do you know when will the github issue mentioned above be finished?

Can you share more info? The networking issues related to multiple network interfaces shouldn’t cause issues when just on the head node. Are you seeing failures on a single node, and if so can you share the log?

Also, the GitHub issue is actively being worked on, I anticipate it will go out in Ray 1.14 but it’s not a guarantee.

Yes, when I only on the head node, it still had the (raylet) errors mentioned above.

which one of the logs should I show? The raylet.out? Or the raylet.err?

Share both if you can :slight_smile:

Hi, @cade . I’m busy with the review comment response for my paper today, so I apologize for the late reply to you.
First, I noticed when the errors happend, many python process were being created.


And in the logs file folder, there are so many files (the number of the files is 298): (So how do I let you know what information is in these files?)
image

The raylet.out is too long to post here. What other ways can I showcase?
And the raylet.err is the same with the error information mentioned above. I show the raylet.err below:

cc @Chen_Shen , what logs are helpful here?

@zyc-bit if it’s a network interface issue, any of those python-core-worker-* log should have some error message.

@Chen_Shen the interesting thing is that it fails even on a single node. Can a worker node fail to connect to the head node if there are multiple NICs, even if they’re on the same node?

@cade yeah good point. @zyc-bit is it possible to share the full log folder if there is no sensitive data? I think it should be very easy to tell from the error logs.

Thank you @Chen_Shen and @cade . I really appreciate your help.
There is no sensitive data. And I really want to share you the full log folder to solve the problem. So how can I share the full log folder with you? I don’t see a share folder button at the top. And it is too long to post here.
image

@zyc-bit I think you can first paste the python-core-worker-* log and raylet.out, I think that might be sufficient to diagnose the issue.

Hi, @Chen_Shen . Thank you for reply.
I upload my logs folder to github. You can see the full ray logs folder here:
zyc-bit/raylog: ray logs (github.com) .
In particular, some files are too long to be displayed below.
The raylet.out has 197029 lines. It is too long to post here.


And there are lots of python-core-worker-* ,

and many of them are very long. so I post two of them:

python-core-worker-0cd1786eab7068a85030519ca4a1d3bad53e955c1aba88c4fbc65e51_29713.log

[2022-06-16 22:22:17,322 I 29713 29713] core_worker_process.cc:120: Constructing CoreWorkerProcess. pid: 29713
[2022-06-16 22:22:18,194 I 29713 29713] grpc_server.cc:105: worker server started, listening on port 10060.
[2022-06-16 22:22:18,196 I 29713 29713] core_worker.cc:175: Initializing worker at address: 10.140.1.24:10060, worker ID 0cd1786eab7068a85030519ca4a1d3bad53e955c1aba88c4fbc65e51, raylet 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c
[2022-06-16 22:22:18,197 I 29713 30504] gcs_server_address_updater.cc:32: GCS Server updater thread id: 139837684311808
[2022-06-16 22:22:18,652 I 29713 29713] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-06-16 22:22:18,668 I 29713 30518] core_worker.cc:494: Event stats:


Global stats: 17 total (10 active)
Queueing time: mean = 108.245 ms, max = 355.299 ms, min = 20.330 ms, total = 1.840 s
Execution time:  mean = 25.376 us, total = 431.395 us
Event stats:
	PeriodicalRunner.RunFnPeriodically - 7 total (2 active, 1 running), CPU time: mean = 28.847 us, total = 201.932 us
	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 16.371 us, total = 16.371 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	GcsClient.deadline_timer.check_gcs_connection - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeManagerService.grpc_client.ReportWorkerBacklog - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 213.092 us, total = 213.092 us
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-06-16 22:22:18,668 I 29713 30518] accessor.cc:599: Received notification for node id = 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, IsAlive = 1
[2022-06-16 22:22:19,359 I 29713 29713] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 94fe4d27cfeb53ada734bf410b000000
[2022-06-16 22:22:19,359 I 29713 29713] direct_actor_task_submitter.cc:217: Connecting to actor 94fe4d27cfeb53ada734bf410b000000 at worker 0cd1786eab7068a85030519ca4a1d3bad53e955c1aba88c4fbc65e51
[2022-06-16 22:22:19,359 I 29713 29713] core_worker.cc:2317: Creating actor: 94fe4d27cfeb53ada734bf410b000000
[2022-06-16 22:23:21,611 I 29713 30518] core_worker.cc:494: Event stats:


Global stats: 891 total (9 active)
Queueing time: mean = 29.958 ms, max = 3.380 s, min = -0.066 s, total = 26.692 s
Execution time:  mean = 8.328 ms, total = 7.420 s
Event stats:
	UNKNOWN - 661 total (5 active, 1 running), CPU time: mean = 8.880 ms, total = 5.870 s
	GcsClient.deadline_timer.check_gcs_connection - 58 total (1 active), CPU time: mean = 524.995 us, total = 30.450 ms
	CoreWorker.deadline_timer.flush_profiling_events - 58 total (1 active), CPU time: mean = 8.139 ms, total = 472.043 ms
	NodeManagerService.grpc_client.ReportWorkerBacklog - 57 total (1 active), CPU time: mean = 5.614 us, total = 320.020 us
	CoreWorkerService.grpc_server.GetCoreWorkerStats - 45 total (0 active), CPU time: mean = 22.598 ms, total = 1.017 s
	PeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 2.152 ms, total = 15.062 ms
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 16.371 us, total = 16.371 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	CoreWorkerService.grpc_server.PushTask - 1 total (0 active), CPU time: mean = 15.032 ms, total = 15.032 ms
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 213.092 us, total = 213.092 us
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 40.564 us, total = 40.564 us


[2022-06-16 22:23:31,834 I 29713 29713] direct_actor_transport.cc:144: Actor creation task finished, task_id: ffffffffffffffff94fe4d27cfeb53ada734bf410b000000, actor_id: 94fe4d27cfeb53ada734bf410b000000
[2022-06-16 22:24:21,949 I 29713 30518] core_worker.cc:494: Event stats:


Global stats: 1742 total (8 active)
Queueing time: mean = 33.239 ms, max = 4.327 s, min = -0.066 s, total = 57.902 s
Execution time:  mean = 8.316 ms, total = 14.487 s
Event stats:
	UNKNOWN - 1302 total (5 active, 1 running), CPU time: mean = 8.137 ms, total = 10.595 s
	GcsClient.deadline_timer.check_gcs_connection - 112 total (1 active), CPU time: mean = 586.926 us, total = 65.736 ms
	CoreWorker.deadline_timer.flush_profiling_events - 112 total (1 active), CPU time: mean = 7.300 ms, total = 817.561 ms
	NodeManagerService.grpc_client.ReportWorkerBacklog - 111 total (0 active), CPU time: mean = 744.580 us, total = 82.648 ms
	CoreWorkerService.grpc_server.GetCoreWorkerStats - 88 total (0 active), CPU time: mean = 32.277 ms, total = 2.840 s
	PeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 2.152 ms, total = 15.062 ms
	StatsGcsService.grpc_client.AddProfileData - 3 total (0 active), CPU time: mean = 3.347 ms, total = 10.042 ms
	CoreWorkerService.grpc_server.PushTask - 2 total (0 active), CPU time: mean = 15.292 ms, total = 30.585 ms
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 16.371 us, total = 16.371 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	CoreWorkerService.grpc_server.DirectActorCallArgWaitComplete - 1 total (0 active), CPU time: mean = 30.085 ms, total = 30.085 ms
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 213.092 us, total = 213.092 us
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 40.564 us, total = 40.564 us


[2022-06-16 22:24:31,825 I 29713 30518] core_worker.cc:3086: Force kill actor request has received. exiting immediately...
[2022-06-16 22:24:31,825 I 29713 30518] core_worker.cc:591: Disconnecting to the raylet.
[2022-06-16 22:24:31,855 I 29713 30518] raylet_client.cc:162: RayletClient::Disconnect, exit_type=INTENDED_EXIT, has creation_task_exception_pb_bytes=0

python-core-driver-11000000ffffffffffffffffffffffffffffffffffffffffffffffff_108246.log

[2022-06-16 23:31:30,392 I 108246 108246] core_worker_process.cc:120: Constructing CoreWorkerProcess. pid: 108246
[2022-06-16 23:31:31,454 I 108246 108246] grpc_server.cc:105: driver server started, listening on port 10100.
[2022-06-16 23:31:31,457 I 108246 108246] core_worker.cc:175: Initializing worker at address: 10.140.1.24:10100, worker ID 11000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c
[2022-06-16 23:31:31,457 I 108246 109886] gcs_server_address_updater.cc:32: GCS Server updater thread id: 139635007149824
[2022-06-16 23:31:31,637 I 108246 108246] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-06-16 23:31:31,691 I 108246 109896] core_worker.cc:494: Event stats:


Global stats: 15 total (10 active)
Queueing time: mean = 18.648 ms, max = 79.653 ms, min = 308.603 us, total = 279.715 ms
Execution time:  mean = 23.626 us, total = 354.389 us
Event stats:
	PeriodicalRunner.RunFnPeriodically - 6 total (2 active, 1 running), CPU time: mean = 25.735 us, total = 154.410 us
	UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
	GcsClient.deadline_timer.check_gcs_connection - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeManagerService.grpc_client.ReportWorkerBacklog - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 199.979 us, total = 199.979 us
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-06-16 23:31:31,691 I 108246 109896] accessor.cc:599: Received notification for node id = 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, IsAlive = 1
[2022-06-16 23:32:31,769 I 108246 109896] core_worker.cc:494: Event stats:


Global stats: 840 total (7 active)
Queueing time: mean = 12.712 ms, max = 1.837 s, min = -0.060 s, total = 10.678 s
Execution time:  mean = 6.191 ms, total = 5.201 s
Event stats:
	UNKNOWN - 607 total (4 active, 1 running), CPU time: mean = 5.416 ms, total = 3.287 s
	GcsClient.deadline_timer.check_gcs_connection - 58 total (1 active), CPU time: mean = 1.386 ms, total = 80.378 ms
	CoreWorker.deadline_timer.flush_profiling_events - 58 total (1 active), CPU time: mean = 6.054 ms, total = 351.145 ms
	NodeManagerService.grpc_client.ReportWorkerBacklog - 57 total (0 active), CPU time: mean = 1.421 ms, total = 80.999 ms
	CoreWorkerService.grpc_server.GetCoreWorkerStats - 50 total (0 active), CPU time: mean = 26.936 ms, total = 1.347 s
	PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 8.994 ms, total = 53.961 ms
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 35.732 us, total = 35.732 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 199.979 us, total = 199.979 us
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 16.397 us, total = 16.397 us


[2022-06-16 23:32:37,809 I 108246 108246] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 6ea2aaf4797ab459b3d497d311000000
[2022-06-16 23:32:38,017 I 108246 108246] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 7136362e7b3835d8c9786a7d11000000
[2022-06-16 23:32:38,858 I 108246 108246] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 9ee24193beb1dc62ce2fa96711000000
[2022-06-16 23:32:38,858 I 108246 108246] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 7cbe8c1381ac557dd5e9367511000000
[2022-06-16 23:32:38,858 I 108246 108246] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor bf2af9b118af148c0ad26a0a11000000
[2022-06-16 23:32:40,246 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 9ee24193beb1dc62ce2fa96711000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,305 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 7cbe8c1381ac557dd5e9367511000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,305 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: bf2af9b118af148c0ad26a0a11000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,446 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 6ea2aaf4797ab459b3d497d311000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,446 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 7136362e7b3835d8c9786a7d11000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,447 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 9ee24193beb1dc62ce2fa96711000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,447 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 7cbe8c1381ac557dd5e9367511000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,447 I 108246 109896] actor_manager.cc:246: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: bf2af9b118af148c0ad26a0a11000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,447 I 108246 109896] actor_manager.cc:246: received notification on actor, state: PENDING_CREATION, actor_id: 6ea2aaf4797ab459b3d497d311000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,452 I 108246 109896] actor_manager.cc:246: received notification on actor, state: PENDING_CREATION, actor_id: 7136362e7b3835d8c9786a7d11000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,452 I 108246 109896] actor_manager.cc:246: received notification on actor, state: PENDING_CREATION, actor_id: 9ee24193beb1dc62ce2fa96711000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,452 I 108246 109896] actor_manager.cc:246: received notification on actor, state: PENDING_CREATION, actor_id: 7cbe8c1381ac557dd5e9367511000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:32:40,452 I 108246 109896] actor_manager.cc:246: received notification on actor, state: PENDING_CREATION, actor_id: bf2af9b118af148c0ad26a0a11000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:33:32,039 I 108246 109896] core_worker.cc:494: Event stats:


Global stats: 1677 total (14 active)
Queueing time: mean = 27.565 ms, max = 3.279 s, min = -0.060 s, total = 46.226 s
Execution time:  mean = 8.041 ms, total = 13.485 s
Event stats:
	UNKNOWN - 1179 total (4 active, 1 running), CPU time: mean = 6.694 ms, total = 7.892 s
	GcsClient.deadline_timer.check_gcs_connection - 112 total (1 active), CPU time: mean = 1.079 ms, total = 120.864 ms
	CoreWorker.deadline_timer.flush_profiling_events - 112 total (1 active), CPU time: mean = 6.093 ms, total = 682.464 ms
	NodeManagerService.grpc_client.ReportWorkerBacklog - 111 total (2 active), CPU time: mean = 732.792 us, total = 81.340 ms
	CoreWorkerService.grpc_server.GetCoreWorkerStats - 91 total (0 active), CPU time: mean = 30.758 ms, total = 2.799 s
	Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 10 total (0 active), CPU time: mean = 7.418 us, total = 74.181 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 7 total (1 active), CPU time: mean = 7.848 ms, total = 54.934 ms
	PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 8.994 ms, total = 53.961 ms
	ActorCreator.AsyncRegisterActor - 5 total (0 active), CPU time: mean = 35.180 ms, total = 175.902 ms
	NodeManagerService.grpc_client.PinObjectIDs - 5 total (0 active), CPU time: mean = 7.062 ms, total = 35.310 ms
	ActorInfoGcsService.grpc_client.GetActorInfo - 5 total (0 active), CPU time: mean = 90.510 ms, total = 452.549 ms
	ActorInfoGcsService.grpc_client.RegisterActor - 5 total (0 active), CPU time: mean = 11.060 ms, total = 55.301 ms
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 5 total (0 active), CPU time: mean = 168.629 ms, total = 843.147 ms
	ActorInfoGcsService.grpc_client.CreateActor - 5 total (5 active), CPU time: mean = 0.000 s, total = 0.000 s
	CoreWorkerService.grpc_server.WaitForActorOutOfScope - 5 total (0 active), CPU time: mean = 4.800 us, total = 23.999 us
	CoreWorkerDirectActorTaskSubmitter::SubmitTask - 5 total (0 active), CPU time: mean = 10.020 ms, total = 50.101 ms
	CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 2 total (0 active), CPU time: mean = 60.569 ms, total = 121.139 ms
	StatsGcsService.grpc_client.AddProfileData - 2 total (0 active), CPU time: mean = 11.056 us, total = 22.111 us
	CoreWorkerService.grpc_server.PubsubCommandBatch - 2 total (0 active), CPU time: mean = 17.591 ms, total = 35.181 ms
	CoreWorkerService.grpc_server.PubsubLongPolling - 1 total (0 active), CPU time: mean = 31.061 ms, total = 31.061 ms
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 35.732 us, total = 35.732 us
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 16.397 us, total = 16.397 us


[2022-06-16 23:34:13,923 I 108246 109896] actor_manager.cc:246: received notification on actor, state: ALIVE, actor_id: 7cbe8c1381ac557dd5e9367511000000, ip address: 10.140.1.24, port: 10104, worker_id: 58a268000ebca54cb66f2c3b7887066a61d9a4986ea5f6093e5e528f, raylet_id: 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:34:14,020 I 108246 109896] direct_actor_task_submitter.cc:217: Connecting to actor 7cbe8c1381ac557dd5e9367511000000 at worker 58a268000ebca54cb66f2c3b7887066a61d9a4986ea5f6093e5e528f
[2022-06-16 23:34:14,282 I 108246 109896] actor_manager.cc:246: received notification on actor, state: ALIVE, actor_id: 7136362e7b3835d8c9786a7d11000000, ip address: 10.140.1.24, port: 10101, worker_id: e27ca511972afa0d8f1ece09a21e4e680e83ed806f3a70a14d617e05, raylet_id: 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:34:14,282 I 108246 109896] direct_actor_task_submitter.cc:217: Connecting to actor 7136362e7b3835d8c9786a7d11000000 at worker e27ca511972afa0d8f1ece09a21e4e680e83ed806f3a70a14d617e05
[2022-06-16 23:34:14,282 I 108246 109896] actor_manager.cc:246: received notification on actor, state: ALIVE, actor_id: 6ea2aaf4797ab459b3d497d311000000, ip address: 10.140.1.24, port: 10103, worker_id: 09f217361f14f3de5c40e63d6afab541e28f1e669db6573b76e71255, raylet_id: 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:34:14,283 I 108246 109896] direct_actor_task_submitter.cc:217: Connecting to actor 6ea2aaf4797ab459b3d497d311000000 at worker 09f217361f14f3de5c40e63d6afab541e28f1e669db6573b76e71255
[2022-06-16 23:34:14,283 I 108246 109896] actor_manager.cc:246: received notification on actor, state: ALIVE, actor_id: bf2af9b118af148c0ad26a0a11000000, ip address: 10.140.1.24, port: 10102, worker_id: a73fb2dbb422661802333bc66f5748c21a29a3e58091138f00f4d40a, raylet_id: 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:34:14,283 I 108246 109896] direct_actor_task_submitter.cc:217: Connecting to actor bf2af9b118af148c0ad26a0a11000000 at worker a73fb2dbb422661802333bc66f5748c21a29a3e58091138f00f4d40a
[2022-06-16 23:34:14,283 I 108246 109896] actor_manager.cc:246: received notification on actor, state: ALIVE, actor_id: 9ee24193beb1dc62ce2fa96711000000, ip address: 10.140.1.24, port: 10105, worker_id: 4c030ba15be687abaf9d45f9a81b128e5ccd48482ac6bfc4145b36da, raylet_id: 5362b08c08075f2f5bd7b1d3e1df0d75415293f6ede64c2d4cee5e5c, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-06-16 23:34:14,283 I 108246 109896] direct_actor_task_submitter.cc:217: Connecting to actor 9ee24193beb1dc62ce2fa96711000000 at worker 4c030ba15be687abaf9d45f9a81b128e5ccd48482ac6bfc4145b36da
[2022-06-16 23:34:22,369 I 108246 108246] core_worker.cc:591: Disconnecting to the raylet.
[2022-06-16 23:34:22,369 I 108246 108246] raylet_client.cc:162: RayletClient::Disconnect, exit_type=INTENDED_EXIT, has creation_task_exception_pb_bytes=0
[2022-06-16 23:34:22,369 I 108246 108246] core_worker.cc:539: Shutting down a core worker.
[2022-06-16 23:34:22,369 I 108246 108246] core_worker.cc:563: Disconnecting a GCS client.
[2022-06-16 23:34:22,369 I 108246 108246] core_worker.cc:567: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-06-16 23:34:22,369 I 108246 109896] core_worker.cc:679: Core worker main io service stopped.
[2022-06-16 23:34:22,369 I 108246 108246] core_worker.cc:576: Core worker ready to be deallocated.
[2022-06-16 23:34:22,377 I 108246 108246] core_worker_process.cc:298: Removed worker 11000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2022-06-16 23:34:22,397 I 108246 108246] core_worker.cc:530: Core worker is destructed
[2022-06-16 23:34:22,790 I 108246 108246] core_worker_process.cc:154: Destructing CoreWorkerProcessImpl. pid: 108246
[2022-06-16 23:34:22,791 I 108246 108246] io_service_pool.cc:47: IOServicePool is stopped.

Looking at the log it’s unlikely network issues. do you have the logs of the workers who failed to start?
it should be in the form python-core-worker-*_{pid}.log, for example, python-core-worker-*_68821.log.

Also, how long have your clusters been running? We have noticed similar issues here: [Core][Nightly-test] scheduling_test_many_0s_tasks_many_nodes failed · Issue #24234 · ray-project/ray · GitHub