Remote worker nodes only alive for 30 seconds

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,
I’m trying to setup a small on premise cluster of high end GPU machines. I managed to connect the worker nodes to the head node but once they appear on the dashboard they are only alive for ~ 30 seconds. Also it is not possible to access their logs from the dashboard. I created firewall inbound rules for python, ray, raylet and gcs server executables to allow all port connections on both the head node and the worker machines.
My setup:
Windows 10 Pro
Python 3.7.9
Ray 2.1.0

raylet.out

[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] 	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump] 
[state-dump] 
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 1
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044, IsAlive = 1
[2022-11-18 15:10:15,718 I 14604 2452] (raylet.exe) agent_manager.cc:40: HandleRegisterAgent, ip: 10.14.228.74, port: 59081, id: 15724
[2022-11-18 15:10:43,587 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 0
[2022-11-18 15:10:43,591 C 14604 2452] (raylet.exe) node_manager.cc:1057: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
BaseThreadInitThunk
RtlUserThreadStart

Do you mind posting your firewall configuration? In particular do you have bidirectional communication on these ports? Configuring Ray — Ray 2.1.0

Hi Alex,
thanks for your time. The interesting thing is that I tried the same at my personal network at home (Same OS and Python env) and there it seems to work. The following settings where created automatically when the windows defender popup was accepted. So I copied these settings to our company machines but somehow I have the problem with the nodes being alive only a few seconds there.
Here are snippets from my firewall Inbound rules :