Remote worker nodes only alive for 30 seconds

bananajoe182 · November 22, 2022, 3:16pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,
I’m trying to setup a small on premise cluster of high end GPU machines. I managed to connect the worker nodes to the head node but once they appear on the dashboard they are only alive for ~ 30 seconds. Also it is not possible to access their logs from the dashboard. I created firewall inbound rules for python, ray, raylet and gcs server executables to allow all port connections on both the head node and the worker machines.
My setup:
Windows 10 Pro
Python 3.7.9
Ray 2.1.0

raylet.out

[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] 	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump] 
[state-dump] 
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 1
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044, IsAlive = 1
[2022-11-18 15:10:15,718 I 14604 2452] (raylet.exe) agent_manager.cc:40: HandleRegisterAgent, ip: 10.14.228.74, port: 59081, id: 15724
[2022-11-18 15:10:43,587 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 0
[2022-11-18 15:10:43,591 C 14604 2452] (raylet.exe) node_manager.cc:1057: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
BaseThreadInitThunk
RtlUserThreadStart

Alex · November 22, 2022, 6:35pm

Do you mind posting your firewall configuration? In particular do you have bidirectional communication on these ports? Configuring Ray — Ray 2.1.0

bananajoe182 · November 24, 2022, 9:54am

Hi Alex,
thanks for your time. The interesting thing is that I tried the same at my personal network at home (Same OS and Python env) and there it seems to work. The following settings where created automatically when the windows defender popup was accepted. So I copied these settings to our company machines but somehow I have the problem with the nodes being alive only a few seconds there.
Here are snippets from my firewall Inbound rules :

bananajoe182 · December 20, 2022, 1:43pm

Hey,
the problem is solved now. I identified 2 problems:

Firewall Inbound rules were overruled by Group Policy settings so the local settings were ignored
As I used a Python virtual env, I also had to create a rule for the base python 3.7 .exe as well, in addition to the venv python.exe. That was the reason that the logs were not accessible.

varun_raju · February 23, 2023, 12:35pm

How did you solved the above mentioned problems?

doryan607 · April 10, 2023, 1:20am

@bananajoe182

Hello, I have the same problem as you mentioned. I have a quick question about the solutions you presented.

What group policy settings did you edit?
Which rule should I create for the python 3.7 .exe and venv python.exe? (Group policy, inbound, etc.)

Thank you for your help in advance!

bananajoe182 · April 10, 2023, 7:53am

Hi,

You have to allow all Inbound Connections for the system python.exe as well es venv python.exe if you use venv. Additionally allow all for ray.exe and gsc_server.exe. Do this for all worker nodes and the head node. Be sure if your machine is part of a company network that those rules are set as administrator so it overrules local user settings.

If it still doesn’t work try adding the same for outbound connections.

Hope that helps!

MiJi · April 24, 2025, 8:34am

Hi,
I had a similar issue with the Azure VM as well and inside a docker container.
The instantiation of the docker container needed to be allowed with the “host” network, otherwise the docker container uses it’s own isolated bridge.
you can add the docker container to the ray head as a node, but will be vanished after 30 seconds.

in a docker-compose, it is needed only to add the following line to the docker-compose.yml file:
network_mode: “host”

Hope that helps!

Topic		Replies	Views
Remote Worker Nodes die after a few seconds Ray Clusters	5	1975	July 17, 2024
Health check failed due to missing too many heartbeats Ray Clusters	0	335	July 17, 2024
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1704	December 15, 2020
Raylet errors some worker have not registered within the timeout Ray Core	31	3702	March 30, 2023
Worker gets killed unexpectedly Ray Clusters	6	45	August 13, 2025

Remote worker nodes only alive for 30 seconds

Related topics