Remote worker nodes only alive for 30 seconds

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,
I’m trying to setup a small on premise cluster of high end GPU machines. I managed to connect the worker nodes to the head node but once they appear on the dashboard they are only alive for ~ 30 seconds. Also it is not possible to access their logs from the dashboard. I created firewall inbound rules for python, ray, raylet and gcs server executables to allow all port connections on both the head node and the worker machines.
My setup:
Windows 10 Pro
Python 3.7.9
Ray 2.1.0

raylet.out

[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] 	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump] 
[state-dump] 
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 1
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044, IsAlive = 1
[2022-11-18 15:10:15,718 I 14604 2452] (raylet.exe) agent_manager.cc:40: HandleRegisterAgent, ip: 10.14.228.74, port: 59081, id: 15724
[2022-11-18 15:10:43,587 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 0
[2022-11-18 15:10:43,591 C 14604 2452] (raylet.exe) node_manager.cc:1057: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
BaseThreadInitThunk
RtlUserThreadStart

Do you mind posting your firewall configuration? In particular do you have bidirectional communication on these ports? Configuring Ray — Ray 2.1.0

Hi Alex,
thanks for your time. The interesting thing is that I tried the same at my personal network at home (Same OS and Python env) and there it seems to work. The following settings where created automatically when the windows defender popup was accepted. So I copied these settings to our company machines but somehow I have the problem with the nodes being alive only a few seconds there.
Here are snippets from my firewall Inbound rules :



Hey,
the problem is solved now. I identified 2 problems:

  1. Firewall Inbound rules were overruled by Group Policy settings so the local settings were ignored
  2. As I used a Python virtual env, I also had to create a rule for the base python 3.7 .exe as well, in addition to the venv python.exe. That was the reason that the logs were not accessible.

How did you solved the above mentioned problems?

@bananajoe182

Hello, I have the same problem as you mentioned. I have a quick question about the solutions you presented.

  1. What group policy settings did you edit?
  2. Which rule should I create for the python 3.7 .exe and venv python.exe? (Group policy, inbound, etc.)

Thank you for your help in advance!

Hi,

You have to allow all Inbound Connections for the system python.exe as well es venv python.exe if you use venv. Additionally allow all for ray.exe and gsc_server.exe. Do this for all worker nodes and the head node. Be sure if your machine is part of a company network that those rules are set as administrator so it overrules local user settings.

If it still doesn’t work try adding the same for outbound connections.

Hope that helps!

2 Likes