The heartbeat between the worker and the header has failed

Dunty_Z · July 15, 2024, 7:09am

High: It blocks me to complete my task.

Hi hi,

I hope you can help me with the following error:

“The node with node id: * and address: * and node name: * has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.”

How should I resolve this issue? It seems like the heartbeat connection between the worker and the header is failing. Which port do I need to ensure is accessible for the worker to communicate with the header? Also, how do I set this port when starting the worker and the header?

Thank you for your assistance.

BR

Sam_Chan · July 15, 2024, 4:28pm

Does this occur after running for a while or even right on new Ray Worker startup?

Dunty_Z · July 16, 2024, 6:12am

It happened at boot time. I think maybe there’s a port that worker can’t be accessed
I would like to ask you how the system works. After the header starts the program and the worker accesses the TCP server through one of the ports, the heartbeat is established in this service, or does the worker start a server and ask the header to connect?

Sam_Chan · July 16, 2024, 4:36pm

It goes Ray Head > Worker; it’s a Driver > Slave distributed paradigm. So your Compute fleet would need all ports available and open in order for communications to work.

What’s your Compute infra setup look like? Are you running on Amazon/Google/one-of-the-cloud-providers?

Dunty_Z · July 17, 2024, 6:09am

Yes, my header is on AWS server, but I don’t want to allow access to all ports. Could you please let me know which specific ports I need to allow?

Sam_Chan · July 17, 2024, 6:34am

https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations

Topic		Replies	Views
Health check failed due to missing too many heartbeats Ray Clusters	0	299	July 17, 2024
Node mistakenly marked dead: increase heartbeat timeout? Ray Core	4	1695	July 12, 2021
Workers crashes after few seconds automatically Ray Clusters	1	335	March 5, 2025
When I increase the number of workers, the actor died and the parameter server failed due to lagging heartbeats Ray Clusters	2	1146	November 28, 2022
[Clusters] Preemptible machines stop unexpectedly in GCP Ray Clusters	1	330	June 2, 2021

The heartbeat between the worker and the header has failed

Related topics