- High: It blocks me to complete my task.
Hi hi,
I hope you can help me with the following error:
“The node with node id: * and address: * and node name: * has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.”
How should I resolve this issue? It seems like the heartbeat connection between the worker and the header is failing. Which port do I need to ensure is accessible for the worker to communicate with the header? Also, how do I set this port when starting the worker and the header?
Thank you for your assistance.
BR
Does this occur after running for a while or even right on new Ray Worker startup?
It happened at boot time. I think maybe there’s a port that worker can’t be accessed
I would like to ask you how the system works. After the header starts the program and the worker accesses the TCP server through one of the ports, the heartbeat is established in this service, or does the worker start a server and ask the header to connect?
It goes Ray Head > Worker; it’s a Driver > Slave distributed paradigm. So your Compute fleet would need all ports available and open in order for communications to work.
What’s your Compute infra setup look like? Are you running on Amazon/Google/one-of-the-cloud-providers?
Yes, my header is on AWS server, but I don’t want to allow access to all ports. Could you please let me know which specific ports I need to allow?