Node mistakenly marked dead: increase heartbeat timeout?

dutch · July 7, 2021, 8:52pm

Hi,

I’m using ray on a GCP gpu cluster for hyperparameter tuning, training, and prediction. For any of these use cases, about 25% of the time, ray crashes with the following message:

The node with node id: xxx and ip: xxx has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

On nodes that are marked dead, I go to raylet.out, and I see many messages like

Last resource report was sent 612 ms ago. There might be resource pressure on this node. If resource reports keep lagging, scheduling decisions of other nodes may become stale

and

Last heartbeat was sent 515 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly

Most warnings are in the 500 ms range. But I do see a few as high as 20 seconds. When they hit 30 seconds, the node gets marked dead. If it helps, it seems like most of the warnings tend to appear when I’m moving data between nodes and cloud storage. To move data, I’m using multithreaded rsync. I’ve tried reducing the number of threads for rsync, but that hasn’t helped. Also, nodes are nowhere near max memory/CPU usage when the error occurs.

Any suggestions? I feel like increasing the heartbeat timeout would help, but I don’t see any options for that anywhere. Thanks.

rliaw · July 7, 2021, 9:23pm

Hmm, @dutch can you report this on github? let’s follow up there.

sangcho · July 12, 2021, 6:00pm

@dutch did you report an issue at Github? If so can you give me the link?

rliaw · July 12, 2021, 7:02pm

Hmm, is this [gcp] Node mistakenly marked dead: increase heartbeat timeout? · Issue #16945 · ray-project/ray · GitHub perhaps the same issue?

dutch · July 12, 2021, 8:11pm

It is. Sorry for not updating here.

Topic		Replies	Views
Nodes marked dead near end of hyperparameter search Ray Tune	0	335	June 27, 2021
Ray cluster Vertex AI: raylet has lagging heartbeats due to slow network or busy workload Ray Core	2	110	October 17, 2024
The heartbeat between the worker and the header has failed Ray Clusters	5	456	July 17, 2024
Ray head crashed silently Ray Clusters	6	99	September 25, 2024
When I increase the number of workers, the actor died and the parameter server failed due to lagging heartbeats Ray Clusters	2	1167	November 28, 2022

Node mistakenly marked dead: increase heartbeat timeout?

Related topics