Nodes marked dead near end of hyperparameter search


I’m running a distributed hyperparameter search using hyperopt on a dataproc gpu cluster.

About half the time, when the last couple of trials are running, and many of the nodes are idle, I get the following error:

2021-06-27 16:55:11,688 WARNING -- The node with node id: ced0c9c4c8df26aa204c51b02b40f5c2bae4ffa68fded19af4c8102a and ip: has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(raylet, [2021-06-27 16:55:12,126 C 10863 10863]  Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor: GCS didn't receive heartbeats within timeout 30000 ms. This is likely since the machine or raylet became overloaded.

AFAIK, this error has never happened in the beginning or middle of the hyperparameter search. Also, I’ve seen it happen on nodes that were done processing all of their trials.

Finally I use ray for many other non-tune applications, and it’s never happened anywhere else.

Does anyone have any suggestions for avoiding this error? Thanks.