Hi,
I’m using ray on a GCP gpu cluster for hyperparameter tuning, training, and prediction. For any of these use cases, about 25% of the time, ray crashes with the following message:
The node with node id: xxx and ip: xxx has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
On nodes that are marked dead, I go to raylet.out, and I see many messages like
Last resource report was sent 612 ms ago. There might be resource pressure on this node. If resource reports keep lagging, scheduling decisions of other nodes may become stale
and
Last heartbeat was sent 515 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly
Most warnings are in the 500 ms range. But I do see a few as high as 20 seconds. When they hit 30 seconds, the node gets marked dead. If it helps, it seems like most of the warnings tend to appear when I’m moving data between nodes and cloud storage. To move data, I’m using multithreaded rsync. I’ve tried reducing the number of threads for rsync, but that hasn’t helped. Also, nodes are nowhere near max memory/CPU usage when the error occurs.
Any suggestions? I feel like increasing the heartbeat timeout would help, but I don’t see any options for that anywhere. Thanks.