How to recover or re-run actor task on a specific worker node after raylet crashed

Hi Ray team,

Environment: 3 nodes K8S, Ray 1.9.2 and Python 3.7

I followed Distributed data loading and compute example to load a large parquet dataset and distribute data across the 3 worker nodes (using placement_group with strict_spread). The data loaded and spread across the nodes correctly.

However, the raylet process crashed

The node with node id: 7151fe9c00f74fa75a834c3bddf657c094fcfd017a97d9b731c76720 and ip: 10.244.2.11 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

In this case, restarting a worker node did not help. Is there way to recover it from the failure? Ideally, I would like to Ray start the process and re-scheduled or run a task again to recover the dataset for that node.

I really appreciate your feedback to resolve this issue.

I think lineage reconstruction can help in this case. @Stephanie_Wang could you help check this thread?

@mmuru it’s likely that K8s killed your node due to exceeding a memory limit or running out of disk space. Ray’s object store will spill data to disk when out of memory, but it is not currently enforcing any disk limit. Do you have an indication of whether one of these scenarios is happening? One way to check here is to configure higher disk space / memory limits for the node, or monitoring usage stats for these resources.