How to recover or re-run actor task on a specific worker node after raylet crashed

mmuru · February 16, 2022, 6:01pm

Hi Ray team,

Environment: 3 nodes K8S, Ray 1.9.2 and Python 3.7

I followed Distributed data loading and compute example to load a large parquet dataset and distribute data across the 3 worker nodes (using placement_group with strict_spread). The data loaded and spread across the nodes correctly.

However, the raylet process crashed

The node with node id: 7151fe9c00f74fa75a834c3bddf657c094fcfd017a97d9b731c76720 and ip: 10.244.2.11 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

In this case, restarting a worker node did not help. Is there way to recover it from the failure? Ideally, I would like to Ray start the process and re-scheduled or run a task again to recover the dataset for that node.

I really appreciate your feedback to resolve this issue.

yic · February 17, 2022, 3:20am

I think lineage reconstruction can help in this case. @Stephanie_Wang could you help check this thread?

ericl · May 3, 2022, 11:39pm

@mmuru it’s likely that K8s killed your node due to exceeding a memory limit or running out of disk space. Ray’s object store will spill data to disk when out of memory, but it is not currently enforcing any disk limit. Do you have an indication of whether one of these scenarios is happening? One way to check here is to configure higher disk space / memory limits for the node, or monitoring usage stats for these resources.

Topic		Replies	Views
Ray job is stuck when node worker runs on is killed Ray Core	3	1519	July 1, 2022
How to debug when node dies? Ray Clusters	3	1137	July 27, 2023
Ray worker dies when reading multiple parquet files Ray Data	3	708	November 17, 2022
RayActorError: The actor died unexpectedly before finishing this task	2	1463	November 1, 2022
Ray.data recovery/checkpointing Ray Libraries (Data, Train, Tune, Serve)	0	13	September 27, 2024

How to recover or re-run actor task on a specific worker node after raylet crashed

Related topics