Hi Ray team,
Environment: 3 nodes K8S, Ray 1.9.2 and Python 3.7
I followed Distributed data loading and compute example to load a large parquet dataset and distribute data across the 3 worker nodes (using placement_group with strict_spread). The data loaded and spread across the nodes correctly.
However, the raylet process crashed
The node with node id: 7151fe9c00f74fa75a834c3bddf657c094fcfd017a97d9b731c76720 and ip: 10.244.2.11 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
In this case, restarting a worker node did not help. Is there way to recover it from the failure? Ideally, I would like to Ray start the process and re-scheduled or run a task again to recover the dataset for that node.
I really appreciate your feedback to resolve this issue.