How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
My whole cluster recently crashed because the head node died, and I can’t find out how to debug it. I can only see thousands of the following error in the logs:
worker-xxxxx-01000000-21683.err:
The actor is dead because its node has died.
Node Id: 01c3cfea329fb8c3169abb71cd67e6666323095c4cecc29854e97425
worker-xxxxx-01000000-21683.err:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
I don’t know where to look to find out what caused the node to die, and am not sure what to search for.
Has anyone else experienced this issue, or knows how to debug/prevent/recover from these failures?