How to debug when node dies?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

My whole cluster recently crashed because the head node died, and I can’t find out how to debug it. I can only see thousands of the following error in the logs:

worker-xxxxx-01000000-21683.err:
The actor is dead because its node has died. 
Node Id: 01c3cfea329fb8c3169abb71cd67e6666323095c4cecc29854e97425
worker-xxxxx-01000000-21683.err:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

I don’t know where to look to find out what caused the node to die, and am not sure what to search for.

Has anyone else experienced this issue, or knows how to debug/prevent/recover from these failures?

Sorry @steventrouble for seeing this just now. Could you share the entire log file (fine to censor worker IDs as you’ve done here)?

cc @Chen_Shen who might have insight on how to debug head node failures.

@steventrouble can you share the content/errors in gcs_server.out and raylet.out on the head node? usually, the head node death leaves some hints there.