The best solution now is just to look at the log files of the worker (Ray Logging — Ray v2.0.0.dev0) for both core-worker-* and worker-*. And there are usually 3 possibilities, 1. node failure causes the worker to be killed. 2. Application error occurred. 3. there are unexpected system errors from the worker.
1 can be verified if you see any log in your driver saying node X has died.
2 can be verified by looking at log files.
For 3, the error message will say the error is due to the system error.
I agree we’ll need more comprehensive ways to do this though. Feel free to make a suggestion if you have any ideas. I don’t think we currently have any immediate plan to improve the debugging experience here (but again, I definitely agree it is important, so we should start thinking about it).