Best practice for understanding why tasks get killed

When running a larger number of tasks, some of them get ‘killed’, with no further obvious information.

Is there already a best practice on how to get to the bottom of this in the fastest possible way? E.g. which tools to use, which logs to look at (first), etc. ?


What do you mean by tasks get killed and how did you figure that out?

when executing certain tasks, sometimes this shows up in the logs:

(pid=894) Killed

I suspect this might be resource-constraints related. I could be using traditional techniques to get to the root cause of this. Just felt there might be something ray-specific, since if this is happening across many tasks, you need a scalable approach for getting to the bottom of such issues.

(pid=894) Killed

Just to be clear, did you add this log on your own? I don’t think Ray has any log that only prints Killed in this way.

(I agree about the scalable approach, but I’d just like to make sure I understood how you discover this first).

no, i haven’t added anything that produces this log. WIthout haven’t drilled into the code yet, i can imagine this is related to a process i’m launching from the task. This could be caused my mem shortages or other factors, I could try to drill into this manually, but before doing that i wanted to take a step back and see whether there is.a best practice or a special fit-for-purpose tool for drilling into the root cause of this issue.

The best solution now is just to look at the log files of the worker (Ray Logging — Ray v2.0.0.dev0) for both core-worker-* and worker-*. And there are usually 3 possibilities, 1. node failure causes the worker to be killed. 2. Application error occurred. 3. there are unexpected system errors from the worker.

1 can be verified if you see any log in your driver saying node X has died.
2 can be verified by looking at log files.
For 3, the error message will say the error is due to the system error.

I agree we’ll need more comprehensive ways to do this though. Feel free to make a suggestion if you have any ideas. I don’t think we currently have any immediate plan to improve the debugging experience here (but again, I definitely agree it is important, so we should start thinking about it).

Also one surprising thing is that the log you posted (Killed) is not in our repo, so it should’ve be printed from other libraries within that worker or something. I think in this case you can also try using this; Ray Debugger — Ray v2.0.0.dev0

I think it must be this kind of ‘Killed’: