Best practice for understanding why tasks get killed

mbehrendt · March 2, 2021, 12:40pm

When running a larger number of tasks, some of them get ‘killed’, with no further obvious information.

Is there already a best practice on how to get to the bottom of this in the fastest possible way? E.g. which tools to use, which logs to look at (first), etc. ?

sangcho · March 2, 2021, 4:57pm

What do you mean by tasks get killed and how did you figure that out?

mbehrendt · March 2, 2021, 6:27pm

when executing certain tasks, sometimes this shows up in the logs:

(pid=894) Killed

I suspect this might be resource-constraints related. I could be using traditional techniques to get to the root cause of this. Just felt there might be something ray-specific, since if this is happening across many tasks, you need a scalable approach for getting to the bottom of such issues.

sangcho · March 2, 2021, 10:11pm

(pid=894) Killed

Just to be clear, did you add this log on your own? I don’t think Ray has any log that only prints Killed in this way.

(I agree about the scalable approach, but I’d just like to make sure I understood how you discover this first).

mbehrendt · March 2, 2021, 10:27pm

no, i haven’t added anything that produces this log. WIthout haven’t drilled into the code yet, i can imagine this is related to a process i’m launching from the task. This could be caused my mem shortages or other factors, I could try to drill into this manually, but before doing that i wanted to take a step back and see whether there is.a best practice or a special fit-for-purpose tool for drilling into the root cause of this issue.

sangcho · March 3, 2021, 4:31am

The best solution now is just to look at the log files of the worker (Ray Logging — Ray v2.0.0.dev0) for both core-worker-* and worker-*. And there are usually 3 possibilities, 1. node failure causes the worker to be killed. 2. Application error occurred. 3. there are unexpected system errors from the worker.

1 can be verified if you see any log in your driver saying node X has died.
2 can be verified by looking at log files.
For 3, the error message will say the error is due to the system error.

I agree we’ll need more comprehensive ways to do this though. Feel free to make a suggestion if you have any ideas. I don’t think we currently have any immediate plan to improve the debugging experience here (but again, I definitely agree it is important, so we should start thinking about it).

sangcho · March 3, 2021, 4:32am

Also one surprising thing is that the log you posted (Killed) is not in our repo, so it should’ve be printed from other libraries within that worker or something. I think in this case you can also try using this; Ray Debugger — Ray v2.0.0.dev0

mbehrendt · March 4, 2021, 1:05pm

I think it must be this kind of ‘Killed’:

Topic		Replies	Views
How to get ray task again while the driver submit the task died? Ray Core	2	347	December 13, 2022
Distinguishing between two causes for worker death Dashboard, Monitoring & Debugging	0	170	August 13, 2024
How does Ray get over workers killing/revival? Ray Core	6	1510	June 9, 2023
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4239	May 8, 2023
More details about worker die Ray Core	3	720	April 29, 2021

Best practice for understanding why tasks get killed

Related topics