Failed Tasks Debugging

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi there,
I am noticing that when I submit a job that has more tasks than the number of CPU cores on the cluster, some tasks immediately Fail and then go to Pending Scheduling.

I would like to know if there is a log file or similar that explains why the Task failed in the first place.

On a related note, if there is a blog or some reference on how to actually utilize all the different type of log files that would be great. There can be hundreds of log files especially for worker and python and it’s not clear how to find the one corresponding to the Failed task.

Thanks, Charu

hi @ckapoor to monitor task status you might find Monitoring Ray States — Ray 2.3.0 useful, particularly ray list tasks or ray get task <TASK_ID>. In general, Monitoring Ray States — Ray 2.3.0 could be a good starting point for debugging ray task info.