Random Halt and No Error/Warnings

Hi,

For some reason, my runs are getting halt and there is no clear error that I can find. I was wondering whether you can help in this regard or if you know anywhere that I have to ask this question. Attached are the output of the Ray logs for your reference:

Link to all the important logs: Ray Logs - Failed Experiment - Pastebin.com

I also have this problem

@vpcom @kelzheng What workload are you running? Are you just executing a Ray Dataset, or is training also involved? Do you have any relevant driver logs that you could show?

@justinvyu Thanks for your reply. I think RAY_TASK_MAX_RETRIES=0 resolved our issue. For some reason in some specific environment this can happen without RAY_TASK_MAX_RETRIES=0. I have already posted all the relevant logs in the pastebin. There was no specific error in the driver log as well. Just was doing nothing, everything was stopped without any utilization. And the task was a training task, in addition to the ray Dataset loading the data on the same node. But the node was large enough.