[Core] Timeout individual remote tasks

Consider submitting a collection of work to a remote method, where most tasks will complete in under a minute, but some may take orders of magnitude longer. The desire is to kill the outliers.

Assume that the number of tasks far exceeds the number of ray workers. This could be implemented by a user if, given an incomplete task_id, there was a way to know if its status was queued vs assigned to a worker. Is there a way to do that?

Thanks, Eddie

Hey @eddie , great question and sorry for the delay, which was caused by the question being “uncategorized”. It helps if you set a category (e.g. “RLlib”) when you post a new question. That way, we’ll find it more easily and can assign the right person to answer it.

Hey @Clark_Zinzow , could someone from the Ray core team answer this one here? Thanks :slight_smile:

Hi @eddie, unfortunately, there currently isn’t a way to set a per-task timeout or an API to determine if a task is queued or running. There’s an issue that was opened recently that contains a few different options, such as using the ray.wait API or using an actor to register, watch, and cancel tasks.

Please let us know if any of those patterns will work for you!

Hi. Thanks for picking up this topic. I’ve gone thru the github enhancement discussion and don’t understand how it would work for a thousand tasks running with the number of workers dynamically changing between one worker and many. Also, it looks like asm582 was not successful getting the proposed actor-as-intermediary working.

Implementing something in the core is probably the simplest way to fix. Ray.wait() could then return 3 lists: completed, not-completed, timed-out.
Thanks again for considering this issue.

Thanks for bringing this up @eddie I brought this up again, and we will investigate what’s the best fix (you can see usability-hotfix label is added to the issue). You can follow up with the Github issue if things seem to go slowly!