Tracing reason for get_next_unordered Timeout

raycharles · June 25, 2024, 2:58pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m using the ActorPool class and calling has_next and get_next_unordered, but get_next_unordered frequently hangs. I implemented some basic retry mechanisms but still encounter frequent get_next_unordered blocking.

When recommendations does the community have to troubleshoot frequent hanging of ActorPool.get_next_unordered (ray.wait)?

Ruiyang_Wang · July 1, 2024, 5:52pm

Can you share a repro script?

has_next returns True if there are tasks that were submitted but never called get_next[_unordered]. The task may still be running, and the call to get_next_unordered can hang on that. What’s your expected behavior?

raycharles · July 1, 2024, 8:38pm

The way you stated the situation is the understanding I came to after thinking about the problem. My incorrect assumption was that has_next should return true if there is completed work to get. After changing my understanding, I want to propose a new function to ActorPool like this(maybe my first pull request, if this idea is liked):

import ray
from ray.util.actor_pool import ActorPool

class NewActorPool(ActorPool):
def has_next_completed_result(self):
“”"Returns whether there are any completed results to return.

    Returns:
        True if there are any completed results not yet returned.

    Examples:
        .. testcode::

            import ray
            from time import sleep
            from ray.util.actor_pool import ActorPool

            @ray.remote
            class Actor:
                def double(self, v):
                    sleep(6)
                    return 2 * v
                
                def double_sleep_long(self, v):
                    sleep(60)
                    return 2 * v

            a1, a2 = Actor.remote(), Actor.remote()
            pool = ActorPool([a1, a2])
            pool.submit(lambda a, v: a.double.remote(v), 1)
            pool.submit(lambda a, v: a.double_sleep_long.remote(v), 1)
            
            print(pool.has_next_completed_result())
            sleep(10)
            print(pool.has_next_completed_result())
            print(pool.get_next())
            print(pool.has_next_completed_result())
            sleep(70)
            print(pool.has_next_completed_result())
            print(pool.get_next())

        .. testoutput::

            False
            True
            2
            False
            True
            2
    """
    if not self._future_to_actor:
        return False

    ready, _ = ray.wait(list(self._future_to_actor), num_returns=1, timeout=0)
    return len(ready) > 0

With this function I can use has_next_completed_result as condition of while loop without dealing with timeout/blocking/retry logic of get_next_unordered/get_next.

What do you think @Ruiyang_Wang ? If this function is useful for other or just a good addition I will make the pull request.

On another point related to ActorPool, I’m working on some changes that will allow the ActorPool to more gracefully/intelligently handle Actors that fail or are killed due to OOM errors. Do you have any thoughts on strategies around recovering those Actor failures?

Ruiyang_Wang · July 1, 2024, 8:45pm

hmm, I wonder if we can make it more generic by exposing the ray objects out, so the users could just do ray.wait on their own. The reason being, using Ray Objects enables use all established Object methods, like ray.cancel or to pass it to another ray task.

raycharles · July 1, 2024, 9:06pm

I’m not following. My goal was to only call get_next/get_next_unordered if there was guaranteed to be a result. Do you mean something like exposing the ObjectRefs directly in the ActorPool class?

In my case I don’t care which actor returns the result first but I only want to call get_next_unordered if there is actually a result. I guess if the ObjectRefs were available I could iterate those or pass that list to ray.wait. That would work of course but I also like have the convenience method to know if there is completed work.

Exposing the ObjectRefs directly might support my final question: when my actor fails how do I know it failed and how do I recover the arguments passed to the remote function so I can gracefully handle that in my application.

For instance if an Actor is killed due to OOM error, I think (please correct me) that the arguments I passed to the remote function are gone. If the ActorPool exposed those task related objects and arguments we might be able to create a more reliable ActorPool. Maybe some functions like:

ActorPool.pending_tasks → return future + args
ActorPool.running_tasks → ObjectRef + args
ActorPool.complete_tasks → return ObjectRef to task result + args
ActorPool.failed_tasks → return args, exception class.

Those could return other metadata as well, such as time running (in this case the user could call cancel (as you suggest) if the task has been running for long enough.

raycharles · July 2, 2024, 12:52am

@Ruiyang_Wang It might also be interesting to keep track of the number of actors in the ActorPool.And possibly other metadata about the Actors in the pool, working, idle, failed… I might want to make sure there are a target number of actors in the pool. I’m working on this logic now, so I think I’ll have to evaluate, idle_actors and _futures_to_actor to understand the current state of the pool.

How do I know if an idle actor was killed/died? How do I know if _future_to_actor holds an exception instead of a completed result? It would be nice if the ActorPool itself was a window into the state, rather than evaluating the result of get_next_unordered, finding an error, then recovering the actor and resubmitting to work.

I’m only now addressing these cases in my application, so I thought I would ask here, maybe you or some other devs have some thoughts or suggestions.

raycharles · July 16, 2024, 6:50pm

@Ruiyang_Wang Any thoughts?

Ruiyang_Wang · July 16, 2024, 10:55pm

Sorry for the delayed response.

The has_next_completed_result you proposed can already be achieved by pool.get_next_unordered(timeout=0) which raises TimeoutError if there’s nothing ready, and returns a value if an object’s ready.

For failed actors, ActorPool currently does not have anything for fault tolerance. You can set @ray.remote(max_restarts=-1) in the Actor definition to make it auto-restart if the actor died. For more nuanced use cases maybe we need Ray Data .

raycharles · July 23, 2024, 4:06pm

Thanks @Ruiyang_Wang , I’ll look into max_restarts and Ray Data. I’ll open another conversation with more specific questions if needed. I think we can close this.

Topic		Replies	Views
Ray spawns too many actors	1	74	July 1, 2024
Timeout for ray.get_actor() function Ray Core	4	332	May 24, 2023
Ray 1.11.0 sometimes hangs between ray.get calls Ray Core	5	283	March 31, 2022
Have ActorPool.submit return a future Ray Core	6	392	August 3, 2023
Actor remote function blocks on client Ray Core	10	478	April 20, 2022

Tracing reason for get_next_unordered Timeout

Related topics