How to detect when creating an actor fails

Is there any way to detect when the creation of an actor fails due to not having the required resources?

I understand about using a ready() method to test for a successful actor creation; however, if there’s not enough memory then Ray appears to hang.

Is there any way to trap this and recover gracefully? In our experience with Ray on K8s, this creates an unrecoverable condition which requires a restart of Ray.

Consider the following example code:

@ray.remote
class FailingActor:
    def __init__(self, fail=False):
        if fail:
            raise Exception("dead")
    def ready(self):
        return

h = FailingActor.options(
    memory = 1000 ** 4,
).remote(fail=False)

try:
    ray.get(h.ready.remote()) 
except ray.exceptions.RayActorError as e:
    print(e)

Suppose that the memory setting for memory-aware scheduling is larger than the available system memory.

Ray will give a warning, such as:

WARNING worker.py:1227 -- The actor or task with ID ffffffffffffffff050fc0bc7c0a14cc23c5f8f201000000 cannot be scheduled right now. It requires {memory: 48828125000.000000 GiB}, {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

Then the process appears hang. A keyboard interrupt shows:

Traceback (most recent call last):
  File "x.py", line 30, in <module>
    ray.get(h.ready.remote())
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 1615, in get
    object_refs, timeout=timeout)
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 348, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 1076, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 156, in ray._raylet.check_status
KeyboardInterrupt

BTW, this is related to https://discuss.ray.io/t/creating-actors-when-their-amount-is-more-than-num-cpus/1810/8 although a different aspect of the problem, and not solved by using the excellent approach of a ready() method

Also tried limiting restarts, based on https://docs.ray.io/en/latest/fault-tolerance.html?#fault-tolerance (thanks @dmatrix !)

@ray.remote(max_restarts=0)
class FailingActor:

and with Ray 1.9.2 on macOS, this gets into a loop with the message:

(scheduler +1m52s) Error: No available node types can fulfill resource request {'memory': 1000000000000.0}. Add suitable node types to this cluster to resolve this issue.

although the process still hangs.

If a task requires permanently unsatisfiable resources, it is called a infeasible task. Right now, the best approach to catch infeasible errors is to rely on timeout (basically adding timeout to ray.get).

We are also planning to implement InfeasibleTaskException, which raises an exception in this case, but this is not in progress yet (there’s the prepared proposal out there now, but it hasn’t been prioritized yet).

1 Like

Thank you, yes we’ll wait for the InfeasibleTastException implementation

Btw @ceteri a short term mitigation could be to use ray.wait[(actor.ready()], timeout=10), the idea being if the actor doesn’t start within the timeout, then it probably won’t start.

1 Like