How to detect when creating an actor fails

ceteri · January 21, 2022, 10:04pm

Is there any way to detect when the creation of an actor fails due to not having the required resources?

I understand about using a ready() method to test for a successful actor creation; however, if there’s not enough memory then Ray appears to hang.

Is there any way to trap this and recover gracefully? In our experience with Ray on K8s, this creates an unrecoverable condition which requires a restart of Ray.

Consider the following example code:

@ray.remote
class FailingActor:
    def __init__(self, fail=False):
        if fail:
            raise Exception("dead")
    def ready(self):
        return

h = FailingActor.options(
    memory = 1000 ** 4,
).remote(fail=False)

try:
    ray.get(h.ready.remote()) 
except ray.exceptions.RayActorError as e:
    print(e)

Suppose that the memory setting for memory-aware scheduling is larger than the available system memory.

Ray will give a warning, such as:

WARNING worker.py:1227 -- The actor or task with ID ffffffffffffffff050fc0bc7c0a14cc23c5f8f201000000 cannot be scheduled right now. It requires {memory: 48828125000.000000 GiB}, {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

Then the process appears hang. A keyboard interrupt shows:

Traceback (most recent call last):
  File "x.py", line 30, in <module>
    ray.get(h.ready.remote())
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 1615, in get
    object_refs, timeout=timeout)
  File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 348, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 1076, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 156, in ray._raylet.check_status
KeyboardInterrupt

ceteri · January 21, 2022, 10:16pm

BTW, this is related to https://discuss.ray.io/t/creating-actors-when-their-amount-is-more-than-num-cpus/1810/8 although a different aspect of the problem, and not solved by using the excellent approach of a ready() method

ceteri · January 22, 2022, 1:13am

Also tried limiting restarts, based on https://docs.ray.io/en/latest/fault-tolerance.html?#fault-tolerance (thanks @dmatrix !)

@ray.remote(max_restarts=0)
class FailingActor:

and with Ray 1.9.2 on macOS, this gets into a loop with the message:

(scheduler +1m52s) Error: No available node types can fulfill resource request {'memory': 1000000000000.0}. Add suitable node types to this cluster to resolve this issue.

although the process still hangs.

sangcho · January 23, 2022, 11:17pm

If a task requires permanently unsatisfiable resources, it is called a infeasible task. Right now, the best approach to catch infeasible errors is to rely on timeout (basically adding timeout to ray.get).

We are also planning to implement InfeasibleTaskException, which raises an exception in this case, but this is not in progress yet (there’s the prepared proposal out there now, but it hasn’t been prioritized yet).

ceteri · January 24, 2022, 2:35am

Thank you, yes we’ll wait for the InfeasibleTastException implementation

Alex · January 24, 2022, 3:18pm

Btw @ceteri a short term mitigation could be to use ray.wait[(actor.ready()], timeout=10), the idea being if the actor doesn’t start within the timeout, then it probably won’t start.

Topic		Replies	Views
Creating actors when their amount is more than `num_cpus` Ray Core	8	4310	April 29, 2021
Is ray setting memory resource? Ray Core	8	384	August 8, 2024
With enough Available Resources, Most of the Actors' Creation is Pending Ray Core	5	530	December 6, 2021
How can I synchronously create an actor? Ray Core	2	268	January 13, 2021
Detached actor detect oom restart Ray Core	1	24	October 24, 2024

How to detect when creating an actor fails

Related topics