Is there any way to detect when the creation of an actor fails due to not having the required resources?
I understand about using a ready()
method to test for a successful actor creation; however, if there’s not enough memory then Ray appears to hang.
Is there any way to trap this and recover gracefully? In our experience with Ray on K8s, this creates an unrecoverable condition which requires a restart of Ray.
Consider the following example code:
@ray.remote
class FailingActor:
def __init__(self, fail=False):
if fail:
raise Exception("dead")
def ready(self):
return
h = FailingActor.options(
memory = 1000 ** 4,
).remote(fail=False)
try:
ray.get(h.ready.remote())
except ray.exceptions.RayActorError as e:
print(e)
Suppose that the memory
setting for memory-aware scheduling is larger than the available system memory.
Ray will give a warning, such as:
WARNING worker.py:1227 -- The actor or task with ID ffffffffffffffff050fc0bc7c0a14cc23c5f8f201000000 cannot be scheduled right now. It requires {memory: 48828125000.000000 GiB}, {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
Then the process appears hang. A keyboard interrupt shows:
Traceback (most recent call last):
File "x.py", line 30, in <module>
ray.get(h.ready.remote())
File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 1615, in get
object_refs, timeout=timeout)
File "/Users/paco/src/ffurf/venv/lib/python3.7/site-packages/ray/worker.py", line 348, in get_objects
object_refs, self.current_task_id, timeout_ms)
File "python/ray/_raylet.pyx", line 1076, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx", line 156, in ray._raylet.check_status
KeyboardInterrupt