[Data] How to limit the number of retries from system failures for dataset.map?

amtn · October 23, 2024, 9:53pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I tried running dataset.map with sys.exit in it, and it looks like the tasks are retried forever. Is there a way to limit such retries? This is not a true system failure, so if these can be considered as application failures that counts towards the max retry, that would work too.

Minimal example:

import sys
import ray

ray.data.from_items(range(10)).map(lambda x: sys.exit(0))

This never ends.

The example is artificial, but some libraries may have a bug in non-python code and fail without raising an exception. Ray shouldn’t keep retrying in that case.

amtn · October 23, 2024, 10:47pm

@ray.remote
def fail():
  sys.exit(0)
ray.get(fail.remote())

This raises ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information. after a few retries. I would expect something similar for a ray dataset.

amtn · October 31, 2024, 7:34pm

In case anyone stumbled upon this, I found a workaround to wrap a function as a class to make it an actor, and check num_restarts of the actor state. The following ends after 5 retries. This was inspired by was_current_actor_reconstructed implementation.

import sys

import ray
import ray.util.state

class Actor():
    def __init__(self):
        runtime_context = ray.runtime_context.get_runtime_context()
        actors = ray.util.state.list_actors(detail=True, filters=[('actor_id', '=', runtime_context.actor_id.hex())])
        if actors and actors[0]['num_restarts'] > 5:
            raise ValueError('Too many restarts')
        
    def __call__(self, x):
        sys.exit(1)
        

def main():
    ray.init()
    print(ray.data.from_items(range(10)).map(Actor, concurrency=1).take_all())

if __name__ == '__main__':
    main()

amtn · November 1, 2024, 3:24pm

Setting max_retries explicitly also resolved the issue from infinite retries.

ray.data.from_items(range(10)).map(lambda x: sys.exit(1), max_retries=3).take_all()

Topic		Replies	Views
Fault tolerance with Actors and map_batches	1	291	January 9, 2025
Ray job died unexpectedly , No retries left for task , not going to resubmit	0	20	March 13, 2024
How to prevent ray from retrying an actor task while the actor is restarting? Ray Core	1	237	October 31, 2023
Random Halt and No Error/Warnings	3	25	November 10, 2024
Retries for deployments Ray Client	0	28	November 28, 2024

[Data] How to limit the number of retries from system failures for dataset.map?

Related topics