How to prevent ray from retrying an actor task while the actor is restarting?

cemk · October 30, 2023, 3:20pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a use case where I am creating lots of actors per worker node and I submit actor tasks to those actors from the driver script. I run these actors with max_restarts = -1 (infinite restarts) and max_task_retries = 3 (3 task retries) and I noticed that if my task failed due to an actor dying the task gets retried by Ray while the actor is still restarting which then causes the task to fail again and again until it depletes the task retry counter. My question is:

Am I correct in saying that ray may retry a failed task while an actor is unreachable? And if that is correct, how can I prevent the task being retried on an actor while the actor is restarting?

sangcho · October 31, 2023, 1:13am

Hmm if it is the case, it seems like a bug. Not an intended behavior.

Can you create an issue for this? We can try fixing it sooner or later. The best thing you can try now is

Increase the retry time (more than 3)
Do not use max_task_retries and retry on your application layer (i.e., catch exception from ray.get and retry).

Topic		Replies	Views
Confused on the behavior of "RAY_task_oom_retries" in Actor restarts Ray Core	1	323	June 26, 2023
Newbi Question: Worker Fault Tolerance?	4	561	February 28, 2022
[Core] Keep Actors Alive Forever Ray Core	3	521	May 20, 2021
Restarting task that was running before Actor killed for OOM Ray Core	3	58	June 25, 2024
[Data] How to limit the number of retries from system failures for dataset.map? Ray Data	3	66	November 1, 2024

How to prevent ray from retrying an actor task while the actor is restarting?

Related topics