How to prevent ray from retrying an actor task while the actor is restarting?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a use case where I am creating lots of actors per worker node and I submit actor tasks to those actors from the driver script. I run these actors with max_restarts = -1 (infinite restarts) and max_task_retries = 3 (3 task retries) and I noticed that if my task failed due to an actor dying the task gets retried by Ray while the actor is still restarting which then causes the task to fail again and again until it depletes the task retry counter. My question is:

Am I correct in saying that ray may retry a failed task while an actor is unreachable? And if that is correct, how can I prevent the task being retried on an actor while the actor is restarting?

Hmm if it is the case, it seems like a bug. Not an intended behavior.

Can you create an issue for this? We can try fixing it sooner or later. The best thing you can try now is

  1. Increase the retry time (more than 3)
  2. Do not use max_task_retries and retry on your application layer (i.e., catch exception from ray.get and retry).