How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have a use case where I am creating lots of actors per worker node and I submit actor tasks to those actors from the driver script. I run these actors with max_restarts = -1 (infinite restarts) and max_task_retries = 3 (3 task retries) and I noticed that if my task failed due to an actor dying the task gets retried by Ray while the actor is still restarting which then causes the task to fail again and again until it depletes the task retry counter. My question is:
Am I correct in saying that ray may retry a failed task while an actor is unreachable? And if that is correct, how can I prevent the task being retried on an actor while the actor is restarting?