Ray cluster raylet is down but the worker doesn't come back up

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

We launch our ray cluster on cluster and we recently see an issue that some workers may be failed to join the cluster because the raylet running on them was down for some reason and they become idle workers. My question is in this case can autoscaler/kuberay help to restart or launch new worker to replace those idle workers? Also can autoscaler/kuberay terminate those idle workers to release the resource?

We use ray 1.12.0

1 Like

Yes, the autoscaler and kuberay will try to replace unhealthy nodes with healthy ones.
In addition, the autoscaler can scale down idle workers.