I launch a cluster with autoscaler and at some point, I use helm to change max_workers which leads to downscaling of the cluster. In observation we observe that the nodes with the slowest heartbeat get killed, such nodes are often doing computations hence slow to respond.
- is there a way where we can change such a policy?
- when a cluster is downscaled, can we search for idle nodes in the cluster and kill them instead of randomly killing ray nodes when downscaled?
- when a cluster is downscaled, do we keep accounting information of ray nodes running actors vs tasks?