How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
The operator sometimes ends up in a bad state where it is not able to remove workers.
Logs that might be relevant:
monitor.log
2023-03-03 07:20:43,762 INFO node_provider.py:239 -- Listing pods for RayCluster ray in namespace default at pods resource version >= 4144232.
2023-03-03 07:20:43,816 INFO node_provider.py:257 -- Fetched pod data at resource version 4144232.
2023-03-03 07:20:43,820 INFO autoscaler.py:143 -- The autoscaler took 0.106 seconds to fetch the list of non-terminated nodes.
2023-03-03 07:20:43,820 WARNING node_provider.py:328 -- Waiting for operator to remove worker ray-worker-ray-n1-standard-8-ssd-0-gpu-1-9bf4k.
2023-03-03 07:20:43,820 INFO autoscaler.py:396 -- Backing off of autoscaler update. Will try again in 5 seconds.
2023-03-03 07:20:48,928 INFO node_provider.py:239 -- Listing pods for RayCluster ray in namespace default at pods resource version >= 4144304.
2023-03-03 07:20:48,983 INFO node_provider.py:257 -- Fetched pod data at resource version 4144304.
2023-03-03 07:20:48,989 INFO autoscaler.py:143 -- The autoscaler took 0.113 seconds to fetch the list of non-terminated nodes.
2023-03-03 07:20:48,989 WARNING node_provider.py:328 -- Waiting for operator to remove worker ray-worker-ray-n1-standard-8-ssd-0-gpu-1-9bf4k.
2023-03-03 07:20:48,990 INFO autoscaler.py:396 -- Backing off of autoscaler update. Will try again in 5 seconds.
operator logs
2023-03-03T08:11:49.945Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "ray"}
2023-03-03T08:11:49.945Z INFO controllers.RayCluster reconcileServices {"headService service found": "ray-head-svc"}
2023-03-03T08:11:49.946Z INFO controllers.RayCluster reconcilePods {"head pod found": "ray-head-bmgrb"}
2023-03-03T08:11:49.946Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "ray-head-bmgrb"}
2023-03-03T08:11:49.946Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-0-gpu-0"}
2023-03-03T08:11:49.946Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n1-standard-8-ssd-0-gpu-0"}
2023-03-03T08:11:49.947Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n1-standard-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n1-highmem-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n1-highmem-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-e2-highmem-16-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-e2-highmem-16-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n2-highmem-32-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n2-highmem-32-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-0-gpu-1"}
2023-03-03T08:11:49.948Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n1-standard-8-ssd-0-gpu-1"}
2023-03-03T08:11:49.949Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-1-gpu-1"}
2023-03-03T08:11:49.949Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "ray-n1-standard-8-ssd-1-gpu-1"}
I have more logs from the ray session, but not sure which parts would be useful.