KubeRay operator does not remove workers

Birger · March 3, 2023, 9:00am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

The operator sometimes ends up in a bad state where it is not able to remove workers.

Logs that might be relevant:

monitor.log

2023-03-03 07:20:43,762	INFO node_provider.py:239 -- Listing pods for RayCluster ray in namespace default at pods resource version >= 4144232.
2023-03-03 07:20:43,816	INFO node_provider.py:257 -- Fetched pod data at resource version 4144232.
2023-03-03 07:20:43,820	INFO autoscaler.py:143 -- The autoscaler took 0.106 seconds to fetch the list of non-terminated nodes.
2023-03-03 07:20:43,820	WARNING node_provider.py:328 -- Waiting for operator to remove worker ray-worker-ray-n1-standard-8-ssd-0-gpu-1-9bf4k.
2023-03-03 07:20:43,820	INFO autoscaler.py:396 -- Backing off of autoscaler update. Will try again in 5 seconds.
2023-03-03 07:20:48,928	INFO node_provider.py:239 -- Listing pods for RayCluster ray in namespace default at pods resource version >= 4144304.
2023-03-03 07:20:48,983	INFO node_provider.py:257 -- Fetched pod data at resource version 4144304.
2023-03-03 07:20:48,989	INFO autoscaler.py:143 -- The autoscaler took 0.113 seconds to fetch the list of non-terminated nodes.
2023-03-03 07:20:48,989	WARNING node_provider.py:328 -- Waiting for operator to remove worker ray-worker-ray-n1-standard-8-ssd-0-gpu-1-9bf4k.
2023-03-03 07:20:48,990	INFO autoscaler.py:396 -- Backing off of autoscaler update. Will try again in 5 seconds.

operator logs

2023-03-03T08:11:49.945Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray"}
2023-03-03T08:11:49.945Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-head-svc"}
2023-03-03T08:11:49.946Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-head-bmgrb"}
2023-03-03T08:11:49.946Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-head-bmgrb"}
2023-03-03T08:11:49.946Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-0-gpu-0"}
2023-03-03T08:11:49.946Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n1-standard-8-ssd-0-gpu-0"}
2023-03-03T08:11:49.947Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n1-standard-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n1-highmem-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.947Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n1-highmem-8-ssd-1-gpu-0"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-e2-highmem-16-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-e2-highmem-16-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n2-highmem-32-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n2-highmem-32-ssd-0-gpu-0"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-0-gpu-1"}
2023-03-03T08:11:49.948Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n1-standard-8-ssd-0-gpu-1"}
2023-03-03T08:11:49.949Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "ray-n1-standard-8-ssd-1-gpu-1"}
2023-03-03T08:11:49.949Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "ray-n1-standard-8-ssd-1-gpu-1"}

I have more logs from the ray session, but not sure which parts would be useful.

architkulkarni · March 3, 2023, 6:34pm

Hi @Birger, sorry you’re running into this, it sounds like this could be a bug. Do you mind filing an issue on the KubeRay repo with any additional details you have? A self-contained reproduction script would be ideal.

architkulkarni · March 3, 2023, 6:34pm

If there’s no sensitive information, you could share a zipped folder of all the logs.

Birger · March 6, 2023, 7:03am

Here is the reported issue: [Bug] Operator does not remove workers · Issue #942 · ray-project/kuberay · GitHub

Topic		Replies	Views
Kuberay cluster not create worker pods after ray operator update to 1.1.0 Kubernetes	0	420	March 29, 2024
Autoscaler not removing idle workers Ray Clusters	2	687	April 12, 2023
Ray controller restart worker pod after head pod restart Kubernetes	0	383	November 19, 2023
Ray cluster raylet is down but the worker doesn't come back up Ray Clusters	1	409	November 3, 2022
Min_workers doesn't seem to be honored Kubernetes	15	856	February 27, 2021

KubeRay operator does not remove workers

Related topics