Error setting up operator after 1.11.0 release

catica · March 25, 2022, 8:29am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello everyone. My team and I have been successfully and happily using ray and ray-tune for hyperparameter search and cluster setup on AWS (using K8s) for the past year. In our current infrastructure, we are deploying a ray-operator pod, on AWS, which setups the head and scales up to 5 workers. Life was going well until the 10th of March! Suddenly, we could not run any more experiments. After 3 days of digging, we figured out that the problem was related to the ray cluster deployment. Something was wrong with the ray-operator and, while the operator was running, we could never reach a healthy status (FYI: we were using ray:latest as operator image). We then changed the ray-operator image to ray:1.10.0-py37 and, we were finally able to get our health status and launch jobs on it.

I wanted then to ask for help from the community to better understand what happened. Does anyone have any idea of what changed so much between previous and latest releases to actually break the system we had been using for a year? and, maybe, how could we fix it? (other than pinning our ray-operator image to an older image?)

Topic		Replies	Views
Fail to launch ray cluster after upgrade to 2.0.1 Kubernetes	2	429	November 18, 2022
KubeRay operator keep restarting Kubernetes	13	2783	October 7, 2022
Kuberay cluster not create worker pods after ray operator update to 1.1.0 Kubernetes	0	433	March 29, 2024
NSTALLATION FAILED: ray-cluster Ray Clusters	1	248	December 28, 2023
Can ray operator supports multiple ray cluster version at the same time? Kubernetes	2	532	April 20, 2021

Error setting up operator after 1.11.0 release

Related topics