How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello everyone. My team and I have been successfully and happily using ray and ray-tune for hyperparameter search and cluster setup on AWS (using K8s) for the past year. In our current infrastructure, we are deploying a ray-operator pod, on AWS, which setups the head and scales up to 5 workers. Life was going well until the 10th of March! Suddenly, we could not run any more experiments. After 3 days of digging, we figured out that the problem was related to the ray cluster deployment. Something was wrong with the ray-operator and, while the operator was running, we could never reach a healthy status (FYI: we were using ray:latest as operator image). We then changed the ray-operator image to ray:1.10.0-py37 and, we were finally able to get our health status and launch jobs on it.
I wanted then to ask for help from the community to better understand what happened. Does anyone have any idea of what changed so much between previous and latest releases to actually break the system we had been using for a year? and, maybe, how could we fix it? (other than pinning our ray-operator image to an older image?)