I’m currently working on switching our K8S cluster to the Ray Autoscaler and wanted to try out request_resources form ray.autoscaler.sdk. However the operator node
does not spin up new workers
if a worker node fails, it is not spun-up again and the operator stays below the min_worker number specified in the operator YAML
Could you please share more about how you are deploying, the configs you’re using, what versions of Ray you’re using, where request_resources is being run from, any other potentially helpful that come to mind?
One other user reported something similar – summed CPU was being set to 0 internally for some reason.
I was never able to reproduce the problem though. @rico-ci Could you file an issue on the Ray github with reproduction info?
We’re deploying with helm, the configs are just adapted to helm charts from here, we’re using Ray 1.2.0 on a custom image for the worker and head nodes. For the operator node we use the rayproject/ray:1.2.0 image.
I believe from what I gathered this far is that this is still very much under development? However, I guess a more helpful error message would be a nice to have as I really don’t know what the variable v in that error log is .
@rico-ci . This should actually work well. I am not sure what the issue and I believe @Dmitri will figure it out. If you file an issue on the ray github repo we will hopefully be able to reproduce and add a test so this will not happen again.