Autoscaler SDK request_resoures fails on EKS

Hi there,

I’m currently working on switching our K8S cluster to the Ray Autoscaler and wanted to try out request_resources form ray.autoscaler.sdk. However the operator node

  • does not spin up new workers
  • if a worker node fails, it is not spun-up again and the operator stays below the min_worker number specified in the operator YAML

cc @Dmitri @Ameer_Haj_Ali Can you guys take a look at this?

Hi!

Could you please share more about how you are deploying, the configs you’re using, what versions of Ray you’re using, where request_resources is being run from, any other potentially helpful that come to mind?

I suspect one of the resources is 0 and this is getting divided by zero when calculating the utilization in autoscaler.

One other user reported something similar – summed CPU was being set to 0 internally for some reason.
I was never able to reproduce the problem though.
@rico-ci Could you file an issue on the Ray github with reproduction info?

Absolutely!

We’re deploying with helm, the configs are just adapted to helm charts from here, we’re using Ray 1.2.0 on a custom image for the worker and head nodes. For the operator node we use the rayproject/ray:1.2.0 image.

I believe from what I gathered this far is that this is still very much under development? However, I guess a more helpful error message would be a nice to have as I really don’t know what the variable v in that error log is :sweat_smile:.

Hi Dimitri,

Yeah, I’ll gladly do so and link it here. Thanks a lot for your time guys!

1 Like

@rico-ci . This should actually work well. I am not sure what the issue and I believe @Dmitri will figure it out. If you file an issue on the ray github repo we will hopefully be able to reproduce and add a test so this will not happen again.

1 Like