Autoscaler SDK request_resoures fails on EKS

rico-ci · February 15, 2021, 10:27am

Hi there,

I’m currently working on switching our K8S cluster to the Ray Autoscaler and wanted to try out request_resources form ray.autoscaler.sdk. However the operator node

does not spin up new workers
if a worker node fails, it is not spun-up again and the operator stays below the min_worker number specified in the operator YAML

sangcho · February 15, 2021, 6:49pm

cc @Dmitri @Ameer_Haj_Ali Can you guys take a look at this?

Dmitri · February 15, 2021, 7:33pm

Hi!

Could you please share more about how you are deploying, the configs you’re using, what versions of Ray you’re using, where request_resources is being run from, any other potentially helpful that come to mind?

Ameer_Haj_Ali · February 15, 2021, 7:36pm

I suspect one of the resources is 0 and this is getting divided by zero when calculating the utilization in autoscaler.

Dmitri · February 15, 2021, 11:10pm

One other user reported something similar – summed CPU was being set to 0 internally for some reason.
I was never able to reproduce the problem though.
@rico-ci Could you file an issue on the Ray github with reproduction info?

rico-ci · February 15, 2021, 11:15pm

Absolutely!

We’re deploying with helm, the configs are just adapted to helm charts from here, we’re using Ray 1.2.0 on a custom image for the worker and head nodes. For the operator node we use the rayproject/ray:1.2.0 image.

rico-ci · February 15, 2021, 11:18pm

I believe from what I gathered this far is that this is still very much under development? However, I guess a more helpful error message would be a nice to have as I really don’t know what the variable v in that error log is .

rico-ci · February 15, 2021, 11:19pm

Hi Dimitri,

Yeah, I’ll gladly do so and link it here. Thanks a lot for your time guys!

Ameer_Haj_Ali · February 16, 2021, 11:36am

@rico-ci . This should actually work well. I am not sure what the issue and I believe @Dmitri will figure it out. If you file an issue on the ray github repo we will hopefully be able to reproduce and add a test so this will not happen again.

Topic		Replies	Views
Autoscaler doesn't scale workers on K8s	5	693	February 15, 2021
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	900	July 2, 2022
Autoscaler issues with the K8 Operator Kubernetes	8	667	March 2, 2021
Testing autoscaler Kubernetes	15	1549	March 16, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	729	November 20, 2023

Autoscaler SDK request_resoures fails on EKS

Related topics