Autoscaler doesn't scale workers on K8s

rico-ci · February 15, 2021, 12:16pm

Hi there,

Along with my k8s autoscaler work at the moment, I also found that my operator node would never fulfill the demands specified in my ray.remote decorator. I also tried this:

@ray.remote(resources={'Custom':5, 'CPU'=5, 'num_cpus'=5, 'worker-node':5})
def f(x):
    time.sleep(0.01)
    return x + (platform.node(), )

Where Custom is the custom resource I have defined in the deployment_ray_autoscaler.yaml. I have also requested CPU, num_cpus, worker-node in order to see if any of these work (separately and together). I then also tried request_resources(bundles=[{"worker-nodes":5}, {"GeoDataBuilder":5}, {"CPU":5}]). While the logs on the operator show no failure, I get this log:

The suggested website to visit (https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling) is actually not reachable and I’m not sure where to go form here.

In my deployment_ray_autoscaler.yaml I have specified rayResources: {"GeodataBuildWorker": 1, "is_spot": 0} under the worker-nodes spec (the yaml template I used was from here).

Am I missing something?

sangcho · February 15, 2021, 10:39pm

Hey @Dmitri This seems to be an issue. Can you take a look at it?

Also, @rico-ci which version of ray are you using?

sangcho · February 15, 2021, 10:39pm

Also can you address this at the same time? TypeError on Ray Cluster with Ray 1.2.0

Dmitri · February 15, 2021, 10:53pm

I’d recommend using nightly versions of Ray everywhere, if you’re not already –
so, freshly pulled rayproject/ray:nightly in the operator pod if using the Operator, local nightly build of ray if using the Ray Cluster Launcher, Ray nodes running Ray at least 1.2.0 .

rico-ci · February 15, 2021, 11:26pm

Hi Dimitri!

So nightly won’t work for us because we need a custom image for the head/worked nodes. Also, because this will eventually end-up on production, I would really rather have a fixed version number (last month we’ve lost a few days worth of work because another framework introduced a bug with their newest version). I waited with the autoscaler implementation until 1.2.0 was out in the hope I could get the autoscaler to work with that.

My issue, however, could also be related to some misconfiguration of the Custom resource. Do I have to add anything to the rayResources definition in my helm chart or do I have to do something programmatically for it to work?

Here I mostly just followed the yamls from here.

Dmitri · February 15, 2021, 11:59pm

There was a critical fix for the operator that unfortunately didn’t make it into 1.2.0 ([autoscaler][kubernetes] autoscaling hotfix by DmitriGekhtman · Pull Request #14024 · ray-project/ray · GitHub).
Just replacing the operator image with a newer one should improve the situation (everything else can be left alone)

It’s sub-ideal, but for reproducibility, you can specify an image with a particular commit with the first six digits of its hash – for example rayproject/ray:4846a6.
The commit and version of ray can be determined by running ray.__version__ and ray.__commit__ in a Python shell.

Topic		Replies	Views
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	900	July 2, 2022
Autoscaler SDK request_resoures fails on EKS Kubernetes	8	584	February 16, 2021
Autoscaler issues with the K8 Operator Kubernetes	8	667	March 2, 2021
Testing autoscaler Kubernetes	15	1549	March 16, 2021
Scale up from 0 Ray Clusters	7	565	July 15, 2021

Autoscaler doesn't scale workers on K8s

Related topics