Autoscaler doesn't scale workers on K8s

Hi there,

Along with my k8s autoscaler work at the moment, I also found that my operator node would never fulfill the demands specified in my ray.remote decorator. I also tried this:

@ray.remote(resources={'Custom':5, 'CPU'=5, 'num_cpus'=5, 'worker-node':5})
def f(x):
    time.sleep(0.01)
    return x + (platform.node(), )

Where Custom is the custom resource I have defined in the deployment_ray_autoscaler.yaml. I have also requested CPU, num_cpus, worker-node in order to see if any of these work (separately and together). I then also tried request_resources(bundles=[{"worker-nodes":5}, {"GeoDataBuilder":5}, {"CPU":5}]). While the logs on the operator show no failure, I get this log:

The suggested website to visit (https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling) is actually not reachable and I’m not sure where to go form here.

In my deployment_ray_autoscaler.yaml I have specified rayResources: {"GeodataBuildWorker": 1, "is_spot": 0} under the worker-nodes spec (the yaml template I used was from here).

Am I missing something?

Hey @Dmitri This seems to be an issue. Can you take a look at it?

Also, @rico-ci which version of ray are you using?

Also can you address this at the same time? TypeError on Ray Cluster with Ray 1.2.0

I’d recommend using nightly versions of Ray everywhere, if you’re not already –
so, freshly pulled rayproject/ray:nightly in the operator pod if using the Operator, local nightly build of ray if using the Ray Cluster Launcher, Ray nodes running Ray at least 1.2.0 .

Hi Dimitri!

So nightly won’t work for us because we need a custom image for the head/worked nodes. Also, because this will eventually end-up on production, I would really rather have a fixed version number (last month we’ve lost a few days worth of work because another framework introduced a bug with their newest version). I waited with the autoscaler implementation until 1.2.0 was out in the hope I could get the autoscaler to work with that.

My issue, however, could also be related to some misconfiguration of the Custom resource. Do I have to add anything to the rayResources definition in my helm chart or do I have to do something programmatically for it to work?

Here I mostly just followed the yamls from here.

There was a critical fix for the operator that unfortunately didn’t make it into 1.2.0 ([autoscaler][kubernetes] autoscaling hotfix by DmitriGekhtman · Pull Request #14024 · ray-project/ray · GitHub).
Just replacing the operator image with a newer one should improve the situation (everything else can be left alone)

It’s sub-ideal, but for reproducibility, you can specify an image with a particular commit with the first six digits of its hash – for example rayproject/ray:4846a6.
The commit and version of ray can be determined by running ray.__version__ and ray.__commit__ in a Python shell.