Along with my k8s autoscaler work at the moment, I also found that my operator node would never fulfill the demands specified in my ray.remote decorator. I also tried this:
Where Custom is the custom resource I have defined in the deployment_ray_autoscaler.yaml. I have also requested CPU, num_cpus, worker-node in order to see if any of these work (separately and together). I then also tried request_resources(bundles=[{"worker-nodes":5}, {"GeoDataBuilder":5}, {"CPU":5}]). While the logs on the operator show no failure, I get this log:
In my deployment_ray_autoscaler.yaml I have specified rayResources: {"GeodataBuildWorker": 1, "is_spot": 0} under the worker-nodes spec (the yaml template I used was from here).
I’d recommend using nightly versions of Ray everywhere, if you’re not already –
so, freshly pulled rayproject/ray:nightly in the operator pod if using the Operator, local nightly build of ray if using the Ray Cluster Launcher, Ray nodes running Ray at least 1.2.0 .
So nightly won’t work for us because we need a custom image for the head/worked nodes. Also, because this will eventually end-up on production, I would really rather have a fixed version number (last month we’ve lost a few days worth of work because another framework introduced a bug with their newest version). I waited with the autoscaler implementation until 1.2.0 was out in the hope I could get the autoscaler to work with that.
My issue, however, could also be related to some misconfiguration of the Custom resource. Do I have to add anything to the rayResources definition in my helm chart or do I have to do something programmatically for it to work?
It’s sub-ideal, but for reproducibility, you can specify an image with a particular commit with the first six digits of its hash – for example rayproject/ray:4846a6.
The commit and version of ray can be determined by running ray.__version__ and ray.__commit__ in a Python shell.