Why the KubeRay disable the autoscaling in default?

Andrew_Li · October 3, 2022, 9:28am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

In the KubeRay Doc: Autoscaling, the autoscaling is an option when we deploy a Ray cluster.

What are the points of consideration when the Ray Team decided to turn off the autoscaling as default?

When I deploy a Ray cluster with the minimum worker nodes as 1 and the maximum worker nodes as 5. If I don’t set enableInTreeAutoscaling as true, is that mean the number of worker nodes would be fixed as 1 and it will never be changed?

Alex · October 4, 2022, 9:01pm

What are the points of consideration when the Ray Team decided to turn off the autoscaling as default?

It simply started off as disabled by default because it’s a newer feature and there are no (known) large scale production users of autoscaling with KubeRay at the moment.

When I deploy a Ray cluster with the minimum worker nodes as 1 and the maximum worker nodes as 5. If I don’t set enableInTreeAutoscaling as true, is that mean the number of worker nodes would be fixed as 1 and it will never be changed?

Yep

Andrew_Li · October 5, 2022, 1:50am

Hi Alex, thanks for your kindly reply.

it’s a newer feature and there are no (known) large scale production users of autoscaling with KubeRay at the moment.

It’s interesting, I really love the Autoscaling feature when I deploy on GCP or use legacy Ray operator on K8s. The Autoscaling can let me have a Ray cluster with 0 ~ 10 Ray worker nodes. It saves money. lol

What circumstances that users would like to have a fixed size Ray cluster on K8s?

Alex · October 5, 2022, 8:51pm

Some users spin up short-lived ray clusters for their jobs. You can imagine for some batch data processing job (which practically runs on a fixed size cluster anyways), if you spin up the cluster, run the job, then tear down the cluster, autoscaling doesn’t provide as much benefit.

Others simply use some more domain-specific autoscaling logic. Note that with kuberay, you can update the RayCluster CR and the operator will add or remove pods to an existing cluster.

Andrew_Li · October 6, 2022, 2:50am

Some users spin up short-lived ray clusters for their jobs. You can imagine for some batch data processing job (which practically runs on a fixed size cluster anyways), if you spin up the cluster, run the job, then tear down the cluster, autoscaling doesn’t provide as much benefit.

Got it.
For the short-lived ETL job, I can imagine two workflows with KubeRay. Which one do you suggest?

Fixed size and short-lived cluster(as you described above):
```
helm install [MY-FIXED-SIZE-CLUSTER] --namespace [SPECIAL-NAMESPACE] .
# Run the Ray program with Ray submit or Ray Client
# The short-live ETL job finish
helm delete [MY-FIXED-SIZE-CLUSTER] --namespace [SPECIAL-NAMESPACE]
```
Some things I can imagine:
pros:
- There is not a dedicated Ray Head pod standby forever
- This fixed-size cluster only serves the short-lived ETL job
cons:
- Users must have permission to use kubectl and helm commands
A Ray cluster with Autoscaling:
Assume I set the Ray workers with 0 ~100 and enable the Autoscaling.
Some things I can imagine:
pros:
- Users can connect to Ray cluster using Ray Client without kubectl and helm commands
cons:
- There is a dedicated Ray Head pod
- The resources of the Ray Head pod are fixed. High-spec Ray Head would waste money. Low-spec Ray Head would cause the low performance

Andrew_Li · October 6, 2022, 2:52am

Others simply use some more domain-specific autoscaling logic. Note that with kuberay, you can update the RayCluster CR and the operator will add or remove pods to an existing cluster.

Sounds great. We can have a custom Autoscaling with KubeRay.

Andrew_Li · October 7, 2022, 9:42am

Some configuration related to this question. I use Helm Chart to deploy the Ray cluster.

Since the Autoscaling become an optional sidecar, the replicas, miniReplicas and maxiReplicas would be a little be confused.

Case 1:

head:
  enableInTreeAutoscaling: false
worker:
  replicas: 3
  miniReplicas: 1
  maxiReplicas: 5

This Ray cluster start with 3 Ray workers and the number of workers don’t change.

Case 2:

head:
  enableInTreeAutoscaling: false
worker:
  # replicas: 3
  miniReplicas: 1
  maxiReplicas: 5

This Ray cluster start with 1 Ray workers and the number of workers don’t change.

Case 3:

head:
  enableInTreeAutoscaling: true
worker:
  # replicas: 3
  miniReplicas: 1
  maxiReplicas: 5

This Ray cluster start with 1 Ray workers and the number of workers will be changed from 1 ~ 5 depend on the demand tasks/actors.

Case 4:

head:
  enableInTreeAutoscaling: true
worker:
  replicas: 3
  miniReplicas: 1
  maxiReplicas: 5

This Ray cluster start with 3 Ray workers but the Ray workers scale down to 1 after few seconds.
The number of workers will be changed from 1 ~ 5 depend on the demand tasks/actors.

The case 1 and 2 are reasonable. The Ray cluster would look at either replicas or miniReplicas because the enableInTreeAutoscaling is off.

The case 3 is also reasonable.

The case 4 was surprising me. Seems when the Ray cluster start up, it would use replicas. When the Ray cluster has already set up, the Autoscaler take over it.

Alex · October 8, 2022, 1:08am

Fixed size and short-lived cluster(as you described above):

Pro here is that it’s a lot easier to specify different docker images for different jobs.

A Ray cluster with Autoscaling:

Another advantage here is that job submission may be faster since there may already be hot nodes.

A con here is that dependency management for things runtime envs can’t handle can be quite a bit more annoying here.

Alex · October 8, 2022, 1:10am

Ultimately, you should pick between the options based on which pros/cons are most useful to you, but for most users with simple batch jobs, they tend to go with the cluster-per-job approach (which of course you can still use autoscaling with if it helps). Serving users tend to value the single large cluster approach more.

Topic		Replies	Views
RayJob enableInTreeAutoscaling crash loop Kubernetes	1	398	August 27, 2024
[Kuberay] Enabling/configuring autoscaling via kuberay-apiserver and/or ray-cluster Helm chart Kubernetes	1	557	January 20, 2023
Can I use ray autoscaler to control a manually launched ray cluster Kubernetes	3	577	July 15, 2021
Scale up from 0 Ray Clusters	7	567	July 15, 2021
Testing autoscaler Kubernetes	15	1553	March 16, 2021

Why the KubeRay disable the autoscaling in default?

Related topics