Horizontal Autoscaler?

samrogers226 · August 13, 2021, 9:28pm

Hello,

Was wondering if there has been any thought around implementing a horizontal autoscaler? I.e. instead of adding more workers, add an additional cluster and keep it behind the same load balancer.

Alex · August 16, 2021, 8:32pm

Hey, in general, I’d highly recommend achieving larger scale, by making a larger Ray cluster, rather than horizontally scaling. If you haven’t already had a chance, I’d recommend checking out Ray Deployment Guide — Ray v2.0.0.dev0. for some best practices on how to deploy larger clusters.

If you really want to, you may be interested in checking out the prometheus metrics that ray exports for autoscaling data (or using the same grpc endpoints as the autoscaler). In general, we recommend scaling a single cluster though.

samrogers226 · August 16, 2021, 8:50pm

Alex, thanks for chiming in! The challenge that we’ve seen (and I’ve chatted with a few other Ray team members about this in the past) is that things slow to a crawl when utilizing numerous (50+) clients. Simply adding more workers or making the head larger doesn’t do much for us, my understanding is that Redis is a big driver of this.

Alex · August 16, 2021, 9:25pm

I see, it would be great to hear more about the workload, and scale you’re trying to achieve. (We can also take this offline if needed).

Dmitri · August 16, 2021, 11:10pm

cc @jiaodong – I think you were interested in cluster sharding for other reasons (fault tolerance)

asm582 · August 16, 2021, 11:20pm

While the idea is interesting, isn’t the case that multiple GCS instances have to co-ordinate for task management in the system?

Dmitri · August 16, 2021, 11:45pm

It would work if the workloads running on each cluster were completely decoupled / embarrassingly parallel.

samrogers226 · August 17, 2021, 12:31am

Alex, happy to talk more. I’ll message you in the morning.

@Dmitri is correct, jobs are totally independent of one another. At the moment, we’re basically sharding - a bunch of clusters running in K8s behind one load balancer & selector. In the near term, we’re thinking we’ll implement a stand-alone job that will add/remove ray clusters via the python sdk based on cluster utilization. I’d love to attempt a PR for adding this to the autoscaler at some point, but at the moment I’ve got a full plate.

Topic		Replies	Views
[Autoscaler] Autoscaler behavior for changes to min_workers for deployed cluster Ray Clusters	2	319	June 3, 2021
Autoscaling not working with ray.util.multiprocessing Kubernetes	5	775	June 17, 2021
Autoscaling - Adding new worker nodes - stopped? Ray Clusters	0	350	July 15, 2021
[Autoscaler] Autoscaler on ray 1.3 with minikube does not scale down Ray Clusters	2	385	June 3, 2021
Autoscaling Replicas in Ray Serve Ray Serve	5	1696	March 12, 2021

Horizontal Autoscaler?

Related topics