Hello,
Was wondering if there has been any thought around implementing a horizontal autoscaler? I.e. instead of adding more workers, add an additional cluster and keep it behind the same load balancer.
Hello,
Was wondering if there has been any thought around implementing a horizontal autoscaler? I.e. instead of adding more workers, add an additional cluster and keep it behind the same load balancer.
Hey, in general, I’d highly recommend achieving larger scale, by making a larger Ray cluster, rather than horizontally scaling. If you haven’t already had a chance, I’d recommend checking out Ray Deployment Guide — Ray v2.0.0.dev0. for some best practices on how to deploy larger clusters.
If you really want to, you may be interested in checking out the prometheus metrics that ray exports for autoscaling data (or using the same grpc endpoints as the autoscaler). In general, we recommend scaling a single cluster though.
Alex, thanks for chiming in! The challenge that we’ve seen (and I’ve chatted with a few other Ray team members about this in the past) is that things slow to a crawl when utilizing numerous (50+) clients. Simply adding more workers or making the head larger doesn’t do much for us, my understanding is that Redis is a big driver of this.
I see, it would be great to hear more about the workload, and scale you’re trying to achieve. (We can also take this offline if needed).
cc @jiaodong – I think you were interested in cluster sharding for other reasons (fault tolerance)
While the idea is interesting, isn’t the case that multiple GCS instances have to co-ordinate for task management in the system?
It would work if the workloads running on each cluster were completely decoupled / embarrassingly parallel.
Alex, happy to talk more. I’ll message you in the morning.
@Dmitri is correct, jobs are totally independent of one another. At the moment, we’re basically sharding - a bunch of clusters running in K8s behind one load balancer & selector. In the near term, we’re thinking we’ll implement a stand-alone job that will add/remove ray clusters via the python sdk based on cluster utilization. I’d love to attempt a PR for adding this to the autoscaler at some point, but at the moment I’ve got a full plate.