One large Ray cluster vs set of specialized clusters

What is the best deployment practice - one large Ray cluster with dedicated resource groups or a set of smaller dedicated clusters. The latter seems to be better from the point of view maintainability, but brings up a question of communication between different cluster. The former makes communications simpler but brings the question of dynamically adding specific resource groups with custom libraries installations.
So what is the official answer?

Depends on the particular application details, but generally I think the first setup is better – one Ray cluster per application concern.

@eoakes would you agree?

Yes, unless you have a strong reason for sharing a cluster (like rapid scaling up and down of different applications), I’d suggest multiple clusters.

Note that there is some support in the works for automatically handling specifying dependencies per-task/actor: [RFC] runtime_env for actors and tasks · Issue #14019 · ray-project/ray · GitHub

Thanks @Dmitri, is it possible, in this case to add dynamically specific resource groups to the running cluster?

Thanks guys, so how can I communicate between clusters? Submit a task from cluster A to cluster B?

this is an interesting thread. When chatting with @rliaw about this, i think he described it as 'if it’s a smaller workload, it would make sense to dynamically bring the ray cluster up and down. But if it’s a larger workload (and/or one that is running continuously), then you’d rather have a longer-running larger shared cluster.
From a pure architectural perspective, it feels like the “one cluster per workload” is a bit cleaner, but it might come with the downside of additional overhead. Also, you’d have to implement some code on the client side that brings up the cluster in itself - which also results in some delay until the execution starts. So some trade-offs to consider, and discuss which of them are predominantly relevant.


I like the idea of dedicated clusters, but I was thinking more along the lines of dedicated clusters, not necessarily dynamic clusters creation. The question is how do I submit from clusterA to clusterB?