A few years ago, IIRC, the recommendation was for a single Ray Cluster to be devoted to a single model training/use case. Right now, the Jobs API seems to allow for multiple trainings to happen on the same Ray Cluster in parallel.
What would the recommendation for this be today with Ray AIR? Disposable per-job Ray clusters (through Kuberay’s RayJob for example) or a single long-term Ray Cluster that take in many Jobs in parallel?
The answer really depends on what your requirements are.
Do you need strict isolation between the jobs? As in should each job need its own object store for example? If so, then using 1 ephemeral cluster per job would be the best approach.
If the requirements are more lax, then it should be safe to run multiple jobs on the same cluster.
That being said, we still recommend only running 1 concurrent job at a time. So the same cluster can run multiple jobs, but you may run into issues if these jobs are being run at the same time.
Hey Amog, thank you! That’s really helpful, the use-case is concurrent runs indeed, so it sounds like it’s probably best to continue with ephemeral clusters.