Ray tune Multi-tenancy

Y_C · August 3, 2023, 4:30pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

https://docs.ray.io/en/latest/tune/faq.html#how-can-i-run-multiple-ray-tune-jobs-on-the-same-cluster-at-the-same-time-multi-tenancy

Right now we provide a ML training platform by deploying a single ray cluster with auto-scaling for multiple users to submit jobs to it, potentially at the same time. That has been working well so far. Now we want to start providing hyper-parameter tuning with ray tune. If multi-tenancy is not supported, do you have any recommended way to handle my use case?
Now we want to start incorporate ray tune to our platform, what is the recommended way to support our use case?

One job could run all its trials at the same time, while the other job waits for a long time until it gets resources to run the first trial.
Can you elaborate on this? In tune, you can also set the resources like this: Ray Tune FAQ — Ray 2.8.0, right?

tuner = tune.Tuner(
    tune.with_resources(
        train_fn, resources={"cpu": 2, "gpu": 0.5, "custom_resources": {"hdd": 80}}
    ),
)

or it’s always trying to use all resources available on a cluster?

I’ve read many posts on this topic, but none of them seem to give a concrete answer.

matthewdeng · August 15, 2023, 5:44am

Hey could you explain more about the existing training jobs that you are submitting?

By nature Ray Tune should not be too different from other jobs, but in the practical sense it is not well supported for dealing with resource contention across different jobs. For example, if two separate Tune jobs get scheduled on the same node, your users can run into issues caused by resource contention.

Y_C · October 5, 2023, 10:07am

existing training jobs that you are submitting

They are just normal ray tasks and actors.
In what way is Ray Tune different from the normal ray jobs?

Topic		Replies	Views
Ray exec multiple scripts w/ tune.run() to same ray cluster Ray Tune	18	1462	February 14, 2021
Distributed asynchronous hyperparameter tuning on queued resources Ray Tune	2	357	March 20, 2023
Resources not available with Ray's multiprocessing Ray Core	4	361	March 11, 2021
Observing `OwnerDiedError` intermittently when running concurrent Ray Tune scripts Ray Tune	8	698	June 27, 2022
Parallelly running experiments with Ray Tune on a single Machine Ray Tune	8	137	March 6, 2025

Ray tune Multi-tenancy

Related topics