How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Right now we provide a ML training platform by deploying a single ray cluster with auto-scaling for multiple users to submit jobs to it, potentially at the same time. That has been working well so far. Now we want to start providing hyper-parameter tuning with ray tune. If multi-tenancy is not supported, do you have any recommended way to handle my use case?
Now we want to start incorporate ray tune to our platform, what is the recommended way to support our use case?
One job could run all its trials at the same time, while the other job waits for a long time until it gets resources to run the first trial.
Can you elaborate on this? In tune, you can also set the resources like this: Ray Tune FAQ — Ray 2.8.0, right?
tuner = tune.Tuner(
tune.with_resources(
train_fn, resources={"cpu": 2, "gpu": 0.5, "custom_resources": {"hdd": 80}}
),
)
or it’s always trying to use all resources available on a cluster?
I’ve read many posts on this topic, but none of them seem to give a concrete answer.