How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m running ray.tune trials on a ray cluster. My problem is that ray.tune only runs “the number of CPUs” concurrent trials on the head node. How could I increase the number of concurrent trials manually to trigger cluster’s autoscale?
-------- PS -------
I understand that I can specify CPU resources for each trial. However, in my case each trial only needs 1 CPU. I only want to increase the number of concurrent trials.
We’ve discussed this on slack before, but I’m pasting my answer here for future reference:
The default for this setting depends on the search algorithm - but autoscaling should still be triggered always if you request more trials than can fit on the current cluster and if the cluster is configured for autoscaling.If you want to increase autoscaling speed, you can try to adjust the TUNE_MAX_PENDING_TRIALS_PG environment variable.
FWIW I’m seeing the same behaviour - ie. my head node has 8 CPUS, my four workers have 64 CPUS each, and 8 GPUs. Each trial needs half a GPU and ray.tune will never schedule more than 8 concurrent trials. I can get it to use all 4 nodes by setting TUNE_MAX_PENDING_TRIALS_PG to a large number. However, when I look at the code in execution/trial_runnery.py I see that it determines the max number of pending trials by calling ray.cluster_resources().get(“CPU”, 1.0)), which for my cluster returns 264. So I’m confused as to how it’s getting set to 8. Any suggestions?