Hello, I’m using ray.tune in a server to optimize the performance of a PyTorch model, and I have a few questions. My search algorithm is BOHB
I’m using 8 nodes with each 4 GPU, which I managed to set as in Deploying on Slurm — Ray 1.11.0
I have in total 576 CPUs and 32 GPUs, and I’d like to run 8 concurrent trials.
My first question is: Does ray automatically use all the GPUs in a node, and not GPUs split among different nodes?
My second question is: When I check the log, one trial is marked as RUNNING, another is marked as PENDING and all other trials are marked as PAUSED/TERMINATED. Does it mean that I’m running only one concurrent trial? How can I change that so it uses the 8 nodes at the same time?
EDIT: I Checked the log and it initially used the 8 nodes, and then started using only one.