I’ve created a cluster manually using the ray start --head --port=6379
command on the head (machine-1
) and ray start --address='<address>:6379'
on (machine-2
).
Then I proceed to start tuning with code that looks like the one presented below (there are some omitted):
ray.init(address="<address>:6379")
...
trainable_with_gpu = tune.with_resources(
train,
resources=tune.PlacementGroupFactory([
{'GPU': 4, 'CPU': 20},
])
)
...
tuner = tune.Tuner(
trainable_with_gpu,
...
)
results = tuner.fit()
Here I’ve specified that each trial should use 4 GPUs and 20 CPUs. machine-1
has 8 GPUs and 40 CPUs, moreover machine-2
has 4 GPUs and 64 CPUs. From my understanding, the limiting factor here would be the GPUs and no more than three trials would be running in parallel. Is that correct?
Next some evidences supporting my claim, resources from the second node are not being utilized.
In the screenshot below you can see the output of nvidia-smi
of machine-1
and machine-2
from left to right, respectively. As you can see, no job is scheduled using the resources of machine-2
(right)
What am I missing?
Reagards,
J.