Raytune does not use resources of the second node

jsv · June 15, 2023, 2:07pm

I’ve created a cluster manually using the ray start --head --port=6379 command on the head (machine-1) and ray start --address='<address>:6379' on (machine-2).

Then I proceed to start tuning with code that looks like the one presented below (there are some omitted):

ray.init(address="<address>:6379")
...
trainable_with_gpu = tune.with_resources(
    train, 
    resources=tune.PlacementGroupFactory([
        {'GPU': 4, 'CPU': 20},
    ])
)
...
tuner = tune.Tuner(
    trainable_with_gpu,
    ...
)
results = tuner.fit()

Here I’ve specified that each trial should use 4 GPUs and 20 CPUs. machine-1 has 8 GPUs and 40 CPUs, moreover machine-2 has 4 GPUs and 64 CPUs. From my understanding, the limiting factor here would be the GPUs and no more than three trials would be running in parallel. Is that correct?

Next some evidences supporting my claim, resources from the second node are not being utilized.

In the screenshot below you can see the output of nvidia-smi of machine-1 and machine-2 from left to right, respectively. As you can see, no job is scheduled using the resources of machine-2 (right)

What am I missing?

Reagards,
J.

jsv · June 15, 2023, 2:08pm

(Due to newcomers restrictions of just one media per post I am forced to reply to my previous comment with additional evidences)

In addition, in the below screenshot you can see the logs of the experiment itself. It show that two processes are running. All the <address> points to the machine-1.

Regards,
J.

Topic		Replies	Views
Ray doesn't use all CPUs Ray Tune	0	303	March 10, 2024
Resources not being used Ray Core	4	1318	September 21, 2021
Reserve workers on GPU node for trainer workers only RLlib	7	1120	June 3, 2022
Trials placed on the same GPU on a 2 GPU machine despite "num_gpus": 1 Ray Tune	4	797	April 13, 2021
Training trials in parallel on multi-gpu machine Ray Tune	8	1759	August 23, 2021

Raytune does not use resources of the second node

Related topics