[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs

athenawisdoms · April 1, 2021, 8:10pm

I have the following systems that I wish to run distributed GPU training on

32 core CPU + 2070S + 2070S
16 core CPU + 2070S + 1080
16 core CPU + 1080 + 1080

What are the downsides of using all 3 systems for perform distributed training of a model? Node #2 has a mix of 2 GPU, while node #1 and #3 uses different GPU models.

When tuning hyperparameter (with Ray Tune), is it better to train a model across all 3 systems, or train 3 models in parallel with 3 different set of hyperparameters where each system trains using its own set of hyperparameters?

yic · April 2, 2021, 8:08pm

Hi @rliaw could you take a look at this (Hopefully you are the right one to ask for this question )

kkong13661 · April 13, 2021, 2:29am

I have the same quetion.
For example, I have two type of gpu cards nodes. One node has 4G gpu memory cards, and another one has 12G cards. My task only need 4G memory, and it’s not calculation heavy. So, how can I run one task per card in the first one, and three tasks per card in the second one? In other words, if each node has two cards, it will run 8 tasks simultaneously.
I notice that there is a not elegant method resolving this partly. For each card, I can use ray start --resources respectively to define custom resources equal to the size of gpu memory , but this is not convenient. Besides, it seems that it can not control other resouces like cpu or memory.
So is there any elegant method?

Alpe6825 · September 19, 2022, 2:13pm

I have the same problem. In my cluster I have GPUs of the type 2080Ti and 3090. Last one i could give more work i think, because there is more VRAM available and free if I configure 1 gpu per trail.

Is there maybe already a solution?

kai · September 20, 2022, 9:29am

There is currently no nice out-of-the box solution in Ray to handle this. Thus the solutions will be custom to your environment.

You can use tune.with_resources to dynamically specify the resources that should be allocated to a trial.

Ray automatically creates device-specific resources:

>>> ray.cluster_resources()
{'object_store_memory': 74558149015.0, 'node:172.31.76.223': 1.0, 'CPU': 36.0, 'memory': 192337857739.0, 'GPU': 4.0, 'accelerator_type:V100': 4.0, 'node:172.31.71.184': 1.0, 'node:172.31.68.10': 1.0, 'node:172.31.72.165': 1.0, 'node:172.31.90.28': 1.0}

Note the 'accelerator_type:V100': 4.0 above (in this cluster this is just one type).

What you could do is to randomly sample one of the accelerators for each trial to use, e.g. like this:

items = [
    {"cpu": 8, "gpu": 0.5, "custom_resources": {"accelerator_type:A": 0.5}},
    {"cpu": 8, "gpu": 1, "custom_resources": {"accelerator_type:B": 1}},
]

resource_iter = iter(cycle(items))

tuner = tune.Tuner(
    tune.with_resources(
        trainable=train_fn,
        resources=lambda config: random.choice(items)
    ),
)
tuner.fit()

This is not ideal as we could theoretically always sample the same device and hence not utilize one of the GPUs. With a large number of trials, this shouldn’t be a problem, but for e.g. only 6 trials it wouldn’t be great.

Topic		Replies	Views
[Train, Tune, Cluster] Handling different GPUs (with different GPU memories) in a Ray Cluster Ray Clusters	0	352	February 2, 2023
[Ray Train] [Ray Tune] [Ray Clusters] Handling different GPUs (with different GPU memory sizes) in a Ray cluster)	0	455	February 2, 2023
Training trials in parallel on multi-gpu machine Ray Tune	8	1715	August 23, 2021
How to allocate specific gpu to trials when use ray tune Ray Tune	4	526	July 25, 2021
How to use only a single accelerator type when running Ray tune in a Ray cluster? Ray Tune	11	2183	November 30, 2022

[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs

Related topics