[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs

I have the following systems that I wish to run distributed GPU training on

  1. 32 core CPU + 2070S + 2070S
  2. 16 core CPU + 2070S + 1080
  3. 16 core CPU + 1080 + 1080

What are the downsides of using all 3 systems for perform distributed training of a model? Node #2 has a mix of 2 GPU, while node #1 and #3 uses different GPU models.

When tuning hyperparameter (with Ray Tune), is it better to train a model across all 3 systems, or train 3 models in parallel with 3 different set of hyperparameters where each system trains using its own set of hyperparameters?

1 Like

Hi @rliaw could you take a look at this (Hopefully you are the right one to ask for this question :slight_smile: )

I have the same quetion.
For example, I have two type of gpu cards nodes. One node has 4G gpu memory cards, and another one has 12G cards. My task only need 4G memory, and it’s not calculation heavy. So, how can I run one task per card in the first one, and three tasks per card in the second one? In other words, if each node has two cards, it will run 8 tasks simultaneously.
I notice that there is a not elegant method resolving this partly. For each card, I can use ray start --resources respectively to define custom resources equal to the size of gpu memory , but this is not convenient. Besides, it seems that it can not control other resouces like cpu or memory.
So is there any elegant method?

I have the same problem. In my cluster I have GPUs of the type 2080Ti and 3090. Last one i could give more work i think, because there is more VRAM available and free if I configure 1 gpu per trail.

Is there maybe already a solution?

There is currently no nice out-of-the box solution in Ray to handle this. Thus the solutions will be custom to your environment.

You can use tune.with_resources to dynamically specify the resources that should be allocated to a trial.

Ray automatically creates device-specific resources:

>>> ray.cluster_resources()
{'object_store_memory': 74558149015.0, 'node:172.31.76.223': 1.0, 'CPU': 36.0, 'memory': 192337857739.0, 'GPU': 4.0, 'accelerator_type:V100': 4.0, 'node:172.31.71.184': 1.0, 'node:172.31.68.10': 1.0, 'node:172.31.72.165': 1.0, 'node:172.31.90.28': 1.0}

Note the 'accelerator_type:V100': 4.0 above (in this cluster this is just one type).

What you could do is to randomly sample one of the accelerators for each trial to use, e.g. like this:

items = [
    {"cpu": 8, "gpu": 0.5, "custom_resources": {"accelerator_type:A": 0.5}},
    {"cpu": 8, "gpu": 1, "custom_resources": {"accelerator_type:B": 1}},
]

resource_iter = iter(cycle(items))

tuner = tune.Tuner(
    tune.with_resources(
        trainable=train_fn,
        resources=lambda config: random.choice(items)
    ),
)
tuner.fit()

This is not ideal as we could theoretically always sample the same device and hence not utilize one of the GPUs. With a large number of trials, this shouldn’t be a problem, but for e.g. only 6 trials it wouldn’t be great.

1 Like