[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs

I have the following systems that I wish to run distributed GPU training on

  1. 32 core CPU + 2070S + 2070S
  2. 16 core CPU + 2070S + 1080
  3. 16 core CPU + 1080 + 1080

What are the downsides of using all 3 systems for perform distributed training of a model? Node #2 has a mix of 2 GPU, while node #1 and #3 uses different GPU models.

When tuning hyperparameter (with Ray Tune), is it better to train a model across all 3 systems, or train 3 models in parallel with 3 different set of hyperparameters where each system trains using its own set of hyperparameters?

Hi @rliaw could you take a look at this (Hopefully you are the right one to ask for this question :slight_smile: )

I have the same quetion.
For example, I have two type of gpu cards nodes. One node has 4G gpu memory cards, and another one has 12G cards. My task only need 4G memory, and it’s not calculation heavy. So, how can I run one task per card in the first one, and three tasks per card in the second one? In other words, if each node has two cards, it will run 8 tasks simultaneously.
I notice that there is a not elegant method resolving this partly. For each card, I can use ray start --resources respectively to define custom resources equal to the size of gpu memory , but this is not convenient. Besides, it seems that it can not control other resouces like cpu or memory.
So is there any elegant method?