[Train, Tune, Cluster] Handling different GPUs (with different GPU memories) in a Ray Cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi everyone,

We have been using Ray Train and Ray Tune independently on each of our on-premise machines (SSH into a given machine, and start the training/hyperparameter tuning scripts manually). Now we have some spare time to re-think/re-organize our on-premise infrastructure, but I need to ask some architectural/conceptual support/recommendations from you guys, who are a black belt in Ray clusters.

Current architecture/methodology
We have 3 on-prem machines with different GPU setups:

  • machine-1: 12x vCPU + 128GB RAM + 3x 12GB GPU
  • machine-2: 8x vCPU + 128GB RAM + 4x 12GB GPU
  • machine-3: 64x vCPU + 256GB RAM + 2x 48GB GPU

machine-1 and machine-2 are mainly used for

  • training/hyperparameter tuning smaller (e.g.: time-series data analysis) models
  • training/testing larger computer vision models with small batch sizes (typically batch-size=1)

machine-3 is mainly used for

  • training/hyperparameter tuning larger computer vision models (with medium batch sizes)
  • training/hyperparameter tuning smaller (e.g.: time-series data analysis) models with large batch sizes

For making calculations simple, let’s assume that the 12GB GPUs and the 48GB GPUs are similar to one another and that:

  • each 12GB GPU can fit 2 time-series model trainer replicas
  • each 12GB GPU can fit 1 computer vision model testing (testing codes are not related to Ray Train or Tune)
  • each 48GB GPU can fit 10 time-series model trainer replicas
  • each 48GB GPU can fit 3 computer vision model trainer replicas

Desired architecture/methodology
Being able to do the following scenarios (with minimal manual configuration each time):

scenario-1
starting a SINGLE time-series model hyperparameter tuning on pre-defined number of (1…7) 12GB GPUs (2 trainer replicas each) and on pre-defined number of (1…2) 48GB GPU(s) (10 trainer replicas each)

scenario-2

  • starting a SINGLE time-series model hyperparameter tuning on e.g.: 5x 12GB GPUs (2 trainer replicas each)
  • starting 2 separate computer vision model testing on the remaining 2x 12GB GPUs (1 testing process on each GPU)
  • starting a SINGLE computer vision model hyperparameter tuning on the 2x 48GB GPU(s) (3 trainer replicas each)

scenario-N

As our GPUs have different GPU memories, setting for example resources={“cpu”: 8, “gpu”: 0.33} for a trainable would mean different GPU memory in case of a 12GB and a 48GB GPU. Furthermore, the CPU and RAM setup of our machines vary as well.

So I am a little bit confused whether Ray is flexible enough to handle all the scenarios above. But if it is, then how would you set the Ray cluster(s) up and configure it according to the scenarios?

Please note that I am a newbie in Ray clusters so any detailed answer/solution would be very appriciated :slight_smile: