[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs

There is currently no nice out-of-the box solution in Ray to handle this. Thus the solutions will be custom to your environment.

You can use tune.with_resources to dynamically specify the resources that should be allocated to a trial.

Ray automatically creates device-specific resources:

>>> ray.cluster_resources()
{'object_store_memory': 74558149015.0, 'node:172.31.76.223': 1.0, 'CPU': 36.0, 'memory': 192337857739.0, 'GPU': 4.0, 'accelerator_type:V100': 4.0, 'node:172.31.71.184': 1.0, 'node:172.31.68.10': 1.0, 'node:172.31.72.165': 1.0, 'node:172.31.90.28': 1.0}

Note the 'accelerator_type:V100': 4.0 above (in this cluster this is just one type).

What you could do is to randomly sample one of the accelerators for each trial to use, e.g. like this:

items = [
    {"cpu": 8, "gpu": 0.5, "custom_resources": {"accelerator_type:A": 0.5}},
    {"cpu": 8, "gpu": 1, "custom_resources": {"accelerator_type:B": 1}},
]

resource_iter = iter(cycle(items))

tuner = tune.Tuner(
    tune.with_resources(
        trainable=train_fn,
        resources=lambda config: random.choice(items)
    ),
)
tuner.fit()

This is not ideal as we could theoretically always sample the same device and hence not utilize one of the GPUs. With a large number of trials, this shouldn’t be a problem, but for e.g. only 6 trials it wouldn’t be great.

1 Like