Getting DistributedTrainableCreator to train with all GPUs

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have 4 GPUs and am trying to do distributed tuning where each tuning experiment uses 2 gpus for distributed training (i.e., 2 tuning experiments running in parallel, using 4 gpus total). However, when I run a simple cifar10 example, it seems to only ever use 2 out of 4 gpus. It seems to only be running one tuning experiment. I think I may be misunderstanding how to allocate num_workers and num_gpus_per_worker.

    trainable_cls = DistributedTrainableCreator(
        train_cifar,
        num_workers=2,
        num_gpus_per_worker=2,
        num_cpus_per_worker=8)

    analysis = tune.run(
        trainable_cls,
        config=config,
        num_samples=4,
        stop={"training_iteration": 10},
        metric="accuracy",
        mode="max")

Hey @jlc, thanks for the question!

First, I’d like to mention that DistributedTrainableCreators are being deprecated, and I’d encourage you to try Ray Train instead! Ray Train also provides a Ray Tune integration.

Second, if you want each Trial to do distributed training across 2 workers with 1 GPU each, you should set num_gpus_per_worker to 1.

-        num_gpus_per_worker=2,
+        num_gpus_per_worker=1,